Statistics on data with GNU datamash

Today I needed to get some statistics for memory and storage analysis. There are a significant number of records that we'll be processing and an even larger set that represents every record ever observed!

The design is coming along well and we have a path to tackle this. However, I wanted to establish some expectation with regard to the memory required at runtime and the storage required over time.

Tentatively, the plan is to write out records not unlike the OCI Content descriptor, so these records are actually line delimited serialized JSON. Each entry is a file read in from a filesystem and records its path, size, and content digest.

That's the background.

Now, I needed to process a set of these records to determine what a typical record size is to scale up to a larger theoretical set size. I was going to hack it out with awk (because I have a tendency to do that) but found myself instead looking into GNU datamash.

It does everything I want to and gives my my quantiles as well!

datamash --headers count 1 min 1 max 1 median 1 perc:99 1 < records.sizes | column -t

You specify which stat and which field you want to display that stat for and voilà:

count(101)  min(101)  max(101)  median(101)  perc:99(101)
6329640     69        539       260          452

Now I have readily accessible and usable data to work with! Neat.