In any real world system there are often multiple layers of encoding and compression applied to data; a base 64 encoded image in an HTML file may be gzipped for transport; a snappy compressed byte array in a datastore might be base 64 encoded in a JSON message. Encoding and lossless compression are invertible transformations, and an invertible transformation must neither create nor destroy information. Yet, as can be seen in the adversarial case of UUIDs, various textual encodings prevent compression algorithms from reaching their potential (that is, the information content of the data).

Base 64 encoding translates arbitrary binary data to valid UTF-8 text, and is used for representing binary data in JSON messages, images embedded in HTML, and various tokens used for authentication and authorisation. Base 64 maps each 6-bit subword of a byte stream to one of 64 valid bytes, so requires four bytes to encode three input bytes: a 33% overhead.

What would happen if we base 64 encode incompressible binary data (i.e. already compressed, or an encryption key or similar) in JSON, and then apply transport level compression to the JSON? Can any compression algorithm undo the 33% inflation? Of GZIP, LZ4 and Snappy, only GZIP is capable of doing so (albeit extremely slowly). The transformation is opaque to the much faster LZ4 and Snappy algorithms.

Compression Encoding Count Compressed Size (MB)
Uncompressed binary 1000000 976.56
Uncompressed base64 1000000 1304.63
GZIP binary 1000000 976.86
GZIP base64 1000000 988.74
LZ4 binary 1000000 980.39
LZ4 base64 1000000 1309.74
Snappy binary 1000000 976.61
Snappy base64 1000000 1304.69

Writing monotonically increasing numbers into a byte[] prior to base 64 encoding tells another story. GZIP, as usual, takes a very long time to achieve a reasonable compression ratio, but the ratio depends on the encoding. LZ4 and Snappy are insensitive to base 64 encoding but they can’t compress beyond the size of the original data.

Compression Encoding Count Compressed Size (MB)
Uncompressed binary 1000000 976.56
Uncompressed base64 1000000 1304.63
GZIP binary 1000000 337.67
GZIP base64 1000000 523.80
LZ4 binary 1000000 980.39
LZ4 base64 1000000 987.99
Snappy binary 1000000 976.61
Snappy base64 1000000 948.17

There is only a single bit difference between each subsequent four byte sequence in this data: it could be abbreviated by storing the first value, and the difference of the difference between each four byte sequence, and the position of each inflection point. This particular data could take up less than 100 bits, but in order to exploit it, we would have to be expecting this pattern and hope it hasn’t been scrambled base 64.

Text encoding is a good way to confound a compression algorithm, but how do compression algorithms compose? This isn’t that odd a question: imagine you store a compressed BLOB in some kind of datastore, and provide an HTTP interface to that datastore. Maybe even without realising it, it’s likely that the compression algorithm used for the BLOBs will be composed with GZIP for the transport, and if the interchange format is JSON, there will be a layer of base 64 encoding involved too.

For monotonic integers, neither LZ4 nor snappy compress the data, but at least they don’t get in GZIP’s way.

Compression Pre-Compression Count Compressed Size (MB)
Uncompressed Uncompressed 1000000 976.56
Uncompressed Snappy 1000000 981.33
Uncompressed LZ4 1000000 981.33
LZ4 Uncompressed 1000000 980.39
LZ4 Snappy 1000000 981.32
LZ4 LZ4 1000000 981.32
Snappy Uncompressed 1000000 976.61
Snappy Snappy 1000000 981.37
Snappy LZ4 1000000 981.37
GZIP Uncompressed 1000000 337.67
GZIP Snappy 1000000 339.95
GZIP LZ4 1000000 339.95

It’s worth stressing that both LZ4 and Snappy can compress some data very well, they just aren’t well suited to monotonic integers. Both algorithms are much faster than GZIP, prioritising speed over compression ratios. I haven’t measured it, but the GZIP code used in these benchmarks is too slow to be practical, and its compression ratio is still a long way from optimal.

I often see decisions to use a compression algorithm taken rather arbitrarily, without due consideration of the data being compressed. This is often driven by the quest for modularity. Modularity entails composition, which can occasionally be toxic, and often requires that we forget aspects of domain knowledge that could be exploited to great effect.

My code for this blog post is at github.