Collaborative Compression
I have recently become interested in the way the effects of compression algorithms and text encoding compose. I started looking at this in my last post. Base 64 encoding extracts and maps each 6-bit subword of a byte stream to one of 64 possible bytes, which is guaranteed to waste 2 bits per byte, but can encode any binary data as UTF-8 text. On a block of incompressible binary data encoded as base 64, neither LZ4 nor Snappy can compress the text to the size of the original binary data, whereas GZIP can (undo a 33% inflation). With monotonically increasing integers, LZ4 and Snappy achieve size parity with uncompressed binary data, whereas GZIP compression can be less effective on base 64 encoded text than on equivalent binary data.
I was interested to see if using LZ4 or Snappy as an intermediate step between base 64 and GZIP would make a difference. Compressing monotonic integers again, my expectation was that LZ4/Snappy could “undo” the base 64 bloat, to get to parity in composition with GZIP on raw binary data, but that’s not what happened:
Compression | Transformation | Count | Compressed Size (MB) |
---|---|---|---|
Uncompressed | binary | 1000000 | 976.56 |
Uncompressed | base64 | 1000000 | 1304.63 |
Uncompressed | base64/snappy | 1000000 | 969.20 |
Uncompressed | base64/lz4 | 1000000 | 993.98 |
GZIP | binary | 1000000 | 337.67 |
GZIP | base64 | 1000000 | 523.80 |
GZIP | base64/snappy | 1000000 | 61.60 |
GZIP | base64/lz4 | 1000000 | 31.53 |
I had already noted that all three algorithms had poor compression ratios for this case, which is better suited to delta encoding. Applying base 64 encoding prior to a fast compression algorithm seems to prime the input for GZIP. I’m so confused by this result that I spent some time on sanity checks: verifying that the results are truly decompressible.
This isn’t very exciting: the result is not competitive with delta encoding, and I ran the same test with a sinusoidal byte stream and saw the opposite effect.
Compression | Transformation | Count | Compressed Size (MB) |
---|---|---|---|
Uncompressed | binary | 1000000 | 976.56 |
Uncompressed | base64 | 1000000 | 1304.63 |
Uncompressed | base64/snappy | 1000000 | 1087.73 |
Uncompressed | base64/lz4 | 1000000 | 1089.67 |
GZIP | binary | 1000000 | 27.74 |
GZIP | base64 | 1000000 | 69.12 |
GZIP | base64/snappy | 1000000 | 282.52 |
GZIP | base64/lz4 | 1000000 | 283.81 |
Still, it’s hard to predict how multiple layers of data representation will interact.