UUIDs and Compressibility
Universally unique identifiers, or UUIDs, are often used for database primary keys in scenarios where coordination of persistence is either impossible or impractical. UUIDs offer very good probabilistic guarantees of collision avoidance, at the cost of 128 bits per key. 128 bits for a key is quite problematic in key scans and joins: with appropriately structured data, these algorithms can benefit significantly from vector processing, but at 128 bits per element, vectorisation is probably a lost cause. The 128 bit cost is also a fixed cost, even if your database has fewer than one quintilliard rows. By virtue of being random enough to avoid collisions when generated in a distributed system, there is no way to compress UUID keys to the number of bits required to identify each record in the dataset. Data engineering is about tradeoffs, and none of this is to say UUIDs should never be used, but it is prudent to be aware of the costs. All of this applies in the best case: assuming the keys are stored in their binary format! How bad can it get if UUIDs are stored as text? Can compression undo the damage?
If you work with a relational database like Postgres, you can use an implementation specific uuid
type to ensure UUIDs take up as little space as possible. However, having worked on several projects using NoSQL databases, I have seen people store UUIDs as text on at least two occasions (though this is not the fault of the databases!). How harmful this is depends on the character encoding used, but UTF-8 is quite common (for the characters found in a UUID, this is equivalent to ISO-8859-1). A UUID represented by a string like “9289f568-c33f-4667-8a3d-092aa4e21262” can take up the following sizes, depending on the encoding used.
Format | Size (bytes) | Ratio |
---|---|---|
binary | 16 | 1 |
ISO-8859-1 | 36 | 2.25 |
UTF-8 | 36 | 2.25 |
UTF-16 | 74 | 4.625 |
The real issue here is not so much the extra storage burden, because keys are usually much smaller than the values, but that the keys are used to process queries. A representation requiring 2-5x more space requires 2-5x more data to pass through the processor when evaluating queries. Many NoSQL databases offer succinct compression options for keys, which allow the keys to be processed without decompression at a small computational cost, such as the prefix and delta encoding available in HBase. This approach can work wonders with well structured keys, but this will do absolutely nothing if your keys are random.
Even heavyweight compression techniques requiring global or block level decompression before evaluation can’t recover the bloat in a textual representation of a UUID because the text is almost as random as the bytes. I compressed collections of 1M UUIDs using compression algorithms typically reserved for “cold data”: gzip, snappy and LZ4, using the code on github.
Compression | Encoding | Count | Size (MB) |
---|---|---|---|
Uncompressed | binary | 1000000 | 15.26 |
Uncompressed | ISO-8859-1 | 1000000 | 34.33 |
Uncompressed | UTF-16 | 1000000 | 70.57 |
GZIP | binary | 1000000 | 15.26 |
GZIP | ISO-8859-1 | 1000000 | 19.50 |
GZIP | UTF-16 | 1000000 | 23.73 |
LZ4 | binary | 1000000 | 15.32 |
LZ4 | ISO-8859-1 | 1000000 | 32.56 |
LZ4 | UTF-16 | 1000000 | 50.16 |
Snappy | binary | 1000000 | 15.26 |
Snappy | ISO-8859-1 | 1000000 | 33.99 |
Snappy | UTF-16 | 1000000 | 37.97 |
Assuming you are OK with treating your keys as cold data, none of these algorithms will undo the inflation. What’s interesting, assuming you’ve never thought about it before, is that none of these algorithms can compress the binary representation of the UUIDs. This is because the UUIDs are random, and random enough to be taken as unique in any given trillion year epoch. Even though there are only one million values, which could be represented by 20 bits per value, none of the compression algorithms improves on 128 bits per value. This reminds me of a passage from Theories of Everything by John D. Barrow:
The goal of science is to make sense of the diversity of nature. [Science] employs observation to gather information about the world and to test predictions about how the world will react to new circumstances, but in between these two procedures lies the heart of the scientific process. This is nothing more than the transformation of lists of observational data into abbreviated form by the recognition of patterns. The recognition of such a pattern allows the information content of the observed sequence of events to be replaced by a shorthand formula which possesses the same, or almost the same, information content... We can extend this image of science in a manner that sharpens its focus. Suppose we are presented with any string of symbols. They do not have to be numbers but let us assume for the sake of illustration that they are. We say that the string is 'random' if there is no other representation of the string which is shorter than itself. But we say it is 'non-random' if there does exist an abbreviated representation.
Random data can’t be compressed, and inflated textual representations of random bits are almost as random as the bits themselves, as far as any existing compression algorithm is concerned. Efficient representation of data requires the identification and exploitation of patterns, and using UUIDs instead of naturally occurring composite keys introduces randomness where there could often be order.