Shrinking BSON Documents

I recently saw a tweet about optimising a DynamoDB instance by shortening attribute names.

#NoSQL databases include both attribute names and values in every item they store. In one design review today we estimated 70% of @DynamoDB IOPS and storage is being consumed by attribute names. Short names will eliminate 14K WCU at peak and reduce table size by 3.5TB. pic.twitter.com/hHN4XIVjWY
— Rick Houlihan (@houlihan_rick) May 12, 2020

The tweet generalised to all NoSQL databases, but when I was asked to reduce the resource requirements of a large MongoDB cluster, I reached the conclusion that the most obvious target - attribute names - wouldn’t lead to the kind of impact I wanted. This post reproduces the analysis I undertook before radically reducing the cost of running that MongoDB cluster. The incentive to reduce document size is for the sake of cache utilisation rather than storage; databases with lots of documents in cache are fast databases, and block level compression algorithms are more effective for storage on disk. I don’t know how much of this applies to DynamoDB.

BSON Specification
BSON Analyser
Metrics Example
Improvements Create Feature Budget

BSON Specification

If you’re working with a NoSQL database, the data model will be a bit different to a traditional database, making some trade-offs which incur spatial overhead. Understanding the storage format helps to think about ways to reduce the size. Fortunately, BSON, the format MongoDB stores its records in, has a specification so you can find out how large everything you store in your documents is. The Backus-Naur definition below taken from the BSON specification is fairly self-explanatory.

document 	::= 	int32 e_list "\x00" 	BSON Document. int32 is the total number of bytes comprising the document.
e_list 	::= 	element e_list 	
	| 	"" 	
element 	::= 	"\x01" e_name double 	64-bit binary floating point
	| 	"\x02" e_name string 	UTF-8 string
	| 	"\x03" e_name document 	Embedded document
	| 	"\x04" e_name document 	Array
	| 	"\x05" e_name binary 	Binary data
	| 	"\x06" e_name 	Undefined (value) — Deprecated
	| 	"\x07" e_name (byte*12) 	ObjectId
	| 	"\x08" e_name "\x00" 	Boolean "false"
	| 	"\x08" e_name "\x01" 	Boolean "true"
	| 	"\x09" e_name int64 	UTC datetime
	| 	"\x0A" e_name 	Null value
	| 	"\x0B" e_name cstring cstring 	Regular expression - The first cstring is the regex pattern, the second is the regex options string. Options are identified by characters, which must be stored in alphabetical order. Valid options are 'i' for case insensitive matching, 'm' for multiline matching, 'x' for verbose mode, 'l' to make \w, \W, etc. locale dependent, 's' for dotall mode ('.' matches everything), and 'u' to make \w, \W, etc. match unicode.
	| 	"\x0C" e_name string (byte*12) 	DBPointer — Deprecated
	| 	"\x0D" e_name string 	JavaScript code
	| 	"\x0E" e_name string 	Symbol. Deprecated
	| 	"\x0F" e_name code_w_s 	JavaScript code w/ scope
	| 	"\x10" e_name int32 	32-bit integer
	| 	"\x11" e_name uint64 	Timestamp
	| 	"\x12" e_name int64 	64-bit integer
	| 	"\x13" e_name decimal128 	128-bit decimal floating point
	| 	"\xFF" e_name 	Min key
	| 	"\x7F" e_name 	Max key
e_name 	::= 	cstring 	Key name
string 	::= 	int32 (byte*) "\x00" 	String - The int32 is the number bytes in the (byte*) + 1 (for the trailing '\x00'). The (byte*) is zero or more UTF-8 encoded characters.
cstring 	::= 	(byte*) "\x00" 	Zero or more modified UTF-8 encoded characters followed by '\x00'. The (byte*) MUST NOT contain '\x00', hence it is not full UTF-8.
binary 	::= 	int32 subtype (byte*) 	Binary - The int32 is the number of bytes in the (byte*).
subtype 	::= 	"\x00" 	Generic binary subtype
	| 	"\x01" 	Function
	| 	"\x02" 	Binary (Old)
	| 	"\x03" 	UUID (Old)
	| 	"\x04" 	UUID
	| 	"\x05" 	MD5
	| 	"\x06" 	Encrypted BSON value
	| 	"\x80" 	User defined
code_w_s 	::= 	int32 string document 	Code w/ scope

Firstly, every element has a type byte, and there are null terminators in all kinds of places. One byte at a time, they add up. Nested documents, of which arrays are a special type, have a four byte length at the start and a null terminator at the end. This means every nested value costs six bytes. There are two kinds of strings: keys, which are null-terminated c-strings, and values, which are length-prefixed, null terminated strings (known as fucked strings). Arrays are documents which means they must have keys. What are those keys? The array index in c-string form (which is useful for MongoDB’s aggregation engine, but you might not have expected it)!

There are a few obvious consequences. Strings take up less space if they can be modeled as keys; eliminating nesting reduces as much space as renaming metrics to m.

BSON Analyser

The best way I found to experiment with schema changes is to encode the information about BSON element sizes into a tool which can parse BSON. MongoDB’s Java driver’s BsonWriter is easy to implement and easy to compose with a BSON reader. I implemented BsonWriter to update counters during document parsing. A BSON or JSON extract can be fed in to the overhead analyser to get a break down of size by document feature. This makes it easy to tweak a document slightly, and see what impact the change made.

Metrics Example

I followed the process below on a past project, with a data model I can’t mention the details of, but replicate it here with a made up data model. The example documents contain metrics taken by sensors. Each step preserves all relationships within the document but makes trade-offs, and it really depends on the application whether they are worth it, but each step also creates a budget for additional features in the documents. If you reduce the size of a document by 1KB, increasing its size by 200 bytes for easier indexing comes for better than free.

Baseline

Here’s the starting point, a document which contains a snapshot of 3 metrics with different percentiles (the JSON gets shorter if you keep going).

{
  "id": "1000000000000",
  "metrics": [
    {
      "tags": [ "tag1","tag2", "tag3" ],
      "percentile": "p50",
      "metric": "metric1",
      "timestamp": 1029831028310928,
      "value": 1029831.102938
    },
    {
      "tags": [ "tag1", "tag2", "tag3" ],
      "percentile": "p90",
      "metric": "metric1",
      "timestamp": 1029831028310928,
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag2", "tag3" ],
      "percentile": "p95",
      "metric": "metric1",
      "timestamp": 1029831028310928,
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag2", "tag3" ],
      "percentile": "p99",
      "metric": "metric1",
      "timestamp": 1029831028310928,
      "value": 1229831.102938
    },
    {
      "tags": [ "tag1", "tag3", "tag4" ],
      "percentile": "p50",
      "metric": "metric2",
      "timestamp": 1029831028310928,
      "value": 1029831.102938
    },
    {
      "tags": [ "tag1", "tag3", "tag4" ],
      "percentile": "p90",
      "metric": "metric2",
      "timestamp": 1029831028310928,
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag3", "tag4" ],
      "percentile": "p95",
      "metric": "metric2",
      "timestamp": 1029831028310928,
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag3", "tag4" ],
      "percentile": "p99",
      "metric": "metric2",
      "timestamp": 1029831028310928,
      "value": 1229831.102938
    },
    {
      "tags": [ "tag1", "tag2" ],
      "percentile": "p50",
      "metric": "metric3",
      "timestamp": 1029831028310928,
      "value": 1029831.102938
    },
    {
      "tags": [ "tag1", "tag2" ],
      "percentile": "p90",
      "metric": "metric3",
      "timestamp": 1029831028310928,
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag2" ],
      "percentile": "p95",
      "metric": "metric3",
      "timestamp": 1029831028310928,
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag2" ],
      "percentile": "p99",
      "metric": "metric3",
      "timestamp": 1029831028310928,
      "value": 1229831.102938
    }
  ]
}

When converted to BSON, this document is fairly large: 1.5KB with over 50% overhead.

raw.json
+----------------------------------------------+
data size by type (total): 738B
	 DOUBLE: 96B
	 STRING: 546B
	 INT64: 96B
binary type markers (total): 0B
+----------------------------------------------+
attribute (total): 463B
null terminators: 189B
+----------------------------------------------+
document lengths (total): 104B
	metrics: 52B
	root: 4B
	tags: 48B
+----------------------------------------------+
data (total): 738B
	id: 18B
	metric: 144B
	percentile: 96B
	tags: 288B
	timestamp: 96B
	value: 96B
	-------------------------------------------+
	DOUBLE: 96B
	STRING: 546B
	INT64: 96B
+----------------------------------------------+
document size (total): 1543B
Overhead: 52.17%

Shrinking Attributes

Let’s chase the biggest item in the profile (this is what you’re supposed to do, right?) and replace the attribute names with single characters. This saves over 300 bytes. Not bad.

raw-minified.json
+----------------------------------------------+
data size by type (total): 747B
	 DOUBLE: 96B
	 STRING: 555B
	 INT64: 96B
binary type markers (total): 0B
+----------------------------------------------+
attribute (total): 109B
null terminators: 191B
+----------------------------------------------+
document lengths (total): 104B
	b: 52B
	c: 48B
	root: 4B
+----------------------------------------------+
data (total): 747B
	a: 18B
	c: 297B
	d: 96B
	e: 144B
	f: 96B
	g: 96B
	-------------------------------------------+
	DOUBLE: 96B
	STRING: 555B
	INT64: 96B
+----------------------------------------------+
document size (total): 1200B
Overhead: 37.75%

If you’ve ever set up indexes on a database, this will horrify you, so it’s worth avoiding doing this.

Normalisation

There’s some duplication in the document: the timestamp can be moved up to the root level, and the obviously numeric identifier is a string. Changing the identifier to an int64 will save a little bit of space, but if it ends up in an index, will speed up queries a lot for various reasons. It might not be possible to change the data type, but if you’re not in control of it, it’s likely you can have a conversation with someone who is, and might be coercible if they can be convinced they stand to gain from the change.

{
  "id": 1000000000000,
  "timestamp": 1029831028310928,
  "metrics": [
    {
      "tags": [ "tag1", "tag2", "tag3" ],
      "percentile": "p50",
      "metric": "metric1",
      "value": 1029831.102938
    },
    {
      "tags": [ "tag1", "tag2", "tag3" ],
      "percentile": "p90",
      "metric": "metric1",
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag2", "tag3" ],
      "percentile": "p95",
      "metric": "metric1",
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag2", "tag3" ],
      "percentile": "p99",
      "metric": "metric1",
      "value": 1229831.102938
    },
    {
      "tags": [ "tag1", "tag3", "tag4" ],
      "percentile": "p50",
      "metric": "metric2",
      "value": 1029831.102938
    },
    {
      "tags": [ "tag1", "tag3", "tag4" ],
      "percentile": "p90",
      "metric": "metric2",
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag3", "tag4" ],
      "percentile": "p95",
      "metric": "metric2",
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag3", "tag4" ],
      "percentile": "p99",
      "metric": "metric2",
      "value": 1229831.102938
    },
    {
      "tags": [ "tag1", "tag2" ],
      "percentile": "p50",
      "metric": "metric3",
      "value": 1029831.102938
    },
    {
      "tags": [ "tag1", "tag2" ],
      "percentile": "p90",
      "metric": "metric3",
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag2" ],
      "percentile": "p95",
      "metric": "metric3",
      "value": 1129831.102938
    },
    {
      "tags": [ "tag1", "tag2" ],
      "percentile": "p99",
      "metric": "metric3",
      "value": 1229831.102938
    }
  ]
}

Just normalising the document and using the right data type gets close to minifying the attributes.

normalised.json
+----------------------------------------------+
data size by type (total): 640B
	 DOUBLE: 96B
	 STRING: 528B
	 INT64: 16B
binary type markers (total): 0B
+----------------------------------------------+
attribute (total): 364B
null terminators: 177B
+----------------------------------------------+
document lengths (total): 104B
	metrics: 52B
	root: 4B
	tags: 48B
+----------------------------------------------+
data (total): 640B
	id: 8B
	metric: 144B
	percentile: 96B
	tags: 288B
	timestamp: 8B
	value: 96B
	-------------------------------------------+
	DOUBLE: 96B
	STRING: 528B
	INT64: 16B
+----------------------------------------------+
document size (total): 1324B
Overhead: 51.66%

Inverted Tags Index

Denormalising the timestamps may have been a little contrived, but removing duplication from the document is a good strategy. Next is the tags: there’s a lot of repetition and strings take up less space as keys than values. Why not add a sub document with an inverted index by tag over the metrics? Below, $.index.tag1 contains all the indices into $.metrics which have "tag1".

{
  "id": 1000000000000,
  "timestamp": 1029831028310928,
  "index": {
    "tag1": [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ],
    "tag2": [ 0, 1, 2, 3, 8, 9, 10, 11 ],
    "tag3": [ 0, 1, 2, 3, 4, 5, 6, 7 ],
    "tag4": [ 4, 5, 6, 7 ]
  },
  "metrics": [
    {
      "percentile": "p50",
      "metric": "metric1",
      "value": 1029831.102938
    },
    {
      "percentile": "p90",
      "metric": "metric1",
      "value": 1129831.102938
    },
    {
      "percentile": "p95",
      "metric": "metric1",
      "value": 1129831.102938
    },
    {
      "percentile": "p99",
      "metric": "metric1",
      "value": 1229831.102938
    },
    {
      "percentile": "p50",
      "metric": "metric2",
      "value": 1029831.102938
    },
    {
      "percentile": "p90",
      "metric": "metric2",
      "value": 1129831.102938
    },
    {
      "percentile": "p95",
      "metric": "metric2",
      "value": 1129831.102938
    },
    {
      "percentile": "p99",
      "metric": "metric2",
      "value": 1229831.102938
    },
    {
      "percentile": "p50",
      "metric": "metric3",
      "value": 1029831.102938
    },
    {
      "percentile": "p90",
      "metric": "metric3",
      "value": 1129831.102938
    },
    {
      "percentile": "p95",
      "metric": "metric3",
      "value": 1129831.102938
    },
    {
      "percentile": "p99",
      "metric": "metric3",
      "value": 1229831.102938
    }
  ]
}

This sheds another 300B, and makes sense unless you want to index by tag; it rearranges the document so that an application could interpret the document structure. As data is moved from the values to the keys, the overhead percentage starts to make less sense; it increases in this step.

indexed.json
+----------------------------------------------+
data size by type (total): 480B
	 DOUBLE: 96B
	 STRING: 240B
	 INT32: 128B
	 INT64: 16B
binary type markers (total): 0B
+----------------------------------------------+
attribute (total): 339B
null terminators: 131B
+----------------------------------------------+
document lengths (total): 76B
	index: 4B
	metrics: 52B
	root: 4B
	tag1: 4B
	tag2: 4B
	tag3: 4B
	tag4: 4B
+----------------------------------------------+
data (total): 480B
	id: 8B
	metric: 144B
	percentile: 96B
	tag1: 48B
	tag2: 32B
	tag3: 32B
	tag4: 16B
	timestamp: 8B
	value: 96B
	-------------------------------------------+
	DOUBLE: 96B
	STRING: 240B
	INT32: 128B
	INT64: 16B
+----------------------------------------------+
document size (total): 1090B
Overhead: 55.96%

Binary Tag Format

BSON arrays aren’t as dense as they look, so the arrays could be replaced by bitsets in binary format.

{
  "id": 1000000000000,
  "timestamp": 1029831028310928,
  "index": {
    "tag1": { "$binary": "AAA=", "$type": "00" },
    "tag2": { "$binary": "AAA=", "$type": "00" },
    "tag3": { "$binary": "AA==", "$type": "00" },
    "tag4": { "$binary": "AA==", "$type": "00" }
  },
  "metrics": [
    {
      "percentile": "p50",
      "metric": "metric1",
      "value": 1029831.102938
    },
    {
      "percentile": "p90",
      "metric": "metric1",
      "value": 1129831.102938
    },
    {
      "percentile": "p95",
      "metric": "metric1",
      "value": 1129831.102938
    },
    {
      "percentile": "p99",
      "metric": "metric1",
      "value": 1229831.102938
    },
    {
      "percentile": "p50",
      "metric": "metric2",
      "value": 1029831.102938
    },
    {
      "percentile": "p90",
      "metric": "metric2",
      "value": 1129831.102938
    },
    {
      "percentile": "p95",
      "metric": "metric2",
      "value": 1129831.102938
    },
    {
      "percentile": "p99",
      "metric": "metric2",
      "value": 1229831.102938
    },
    {
      "percentile": "p50",
      "metric": "metric3",
      "value": 1029831.102938
    },
    {
      "percentile": "p90",
      "metric": "metric3",
      "value": 1129831.102938
    },
    {
      "percentile": "p95",
      "metric": "metric3",
      "value": 1129831.102938
    },
    {
      "percentile": "p99",
      "metric": "metric3",
      "value": 1229831.102938
    }
  ]
}

This saves another 200B, but absolutely rules out indexing on tag.

indexed-bitset.json
+----------------------------------------------+
data size by type (total): 374B
	 DOUBLE: 96B
	 STRING: 240B
	 BINARY: 22B
	 INT64: 16B
binary type markers (total): 4B
+----------------------------------------------+
attribute (total): 305B
null terminators: 95B
+----------------------------------------------+
document lengths (total): 60B
	index: 4B
	metrics: 52B
	root: 4B
+----------------------------------------------+
data (total): 374B
	id: 8B
	metric: 144B
	percentile: 96B
	tag1: 6B
	tag2: 6B
	tag3: 5B
	tag4: 5B
	timestamp: 8B
	value: 96B
	-------------------------------------------+
	DOUBLE: 96B
	STRING: 240B
	BINARY: 22B
	INT64: 16B
+----------------------------------------------+
document size (total): 870B
Overhead: 57.01%

Factorisation

The percentile names and metric names are repeated a lot. This duplication can be removed by thinking of the metrics as a matrix, with percentile on the columns and metric on the rows. The values are ordered by metric and then percentile. To get the value for ("metric1", "p50"), get the zero based position of "metric1" within the metrics, m, and the zero based position of "p50" within the positions, p. The value is at index m * 4 + p. This is possible because the association between tags and metrics is no longer represented by an embedding, and because there is an application collaborating with the database on schema management.

{
  "id": 1000000000000,
  "timestamp": 1029831028310928,
  "index": {
    "tag1": { "$binary": "AAA=", "$type": "00" },
    "tag2": { "$binary": "AAA=", "$type": "00" },
    "tag3": { "$binary": "AA==", "$type": "00" },
    "tag4": { "$binary": "AA==", "$type": "00" }
  },
  "percentiles": [ "p50", "p90", "p95", "p99" ],
  "metrics": [ "metric1", "metric2", "metric3" ],
  "values": [
    1029831.102938, 1129831.102938, 1129831.102938, 1229831.102938, 1029831.102938, 1129831.102938, 1129831.102938, 1229831.102938, 1029831.102938, 1129831.102938, 1129831.102938, 1229831.102938
  ]
}

This gets below 400B, more than 1KB of space has been saved.

matrix.json
+----------------------------------------------+
data size by type (total): 202B
	 DOUBLE: 96B
	 STRING: 68B
	 BINARY: 22B
	 INT64: 16B
binary type markers (total): 4B
+----------------------------------------------+
attribute (total): 77B
null terminators: 41B
+----------------------------------------------+
document lengths (total): 20B
	index: 4B
	metrics: 4B
	percentiles: 4B
	root: 4B
	values: 4B
+----------------------------------------------+
data (total): 202B
	id: 8B
	metrics: 36B
	percentiles: 32B
	tag1: 6B
	tag2: 6B
	tag3: 5B
	tag4: 5B
	timestamp: 8B
	values: 96B
	-------------------------------------------+
	DOUBLE: 96B
	STRING: 68B
	BINARY: 22B
	INT64: 16B
+----------------------------------------------+
document size (total): 366B
Overhead: 44.81%

Binary Values Array

Again, arrays aren’t as dense as they look, it’s unlikely the values need to be indexed, and binary values are easy to unpack in the application layer.

{
  "id": 1000000000000,
  "timestamp": 1029831028310928,
  "index": {
    "tag1": { "$binary": "AAA=", "$type": "00" },
    "tag2": { "$binary": "AAA=", "$type": "00" },
    "tag3": { "$binary": "AA==", "$type": "00" },
    "tag4": { "$binary": "AA==", "$type": "00" }
  },
  "percentiles": [ "p50", "p90", "p95", "p99" ],
  "metrics": [ "metric1", "metric2", "metric3" ],
  "values": {"$binary": "QS9tjjS0Sh9BMT1nGlolEEExPWcaWiUQQTLEBxpaJRBBL22ONLRKH0ExPWcaWiUQQTE9ZxpaJRBBMsQHGlolEEEvbY40tEofQTE9ZxpaJRBBMT1nGlolEEEyxAcaWiUQ", "$type": "00"}
}

This is getting in to scraping the barrel territory, and only a further 40B is saved:

binary.json
+----------------------------------------------+
data size by type (total): 206B
	 STRING: 68B
	 BINARY: 122B
	 INT64: 16B
binary type markers (total): 5B
+----------------------------------------------+
attribute (total): 63B
null terminators: 28B
+----------------------------------------------+
document lengths (total): 16B
	index: 4B
	metrics: 4B
	percentiles: 4B
	root: 4B
+----------------------------------------------+
data (total): 206B
	id: 8B
	metrics: 36B
	percentiles: 32B
	tag1: 6B
	tag2: 6B
	tag3: 5B
	tag4: 5B
	timestamp: 8B
	values: 100B
	-------------------------------------------+
	STRING: 68B
	BINARY: 122B
	INT64: 16B
+----------------------------------------------+
document size (total): 328B
Overhead: 37.20%

Shrinking Remaining Attributes

Finally, it’s true, document database do like short attribute names, but you need to be able to write index configuration too. The names can be shortened, but not further than would make database configuration painful.

{
  "id": 1000000000000,
  "ts": 1029831028310928,
  "idx": {
    "tag1": { "$binary": "AAA=", "$type": "00" },
    "tag2": { "$binary": "AAA=", "$type": "00" },
    "tag3": { "$binary": "AA==", "$type": "00" },
    "tag4": { "$binary": "AA==", "$type": "00" }
  },
  "pctl": [ "p50", "p90", "p95", "p99" ],
  "name": [ "metric1", "metric2", "metric3"],
  "v": {"$binary": "QS9tjjS0Sh9BMT1nGlolEEExPWcaWiUQQTLEBxpaJRBBL22ONLRKH0ExPWcaWiUQQTE9ZxpaJRBBMsQHGlolEEEvbY40tEofQTE9ZxpaJRBBMT1nGlolEEEyxAcaWiUQ", "$type": "00"}
}

This saves another 25B.

minimised.json
+----------------------------------------------+
data size by type (total): 206B
	 STRING: 68B
	 BINARY: 122B
	 INT64: 16B
binary type markers (total): 5B
+----------------------------------------------+
attribute (total): 39B
null terminators: 28B
+----------------------------------------------+
document lengths (total): 16B
	idx: 4B
	name: 4B
	pctl: 4B
	root: 4B
+----------------------------------------------+
data (total): 206B
	id: 8B
	name: 36B
	pctl: 32B
	tag1: 6B
	tag2: 6B
	tag3: 5B
	tag4: 5B
	ts: 8B
	v: 100B
	-------------------------------------------+
	STRING: 68B
	BINARY: 122B
	INT64: 16B
+----------------------------------------------+
document size (total): 304B
Overhead: 32.24%

Improvements Create Feature Budget

The improvements can be summarised with a bar chart, sorted by document size, with shrinking the attribute names towards the left.

BSON Chart

MongoDB experts might be screaming at this point: it’s impossible to index the the documents by tag, for instance. If that’s a requirement, more than enough budget has been created to add a set of tags at the root level of the document by now, but shrinking the attribute names would not have got close to justifying adding more duplication for the sake of indexing.