Data Encoding Techniques used by Columnar and Time series databases
A detailed overview about the common data encoding techniques used columnar database and time series database to reduce the data size and improve performance.
Int Encoding The Basics seq = [255, 16777215, 65535, 4294967295] for value in seq: writer.write_uint32(value) 255 16777215 65535 4294967295 4 bytes 255 16777215 65535 4294967295 1 3 2 4 Variable N bytes 1byte for value in seq: length = bytes_length(value) writer.write_uint8(bytes_length) writer.write_uint(value, bytes_length) 16 bytes 4 + 10 bytes We can do better! def write_int_le(value, nbytes): buf = bytearray() for i in range(nbytes): buf.append((value >> (i * 8)) & 0xff) return buf def read_int_le(data): result = 0 for i in range(len(data)): result |= (data[i] & 0xff) << (i * 8) return result
int48 int56 int64 8 bytes 255 16777215 65535 4294967295 (1 + 10) bytes 0 1 2 3 4 5 6 7 0b00 0 0b10 2 0b01 1 0b11 3 1 byte (1 byte) (3 bytes) (2 bytes) (4 bytes) Using 2bit we can store the int “bytes required” between 1-4, for the int32 In 1byte we can fit the length of 4 values seq = [255, 16777215, 65535, 4294967295] # Compute Lengths lengths = [bytes_length(v) for v in seq] # Write Header bit_writer = BitWriter(writer, width=2) for vlen in lengths: bit_writer.write(vlen - 1) bit_writer.flush() # Write Values for vlen, v in zip(lengths, seq): writer.write_uint(v, vlen)
7 8 9 10 11 12 13 14 15 60 values of 1bit 30 values of 2bit 20 values of 3bit 15 values of 4bit 12 values of 5bit 10 values of 6bit 8 values of 7bit (4bit wasted) 7 values of 8bit (4bit wasted) 6 values of 10bit 5 values of 12bit 4 values of 15bit 3 values of 20bit 2 values of 30bit 1 values of 60bit 240 values of 0bit (60bit wasted) 120 values of 0bit (60bit wasted) 60 bit selector 4bit LIMITATION: Max Value (2^60 - 1) compressing multiple integers into a 64-bit storage structure
2 bits: size in bits of the length of the encoded sequence 6 bits: size in bits of the “repeated" value N bits: length of the encoded sequence N bits: value xx xxxxxx n-items value 2 bits: encoding type (LIN) 2 bits: size in bits of the length of the encoded sequence 6 bits: size in bits of the “base” value N bits: length of the encoded sequence N bits: base value 01 xx xxxxxx xxxxxx 6 bits: size in bits of the "delta" values n-items base value N bits: delta value delta value Store Values or Sequences Using as few bit as possible xx 2 bits: encoding type (RLE, LIN, Delta) 2 bits: size in bits of the length of the encoded sequence xx Aligned to 4 (0:4bit, 1:8bit, 2:12bit, 3:16bit) delta_seq = [1688313600, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] nitems = 11 (stored using 4bit) base_value = 1688313600 (stored using 31bit) delta_value = 5 (stored using 3bit) Example: Timestamp at a fixed interval seq = [8079931567495, 8079931567495, 8079931567495, 8079931567495] nitems = 4 (stored using 3bit) value = 8079931567495 (stored using 43bit) Example: Repeated Value
bits: size in bits of the length of the encoded sequence 6 bits: size in bits of the “base” value N bits: length of the encoded sequence `min` bits: min value 10 xx xxxxxx xxxxxx 6 bits: size in bits of the "delta" values n-items min value `delta` bits: deltas delta[0] delta[1] delta[N] ... Store Values or Sequences Using as few bit as possible delta_seq = [31079011597204624, 1, 7, 3, 5, 0, 2, 4, 0, 2, 7] nitems = 10 # (stored using 4bit) base_value = 31079011597204624 # (stored using 55bit) delta_values_bits = 3 Example: Delta Sequence with same bit width 11 2 bits: encoding type (Fixed Width) 2 bits: size in bits of the length of the encoded sequence 6 bits: size in bits of values N bits: length of the encoded sequence N bits: value xx xxxxxx n-items value[0] value[1] value[N] ... delta_seq = [627, 652, 434, 579, 242, 458, 839, 801, 815, 77, 280] nitems = 11 # (stored using 4bit) values_bits = 10 Example: Fixed Width Values
= [] for v in seq: index = dictionary.get(v) if index is None: index = len(dictionary) dictionary[v] = index indexes.append(index) keys = [k for k, _ in sorted(dictionary.items(), key=itemgetter(1))] return keys, indexes def dict_decoder(keys, indexes): for v in indexes: yield keys[v] Encoded 0 1 1 0 2 1 1 1 0 Red Blue Blue Red Green Blue Blue Blue Red Input seq = ['Red', 'Blue', ‘Blue', 'Red', ‘Green', 'Blue', 'Blue', ‘Blue', 'Red'] ['Red', 'Blue', ‘Green'] [0, 1, 1, 0, 2, 1, 1, 1, 0]
251 252 253 254 255 8bytes 7bytes 6bytes 5bytes 4bytes 3bytes A1 A1 A2 value = 2288 + 256 * A1 + A2 value = 240 + 256 * (A0 - 241) + A1 2287 67823 A0 = ((value - 240) / 256 + 241), A1 = ((value - 240) % 256) A1 = ((value - 2288) / 256), A2 = ((value - 2288) % 256) 1 byte The length can be determined by looking at the fi rst byte Lexicographical ordering Is equal to numeric ordering. They can be used as keys in the key/value backend storage. Great for Small Value