Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Encoding Techniques used by Columnar and T...

Data Encoding Techniques used by Columnar and Time series databases

A detailed overview about the common data encoding techniques used columnar database and time series database to reduce the data size and improve performance.

Matteo Bertozzi

July 05, 2023
Tweet

More Decks by Matteo Bertozzi

Other Decks in Programming

Transcript

  1. Data Compression Algorithms, Explained Data Size Plain Generic Algorithm Type

    Speci fi c A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 Row storage A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 Columnar storage A0 B0 C0 A1 B1 C1 Hybrid storage B2 C2 B3 C3 A2 A3
  2. Data Compression/Encoding Where is used? A0 B0 C0 A1 B1

    C1 A2 B2 C2 A3 B3 C3 A B C A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 Row-wise storage row A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 Column-wise (Columnar) storage column A0 B0 C0 A1 B1 C1 B2 C2 B3 C3 Hybrid storage A2 A3 row group page Table “visually” CPU RAM Disk Network compression is a bandwidth-computation tradeoff
  3. int8 int16 int24 int32 int40 int48 int56 int64 8 bytes

    Int Encoding The Basics seq = [255, 16777215, 65535, 4294967295] for value in seq: writer.write_uint32(value) 255 16777215 65535 4294967295 4 bytes 255 16777215 65535 4294967295 1 3 2 4 Variable N bytes 1byte for value in seq: length = bytes_length(value) writer.write_uint8(bytes_length) writer.write_uint(value, bytes_length) 16 bytes 4 + 10 bytes We can do better! def write_int_le(value, nbytes): buf = bytearray() for i in range(nbytes): buf.append((value >> (i * 8)) & 0xff) return buf def read_int_le(data): result = 0 for i in range(len(data)): result |= (data[i] & 0xff) << (i * 8) return result
  4. Int Encoding Group VarInt Encoding int8 int16 int24 int32 int40

    int48 int56 int64 8 bytes 255 16777215 65535 4294967295 (1 + 10) bytes 0 1 2 3 4 5 6 7 0b00 0 0b10 2 0b01 1 0b11 3 1 byte (1 byte) (3 bytes) (2 bytes) (4 bytes) Using 2bit we can store the int “bytes required” between 1-4, for the int32 In 1byte we can fit the length of 4 values seq = [255, 16777215, 65535, 4294967295] # Compute Lengths lengths = [bytes_length(v) for v in seq] # Write Header bit_writer = BitWriter(writer, width=2) for vlen in lengths: bit_writer.write(vlen - 1) bit_writer.flush() # Write Values for vlen, v in zip(lengths, seq): writer.write_uint(v, vlen)
  5. Yeah, I know, Doc. It's not to scale. It's okay

    - Marty McFly Int Encoding Group VarInt Encoding int8 int16 int24 int32 int40 int48 int56 int64 8 bytes 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0b111 0 0b010 2 0b001 1 0b011 3 0b111 7 0b100 4 0b110 6 0b101 5 255 16777215 65535 4294967295 (1 byte) (3 bytes) (2 bytes) (4 bytes) (8 bytes) 1099511627775 (5 bytes) 72057594037927935 (7 bytes) 281474976710655 (6 bytes) 18446744073709551615 1st byte 2nd byte 3rd byte (3 + 36) bytes Using 3bit we can store the int “bytes required” between 1-8, for the int64 In 3bytes we can fit the length of 8 values
  6. Int Encoding Simple-8b 0 1 2 3 4 5 6

    7 8 9 10 11 12 13 14 15 60 values of 1bit 30 values of 2bit 20 values of 3bit 15 values of 4bit 12 values of 5bit 10 values of 6bit 8 values of 7bit (4bit wasted) 7 values of 8bit (4bit wasted) 6 values of 10bit 5 values of 12bit 4 values of 15bit 3 values of 20bit 2 values of 30bit 1 values of 60bit 240 values of 0bit (60bit wasted) 120 values of 0bit (60bit wasted) 60 bit selector 4bit LIMITATION: Max Value (2^60 - 1) compressing multiple integers into a 64-bit storage structure
  7. Int Encoding Delta Encoding seq = [1500, 1510, 1520, 1535,

    1542, 1550] delta_seq = delta_encode(seq) # delta_seq = [1500, 10, 10, 15, 7, 8] int8 int16 int24 int32 int40 int48 int56 int64 8 bytes def delta_encode(seq): last = 0 for v in seq: yield v - last last = v def delta_decode(seq): last = 0 for v in seq: last += v yield last 1500 1510 1520 1535 1542 1550 2 bytes 2 bytes 2 bytes 2 bytes 2 bytes 2 bytes 1500 10 10 15 7 8 2 bytes 1byte 1byte 1byte 1byte 1byte Encoding data by storing the difference between consecutive values. reducing storage requirements. 1688313600 1688313605 1688313610 1688313615 1688313620 1688313600 5 5 5 5 Timestamp at a fixed interval 57 0 0 5 57 57 57 62 62 62 62 62 73 0 0 0 0 11 Measure value (e.g. counter, temperature, …)
  8. Int Encoding Run-Length Encoding (RLE) seq = [0, 0, 0,

    1, 2, 2, 2, 2, 3] def rle_encoder(seq): count = 1 for i in range(1, len(seq)): if seq[i] == seq[i - 1]: count += 1 else: yield count, seq[i - 1] count = 1 yield count, seq[-1] def rle_decoder(data): for count, v in data: for _ in range(count): yield v 0 0 1 1 2 2 2 2 2 4 repetitions 5 repetitions 4 0 2 1 5 2 1 3 0 0 3 SELECT count(*) WHERE v = 2 Does not require decode
  9. Int Encoding Bit Packing 00 2 bits: encoding type (RLE)

    2 bits: size in bits of the length of the encoded sequence 6 bits: size in bits of the “repeated" value N bits: length of the encoded sequence N bits: value xx xxxxxx n-items value 2 bits: encoding type (LIN) 2 bits: size in bits of the length of the encoded sequence 6 bits: size in bits of the “base” value N bits: length of the encoded sequence N bits: base value 01 xx xxxxxx xxxxxx 6 bits: size in bits of the "delta" values n-items base value N bits: delta value delta value Store Values or Sequences Using as few bit as possible xx 2 bits: encoding type (RLE, LIN, Delta) 2 bits: size in bits of the length of the encoded sequence xx Aligned to 4 (0:4bit, 1:8bit, 2:12bit, 3:16bit) delta_seq = [1688313600, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] nitems = 11 (stored using 4bit) base_value = 1688313600 (stored using 31bit) delta_value = 5 (stored using 3bit) Example: Timestamp at a fixed interval seq = [8079931567495, 8079931567495, 8079931567495, 8079931567495] nitems = 4 (stored using 3bit) value = 8079931567495 (stored using 43bit) Example: Repeated Value
  10. Int Encoding Bit Packing 2 bits: encoding type (Delta) 2

    bits: size in bits of the length of the encoded sequence 6 bits: size in bits of the “base” value N bits: length of the encoded sequence `min` bits: min value 10 xx xxxxxx xxxxxx 6 bits: size in bits of the "delta" values n-items min value `delta` bits: deltas delta[0] delta[1] delta[N] ... Store Values or Sequences Using as few bit as possible delta_seq = [31079011597204624, 1, 7, 3, 5, 0, 2, 4, 0, 2, 7] nitems = 10 # (stored using 4bit) base_value = 31079011597204624 # (stored using 55bit) delta_values_bits = 3 Example: Delta Sequence with same bit width 11 2 bits: encoding type (Fixed Width) 2 bits: size in bits of the length of the encoded sequence 6 bits: size in bits of values N bits: length of the encoded sequence N bits: value xx xxxxxx n-items value[0] value[1] value[N] ... delta_seq = [627, 652, 434, 579, 242, 458, 839, 801, 815, 77, 280] nitems = 11 # (stored using 4bit) values_bits = 10 Example: Fixed Width Values
  11. Int Encoding Sorting the Data 0 1 2 3 4

    5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 1st byte 2nd byte 3rd byte 0 1 2 3 4 5 6 7 4th byte 0b0111 7 0b1111 15 0b1011 11 0b0010 2 0b1010 10 0b0000 0 0b1110 14 0b0011 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 5th byte 6th byte 7th byte 0 1 2 3 4 5 6 7 8th byte 0b1101 13 0b1100 12 0b0001 1 0b1000 8 0b0110 6 0b0100 4 0b1001 9 0b0101 5 entry[ 0] = 106 entry[ 1] = 111 entry[ 2] = 104 entry[ 3] = 108 entry[ 4] = 114 entry[ 5] = 116 entry[ 6] = 113 entry[ 7] = 100 entry[ 8] = 112 entry[ 9] = 115 entry[10] = 105 entry[11] = 103 entry[12] = 110 entry[13] = 109 entry[14] = 107 entry[15] = 101 When data is not sorted sorted values, encoded/compressed The index for 16 entries can fit into 8 bytes, 4bit (0-15) per entry. We can sort (chunks) and get an index. Write the data sorted, And on read restore the original order. seq = [106, 111, 104, 108, ...] index = sorted_index(seq) # index = [7, 15, 11, 2, ...] bit_writer = BitWriter(width=4) for idx in index: bit_writer.write(idx) bit_reader = BitReader(stream, width=4) index = reader.read(16) sorted_seq = decoder(…) seq = [sorted_seq[idx] for idx in index]
  12. String Encoding Incremental Encoding seq = [ 'christine', 'christophe', 'darnell',

    'emily', 'emma', 'joey', 'john', 'johnny', ‘marky' ] def prefix_encoder(seq): last = '' for v in seq: prefix = common_prefix(last, v) yield prefix, v[prefix:] last = v def prefix_decoder(data): last = '' for prefix, v in data: v = last[:prefix] + v yield v last = v Prefix Length Encoded 0 9 christine 6 4 ophe 0 7 darnell 0 5 emily 1 2 ma 0 4 joey 2 2 hn 4 2 ny 0 5 marky christine christophe darnell emily emma joey john johnny marky Input Delta([0, 6, 0, 0, 2, 0, 2, 4, 0]) Delta([9, 4, 7, 5, 2, 4, 2, 2, 5]) 'christineophedarnellemilymajoeyhnnymarky' christine christophe Prefix 6 bytes Remaining Length 4 bytes Removes Common Pre fi xes
  13. Data-agnostic Encoding Dictionary Encoding def dict_encoder(seq): dictionary = {} indexes

    = [] for v in seq: index = dictionary.get(v) if index is None: index = len(dictionary) dictionary[v] = index indexes.append(index) keys = [k for k, _ in sorted(dictionary.items(), key=itemgetter(1))] return keys, indexes def dict_decoder(keys, indexes): for v in indexes: yield keys[v] Encoded 0 1 1 0 2 1 1 1 0 Red Blue Blue Red Green Blue Blue Blue Red Input seq = ['Red', 'Blue', ‘Blue', 'Red', ‘Green', 'Blue', 'Blue', ‘Blue', 'Red'] ['Red', 'Blue', ‘Green'] [0, 1, 1, 0, 2, 1, 1, 1, 0]
  14. Single Int Encoding Var Int (VLQ) 0 1 2 3

    4 5 6 7 1 1 byte 0 1 2 3 4 5 6 7 1 1 byte 0 1 2 3 4 5 6 7 0 1 byte def write_varint(value: int): while (value & 0xffffff80) != 0: write_uint8((value & 0x7f) | 0x80) value >>= 7 write_uint8(value) def read_varint(): shift = 0 result = 0 while True: v = read_uint8() result |= (v & 0x7f) << shift shift += 7 if not (v & 0x80): break return result 1bytes - 0-127 2bytes - 128 - 16384 3bytes - 16385 - 2097151 4bytes - 2097152 - 268435455 ...
  15. Single Int Encoding Sqlite4 Variable-Length Integer 0-240 241-248 249 250

    251 252 253 254 255 8bytes 7bytes 6bytes 5bytes 4bytes 3bytes A1 A1 A2 value = 2288 + 256 * A1 + A2 value = 240 + 256 * (A0 - 241) + A1 2287 67823 A0 = ((value - 240) / 256 + 241), A1 = ((value - 240) % 256) A1 = ((value - 2288) / 256), A2 = ((value - 2288) % 256) 1 byte The length can be determined by looking at the fi rst byte Lexicographical ordering Is equal to numeric ordering. They can be used as keys in the key/value backend storage. Great for Small Value
  16. Data Encoding Int Encoding Group VarInt Simple 8b Delta Encoding

    Run Length Encoding Bit Packing String Encoding Incremental Encoding Generic Encoding Dictionary Compression 16711823 = 0b11111111 00000000 10001111 Little Endian Big Endian 0x8F 0x00 0xFF 0x00 0x00 0xFF 0x00 0x8f ZigZag Encoding 0 = 0, -1 = 1, 1 = 2, -2 = 3, 2 = 4, -3 = 5, 3 = 6 ... Maps negative values to positive values Encode: (v << 1) ^ (v >> (bit_length - 1)) Decode: (v >> 1) ^ -(v & 1) int8 int16 int24 int32 int40 int48 int56 int64 8 bytes