Data Encodings and Representations

Slide 1

Slide 1 text

Data Encodings and Representations Michael Ficarra

Slide 2

Slide 2 text

Part 1 of 2 ● integers ● real numbers ● text

Slide 3

Slide 3 text

Integers

Slide 4

Slide 4 text

Integers positional numeral systems ● decimal ○ base/radix: 10 ○ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ● hexadecimal (hex) ○ base/radix: 16 ○ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F ● binary ○ base/radix: 2 ○ 0, 1

Slide 5

Slide 5 text

Integers positional numeral systems Dec: 24 Hex: 18 Bin: 11000 Dec: 41 Hex: 29 Bin: 101001 Dec: 94 Hex: 5E Bin: 1011110

Slide 6

Slide 6 text

Integers fixed width Problem: computers are finite Number: 94 (decimal) ● Word size: 8 bits 0101 1110 ● Word size: 16 bits 0000 0000 0101 1110 ● Word size: 32 bits 0000 0000 0000 0000 0000 0000 0101 1110

Slide 7

Slide 7 text

Integers overflow / underflow Word size: 8 bits ● Number: 254 (decimal) 1111 1110 ● Number: 255 (decimal) 1111 1111 ● Number: 256 (decimal) ???? ????

Slide 8

Slide 8 text

Integers negatives Number: -94 (decimal) Word size: 16 bits ● sign and magnitude: 1000 0000 0101 1110 ● two's complement: 216-94 = 65442 1111 1111 1010 0010 216-65442 = 94

Slide 9

Slide 9 text

Binary Data byte ordering Number: 123456 (decimal) 00000001 11100010 01000000 [ 01 ] [ E2 ] [ 40 ] ● Little Endian 40 E2 01 ● Big Endian 01 E2 40 (aside)

Slide 10

Slide 10 text

Integers arbitrary width (LEB128) Number: 123456 (decimal) 0001 1110 0010 0100 0000 1. group into 7-bit segments 0000111 1000100 1000000 2. add continuation bits 00000111 11000100 11000000 3. little endian 11000000 11000100 00000111

Slide 11

Slide 11 text

Integers in programming languages (literal notation) ● bases ○ decimal: 123456 ○ hex: 0x789ABC ○ octal (base 8) ■ leading zero: 0765 ■ 0o765 ○ binary: 0b110000 ● sized literals ○ long: 123456789L ○ long long: 123456789LL ● separators ○ 1_234_567_890 ○ 0x4996_02D2 ○ 0b00000011_11101000 ● BigInt literals in JavaScript ○ 123456789n

Slide 12

Slide 12 text

Integers in programming languages (sized types) ● absolutely sized types: ○ uint8_t, int32_t, etc ○ uint_least8_t, int_fast32_t, etc ○ integer types in Java: ■ byte: signed, 8 bit ■ short: signed, 16 bit ■ int: signed, 32 bit ■ long: signed, 64 bit ● platform-relative sized types: ○ short: usually half an int ○ int: usually one word or ½word ○ long: usually two ints ○ long long: usually two longs ○ word, dword, qword, etc.

Slide 13

Slide 13 text

Integers pitfalls ● always consider overflow & underflow when performing arithmetic on fixed-width integers ● do not transfer integers to another machine without accounting for byte ordering

Slide 14

Slide 14 text

ℝeal Numbers

Slide 15

Slide 15 text

Real Numbers inherent loss of precision ● problem: ○ real numbers are uncountably infinite ○ computers are finite ● a solution: ○ choose some rationals to be representable on computers ○ approximate your real number by using one of those values instead ○ logarithmically space them out ■ low magnitude, high precision ■ high magnitude, low precision

Slide 16

Slide 16 text

double precision (binary64) ● all integers with magnitude < 253 ● some rationals up to 1.79 × 10308 ● rule of thumb: ~15 decimal digits of precision Real Numbers single precision (binary32) ● all integers with magnitude < 224 ● some rationals up to 3.40 × 1038 ● rule of thumb: ~7 decimal digits of precision IEEE-754 floats

Slide 17

Slide 17 text

Real Numbers IEEE-754 floats -1sign × 2(bias - exponent) × 1.fraction = -10 × 2(127 - 124) × 1.25 = 1 × 23 × 1.25 = 10

Slide 18

Slide 18 text

Real Numbers IEEE-754 oddities ● NaN: not a number ○ exponent all 1, fraction non-0 ○ represents computation error ○ 253-2 different double-precision NaNs ○ generally propagates ○ NaN ≠ NaN ● infinities ○ exponent all 1, fraction all 0 ○ positive and negative, based on sign ● zeroes ○ exponent all 0, fraction all 0 ○ positive and negative, based on sign ● subnormal numbers ○ exponent all 0, fraction non-0 ○ represent the differences between adjacent "normal" numbers

Slide 19

Slide 19 text

Real Numbers numerical instability ● occurs when precision losses accumulate ● example: summing a list of floats ○ see Kahan's algorithm ● can be re-introduced by naïve compiler optimisations ● detect by changing rounding scheme ○ if round-to-infinity and round-to-negative-infinity differ greatly, it's likely unstable

Slide 20

Slide 20 text

Real Numbers alternatives to IEEE-754 ● Posits ○ similar goals as IEEE-754, done better ○ shown in studies to be more accurate ○ still relatively new; not widely available in hardware ● Rationals ○ pair of arbitrary width integers ○ very large magnitude numbers have very large representations ● Decimals ○ pair of integers ○ fixed or unlimited precision ● BigFloats ○ specified precision

Slide 21

Slide 21 text

Real Numbers in programming languages ● literal notation ○ 1.234 or 123. or .123 ○ 1D or 1F ○ 1.2e34 ○ ~1.23 ● loss of precision in notation ○ 9007199254740993D ○ typically rounds to nearest, ties to even ○ some languages disallow/warn ○ infinity: 2e308 / 4e38

Slide 22

Slide 22 text

Real Numbers pitfalls ● do not use floats to represent money ● compare floats using an epsilon ● do not mix very big and very small floats ● avoid accumulating arithmetic precision losses ● consider the appropriate representation before starting

Slide 23

Slide 23 text

Text

Slide 24

Slide 24 text

Text the bad old days 7-bit ASCII

Slide 25

Slide 25 text

Text the bad old days ISO 8859-1 / Windows-1252

Slide 26

Slide 26 text

Text the modern Unicode era ● goal: to represent all writing systems of the world ○ includes historical writing systems ○ includes symbols and punctuation ○ includes emoji ○ over 150 scripts supported, representing over 780 languages ● sequences of code points ○ 1,114,112 code points ■ 0 through 0x10FFFF ○ ~138,000 assigned so far (non-PUA) ● split into 17 "planes" ○ 1 basic multilingual plane (BMP) ■ U+0000 through U+FFFF ○ 16 supplementary / "astral" planes ● ASCII compatible ○ overlaps very beginning of BMP

Slide 27

Slide 27 text

code point allocation (left: all planes, right: BMP in detail)

Slide 28

Slide 28 text

Unicode Han unification just kidding

Slide 29

Slide 29 text

Unicode grapheme clusters ● combining characters: ○ А U+0410 : CYRILLIC CAPITAL LETTER A ○ ҉ U+0489 : COMBINING CYRILLIC MILLIONS SIGN ● ZWJ sequences: 󰔩 ○ 👨 U+1F468 : MAN ○ U+200D : ZERO WIDTH JOINER [ZWJ] ○ 👨 U+1F468 : MAN ○ U+200D : ZERO WIDTH JOINER [ZWJ] ○ 👧 U+1F467 : GIRL ● emoji skin tone modifiers: 󰯮 ○ 🤏 U+1F90F : PINCHING HAND ○ 🏼 U+1F3FC : EMOJI MODIFIER FITZPATRICK TYPE-3 (MEDIUM-LIGHT)

Slide 30

Slide 30 text

Unicode grapheme clusters ● variation selectors: ○ ➡ U+27A1 : BLACK RIGHTWARDS ARROW ○ U+FE0F : VARIATION SELECTOR-16 ● flags: 󰏀 ○ 🇫 U+1F1EB : REGIONAL INDICATOR SYMBOL LETTER F ○ 🇯 U+1F1EF : REGIONAL INDICATOR SYMBOL LETTER J

Slide 31

Slide 31 text

Unicode decomposition inconsistencies ● ä ○ a U+0061 : LATIN SMALL LETTER A ○ ̈ U+0308 : COMBINING DIAERESIS ● ä ○ ä U+00E4 : LATIN SMALL LETTER A WITH DIAERESIS ● ○ ⁉ U+2049 : EXCLAMATION QUESTION MARK ○ U+FE0F : VARIATION SELECTOR-16 ● ❓ ○ ❓ U+2753 : BLACK QUESTION MARK ORNAMENT

Slide 32

Slide 32 text

Unicode decomposition inconsistencies ● Man Health Worker: 󰞁 ○ 👨 U+1F468 : MAN ○ 🏾 U+1F3FE : EMOJI MODIFIER FITZPATRICK TYPE-5 (MEDIUM-DARK) ○ U+200D : ZERO WIDTH JOINER [ZWJ] ○ ⚕ U+2695 : STAFF OF AESCULAPIUS ○ U+FE0F : VARIATION SELECTOR-16 ● Merman: 󰨝 ○ 🧜 U+1F9DC : MERPERSON ○ 🏾 U+1F3FE : EMOJI MODIFIER FITZPATRICK TYPE-5 (MEDIUM-DARK) ○ U+200D : ZERO WIDTH JOINER [ZWJ] ○ ♂ U+2642 : MALE SIGN ○ U+FE0F : VARIATION SELECTOR-16

Slide 33

Slide 33 text

Unicode normalisation ● NFD: Normalization Form Canonical Decomposition 1. decompose by canonical equivalence 2. sort any combining characters ● NFC: Normalization Form Canonical Composition 1. decompose by canonical equivalence 2. recompose by canonical equivalence ● NFKD: Normalization Form Compatibility Decomposition 1. decompose by compatibility 2. sort any combining characters ● NFKC: Normalization Form Compatibility Composition 1. decompose by compatibility 2. recomposed by canonical equivalence

Slide 34

Slide 34 text

example normalisations

Slide 35

Slide 35 text

Unicode case mappings ● four mappings ○ lowercasing ○ uppercasing ○ titlecasing ○ case folding (used for case-insensitive comparisons) ● case mapping is string-to-string operation ○ string length may change ○ context-dependent mappings ● does not necessarily round-trip ● can be locale sensitive

Slide 36

Slide 36 text

Text encoding 7-bit ASCII ● one byte per character "ASCII" 41 53 43 49 49 byte values shown in hex notation

Slide 37

Slide 37 text

Text encoding Unicode ● UTF-32 ○ 32-bit (4 byte) code units ○ 1 code unit per code point ○ extremely wasteful ○ multi-byte, so endianness aware "waste of 🌌" w a s t e ␠ o f ␠ 🌌

Slide 38

Slide 38 text

Text encoding Unicode ● UTF-32 ○ 32-bit (4 byte) code units ○ 1 code unit per code point ○ extremely wasteful ○ multi-byte, so endianness aware "waste of 🌌" 00 00 00 77 00 00 00 61 00 00 00 73 00 00 00 74 00 00 00 65 00 00 00 20 00 00 00 6F 00 00 00 66 00 00 00 20 00 01 F3 0C byte values shown in hex notation; big endian byte ordering

Slide 39

Slide 39 text

Text encoding Unicode ● UTF-16 ○ 16-bit (2 byte) code units ○ 1 code unit for BMP code points ○ 2 code units for non-BMP code points ■ called "surrogate pair" ■ lead from range D800 to DBFF ■ trail from range DC00 to DFFF ■ lone and out-of-order surrogates disallowed "UTF-16 is 🌟neat" U T F - 1 6 ␠ i s ␠ 🌟 n e a t

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Text encoding Unicode min CP max CP 1st byte 2nd byte 3rd byte 4th byte U+0000 U+007F 0------- U+0080 U+07FF 110----- 10------ U+0800 U+FFFF 1110---- 10------ 10------ U+10000 U+10FFFF 11110--- 10------ 10------ 10------ U T F - 8 ␠ i s ␠ 💯 ␠ f o r ␠ A S C I I - c o m p a t i b l e ␠ t e x t ! "UTF-8 is 💯 for ASCII-compatible text!" ● UTF-8 ○ 8-bit (single byte) code units

Slide 42

Slide 42 text

Text encoding Unicode min cp max cp 1st byte 2nd byte 3rd byte 4th byte U+0000 U+007F 0------- U+0080 U+07FF 110----- 10------ U+0800 U+FFFF 1110---- 10------ 10------ U+10000 U+10FFFF 11110--- 10------ 10------ 10------ 55 54 46 2D 38 20 69 73 20 F0 9F 92 AF 20 66 6F 72 20 41 53 43 49 49 2D 63 6F 6D 70 61 74 69 62 6C 65 20 74 65 78 74 21 byte values shown in hex notation "UTF-8 is 💯 for ASCII-compatible text!" ● UTF-8 ○ 8-bit (single byte) code units

Slide 43

Slide 43 text

Text / Strings in programming languages ● escape sequences ○ do not appear literally; represent the referenced code point ○ allows inclusion of code points that are invalid in the string grammar ○ single-character escapes ■ \n, \t, \0, \", \\, etc ○ fixed-width escapes ■ \x00, \u0000 ○ bracketed escapes ■ \u{1F42B} ○ dynamic-width escapes & null escape ■ \uE50A\&

Slide 44

Slide 44 text

Text / Strings in programming languages ● NUL termination ○ UTF-8 overlong encoding ● "length" could mean ○ number of bytes ○ number of code units ○ number of code points ○ number of grapheme clusters ● indexing/iteration could operate on any of the above ○ not dependent on representation, just more/less performant ● byte-order mark: U+FEFF ○ an ignored code point at the beginning of a string encoding ○ will be read as U+FFEF if the byte order (endianness) is incorrect

Slide 45

Slide 45 text

Text character encoding communication on the web ● Accept-Charset HTTP request header ○ Accept-Charset: utf-8, iso-8859-1;q=0.5 ● Content-Type HTTP request header ○ Content-Type: multipart/form-data; boundary=... ● Content-Type HTTP response header ○ Content-Type: text/html; charset=UTF-8 ● ○ HTML character encoding detection algorithm

Slide 46

Slide 46 text

Strings performance & optimisations ● variable encodings ○ e.g. ISO-8859-1 when possible, UTF-16 otherwise ○ requires tagging; concatenation could trigger re-encoding ● slices ○ zero-copy "view" of substring ○ requires immutable strings ● ropes ○ binary tree of strings ○ good for frequently-edited long strings

Slide 47

Slide 47 text

Strings pitfalls ● do not "convert" a string to/from bytes without explicitly specifying the encoding ● do not iterate, index, or take the length of a string without considering the units ● normalise strings before comparison or hashing ● use case folding for case-insensitive comparisons ● do not NUL-terminate strings whose encoding may contain NUL ● always know the "type" of the data in your strings ● avoid strings whenever possible

Slide 48

Slide 48 text

Part 2 of 2 ● structured data ○ in-memory representations ■ sequences ■ records/structs/tuples ■ maps ○ serialisation or transmission ■ schema-free ■ schema-directed ○ parsing ● binary data ○ typed views ○ slices ● other useful & common encodings ● parting notes

Slide 49

Slide 49 text

Recall from part 1 concept strategies for representation on computers integers ● fixed-width / machine integers ● two's complement ● VLQs / big integers ... real numbers ● IEEE-754 floats ● Posits ... text ● Code pages: ASCII, ISO-8859-1 ● Unicode ● Unicode encodings: UTF-8, UTF-16, UTF-32 ● Ropes, slices, mixed encodings ...

Slide 50

Slide 50 text

Structured Data

Slide 51

Slide 51 text

Structured Data examples ● generic collections ○ sequences (arrays / lists) ○ bags & sets ○ records / structs / tuples ○ maps & multi-maps ● domain-specific structures ○ dates ○ URLs ○ postal addresses

Slide 52

Slide 52 text

In-memory Representations

Slide 53

Slide 53 text

Sequences, Bags, & Sets representations ● arrays ○ contiguous in memory ○ fast indexing ○ insert / remove via copying only ● linked lists ○ slow indexing ○ mutable: fast insert, remove ○ immutable: shared suffixes, cons, drop

Slide 54

Slide 54 text

Bags representations, continued ● binary heaps ○ binary tree ○ constraint: elements have an ordering ○ fast access to smallest element ○ heap property: parent ≤ children ○ binary trees can be efficiently laid out in memory (left: 2n+1, right: 2n+2)

Slide 55

Slide 55 text

Records structs, tuples, same thing ● laid out contiguously (like array) ● compiler translates fields to offsets ○ packed ○ unpacked (word aligned) { a: 12345, b: 'p', c: true, } 12345 'p' true 12345 'p' padding true padding ( 12345, 'p', true )

Slide 56

Slide 56 text

Maps representations ● unlike records, keys exist at runtime ○ key must be present in representation ● simple & slow: any representation as sequence of pairs [ ("key 1", "value 1"), ("key 2", "value 2"), ... ]

Slide 57

Slide 57 text

Maps hash table representation ● keys must be hashable ● digest size = number of "buckets" ● inevitable hash collisions ● collision resolution: matching bucket searched for correct key ● possibly catastrophic performance

Slide 58

Slide 58 text

Considerations about in-memory representations ● will we have biased usage frequency or usage patterns? ● is there a useful constraint or invariant for the contained elements? ● is our application performance sensitive to CPU caches or locality? ● will we be using it enough to justify a complex representation? ● can an adversary take advantage of catastrophic performance?

Slide 59

Slide 59 text

Representations for Serialisation / Transmission

Slide 60

Slide 60 text

Representations for Serialisation / Transmission text-based ● JSON ● YAML ● edn ● UBJSON / BSON / Smile ● MessagePack ● SuperPack ● Protocol Buffers (protobuffs) ● Cap'n Proto / FlatBuffers ● Apache Avro ● Apache Thrift schema-free schema-directed binary

Slide 61

Slide 61 text

Serialisation schema-free, text-based ● JSON ○ very limited data model ○ just a formal grammar, no associated semantics ○ case study: Twitter snowflake bug ● YAML ○ internal references ○ multi-document streams ○ don't use yaml ● edn ○ extensibility ○ coordination between encoder/decoder

Slide 62

Slide 62 text

Serialisation schema-free, binary ● UBJSON / BSON / Smile ● MessagePack ● SuperPack ○ built-in optimisations ○ extensibility

Slide 63

Slide 63 text

Serialisation schema-directed (binary) ● schemas describe shape of data ● allows us to omit tags ○ schema-free encoding: ○ schema-directed encoding: ● Protocol Buffers (protobuffs) ○ limitation: schemas don't support arbitrary nesting, so shape of your data may need to change ● Cap'n Proto / FlatBuffers ○ zero-copy deserialisation tag 1 value 1 tag 2 value 2 tag 3 value 3 value 1 value 2 value 3

Slide 64

Slide 64 text

Deserialisation security ● arbitrary code execution ○ in Ruby and Python by default ● compression ● creating cycles where unexpected ○ generally, violating any structural invariants ● data validation and/or budgeting

Slide 65

Slide 65 text

Considerations about representations for serialisation ● is serialisation / deserialisation performance important? ● does my data have a fixed schema? ● does it need to be canonicalising? ● human-readability / editing ● portable across different languages? ● portable across processes / machines? ● exotic types of data? ● adversary-controlled? ● limitations of underlying transmission protocol ● impossible to represent nonsensical structures?

Slide 66

Slide 66 text

Binary Data

Slide 67

Slide 67 text

Binary Data typed views ● shared buffer abstraction ● can support reads and writes

Slide 68

Slide 68 text

Binary Data slices ● shared buffer abstraction ● lightweight: just translates offsets ● composable ● read/write or read-only ● combine with typed views for principled handling of binary data

Slide 69

Slide 69 text

Other Common Encodings

Slide 70

Slide 70 text

Common Encoding: URLs https://user:[email protected]:443/a/b/c?x=0&y=1&z=2#fragment scheme user info host (punycode-encoded) port path path components (percent- encoded) query query parameters (percent-encoded) fragment (percent-encoded)

Slide 71

Slide 71 text

Other Common Encodings ● UUIDs (RFC 4122) ○ version & variant ○ time, MAC address, user/group ID ● User-Agent headers ○ browser family & version ○ OS & version ○ CPU architecture ○ mobile provider ● Dates ○ Thu Feb 04 2021 10:00:00 GMT-0800 (Pacific Standard Time) ○ year, month, day ○ hours, minutes, seconds ○ day of week ○ time zone & offset

Slide 72

Slide 72 text

Parting Notes

Slide 73

Slide 73 text

Parting Notes ● distinguish abstractions from concrete representations in communication ● try working in abstractions, but be aware of your representations ○ abstractions overpromise ○ interface may be possible but with costs that are not evident upfront ○ easy to forget about potential scenarios in which representation matters ● constraints or invariants are opportunities for optimisations ● work with structured data through structured interfaces

Slide 74

Slide 74 text

end

Slide 75

Slide 75 text

Structured Data: Serialisation Strategies schema-free schema-directed ● examples: JSON, YAML, MessagePack ● examples: Protocol Buffers, Apache Avro, Apache Thrift ● shape of data may be unknown ● shape of data known ahead of time ● encoded data is self-describing ● schema is required for both encoding and decoding ● encoded data may be human-readable ● encoded data typically not human-readable ● size overhead to describe contained data ● compact encoding

Slide 76

Slide 76 text

Binary Data base-64 encoding ● more convenient representation in certain contexts that prefer working with printable ASCII ● 4/3: 644 = 2563 ● printable ASCII has at least 64 unique symbols (62 alphanums!) ● padding ● URL-safe alphabet