Michael Ficarra
February 03, 2021
9

# Data Encodings and Representations

With no necessary background, you will learn how the numbers and strings and other data structures on our computers are represented by ones and zeroes underneath, and how that representation leaks through our abstractions to impact how we use them. From there, we will move onto other, more conspicuous changes of representation that are common in our daily lives, and review what should be considered when dealing with them.

## Michael Ficarra

February 03, 2021

## Transcript

text

4. ### Integers positional numeral systems • decimal ◦ base/radix: 10 ◦

0, 1, 2, 3, 4, 5, 6, 7, 8, 9 • hexadecimal (hex) ◦ base/radix: 16 ◦ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F • binary ◦ base/radix: 2 ◦ 0, 1
5. ### Integers positional numeral systems Dec: 24 Hex: 18 Bin: 11000

Dec: 41 Hex: 29 Bin: 101001 Dec: 94 Hex: 5E Bin: 1011110
6. ### Integers fixed width Problem: computers are finite Number: 94 (decimal)

• Word size: 8 bits 0101 1110 • Word size: 16 bits 0000 0000 0101 1110 • Word size: 32 bits 0000 0000 0000 0000 0000 0000 0101 1110
7. ### Integers overflow / underflow Word size: 8 bits • Number:

254 (decimal) 1111 1110 • Number: 255 (decimal) 1111 1111 • Number: 256 (decimal) ???? ????
8. ### Integers negatives Number: -94 (decimal) Word size: 16 bits •

sign and magnitude: 1000 0000 0101 1110 • two's complement: 216-94 = 65442 1111 1111 1010 0010 216-65442 = 94
9. ### Binary Data byte ordering Number: 123456 (decimal) 00000001 11100010 01000000

[ 01 ] [ E2 ] [ 40 ] • Little Endian 40 E2 01 • Big Endian 01 E2 40 (aside)
10. ### Integers arbitrary width (LEB128) Number: 123456 (decimal) 0001 1110 0010

0100 0000 1. group into 7-bit segments 0000111 1000100 1000000 2. add continuation bits 00000111 11000100 11000000 3. little endian 11000000 11000100 00000111
11. ### Integers in programming languages (literal notation) • bases ◦ decimal:

123456 ◦ hex: 0x789ABC ◦ octal (base 8) ▪ leading zero: 0765 ▪ 0o765 ◦ binary: 0b110000 • sized literals ◦ long: 123456789L ◦ long long: 123456789LL • separators ◦ 1_234_567_890 ◦ 0x4996_02D2 ◦ 0b00000011_11101000 • BigInt literals in JavaScript ◦ 123456789n
12. ### Integers in programming languages (sized types) • absolutely sized types:

◦ uint8_t, int32_t, etc ◦ uint_least8_t, int_fast32_t, etc ◦ integer types in Java: ▪ byte: signed, 8 bit ▪ short: signed, 16 bit ▪ int: signed, 32 bit ▪ long: signed, 64 bit • platform-relative sized types: ◦ short: usually half an int ◦ int: usually one word or ½word ◦ long: usually two ints ◦ long long: usually two longs ◦ word, dword, qword, etc.
13. ### Integers pitfalls • always consider overflow & underflow when performing

arithmetic on fixed-width integers • do not transfer integers to another machine without accounting for byte ordering

15. ### Real Numbers inherent loss of precision • problem: ◦ real

numbers are uncountably infinite ◦ computers are finite • a solution: ◦ choose some rationals to be representable on computers ◦ approximate your real number by using one of those values instead ◦ logarithmically space them out ▪ low magnitude, high precision ▪ high magnitude, low precision
16. ### double precision (binary64) • all integers with magnitude < 253

• some rationals up to 1.79 × 10308 • rule of thumb: ~15 decimal digits of precision Real Numbers single precision (binary32) • all integers with magnitude < 224 • some rationals up to 3.40 × 1038 • rule of thumb: ~7 decimal digits of precision IEEE-754 floats
17. ### Real Numbers IEEE-754 floats -1sign × 2(bias - exponent) ×

1.fraction = -10 × 2(127 - 124) × 1.25 = 1 × 23 × 1.25 = 10
18. ### Real Numbers IEEE-754 oddities • NaN: not a number ◦

exponent all 1, fraction non-0 ◦ represents computation error ◦ 253-2 different double-precision NaNs ◦ generally propagates ◦ NaN ≠ NaN • infinities ◦ exponent all 1, fraction all 0 ◦ positive and negative, based on sign • zeroes ◦ exponent all 0, fraction all 0 ◦ positive and negative, based on sign • subnormal numbers ◦ exponent all 0, fraction non-0 ◦ represent the differences between adjacent "normal" numbers
19. ### Real Numbers numerical instability • occurs when precision losses accumulate

• example: summing a list of floats ◦ see Kahan's algorithm • can be re-introduced by naïve compiler optimisations • detect by changing rounding scheme ◦ if round-to-infinity and round-to-negative-infinity differ greatly, it's likely unstable
20. ### Real Numbers alternatives to IEEE-754 • Posits ◦ similar goals

as IEEE-754, done better ◦ shown in studies to be more accurate ◦ still relatively new; not widely available in hardware • Rationals ◦ pair of arbitrary width integers ◦ very large magnitude numbers have very large representations • Decimals ◦ pair of integers ◦ fixed or unlimited precision • BigFloats ◦ specified precision
21. ### Real Numbers in programming languages • literal notation ◦ 1.234

or 123. or .123 ◦ 1D or 1F ◦ 1.2e34 ◦ ~1.23 • loss of precision in notation ◦ 9007199254740993D ◦ typically rounds to nearest, ties to even ◦ some languages disallow/warn ◦ infinity: 2e308 / 4e38
22. ### Real Numbers pitfalls • do not use floats to represent

money • compare floats using an epsilon • do not mix very big and very small floats • avoid accumulating arithmetic precision losses • consider the appropriate representation before starting

26. ### Text the modern Unicode era • goal: to represent all

writing systems of the world ◦ includes historical writing systems ◦ includes symbols and punctuation ◦ includes emoji ◦ over 150 scripts supported, representing over 780 languages • sequences of code points ◦ 1,114,112 code points ▪ 0 through 0x10FFFF ◦ ~138,000 assigned so far (non-PUA) • split into 17 "planes" ◦ 1 basic multilingual plane (BMP) ▪ U+0000 through U+FFFF ◦ 16 supplementary / "astral" planes • ASCII compatible ◦ overlaps very beginning of BMP

29. ### Unicode grapheme clusters • combining characters: ◦ А U+0410 :

CYRILLIC CAPITAL LETTER A ◦ ҉ U+0489 : COMBINING CYRILLIC MILLIONS SIGN • ZWJ sequences: 󰔩 ◦ 👨 U+1F468 : MAN ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ 👨 U+1F468 : MAN ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ 👧 U+1F467 : GIRL • emoji skin tone modifiers: 󰯮 ◦ 🤏 U+1F90F : PINCHING HAND ◦ 🏼 U+1F3FC : EMOJI MODIFIER FITZPATRICK TYPE-3 (MEDIUM-LIGHT)
30. ### Unicode grapheme clusters • variation selectors: ◦ ➡ U+27A1 :

BLACK RIGHTWARDS ARROW ◦ U+FE0F : VARIATION SELECTOR-16 • flags: 󰏀 ◦ 🇫 U+1F1EB : REGIONAL INDICATOR SYMBOL LETTER F ◦ 🇯 U+1F1EF : REGIONAL INDICATOR SYMBOL LETTER J
31. ### Unicode decomposition inconsistencies • ä ◦ a U+0061 : LATIN

SMALL LETTER A ◦ ̈ U+0308 : COMBINING DIAERESIS • ä ◦ ä U+00E4 : LATIN SMALL LETTER A WITH DIAERESIS • ◦ ⁉ U+2049 : EXCLAMATION QUESTION MARK ◦ U+FE0F : VARIATION SELECTOR-16 • ❓ ◦ ❓ U+2753 : BLACK QUESTION MARK ORNAMENT
32. ### Unicode decomposition inconsistencies • Man Health Worker: 󰞁 ◦ 👨

U+1F468 : MAN ◦ 🏾 U+1F3FE : EMOJI MODIFIER FITZPATRICK TYPE-5 (MEDIUM-DARK) ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ ⚕ U+2695 : STAFF OF AESCULAPIUS ◦ U+FE0F : VARIATION SELECTOR-16 • Merman: 󰨝 ◦ 🧜 U+1F9DC : MERPERSON ◦ 🏾 U+1F3FE : EMOJI MODIFIER FITZPATRICK TYPE-5 (MEDIUM-DARK) ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ ♂ U+2642 : MALE SIGN ◦ U+FE0F : VARIATION SELECTOR-16
33. ### Unicode normalisation • NFD: Normalization Form Canonical Decomposition 1. decompose

by canonical equivalence 2. sort any combining characters • NFC: Normalization Form Canonical Composition 1. decompose by canonical equivalence 2. recompose by canonical equivalence • NFKD: Normalization Form Compatibility Decomposition 1. decompose by compatibility 2. sort any combining characters • NFKC: Normalization Form Compatibility Composition 1. decompose by compatibility 2. recomposed by canonical equivalence

35. ### Unicode case mappings • four mappings ◦ lowercasing ◦ uppercasing

◦ titlecasing ◦ case folding (used for case-insensitive comparisons) • case mapping is string-to-string operation ◦ string length may change ◦ context-dependent mappings • does not necessarily round-trip • can be locale sensitive
36. ### Text encoding 7-bit ASCII • one byte per character "ASCII"

41 53 43 49 49 byte values shown in hex notation
37. ### Text encoding Unicode • UTF-32 ◦ 32-bit (4 byte) code

units ◦ 1 code unit per code point ◦ extremely wasteful ◦ multi-byte, so endianness aware "waste of 🌌" w a s t e ␠ o f ␠ 🌌
38. ### Text encoding Unicode • UTF-32 ◦ 32-bit (4 byte) code

units ◦ 1 code unit per code point ◦ extremely wasteful ◦ multi-byte, so endianness aware "waste of 🌌" 00 00 00 77 00 00 00 61 00 00 00 73 00 00 00 74 00 00 00 65 00 00 00 20 00 00 00 6F 00 00 00 66 00 00 00 20 00 01 F3 0C byte values shown in hex notation; big endian byte ordering
39. ### Text encoding Unicode • UTF-16 ◦ 16-bit (2 byte) code

units ◦ 1 code unit for BMP code points ◦ 2 code units for non-BMP code points ▪ called "surrogate pair" ▪ lead from range D800 to DBFF ▪ trail from range DC00 to DFFF ▪ lone and out-of-order surrogates disallowed "UTF-16 is 🌟neat" U T F - 1 6 ␠ i s ␠ 🌟 n e a t
40. ### Text encoding Unicode • UTF-16 ◦ 16-bit (2 byte) code

units ◦ 1 code unit for BMP code points ◦ 2 code units for non-BMP code points ▪ called "surrogate pair" ▪ lead from range D800 to DBFF ▪ trail from range DC00 to DFFF ▪ lone and out-of-order surrogates disallowed "UTF-16 is 🌟neat" 00 55 00 54 00 46 00 2D 00 31 00 36 00 20 00 69 00 73 00 20 D8 3C DF 1F 00 6E 00 65 00 61 00 74 byte values shown in hex notation; big endian byte ordering
41. ### Text encoding Unicode min CP max CP 1st byte 2nd

byte 3rd byte 4th byte U+0000 U+007F 0------- U+0080 U+07FF 110----- 10------ U+0800 U+FFFF 1110---- 10------ 10------ U+10000 U+10FFFF 11110--- 10------ 10------ 10------ U T F - 8 ␠ i s ␠ 💯 ␠ f o r ␠ A S C I I - c o m p a t i b l e ␠ t e x t ! "UTF-8 is 💯 for ASCII-compatible text!" • UTF-8 ◦ 8-bit (single byte) code units
42. ### Text encoding Unicode min cp max cp 1st byte 2nd

byte 3rd byte 4th byte U+0000 U+007F 0------- U+0080 U+07FF 110----- 10------ U+0800 U+FFFF 1110---- 10------ 10------ U+10000 U+10FFFF 11110--- 10------ 10------ 10------ 55 54 46 2D 38 20 69 73 20 F0 9F 92 AF 20 66 6F 72 20 41 53 43 49 49 2D 63 6F 6D 70 61 74 69 62 6C 65 20 74 65 78 74 21 byte values shown in hex notation "UTF-8 is 💯 for ASCII-compatible text!" • UTF-8 ◦ 8-bit (single byte) code units
43. ### Text / Strings in programming languages • escape sequences ◦

do not appear literally; represent the referenced code point ◦ allows inclusion of code points that are invalid in the string grammar ◦ single-character escapes ▪ \n, \t, \0, \", \\, etc ◦ fixed-width escapes ▪ \x00, \u0000 ◦ bracketed escapes ▪ \u{1F42B} ◦ dynamic-width escapes & null escape ▪ \uE50A\&
44. ### Text / Strings in programming languages • NUL termination ◦

UTF-8 overlong encoding • "length" could mean ◦ number of bytes ◦ number of code units ◦ number of code points ◦ number of grapheme clusters • indexing/iteration could operate on any of the above ◦ not dependent on representation, just more/less performant • byte-order mark: U+FEFF ◦ an ignored code point at the beginning of a string encoding ◦ will be read as U+FFEF if the byte order (endianness) is incorrect
45. ### Text character encoding communication on the web • Accept-Charset HTTP

request header ◦ Accept-Charset: utf-8, iso-8859-1;q=0.5 • Content-Type HTTP request header ◦ Content-Type: multipart/form-data; boundary=... • Content-Type HTTP response header ◦ Content-Type: text/html; charset=UTF-8 • <meta charset=utf-8> ◦ HTML character encoding detection algorithm
46. ### Strings performance & optimisations • variable encodings ◦ e.g. ISO-8859-1

when possible, UTF-16 otherwise ◦ requires tagging; concatenation could trigger re-encoding • slices ◦ zero-copy "view" of substring ◦ requires immutable strings • ropes ◦ binary tree of strings ◦ good for frequently-edited long strings
47. ### Strings pitfalls • do not "convert" a string to/from bytes

without explicitly specifying the encoding • do not iterate, index, or take the length of a string without considering the units • normalise strings before comparison or hashing • use case folding for case-insensitive comparisons • do not NUL-terminate strings whose encoding may contain NUL • always know the "type" of the data in your strings • avoid strings whenever possible
48. ### Part 2 of 2 • structured data ◦ in-memory representations

▪ sequences ▪ records/structs/tuples ▪ maps ◦ serialisation or transmission ▪ schema-free ▪ schema-directed ◦ parsing • binary data ◦ typed views ◦ slices • other useful & common encodings • parting notes
49. ### Recall from part 1 concept strategies for representation on computers

integers • fixed-width / machine integers • two's complement • VLQs / big integers ... real numbers • IEEE-754 floats • Posits ... text • Code pages: ASCII, ISO-8859-1 • Unicode • Unicode encodings: UTF-8, UTF-16, UTF-32 • Ropes, slices, mixed encodings ...

51. ### Structured Data examples • generic collections ◦ sequences (arrays /

lists) ◦ bags & sets ◦ records / structs / tuples ◦ maps & multi-maps • domain-specific structures ◦ dates ◦ URLs ◦ postal addresses

53. ### Sequences, Bags, & Sets representations • arrays ◦ contiguous in

memory ◦ fast indexing ◦ insert / remove via copying only • linked lists ◦ slow indexing ◦ mutable: fast insert, remove ◦ immutable: shared suffixes, cons, drop
54. ### Bags representations, continued • binary heaps ◦ binary tree ◦

constraint: elements have an ordering ◦ fast access to smallest element ◦ heap property: parent ≤ children ◦ binary trees can be efficiently laid out in memory (left: 2n+1, right: 2n+2)
55. ### Records structs, tuples, same thing • laid out contiguously (like

array) • compiler translates fields to offsets ◦ packed ◦ unpacked (word aligned) { a: 12345, b: 'p', c: true, } 12345 'p' true 12345 'p' padding true padding ( 12345, 'p', true )
56. ### Maps representations • unlike records, keys exist at runtime ◦

key must be present in representation • simple & slow: any representation as sequence of pairs [ ("key 1", "value 1"), ("key 2", "value 2"), ... ]
57. ### Maps hash table representation • keys must be hashable •

digest size = number of "buckets" • inevitable hash collisions • collision resolution: matching bucket searched for correct key • possibly catastrophic performance
58. ### Considerations about in-memory representations • will we have biased usage

frequency or usage patterns? • is there a useful constraint or invariant for the contained elements? • is our application performance sensitive to CPU caches or locality? • will we be using it enough to justify a complex representation? • can an adversary take advantage of catastrophic performance?

60. ### Representations for Serialisation / Transmission text-based • JSON • YAML

• edn • UBJSON / BSON / Smile • MessagePack • SuperPack • Protocol Buffers (protobuffs) • Cap'n Proto / FlatBuffers • Apache Avro • Apache Thrift schema-free schema-directed binary
61. ### Serialisation schema-free, text-based • JSON ◦ very limited data model

◦ just a formal grammar, no associated semantics ◦ case study: Twitter snowflake bug • YAML ◦ internal references ◦ multi-document streams ◦ don't use yaml • edn ◦ extensibility ◦ coordination between encoder/decoder
62. ### Serialisation schema-free, binary • UBJSON / BSON / Smile •

MessagePack • SuperPack ◦ built-in optimisations ◦ extensibility
63. ### Serialisation schema-directed (binary) • schemas describe shape of data •

allows us to omit tags ◦ schema-free encoding: ◦ schema-directed encoding: • Protocol Buffers (protobuffs) ◦ limitation: schemas don't support arbitrary nesting, so shape of your data may need to change • Cap'n Proto / FlatBuffers ◦ zero-copy deserialisation tag 1 value 1 tag 2 value 2 tag 3 value 3 value 1 value 2 value 3
64. ### Deserialisation security • arbitrary code execution ◦ in Ruby and

Python by default • compression • creating cycles where unexpected ◦ generally, violating any structural invariants • data validation and/or budgeting
65. ### Considerations about representations for serialisation • is serialisation / deserialisation

performance important? • does my data have a fixed schema? • does it need to be canonicalising? • human-readability / editing • portable across different languages? • portable across processes / machines? • exotic types of data? • adversary-controlled? • limitations of underlying transmission protocol • impossible to represent nonsensical structures?

68. ### Binary Data slices • shared buffer abstraction • lightweight: just

translates offsets • composable • read/write or read-only • combine with typed views for principled handling of binary data

70. ### Common Encoding: URLs https://user:pass@example.org:443/a/b/c?x=0&y=1&z=2#fragment scheme user info host (punycode-encoded) port

path path components (percent- encoded) query query parameters (percent-encoded) fragment (percent-encoded)
71. ### Other Common Encodings • UUIDs (RFC 4122) ◦ version &

variant ◦ time, MAC address, user/group ID • User-Agent headers ◦ browser family & version ◦ OS & version ◦ CPU architecture ◦ mobile provider • Dates ◦ Thu Feb 04 2021 10:00:00 GMT-0800 (Pacific Standard Time) ◦ year, month, day ◦ hours, minutes, seconds ◦ day of week ◦ time zone & offset

73. ### Parting Notes • distinguish abstractions from concrete representations in communication

• try working in abstractions, but be aware of your representations ◦ abstractions overpromise ◦ interface may be possible but with costs that are not evident upfront ◦ easy to forget about potential scenarios in which representation matters • constraints or invariants are opportunities for optimisations • work with structured data through structured interfaces

75. ### Structured Data: Serialisation Strategies schema-free schema-directed • examples: JSON, YAML,

MessagePack • examples: Protocol Buffers, Apache Avro, Apache Thrift • shape of data may be unknown • shape of data known ahead of time • encoded data is self-describing • schema is required for both encoding and decoding • encoded data may be human-readable • encoded data typically not human-readable • size overhead to describe contained data • compact encoding
76. ### Binary Data base-64 encoding • more convenient representation in certain

contexts that prefer working with printable ASCII • 4/3: 644 = 2563 • printable ASCII has at least 64 unique symbols (62 alphanums!) • padding • URL-safe alphabet