Save 37% off PRO during our Black Friday Sale! »

Data Encodings and Representations

Data Encodings and Representations

With no necessary background, you will learn how the numbers and strings and other data structures on our computers are represented by ones and zeroes underneath, and how that representation leaks through our abstractions to impact how we use them. From there, we will move onto other, more conspicuous changes of representation that are common in our daily lives, and review what should be considered when dealing with them.

88d24101a5653f4b98c363c6a05acc6a?s=128

Michael Ficarra

February 03, 2021
Tweet

Transcript

  1. Data Encodings and Representations Michael Ficarra

  2. Part 1 of 2 • integers • real numbers •

    text
  3. Integers

  4. Integers positional numeral systems • decimal ◦ base/radix: 10 ◦

    0, 1, 2, 3, 4, 5, 6, 7, 8, 9 • hexadecimal (hex) ◦ base/radix: 16 ◦ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F • binary ◦ base/radix: 2 ◦ 0, 1
  5. Integers positional numeral systems Dec: 24 Hex: 18 Bin: 11000

    Dec: 41 Hex: 29 Bin: 101001 Dec: 94 Hex: 5E Bin: 1011110
  6. Integers fixed width Problem: computers are finite Number: 94 (decimal)

    • Word size: 8 bits 0101 1110 • Word size: 16 bits 0000 0000 0101 1110 • Word size: 32 bits 0000 0000 0000 0000 0000 0000 0101 1110
  7. Integers overflow / underflow Word size: 8 bits • Number:

    254 (decimal) 1111 1110 • Number: 255 (decimal) 1111 1111 • Number: 256 (decimal) ???? ????
  8. Integers negatives Number: -94 (decimal) Word size: 16 bits •

    sign and magnitude: 1000 0000 0101 1110 • two's complement: 216-94 = 65442 1111 1111 1010 0010 216-65442 = 94
  9. Binary Data byte ordering Number: 123456 (decimal) 00000001 11100010 01000000

    [ 01 ] [ E2 ] [ 40 ] • Little Endian 40 E2 01 • Big Endian 01 E2 40 (aside)
  10. Integers arbitrary width (LEB128) Number: 123456 (decimal) 0001 1110 0010

    0100 0000 1. group into 7-bit segments 0000111 1000100 1000000 2. add continuation bits 00000111 11000100 11000000 3. little endian 11000000 11000100 00000111
  11. Integers in programming languages (literal notation) • bases ◦ decimal:

    123456 ◦ hex: 0x789ABC ◦ octal (base 8) ▪ leading zero: 0765 ▪ 0o765 ◦ binary: 0b110000 • sized literals ◦ long: 123456789L ◦ long long: 123456789LL • separators ◦ 1_234_567_890 ◦ 0x4996_02D2 ◦ 0b00000011_11101000 • BigInt literals in JavaScript ◦ 123456789n
  12. Integers in programming languages (sized types) • absolutely sized types:

    ◦ uint8_t, int32_t, etc ◦ uint_least8_t, int_fast32_t, etc ◦ integer types in Java: ▪ byte: signed, 8 bit ▪ short: signed, 16 bit ▪ int: signed, 32 bit ▪ long: signed, 64 bit • platform-relative sized types: ◦ short: usually half an int ◦ int: usually one word or ½word ◦ long: usually two ints ◦ long long: usually two longs ◦ word, dword, qword, etc.
  13. Integers pitfalls • always consider overflow & underflow when performing

    arithmetic on fixed-width integers • do not transfer integers to another machine without accounting for byte ordering
  14. ℝeal Numbers

  15. Real Numbers inherent loss of precision • problem: ◦ real

    numbers are uncountably infinite ◦ computers are finite • a solution: ◦ choose some rationals to be representable on computers ◦ approximate your real number by using one of those values instead ◦ logarithmically space them out ▪ low magnitude, high precision ▪ high magnitude, low precision
  16. double precision (binary64) • all integers with magnitude < 253

    • some rationals up to 1.79 × 10308 • rule of thumb: ~15 decimal digits of precision Real Numbers single precision (binary32) • all integers with magnitude < 224 • some rationals up to 3.40 × 1038 • rule of thumb: ~7 decimal digits of precision IEEE-754 floats
  17. Real Numbers IEEE-754 floats -1sign × 2(bias - exponent) ×

    1.fraction = -10 × 2(127 - 124) × 1.25 = 1 × 23 × 1.25 = 10
  18. Real Numbers IEEE-754 oddities • NaN: not a number ◦

    exponent all 1, fraction non-0 ◦ represents computation error ◦ 253-2 different double-precision NaNs ◦ generally propagates ◦ NaN ≠ NaN • infinities ◦ exponent all 1, fraction all 0 ◦ positive and negative, based on sign • zeroes ◦ exponent all 0, fraction all 0 ◦ positive and negative, based on sign • subnormal numbers ◦ exponent all 0, fraction non-0 ◦ represent the differences between adjacent "normal" numbers
  19. Real Numbers numerical instability • occurs when precision losses accumulate

    • example: summing a list of floats ◦ see Kahan's algorithm • can be re-introduced by naïve compiler optimisations • detect by changing rounding scheme ◦ if round-to-infinity and round-to-negative-infinity differ greatly, it's likely unstable
  20. Real Numbers alternatives to IEEE-754 • Posits ◦ similar goals

    as IEEE-754, done better ◦ shown in studies to be more accurate ◦ still relatively new; not widely available in hardware • Rationals ◦ pair of arbitrary width integers ◦ very large magnitude numbers have very large representations • Decimals ◦ pair of integers ◦ fixed or unlimited precision • BigFloats ◦ specified precision
  21. Real Numbers in programming languages • literal notation ◦ 1.234

    or 123. or .123 ◦ 1D or 1F ◦ 1.2e34 ◦ ~1.23 • loss of precision in notation ◦ 9007199254740993D ◦ typically rounds to nearest, ties to even ◦ some languages disallow/warn ◦ infinity: 2e308 / 4e38
  22. Real Numbers pitfalls • do not use floats to represent

    money • compare floats using an epsilon • do not mix very big and very small floats • avoid accumulating arithmetic precision losses • consider the appropriate representation before starting
  23. Text

  24. Text the bad old days 7-bit ASCII

  25. Text the bad old days ISO 8859-1 / Windows-1252

  26. Text the modern Unicode era • goal: to represent all

    writing systems of the world ◦ includes historical writing systems ◦ includes symbols and punctuation ◦ includes emoji ◦ over 150 scripts supported, representing over 780 languages • sequences of code points ◦ 1,114,112 code points ▪ 0 through 0x10FFFF ◦ ~138,000 assigned so far (non-PUA) • split into 17 "planes" ◦ 1 basic multilingual plane (BMP) ▪ U+0000 through U+FFFF ◦ 16 supplementary / "astral" planes • ASCII compatible ◦ overlaps very beginning of BMP
  27. code point allocation (left: all planes, right: BMP in detail)

  28. Unicode Han unification just kidding

  29. Unicode grapheme clusters • combining characters: ◦ А U+0410 :

    CYRILLIC CAPITAL LETTER A ◦ ҉ U+0489 : COMBINING CYRILLIC MILLIONS SIGN • ZWJ sequences: 󰔩 ◦ 👨 U+1F468 : MAN ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ 👨 U+1F468 : MAN ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ 👧 U+1F467 : GIRL • emoji skin tone modifiers: 󰯮 ◦ 🤏 U+1F90F : PINCHING HAND ◦ 🏼 U+1F3FC : EMOJI MODIFIER FITZPATRICK TYPE-3 (MEDIUM-LIGHT)
  30. Unicode grapheme clusters • variation selectors: ◦ ➡ U+27A1 :

    BLACK RIGHTWARDS ARROW ◦ U+FE0F : VARIATION SELECTOR-16 • flags: 󰏀 ◦ 🇫 U+1F1EB : REGIONAL INDICATOR SYMBOL LETTER F ◦ 🇯 U+1F1EF : REGIONAL INDICATOR SYMBOL LETTER J
  31. Unicode decomposition inconsistencies • ä ◦ a U+0061 : LATIN

    SMALL LETTER A ◦ ̈ U+0308 : COMBINING DIAERESIS • ä ◦ ä U+00E4 : LATIN SMALL LETTER A WITH DIAERESIS • ◦ ⁉ U+2049 : EXCLAMATION QUESTION MARK ◦ U+FE0F : VARIATION SELECTOR-16 • ❓ ◦ ❓ U+2753 : BLACK QUESTION MARK ORNAMENT
  32. Unicode decomposition inconsistencies • Man Health Worker: 󰞁 ◦ 👨

    U+1F468 : MAN ◦ 🏾 U+1F3FE : EMOJI MODIFIER FITZPATRICK TYPE-5 (MEDIUM-DARK) ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ ⚕ U+2695 : STAFF OF AESCULAPIUS ◦ U+FE0F : VARIATION SELECTOR-16 • Merman: 󰨝 ◦ 🧜 U+1F9DC : MERPERSON ◦ 🏾 U+1F3FE : EMOJI MODIFIER FITZPATRICK TYPE-5 (MEDIUM-DARK) ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ ♂ U+2642 : MALE SIGN ◦ U+FE0F : VARIATION SELECTOR-16
  33. Unicode normalisation • NFD: Normalization Form Canonical Decomposition 1. decompose

    by canonical equivalence 2. sort any combining characters • NFC: Normalization Form Canonical Composition 1. decompose by canonical equivalence 2. recompose by canonical equivalence • NFKD: Normalization Form Compatibility Decomposition 1. decompose by compatibility 2. sort any combining characters • NFKC: Normalization Form Compatibility Composition 1. decompose by compatibility 2. recomposed by canonical equivalence
  34. example normalisations

  35. Unicode case mappings • four mappings ◦ lowercasing ◦ uppercasing

    ◦ titlecasing ◦ case folding (used for case-insensitive comparisons) • case mapping is string-to-string operation ◦ string length may change ◦ context-dependent mappings • does not necessarily round-trip • can be locale sensitive
  36. Text encoding 7-bit ASCII • one byte per character "ASCII"

    41 53 43 49 49 byte values shown in hex notation
  37. Text encoding Unicode • UTF-32 ◦ 32-bit (4 byte) code

    units ◦ 1 code unit per code point ◦ extremely wasteful ◦ multi-byte, so endianness aware "waste of 🌌" w a s t e ␠ o f ␠ 🌌
  38. Text encoding Unicode • UTF-32 ◦ 32-bit (4 byte) code

    units ◦ 1 code unit per code point ◦ extremely wasteful ◦ multi-byte, so endianness aware "waste of 🌌" 00 00 00 77 00 00 00 61 00 00 00 73 00 00 00 74 00 00 00 65 00 00 00 20 00 00 00 6F 00 00 00 66 00 00 00 20 00 01 F3 0C byte values shown in hex notation; big endian byte ordering
  39. Text encoding Unicode • UTF-16 ◦ 16-bit (2 byte) code

    units ◦ 1 code unit for BMP code points ◦ 2 code units for non-BMP code points ▪ called "surrogate pair" ▪ lead from range D800 to DBFF ▪ trail from range DC00 to DFFF ▪ lone and out-of-order surrogates disallowed "UTF-16 is 🌟neat" U T F - 1 6 ␠ i s ␠ 🌟 n e a t
  40. Text encoding Unicode • UTF-16 ◦ 16-bit (2 byte) code

    units ◦ 1 code unit for BMP code points ◦ 2 code units for non-BMP code points ▪ called "surrogate pair" ▪ lead from range D800 to DBFF ▪ trail from range DC00 to DFFF ▪ lone and out-of-order surrogates disallowed "UTF-16 is 🌟neat" 00 55 00 54 00 46 00 2D 00 31 00 36 00 20 00 69 00 73 00 20 D8 3C DF 1F 00 6E 00 65 00 61 00 74 byte values shown in hex notation; big endian byte ordering
  41. Text encoding Unicode min CP max CP 1st byte 2nd

    byte 3rd byte 4th byte U+0000 U+007F 0------- U+0080 U+07FF 110----- 10------ U+0800 U+FFFF 1110---- 10------ 10------ U+10000 U+10FFFF 11110--- 10------ 10------ 10------ U T F - 8 ␠ i s ␠ 💯 ␠ f o r ␠ A S C I I - c o m p a t i b l e ␠ t e x t ! "UTF-8 is 💯 for ASCII-compatible text!" • UTF-8 ◦ 8-bit (single byte) code units
  42. Text encoding Unicode min cp max cp 1st byte 2nd

    byte 3rd byte 4th byte U+0000 U+007F 0------- U+0080 U+07FF 110----- 10------ U+0800 U+FFFF 1110---- 10------ 10------ U+10000 U+10FFFF 11110--- 10------ 10------ 10------ 55 54 46 2D 38 20 69 73 20 F0 9F 92 AF 20 66 6F 72 20 41 53 43 49 49 2D 63 6F 6D 70 61 74 69 62 6C 65 20 74 65 78 74 21 byte values shown in hex notation "UTF-8 is 💯 for ASCII-compatible text!" • UTF-8 ◦ 8-bit (single byte) code units
  43. Text / Strings in programming languages • escape sequences ◦

    do not appear literally; represent the referenced code point ◦ allows inclusion of code points that are invalid in the string grammar ◦ single-character escapes ▪ \n, \t, \0, \", \\, etc ◦ fixed-width escapes ▪ \x00, \u0000 ◦ bracketed escapes ▪ \u{1F42B} ◦ dynamic-width escapes & null escape ▪ \uE50A\&
  44. Text / Strings in programming languages • NUL termination ◦

    UTF-8 overlong encoding • "length" could mean ◦ number of bytes ◦ number of code units ◦ number of code points ◦ number of grapheme clusters • indexing/iteration could operate on any of the above ◦ not dependent on representation, just more/less performant • byte-order mark: U+FEFF ◦ an ignored code point at the beginning of a string encoding ◦ will be read as U+FFEF if the byte order (endianness) is incorrect
  45. Text character encoding communication on the web • Accept-Charset HTTP

    request header ◦ Accept-Charset: utf-8, iso-8859-1;q=0.5 • Content-Type HTTP request header ◦ Content-Type: multipart/form-data; boundary=... • Content-Type HTTP response header ◦ Content-Type: text/html; charset=UTF-8 • <meta charset=utf-8> ◦ HTML character encoding detection algorithm
  46. Strings performance & optimisations • variable encodings ◦ e.g. ISO-8859-1

    when possible, UTF-16 otherwise ◦ requires tagging; concatenation could trigger re-encoding • slices ◦ zero-copy "view" of substring ◦ requires immutable strings • ropes ◦ binary tree of strings ◦ good for frequently-edited long strings
  47. Strings pitfalls • do not "convert" a string to/from bytes

    without explicitly specifying the encoding • do not iterate, index, or take the length of a string without considering the units • normalise strings before comparison or hashing • use case folding for case-insensitive comparisons • do not NUL-terminate strings whose encoding may contain NUL • always know the "type" of the data in your strings • avoid strings whenever possible
  48. Part 2 of 2 • structured data ◦ in-memory representations

    ▪ sequences ▪ records/structs/tuples ▪ maps ◦ serialisation or transmission ▪ schema-free ▪ schema-directed ◦ parsing • binary data ◦ typed views ◦ slices • other useful & common encodings • parting notes
  49. Recall from part 1 concept strategies for representation on computers

    integers • fixed-width / machine integers • two's complement • VLQs / big integers ... real numbers • IEEE-754 floats • Posits ... text • Code pages: ASCII, ISO-8859-1 • Unicode • Unicode encodings: UTF-8, UTF-16, UTF-32 • Ropes, slices, mixed encodings ...
  50. Structured Data

  51. Structured Data examples • generic collections ◦ sequences (arrays /

    lists) ◦ bags & sets ◦ records / structs / tuples ◦ maps & multi-maps • domain-specific structures ◦ dates ◦ URLs ◦ postal addresses
  52. In-memory Representations

  53. Sequences, Bags, & Sets representations • arrays ◦ contiguous in

    memory ◦ fast indexing ◦ insert / remove via copying only • linked lists ◦ slow indexing ◦ mutable: fast insert, remove ◦ immutable: shared suffixes, cons, drop
  54. Bags representations, continued • binary heaps ◦ binary tree ◦

    constraint: elements have an ordering ◦ fast access to smallest element ◦ heap property: parent ≤ children ◦ binary trees can be efficiently laid out in memory (left: 2n+1, right: 2n+2)
  55. Records structs, tuples, same thing • laid out contiguously (like

    array) • compiler translates fields to offsets ◦ packed ◦ unpacked (word aligned) { a: 12345, b: 'p', c: true, } 12345 'p' true 12345 'p' padding true padding ( 12345, 'p', true )
  56. Maps representations • unlike records, keys exist at runtime ◦

    key must be present in representation • simple & slow: any representation as sequence of pairs [ ("key 1", "value 1"), ("key 2", "value 2"), ... ]
  57. Maps hash table representation • keys must be hashable •

    digest size = number of "buckets" • inevitable hash collisions • collision resolution: matching bucket searched for correct key • possibly catastrophic performance
  58. Considerations about in-memory representations • will we have biased usage

    frequency or usage patterns? • is there a useful constraint or invariant for the contained elements? • is our application performance sensitive to CPU caches or locality? • will we be using it enough to justify a complex representation? • can an adversary take advantage of catastrophic performance?
  59. Representations for Serialisation / Transmission

  60. Representations for Serialisation / Transmission text-based • JSON • YAML

    • edn • UBJSON / BSON / Smile • MessagePack • SuperPack • Protocol Buffers (protobuffs) • Cap'n Proto / FlatBuffers • Apache Avro • Apache Thrift schema-free schema-directed binary
  61. Serialisation schema-free, text-based • JSON ◦ very limited data model

    ◦ just a formal grammar, no associated semantics ◦ case study: Twitter snowflake bug • YAML ◦ internal references ◦ multi-document streams ◦ don't use yaml • edn ◦ extensibility ◦ coordination between encoder/decoder
  62. Serialisation schema-free, binary • UBJSON / BSON / Smile •

    MessagePack • SuperPack ◦ built-in optimisations ◦ extensibility
  63. Serialisation schema-directed (binary) • schemas describe shape of data •

    allows us to omit tags ◦ schema-free encoding: ◦ schema-directed encoding: • Protocol Buffers (protobuffs) ◦ limitation: schemas don't support arbitrary nesting, so shape of your data may need to change • Cap'n Proto / FlatBuffers ◦ zero-copy deserialisation tag 1 value 1 tag 2 value 2 tag 3 value 3 value 1 value 2 value 3
  64. Deserialisation security • arbitrary code execution ◦ in Ruby and

    Python by default • compression • creating cycles where unexpected ◦ generally, violating any structural invariants • data validation and/or budgeting
  65. Considerations about representations for serialisation • is serialisation / deserialisation

    performance important? • does my data have a fixed schema? • does it need to be canonicalising? • human-readability / editing • portable across different languages? • portable across processes / machines? • exotic types of data? • adversary-controlled? • limitations of underlying transmission protocol • impossible to represent nonsensical structures?
  66. Binary Data

  67. Binary Data typed views • shared buffer abstraction • can

    support reads and writes
  68. Binary Data slices • shared buffer abstraction • lightweight: just

    translates offsets • composable • read/write or read-only • combine with typed views for principled handling of binary data
  69. Other Common Encodings

  70. Common Encoding: URLs https://user:pass@example.org:443/a/b/c?x=0&y=1&z=2#fragment scheme user info host (punycode-encoded) port

    path path components (percent- encoded) query query parameters (percent-encoded) fragment (percent-encoded)
  71. Other Common Encodings • UUIDs (RFC 4122) ◦ version &

    variant ◦ time, MAC address, user/group ID • User-Agent headers ◦ browser family & version ◦ OS & version ◦ CPU architecture ◦ mobile provider • Dates ◦ Thu Feb 04 2021 10:00:00 GMT-0800 (Pacific Standard Time) ◦ year, month, day ◦ hours, minutes, seconds ◦ day of week ◦ time zone & offset
  72. Parting Notes

  73. Parting Notes • distinguish abstractions from concrete representations in communication

    • try working in abstractions, but be aware of your representations ◦ abstractions overpromise ◦ interface may be possible but with costs that are not evident upfront ◦ easy to forget about potential scenarios in which representation matters • constraints or invariants are opportunities for optimisations • work with structured data through structured interfaces
  74. end

  75. Structured Data: Serialisation Strategies schema-free schema-directed • examples: JSON, YAML,

    MessagePack • examples: Protocol Buffers, Apache Avro, Apache Thrift • shape of data may be unknown • shape of data known ahead of time • encoded data is self-describing • schema is required for both encoding and decoding • encoded data may be human-readable • encoded data typically not human-readable • size overhead to describe contained data • compact encoding
  76. Binary Data base-64 encoding • more convenient representation in certain

    contexts that prefer working with printable ASCII • 4/3: 644 = 2563 • printable ASCII has at least 64 unique symbols (62 alphanums!) • padding • URL-safe alphabet