Data Encodings and Representations

Data Encodings and Representations Michael Ficarra

Part 1 of 2 • integers • real numbers •
text

Integers

Integers positional numeral systems • decimal ◦ base/radix: 10 ◦
0, 1, 2, 3, 4, 5, 6, 7, 8, 9 • hexadecimal (hex) ◦ base/radix: 16 ◦ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F • binary ◦ base/radix: 2 ◦ 0, 1

Integers positional numeral systems Dec: 24 Hex: 18 Bin: 11000
Dec: 41 Hex: 29 Bin: 101001 Dec: 94 Hex: 5E Bin: 1011110

Integers fixed width Problem: computers are finite Number: 94 (decimal)
• Word size: 8 bits 0101 1110 • Word size: 16 bits 0000 0000 0101 1110 • Word size: 32 bits 0000 0000 0000 0000 0000 0000 0101 1110

Integers overflow / underflow Word size: 8 bits • Number:
254 (decimal) 1111 1110 • Number: 255 (decimal) 1111 1111 • Number: 256 (decimal) ???? ????

Integers negatives Number: -94 (decimal) Word size: 16 bits •
sign and magnitude: 1000 0000 0101 1110 • two's complement: 216-94 = 65442 1111 1111 1010 0010 216-65442 = 94

Binary Data byte ordering Number: 123456 (decimal) 00000001 11100010 01000000
[ 01 ] [ E2 ] [ 40 ] • Little Endian 40 E2 01 • Big Endian 01 E2 40 (aside)

Integers arbitrary width (LEB128) Number: 123456 (decimal) 0001 1110 0010
0100 0000 1. group into 7-bit segments 0000111 1000100 1000000 2. add continuation bits 00000111 11000100 11000000 3. little endian 11000000 11000100 00000111

Integers in programming languages (literal notation) • bases ◦ decimal:
123456 ◦ hex: 0x789ABC ◦ octal (base 8) ▪ leading zero: 0765 ▪ 0o765 ◦ binary: 0b110000 • sized literals ◦ long: 123456789L ◦ long long: 123456789LL • separators ◦ 1_234_567_890 ◦ 0x4996_02D2 ◦ 0b00000011_11101000 • BigInt literals in JavaScript ◦ 123456789n

Integers in programming languages (sized types) • absolutely sized types:
◦ uint8_t, int32_t, etc ◦ uint_least8_t, int_fast32_t, etc ◦ integer types in Java: ▪ byte: signed, 8 bit ▪ short: signed, 16 bit ▪ int: signed, 32 bit ▪ long: signed, 64 bit • platform-relative sized types: ◦ short: usually half an int ◦ int: usually one word or ½word ◦ long: usually two ints ◦ long long: usually two longs ◦ word, dword, qword, etc.

Integers pitfalls • always consider overflow & underflow when performing
arithmetic on fixed-width integers • do not transfer integers to another machine without accounting for byte ordering

ℝeal Numbers

Real Numbers inherent loss of precision • problem: ◦ real
numbers are uncountably infinite ◦ computers are finite • a solution: ◦ choose some rationals to be representable on computers ◦ approximate your real number by using one of those values instead ◦ logarithmically space them out ▪ low magnitude, high precision ▪ high magnitude, low precision

double precision (binary64) • all integers with magnitude < 253
• some rationals up to 1.79 × 10308 • rule of thumb: ~15 decimal digits of precision Real Numbers single precision (binary32) • all integers with magnitude < 224 • some rationals up to 3.40 × 1038 • rule of thumb: ~7 decimal digits of precision IEEE-754 floats

Real Numbers IEEE-754 floats -1sign × 2(bias - exponent) ×
1.fraction = -10 × 2(127 - 124) × 1.25 = 1 × 23 × 1.25 = 10

Real Numbers IEEE-754 oddities • NaN: not a number ◦
exponent all 1, fraction non-0 ◦ represents computation error ◦ 253-2 different double-precision NaNs ◦ generally propagates ◦ NaN ≠ NaN • infinities ◦ exponent all 1, fraction all 0 ◦ positive and negative, based on sign • zeroes ◦ exponent all 0, fraction all 0 ◦ positive and negative, based on sign • subnormal numbers ◦ exponent all 0, fraction non-0 ◦ represent the differences between adjacent "normal" numbers

Real Numbers numerical instability • occurs when precision losses accumulate
• example: summing a list of floats ◦ see Kahan's algorithm • can be re-introduced by naïve compiler optimisations • detect by changing rounding scheme ◦ if round-to-infinity and round-to-negative-infinity differ greatly, it's likely unstable

Real Numbers alternatives to IEEE-754 • Posits ◦ similar goals
as IEEE-754, done better ◦ shown in studies to be more accurate ◦ still relatively new; not widely available in hardware • Rationals ◦ pair of arbitrary width integers ◦ very large magnitude numbers have very large representations • Decimals ◦ pair of integers ◦ fixed or unlimited precision • BigFloats ◦ specified precision

Real Numbers in programming languages • literal notation ◦ 1.234
or 123. or .123 ◦ 1D or 1F ◦ 1.2e34 ◦ ~1.23 • loss of precision in notation ◦ 9007199254740993D ◦ typically rounds to nearest, ties to even ◦ some languages disallow/warn ◦ infinity: 2e308 / 4e38

Real Numbers pitfalls • do not use floats to represent
money • compare floats using an epsilon • do not mix very big and very small floats • avoid accumulating arithmetic precision losses • consider the appropriate representation before starting

Text the bad old days 7-bit ASCII

Text the bad old days ISO 8859-1 / Windows-1252

Text the modern Unicode era • goal: to represent all
writing systems of the world ◦ includes historical writing systems ◦ includes symbols and punctuation ◦ includes emoji ◦ over 150 scripts supported, representing over 780 languages • sequences of code points ◦ 1,114,112 code points ▪ 0 through 0x10FFFF ◦ ~138,000 assigned so far (non-PUA) • split into 17 "planes" ◦ 1 basic multilingual plane (BMP) ▪ U+0000 through U+FFFF ◦ 16 supplementary / "astral" planes • ASCII compatible ◦ overlaps very beginning of BMP

code point allocation (left: all planes, right: BMP in detail)

Unicode Han unification just kidding

Unicode grapheme clusters • combining characters: ◦ А U+0410 :
CYRILLIC CAPITAL LETTER A ◦ ҉ U+0489 : COMBINING CYRILLIC MILLIONS SIGN • ZWJ sequences: 󰔩 ◦ 👨 U+1F468 : MAN ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ 👨 U+1F468 : MAN ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ 👧 U+1F467 : GIRL • emoji skin tone modifiers: 󰯮 ◦ 🤏 U+1F90F : PINCHING HAND ◦ 🏼 U+1F3FC : EMOJI MODIFIER FITZPATRICK TYPE-3 (MEDIUM-LIGHT)

Unicode grapheme clusters • variation selectors: ◦ ➡ U+27A1 :
BLACK RIGHTWARDS ARROW ◦ U+FE0F : VARIATION SELECTOR-16 • flags: 󰏀 ◦ 🇫 U+1F1EB : REGIONAL INDICATOR SYMBOL LETTER F ◦ 🇯 U+1F1EF : REGIONAL INDICATOR SYMBOL LETTER J

Unicode decomposition inconsistencies • ä ◦ a U+0061 : LATIN
SMALL LETTER A ◦ ̈ U+0308 : COMBINING DIAERESIS • ä ◦ ä U+00E4 : LATIN SMALL LETTER A WITH DIAERESIS • ◦ ⁉ U+2049 : EXCLAMATION QUESTION MARK ◦ U+FE0F : VARIATION SELECTOR-16 • ❓ ◦ ❓ U+2753 : BLACK QUESTION MARK ORNAMENT

Unicode decomposition inconsistencies • Man Health Worker: 󰞁 ◦ 👨
U+1F468 : MAN ◦ 🏾 U+1F3FE : EMOJI MODIFIER FITZPATRICK TYPE-5 (MEDIUM-DARK) ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ ⚕ U+2695 : STAFF OF AESCULAPIUS ◦ U+FE0F : VARIATION SELECTOR-16 • Merman: 󰨝 ◦ 🧜 U+1F9DC : MERPERSON ◦ 🏾 U+1F3FE : EMOJI MODIFIER FITZPATRICK TYPE-5 (MEDIUM-DARK) ◦ U+200D : ZERO WIDTH JOINER [ZWJ] ◦ ♂ U+2642 : MALE SIGN ◦ U+FE0F : VARIATION SELECTOR-16

Unicode normalisation • NFD: Normalization Form Canonical Decomposition 1. decompose
by canonical equivalence 2. sort any combining characters • NFC: Normalization Form Canonical Composition 1. decompose by canonical equivalence 2. recompose by canonical equivalence • NFKD: Normalization Form Compatibility Decomposition 1. decompose by compatibility 2. sort any combining characters • NFKC: Normalization Form Compatibility Composition 1. decompose by compatibility 2. recomposed by canonical equivalence

example normalisations

Unicode case mappings • four mappings ◦ lowercasing ◦ uppercasing
◦ titlecasing ◦ case folding (used for case-insensitive comparisons) • case mapping is string-to-string operation ◦ string length may change ◦ context-dependent mappings • does not necessarily round-trip • can be locale sensitive

Text encoding 7-bit ASCII • one byte per character "ASCII"
41 53 43 49 49 byte values shown in hex notation

Text encoding Unicode • UTF-32 ◦ 32-bit (4 byte) code
units ◦ 1 code unit per code point ◦ extremely wasteful ◦ multi-byte, so endianness aware "waste of 🌌" w a s t e ␠ o f ␠ 🌌

units ◦ 1 code unit per code point ◦ extremely wasteful ◦ multi-byte, so endianness aware "waste of 🌌" 00 00 00 77 00 00 00 61 00 00 00 73 00 00 00 74 00 00 00 65 00 00 00 20 00 00 00 6F 00 00 00 66 00 00 00 20 00 01 F3 0C byte values shown in hex notation; big endian byte ordering

units ◦ 1 code unit for BMP code points ◦ 2 code units for non-BMP code points ▪ called "surrogate pair" ▪ lead from range D800 to DBFF ▪ trail from range DC00 to DFFF ▪ lone and out-of-order surrogates disallowed "UTF-16 is 🌟neat" U T F - 1 6 ␠ i s ␠ 🌟 n e a t

units ◦ 1 code unit for BMP code points ◦ 2 code units for non-BMP code points ▪ called "surrogate pair" ▪ lead from range D800 to DBFF ▪ trail from range DC00 to DFFF ▪ lone and out-of-order surrogates disallowed "UTF-16 is 🌟neat" 00 55 00 54 00 46 00 2D 00 31 00 36 00 20 00 69 00 73 00 20 D8 3C DF 1F 00 6E 00 65 00 61 00 74 byte values shown in hex notation; big endian byte ordering

Text encoding Unicode min CP max CP 1st byte 2nd
byte 3rd byte 4th byte U+0000 U+007F 0------- U+0080 U+07FF 110----- 10------ U+0800 U+FFFF 1110---- 10------ 10------ U+10000 U+10FFFF 11110--- 10------ 10------ 10------ U T F - 8 ␠ i s ␠ 💯 ␠ f o r ␠ A S C I I - c o m p a t i b l e ␠ t e x t ! "UTF-8 is 💯 for ASCII-compatible text!" • UTF-8 ◦ 8-bit (single byte) code units

Text encoding Unicode min cp max cp 1st byte 2nd
byte 3rd byte 4th byte U+0000 U+007F 0------- U+0080 U+07FF 110----- 10------ U+0800 U+FFFF 1110---- 10------ 10------ U+10000 U+10FFFF 11110--- 10------ 10------ 10------ 55 54 46 2D 38 20 69 73 20 F0 9F 92 AF 20 66 6F 72 20 41 53 43 49 49 2D 63 6F 6D 70 61 74 69 62 6C 65 20 74 65 78 74 21 byte values shown in hex notation "UTF-8 is 💯 for ASCII-compatible text!" • UTF-8 ◦ 8-bit (single byte) code units

Text / Strings in programming languages • escape sequences ◦
do not appear literally; represent the referenced code point ◦ allows inclusion of code points that are invalid in the string grammar ◦ single-character escapes ▪ \n, \t, \0, \", \\, etc ◦ fixed-width escapes ▪ \x00, \u0000 ◦ bracketed escapes ▪ \u{1F42B} ◦ dynamic-width escapes & null escape ▪ \uE50A\&

Text / Strings in programming languages • NUL termination ◦
UTF-8 overlong encoding • "length" could mean ◦ number of bytes ◦ number of code units ◦ number of code points ◦ number of grapheme clusters • indexing/iteration could operate on any of the above ◦ not dependent on representation, just more/less performant • byte-order mark: U+FEFF ◦ an ignored code point at the beginning of a string encoding ◦ will be read as U+FFEF if the byte order (endianness) is incorrect

Text character encoding communication on the web • Accept-Charset HTTP
request header ◦ Accept-Charset: utf-8, iso-8859-1;q=0.5 • Content-Type HTTP request header ◦ Content-Type: multipart/form-data; boundary=... • Content-Type HTTP response header ◦ Content-Type: text/html; charset=UTF-8 • <meta charset=utf-8> ◦ HTML character encoding detection algorithm

Strings performance & optimisations • variable encodings ◦ e.g. ISO-8859-1
when possible, UTF-16 otherwise ◦ requires tagging; concatenation could trigger re-encoding • slices ◦ zero-copy "view" of substring ◦ requires immutable strings • ropes ◦ binary tree of strings ◦ good for frequently-edited long strings

Strings pitfalls • do not "convert" a string to/from bytes
without explicitly specifying the encoding • do not iterate, index, or take the length of a string without considering the units • normalise strings before comparison or hashing • use case folding for case-insensitive comparisons • do not NUL-terminate strings whose encoding may contain NUL • always know the "type" of the data in your strings • avoid strings whenever possible

Part 2 of 2 • structured data ◦ in-memory representations
▪ sequences ▪ records/structs/tuples ▪ maps ◦ serialisation or transmission ▪ schema-free ▪ schema-directed ◦ parsing • binary data ◦ typed views ◦ slices • other useful & common encodings • parting notes

Recall from part 1 concept strategies for representation on computers
integers • fixed-width / machine integers • two's complement • VLQs / big integers ... real numbers • IEEE-754 floats • Posits ... text • Code pages: ASCII, ISO-8859-1 • Unicode • Unicode encodings: UTF-8, UTF-16, UTF-32 • Ropes, slices, mixed encodings ...

Structured Data

Structured Data examples • generic collections ◦ sequences (arrays /
lists) ◦ bags & sets ◦ records / structs / tuples ◦ maps & multi-maps • domain-specific structures ◦ dates ◦ URLs ◦ postal addresses

In-memory Representations

Sequences, Bags, & Sets representations • arrays ◦ contiguous in
memory ◦ fast indexing ◦ insert / remove via copying only • linked lists ◦ slow indexing ◦ mutable: fast insert, remove ◦ immutable: shared suffixes, cons, drop

Bags representations, continued • binary heaps ◦ binary tree ◦
constraint: elements have an ordering ◦ fast access to smallest element ◦ heap property: parent ≤ children ◦ binary trees can be efficiently laid out in memory (left: 2n+1, right: 2n+2)

Records structs, tuples, same thing • laid out contiguously (like
array) • compiler translates fields to offsets ◦ packed ◦ unpacked (word aligned) { a: 12345, b: 'p', c: true, } 12345 'p' true 12345 'p' padding true padding ( 12345, 'p', true )

Maps representations • unlike records, keys exist at runtime ◦
key must be present in representation • simple & slow: any representation as sequence of pairs [ ("key 1", "value 1"), ("key 2", "value 2"), ... ]

Maps hash table representation • keys must be hashable •
digest size = number of "buckets" • inevitable hash collisions • collision resolution: matching bucket searched for correct key • possibly catastrophic performance

Considerations about in-memory representations • will we have biased usage
frequency or usage patterns? • is there a useful constraint or invariant for the contained elements? • is our application performance sensitive to CPU caches or locality? • will we be using it enough to justify a complex representation? • can an adversary take advantage of catastrophic performance?

Representations for Serialisation / Transmission

Representations for Serialisation / Transmission text-based • JSON • YAML
• edn • UBJSON / BSON / Smile • MessagePack • SuperPack • Protocol Buffers (protobuffs) • Cap'n Proto / FlatBuffers • Apache Avro • Apache Thrift schema-free schema-directed binary

Serialisation schema-free, text-based • JSON ◦ very limited data model
◦ just a formal grammar, no associated semantics ◦ case study: Twitter snowflake bug • YAML ◦ internal references ◦ multi-document streams ◦ don't use yaml • edn ◦ extensibility ◦ coordination between encoder/decoder

Serialisation schema-free, binary • UBJSON / BSON / Smile •
MessagePack • SuperPack ◦ built-in optimisations ◦ extensibility

Serialisation schema-directed (binary) • schemas describe shape of data •
allows us to omit tags ◦ schema-free encoding: ◦ schema-directed encoding: • Protocol Buffers (protobuffs) ◦ limitation: schemas don't support arbitrary nesting, so shape of your data may need to change • Cap'n Proto / FlatBuffers ◦ zero-copy deserialisation tag 1 value 1 tag 2 value 2 tag 3 value 3 value 1 value 2 value 3

Deserialisation security • arbitrary code execution ◦ in Ruby and
Python by default • compression • creating cycles where unexpected ◦ generally, violating any structural invariants • data validation and/or budgeting

Considerations about representations for serialisation • is serialisation / deserialisation
performance important? • does my data have a fixed schema? • does it need to be canonicalising? • human-readability / editing • portable across different languages? • portable across processes / machines? • exotic types of data? • adversary-controlled? • limitations of underlying transmission protocol • impossible to represent nonsensical structures?

Binary Data

Binary Data typed views • shared buffer abstraction • can
support reads and writes

Binary Data slices • shared buffer abstraction • lightweight: just
translates offsets • composable • read/write or read-only • combine with typed views for principled handling of binary data

Other Common Encodings

Common Encoding: URLs https://user:pass@example.org:443/a/b/c?x=0&y=1&z=2#fragment scheme user info host (punycode-encoded) port
path path components (percent- encoded) query query parameters (percent-encoded) fragment (percent-encoded)

Other Common Encodings • UUIDs (RFC 4122) ◦ version &
variant ◦ time, MAC address, user/group ID • User-Agent headers ◦ browser family & version ◦ OS & version ◦ CPU architecture ◦ mobile provider • Dates ◦ Thu Feb 04 2021 10:00:00 GMT-0800 (Pacific Standard Time) ◦ year, month, day ◦ hours, minutes, seconds ◦ day of week ◦ time zone & offset

Parting Notes

Parting Notes • distinguish abstractions from concrete representations in communication
• try working in abstractions, but be aware of your representations ◦ abstractions overpromise ◦ interface may be possible but with costs that are not evident upfront ◦ easy to forget about potential scenarios in which representation matters • constraints or invariants are opportunities for optimisations • work with structured data through structured interfaces

Structured Data: Serialisation Strategies schema-free schema-directed • examples: JSON, YAML,
MessagePack • examples: Protocol Buffers, Apache Avro, Apache Thrift • shape of data may be unknown • shape of data known ahead of time • encoded data is self-describing • schema is required for both encoding and decoding • encoded data may be human-readable • encoded data typically not human-readable • size overhead to describe contained data • compact encoding

Binary Data base-64 encoding • more convenient representation in certain
contexts that prefer working with printable ASCII • 4/3: 644 = 2563 • printable ASCII has at least 64 unique symbols (62 alphanums!) • padding • URL-safe alphabet

Data Encodings and Representations

Data Encodings and Representations

More Decks by Michael Ficarra

Other Decks in Programming

Featured

Transcript