Integers
in programming languages
(sized types)
● absolutely sized types:
○ uint8_t, int32_t, etc
○ uint_least8_t, int_fast32_t,
etc
○ integer types in Java:
■ byte: signed, 8 bit
■ short: signed, 16 bit
■ int: signed, 32 bit
■ long: signed, 64 bit
● platform-relative sized types:
○ short: usually half an int
○ int: usually one word or ½word
○ long: usually two ints
○ long long: usually two longs
○ word, dword, qword, etc.
Slide 13
Slide 13 text
Integers
pitfalls
● always consider overflow &
underflow when performing
arithmetic on fixed-width integers
● do not transfer integers to
another machine without
accounting for byte ordering
Slide 14
Slide 14 text
ℝeal Numbers
Slide 15
Slide 15 text
Real Numbers
inherent loss of precision
● problem:
○ real numbers are uncountably infinite
○ computers are finite
● a solution:
○ choose some rationals to be
representable on computers
○ approximate your real number by
using one of those values instead
○ logarithmically space them out
■ low magnitude, high precision
■ high magnitude, low precision
Slide 16
Slide 16 text
double precision (binary64)
● all integers with magnitude < 253
● some rationals up to 1.79 × 10308
● rule of thumb: ~15 decimal digits of precision
Real Numbers
single precision (binary32)
● all integers with magnitude < 224
● some rationals up to 3.40 × 1038
● rule of thumb: ~7 decimal digits of precision
IEEE-754 floats
Real Numbers
IEEE-754 oddities
● NaN: not a number
○ exponent all 1, fraction non-0
○ represents computation error
○ 253-2 different double-precision NaNs
○ generally propagates
○ NaN ≠ NaN
● infinities
○ exponent all 1, fraction all 0
○ positive and negative, based on sign
● zeroes
○ exponent all 0, fraction all 0
○ positive and negative, based on sign
● subnormal numbers
○ exponent all 0, fraction non-0
○ represent the differences between
adjacent "normal" numbers
Slide 19
Slide 19 text
Real Numbers
numerical instability
● occurs when precision losses
accumulate
● example: summing a list of floats
○ see Kahan's algorithm
● can be re-introduced by naïve
compiler optimisations
● detect by changing rounding
scheme
○ if round-to-infinity and
round-to-negative-infinity differ
greatly, it's likely unstable
Slide 20
Slide 20 text
Real Numbers
alternatives to IEEE-754
● Posits
○ similar goals as IEEE-754, done better
○ shown in studies to be more accurate
○ still relatively new; not widely
available in hardware
● Rationals
○ pair of arbitrary width integers
○ very large magnitude numbers have
very large representations
● Decimals
○ pair of integers
○ fixed or unlimited precision
● BigFloats
○ specified precision
Slide 21
Slide 21 text
Real Numbers
in programming languages
● literal notation
○ 1.234 or 123. or .123
○ 1D or 1F
○ 1.2e34
○ ~1.23
● loss of precision in notation
○ 9007199254740993D
○ typically rounds to nearest, ties to
even
○ some languages disallow/warn
○ infinity: 2e308 / 4e38
Slide 22
Slide 22 text
Real Numbers
pitfalls
● do not use floats to represent
money
● compare floats using an epsilon
● do not mix very big and very small
floats
● avoid accumulating arithmetic
precision losses
● consider the appropriate
representation before starting
Slide 23
Slide 23 text
Text
Slide 24
Slide 24 text
Text
the bad old days
7-bit ASCII
Slide 25
Slide 25 text
Text
the bad old days
ISO 8859-1 / Windows-1252
Slide 26
Slide 26 text
Text
the modern Unicode era
● goal: to represent all writing
systems of the world
○ includes historical writing systems
○ includes symbols and punctuation
○ includes emoji
○ over 150 scripts supported,
representing over 780 languages
● sequences of code points
○ 1,114,112 code points
■ 0 through 0x10FFFF
○ ~138,000 assigned so far (non-PUA)
● split into 17 "planes"
○ 1 basic multilingual plane (BMP)
■ U+0000 through U+FFFF
○ 16 supplementary / "astral" planes
● ASCII compatible
○ overlaps very beginning of BMP
Slide 27
Slide 27 text
code point allocation (left: all planes, right: BMP in detail)
Slide 28
Slide 28 text
Unicode
Han unification
just kidding
Slide 29
Slide 29 text
Unicode
grapheme clusters
● combining characters:
○ А U+0410 : CYRILLIC CAPITAL LETTER A
○ ҉ U+0489 : COMBINING CYRILLIC
MILLIONS SIGN
● ZWJ sequences:
○ 👨 U+1F468 : MAN
○ U+200D : ZERO WIDTH JOINER [ZWJ]
○ 👨 U+1F468 : MAN
○ U+200D : ZERO WIDTH JOINER [ZWJ]
○ 👧 U+1F467 : GIRL
● emoji skin tone modifiers:
○ 🤏 U+1F90F : PINCHING HAND
○ 🏼 U+1F3FC : EMOJI MODIFIER
FITZPATRICK TYPE-3 (MEDIUM-LIGHT)
Slide 30
Slide 30 text
Unicode
grapheme clusters
● variation selectors:
○ ➡ U+27A1 : BLACK RIGHTWARDS
ARROW
○ U+FE0F : VARIATION SELECTOR-16
● flags:
○ 🇫 U+1F1EB : REGIONAL INDICATOR
SYMBOL LETTER F
○ 🇯 U+1F1EF : REGIONAL INDICATOR
SYMBOL LETTER J
Slide 31
Slide 31 text
Unicode
decomposition inconsistencies
● ä
○ a U+0061 : LATIN SMALL LETTER A
○ ̈ U+0308 : COMBINING DIAERESIS
● ä
○ ä U+00E4 : LATIN SMALL LETTER A
WITH DIAERESIS
●
○ ⁉ U+2049 : EXCLAMATION QUESTION
MARK
○ U+FE0F : VARIATION SELECTOR-16
● ❓
○ ❓ U+2753 : BLACK QUESTION MARK
ORNAMENT
Unicode normalisation
● NFD: Normalization Form Canonical
Decomposition
1. decompose by canonical equivalence
2. sort any combining characters
● NFC: Normalization Form Canonical
Composition
1. decompose by canonical equivalence
2. recompose by canonical equivalence
● NFKD: Normalization Form
Compatibility Decomposition
1. decompose by compatibility
2. sort any combining characters
● NFKC: Normalization Form
Compatibility Composition
1. decompose by compatibility
2. recomposed by canonical equivalence
Slide 34
Slide 34 text
example normalisations
Slide 35
Slide 35 text
Unicode
case mappings
● four mappings
○ lowercasing
○ uppercasing
○ titlecasing
○ case folding (used for case-insensitive
comparisons)
● case mapping is string-to-string
operation
○ string length may change
○ context-dependent mappings
● does not necessarily round-trip
● can be locale sensitive
Slide 36
Slide 36 text
Text
encoding 7-bit ASCII
● one byte per character
"ASCII"
41 53 43 49 49
byte values shown in hex notation
Slide 37
Slide 37 text
Text
encoding Unicode
● UTF-32
○ 32-bit (4 byte) code units
○ 1 code unit per code point
○ extremely wasteful
○ multi-byte, so endianness aware
"waste of 🌌"
w a
s t
e ␠
o f
␠ 🌌
Slide 38
Slide 38 text
Text
encoding Unicode
● UTF-32
○ 32-bit (4 byte) code units
○ 1 code unit per code point
○ extremely wasteful
○ multi-byte, so endianness aware
"waste of 🌌"
00 00 00 77 00 00 00 61
00 00 00 73 00 00 00 74
00 00 00 65 00 00 00 20
00 00 00 6F 00 00 00 66
00 00 00 20 00 01 F3 0C
byte values shown in hex notation;
big endian byte ordering
Slide 39
Slide 39 text
Text
encoding Unicode
● UTF-16
○ 16-bit (2 byte) code units
○ 1 code unit for BMP code points
○ 2 code units for non-BMP code points
■ called "surrogate pair"
■ lead from range D800 to DBFF
■ trail from range DC00 to DFFF
■ lone and out-of-order
surrogates disallowed
"UTF-16 is 🌟neat"
U T F -
1 6 ␠ i
s ␠ 🌟
n e a t
Slide 40
Slide 40 text
Text
encoding Unicode
● UTF-16
○ 16-bit (2 byte) code units
○ 1 code unit for BMP code points
○ 2 code units for non-BMP code points
■ called "surrogate pair"
■ lead from range D800 to DBFF
■ trail from range DC00 to DFFF
■ lone and out-of-order
surrogates disallowed
"UTF-16 is 🌟neat"
00 55 00 54 00 46 00 2D
00 31 00 36 00 20 00 69
00 73 00 20 D8 3C DF 1F
00 6E 00 65 00 61 00 74
byte values shown in hex notation;
big endian byte ordering
Slide 41
Slide 41 text
Text
encoding Unicode
min CP max CP 1st byte 2nd byte 3rd byte 4th byte
U+0000 U+007F 0-------
U+0080 U+07FF 110----- 10------
U+0800 U+FFFF 1110---- 10------ 10------
U+10000 U+10FFFF 11110--- 10------ 10------ 10------
U T F - 8 ␠ i s ␠
💯 ␠ f o r ␠ A S
C I I - c o m p a t
i b l e ␠ t e x t !
"UTF-8 is 💯 for ASCII-compatible text!"
● UTF-8
○ 8-bit (single byte) code units
Text / Strings
in programming languages
● escape sequences
○ do not appear literally; represent the
referenced code point
○ allows inclusion of code points that
are invalid in the string grammar
○ single-character escapes
■ \n, \t, \0, \", \\, etc
○ fixed-width escapes
■ \x00, \u0000
○ bracketed escapes
■ \u{1F42B}
○ dynamic-width escapes & null escape
■ \uE50A\&
Slide 44
Slide 44 text
Text / Strings
in programming languages
● NUL termination
○ UTF-8 overlong encoding
● "length" could mean
○ number of bytes
○ number of code units
○ number of code points
○ number of grapheme clusters
● indexing/iteration could operate
on any of the above
○ not dependent on representation,
just more/less performant
● byte-order mark: U+FEFF
○ an ignored code point at the
beginning of a string encoding
○ will be read as U+FFEF if the byte
order (endianness) is incorrect
Slide 45
Slide 45 text
Text
character encoding
communication on the web
● Accept-Charset HTTP request
header
○ Accept-Charset: utf-8,
iso-8859-1;q=0.5
● Content-Type HTTP request
header
○ Content-Type:
multipart/form-data;
boundary=...
● Content-Type HTTP response
header
○ Content-Type: text/html;
charset=UTF-8
●
○ HTML character encoding detection
algorithm
Slide 46
Slide 46 text
Strings
performance & optimisations
● variable encodings
○ e.g. ISO-8859-1 when possible,
UTF-16 otherwise
○ requires tagging; concatenation could
trigger re-encoding
● slices
○ zero-copy "view" of substring
○ requires immutable strings
● ropes
○ binary tree of strings
○ good for frequently-edited long
strings
Slide 47
Slide 47 text
Strings
pitfalls
● do not "convert" a string to/from
bytes without explicitly specifying the
encoding
● do not iterate, index, or take the
length of a string without considering
the units
● normalise strings before comparison
or hashing
● use case folding for case-insensitive
comparisons
● do not NUL-terminate strings whose
encoding may contain NUL
● always know the "type" of the data in
your strings
● avoid strings whenever possible
Slide 48
Slide 48 text
Part 2 of 2
● structured data
○ in-memory representations
■ sequences
■ records/structs/tuples
■ maps
○ serialisation or transmission
■ schema-free
■ schema-directed
○ parsing
● binary data
○ typed views
○ slices
● other useful & common encodings
● parting notes
Slide 49
Slide 49 text
Recall from part 1
concept strategies for representation on computers
integers
● fixed-width / machine integers
● two's complement
● VLQs / big integers
...
real numbers
● IEEE-754 floats
● Posits
...
text
● Code pages: ASCII, ISO-8859-1
● Unicode
● Unicode encodings: UTF-8, UTF-16, UTF-32
● Ropes, slices, mixed encodings
...
Sequences,
Bags, & Sets
representations
● arrays
○ contiguous in memory
○ fast indexing
○ insert / remove via copying only
● linked lists
○ slow indexing
○ mutable: fast insert, remove
○ immutable: shared suffixes, cons, drop
Slide 54
Slide 54 text
Bags
representations,
continued
● binary heaps
○ binary tree
○ constraint: elements have an ordering
○ fast access to smallest element
○ heap property: parent ≤ children
○ binary trees can be efficiently laid out
in memory (left: 2n+1, right: 2n+2)
Slide 55
Slide 55 text
Records
structs, tuples,
same thing
● laid out contiguously (like array)
● compiler translates fields to offsets
○ packed
○ unpacked (word aligned)
{
a: 12345,
b: 'p',
c: true,
}
12345 'p' true
12345 'p' padding true padding
( 12345, 'p', true )
Slide 56
Slide 56 text
Maps
representations
● unlike records, keys exist at
runtime
○ key must be present in representation
● simple & slow: any representation
as sequence of pairs
[
("key 1", "value 1"),
("key 2", "value 2"),
...
]
Slide 57
Slide 57 text
Maps
hash table representation
● keys must be hashable
● digest size = number of "buckets"
● inevitable hash collisions
● collision resolution: matching
bucket searched for correct key
● possibly catastrophic performance
Slide 58
Slide 58 text
Considerations
about in-memory
representations
● will we have biased usage
frequency or usage patterns?
● is there a useful constraint or
invariant for the contained
elements?
● is our application performance
sensitive to CPU caches or
locality?
● will we be using it enough to
justify a complex representation?
● can an adversary take advantage
of catastrophic performance?
Serialisation
schema-free, text-based
● JSON
○ very limited data model
○ just a formal grammar, no associated
semantics
○ case study: Twitter snowflake bug
● YAML
○ internal references
○ multi-document streams
○ don't use yaml
● edn
○ extensibility
○ coordination between
encoder/decoder
Serialisation
schema-directed
(binary)
● schemas describe shape of data
● allows us to omit tags
○ schema-free encoding:
○ schema-directed encoding:
● Protocol Buffers (protobuffs)
○ limitation: schemas don't support
arbitrary nesting, so shape of your
data may need to change
● Cap'n Proto / FlatBuffers
○ zero-copy deserialisation
tag 1 value 1 tag 2 value 2 tag 3 value 3
value 1 value 2 value 3
Slide 64
Slide 64 text
Deserialisation
security
● arbitrary code execution
○ in Ruby and Python by default
● compression
● creating cycles where unexpected
○ generally, violating any structural
invariants
● data validation and/or budgeting
Slide 65
Slide 65 text
Considerations
about representations
for serialisation
● is serialisation / deserialisation
performance important?
● does my data have a fixed schema?
● does it need to be canonicalising?
● human-readability / editing
● portable across different languages?
● portable across processes / machines?
● exotic types of data?
● adversary-controlled?
● limitations of underlying transmission
protocol
● impossible to represent nonsensical
structures?
Slide 66
Slide 66 text
Binary Data
Slide 67
Slide 67 text
Binary Data
typed views
● shared buffer abstraction
● can support reads and writes
Slide 68
Slide 68 text
Binary Data
slices
● shared buffer abstraction
● lightweight: just translates offsets
● composable
● read/write or read-only
● combine with typed views for
principled handling of binary data
Slide 69
Slide 69 text
Other Common Encodings
Slide 70
Slide 70 text
Common Encoding: URLs
https://user:pass@example.org:443/a/b/c?x=0&y=1&z=2#fragment
scheme user info host
(punycode-encoded)
port path
path
components
(percent-
encoded)
query
query parameters
(percent-encoded)
fragment
(percent-encoded)
Slide 71
Slide 71 text
Other Common
Encodings
● UUIDs (RFC 4122)
○ version & variant
○ time, MAC address, user/group ID
● User-Agent headers
○ browser family & version
○ OS & version
○ CPU architecture
○ mobile provider
● Dates
○ Thu Feb 04 2021 10:00:00 GMT-0800
(Pacific Standard Time)
○ year, month, day
○ hours, minutes, seconds
○ day of week
○ time zone & offset
Slide 72
Slide 72 text
Parting Notes
Slide 73
Slide 73 text
Parting Notes
● distinguish abstractions from
concrete representations in
communication
● try working in abstractions, but be
aware of your representations
○ abstractions overpromise
○ interface may be possible but with
costs that are not evident upfront
○ easy to forget about potential
scenarios in which representation
matters
● constraints or invariants are
opportunities for optimisations
● work with structured data through
structured interfaces
Slide 74
Slide 74 text
end
Slide 75
Slide 75 text
Structured Data: Serialisation Strategies
schema-free schema-directed
● examples: JSON, YAML, MessagePack ● examples: Protocol Buffers, Apache Avro,
Apache Thrift
● shape of data may be unknown ● shape of data known ahead of time
● encoded data is self-describing ● schema is required for both encoding and
decoding
● encoded data may be human-readable ● encoded data typically not
human-readable
● size overhead to describe contained data ● compact encoding
Slide 76
Slide 76 text
Binary Data
base-64 encoding
● more convenient representation
in certain contexts that prefer
working with printable ASCII
● 4/3: 644 = 2563
● printable ASCII has at least 64
unique symbols (62 alphanums!)
● padding
● URL-safe alphabet