Slide 1

Slide 1 text

zip code: unpacking data compression by Scott Vokes @silentbicycle

Slide 2

Slide 2 text

Why Compression Matters It’s essential infrastructure. Content-Encoding: gzip Send to compressed (zipped) folder Multimedia encoding

Slide 3

Slide 3 text

Design for Compression curses XML

Slide 4

Slide 4 text

What is Data Compression? heuristically detecting patterns reducing duplication FLICKR: WALLSTALKING.ORG

Slide 5

Slide 5 text

What is Data Compression? “Friendly reminder: Compression is machine learning.” - Paul Snively

Slide 6

Slide 6 text

Patterns & Repetition abababababababab hhluabsolsgtcoor NO PATTERN OBVIOUS PATTERN

Slide 7

Slide 7 text

Patterns & Repetition abababababababab hhluabsolsgtcoor NO PATTERN OBVIOUS PATTERN Kolmogorov Complexity: the smallest way to describe something with 100% accuracy

Slide 8

Slide 8 text

Flavors of Compression lossless gzip zlib GIF lossy JPEG MP3

Slide 9

Slide 9 text

Lossless Compression Run-Length Coding Delta Coding Huffman Coding LZ77 family (e.g., LZSS, DEFLATE) LZ78 family (e.g., LZW) The Burrows-Wheeler Transform

Slide 10

Slide 10 text

Lossless Compression Run-Length Coding Delta Coding Huffman Coding LZ77 family (e.g., LZSS, DEFLATE) LZ78 family (e.g., LZW) The Burrows-Wheeler Transform

Slide 11

Slide 11 text

Run-Length Coding a b b b b b c d d d d d d d d d d e

Slide 12

Slide 12 text

Run-Length Coding a b b b b b c d d d d d d d d d d e a

Slide 13

Slide 13 text

Run-Length Coding a b b b b b c d d d d d d d d d d e a, 5 x b

Slide 14

Slide 14 text

Run-Length Coding a b b b b b c d d d d d d d d d d e a, 5 x b, c

Slide 15

Slide 15 text

Run-Length Coding a b b b b b c d d d d d d d d d d e a, 5 x b, c, 10 x d

Slide 16

Slide 16 text

Run-Length Coding a b b b b b c d d d d d d d d d d e a, 5 x b, c, 10 x d, e

Slide 17

Slide 17 text

Run-Length Coding a b b b b b c d d d d d d d d d d e a, 5 x b, c, 10 x d, e or 1a 5b 1c 10d 1e

Slide 18

Slide 18 text

Lossless Compression Run-Length Coding Delta Coding Huffman Coding LZ77 family (e.g., LZSS, DEFLATE) LZ78 family (e.g., LZW) The Burrows-Wheeler Transform

Slide 19

Slide 19 text

Delta Coding 32491 32492 32495 32500 32507 32516 32527

Slide 20

Slide 20 text

Delta Coding 32491 32492 32495 32500 32507 32516 32527 @32491 +1 +3 +5 +7 +9 +11

Slide 21

Slide 21 text

Lossless Compression Run-Length Coding Delta Coding Huffman Coding LZ77 family (e.g., LZSS, DEFLATE) LZ78 family (e.g., LZW) The Burrows-Wheeler Transform

Slide 22

Slide 22 text

Huffman Coding (1952) Variable-length bit patterns, most common are shortest COMMON E . T - RARE X -..- J .--- Morse Code

Slide 23

Slide 23 text

Huffman Coding sort tokens by frequency merge nodes w/ lowest frequencies build an unbalanced binary tree

Slide 24

Slide 24 text

Huffman Coding adaptive to frequencies in the data COMMON the cat in the hat UNUSUAL syzygy of zephyrs NARROW humulus lupulus

Slide 25

Slide 25 text

Huffman Coding

Slide 26

Slide 26 text

Huffman Coding

Slide 27

Slide 27 text

Huffman Coding

Slide 28

Slide 28 text

Huffman Coding

Slide 29

Slide 29 text

Huffman Coding

Slide 30

Slide 30 text

Huffman Coding

Slide 31

Slide 31 text

Huffman Coding

Slide 32

Slide 32 text

Huffman Coding T 00 H 01 _ 10 N 1100 I 1101 C 1110 A 1111

Slide 33

Slide 33 text

Lossless Compression Run-Length Coding Delta Coding Huffman Coding LZ77 family (e.g., LZSS, DEFLATE) LZ78 family (e.g., LZW) The Burrows-Wheeler Transform

Slide 34

Slide 34 text

LZ77 (1977) sliding window compression invented by Jacob Ziv and Abraham Lempel SLIDER BURGER BY MARCO ANGELES WWW.MARCOANGELES.COM

Slide 35

Slide 35 text

LZ77 abcabcdabcdefghabcabchij

Slide 36

Slide 36 text

LZ77 abcabcdabcdefghabcabchij abc

Slide 37

Slide 37 text

LZ77 abcabcdabcdefghabcabchij abc# (back-reference) (-3,+3)

Slide 38

Slide 38 text

LZ77 abcabcdabcdefghabcabchij abc#d (-3,+3)

Slide 39

Slide 39 text

LZ77 abcabcdabcdefghabcabchij abc#d# (-3,+3) (-4,+4)

Slide 40

Slide 40 text

LZ77 abcabcdabcdefghabcabchij abc#d#efgh (-3,+3) (-4,+4)

Slide 41

Slide 41 text

LZ77 abcabcdabcdefghabcabchij abc#d#efgh# (-3,+3) (-4,+4) (-15,+6)

Slide 42

Slide 42 text

LZ77 abcabcdabcdefghabcabchij abc#d#efgh#hij (-3,+3) (-4,+4) (-15,+6)

Slide 43

Slide 43 text

LZ77 arrrrrrrrrr

Slide 44

Slide 44 text

LZ77 arrrrrrrrrr ar

Slide 45

Slide 45 text

LZ77 arrrrrrrrrr ar

Slide 46

Slide 46 text

LZ77 arrrrrrrrrr ar# repeating past into the future (-1,+9)

Slide 47

Slide 47 text

Lossless Compression Run-Length Coding Delta Coding Huffman Coding LZ77 family (e.g., LZSS, DEFLATE) LZ78 family (e.g., LZW) The Burrows-Wheeler Transform

Slide 48

Slide 48 text

LZ78 dictionary compression find longest match; useful patterns grow in dictionary FLICKR: @S4XTON

Slide 49

Slide 49 text

LZ78 when dictionary is “too full”, throw it out and start over can run in constant space

Slide 50

Slide 50 text

Variants many things in common use are variations on LZ77 or LZ78 often combined with Huffman Coding, or with other simple adaptations

Slide 51

Slide 51 text

LZSS (1982) sliding window based Lempel-Ziv-Storer-Szymanski only substitutions that break even single bit markers

Slide 52

Slide 52 text

LZW (1984) Lempel-Ziv-Welch a better LZ78 (dictionary-based) starts w/ a smart default dictionary

Slide 53

Slide 53 text

LZW (1984) nasty patent situation (expired, as of June 2003)

Slide 54

Slide 54 text

DEFLATE: LZSS + Huffman PKZIP. Also, gzip.

Slide 55

Slide 55 text

Lossless Compression Run-Length Coding Delta Coding Huffman Coding LZ77 family (e.g., LZSS, DEFLATE) LZ78 family (e.g., LZW) The Burrows-Wheeler Transform

Slide 56

Slide 56 text

Transformations sorted data compresses much better NORMAL the_cat_in_the_hat SORTED ____aaceehhhintttt

Slide 57

Slide 57 text

WELLSORTEDVERSION.COM

Slide 58

Slide 58 text

Transformations sorted data compresses much better NORMAL the_cat_in_the_hat SORTED ____aaceehhhintttt unfortunately, sorting is a one-way process...

Slide 59

Slide 59 text

Burrows-Wheeler Transform (1994) a reversible, partial sort collates together common substrings transformed data compresses better used in bzip

Slide 60

Slide 60 text

Burrows-Wheeler Transform repetition

Slide 61

Slide 61 text

Burrows-Wheeler Transform ^repetition repetition^ epetition^r petition^re etition^rep tition^repe ition^repet tion^repeti ion^repetit on^repetiti n^repetitio

Slide 62

Slide 62 text

Burrows-Wheeler Transform ^repetition epetition^r etition^rep ion^repetit ition^repet n^repetitio on^repetiti petition^re repetition^ tion^repeti tition^repe

Slide 63

Slide 63 text

Burrows-Wheeler Transform ..........n ..........r ..........p ..........t ..........t ..........o ..........i ..........e ..........^ ..........i ..........e nrpttoie^ie

Slide 64

Slide 64 text

Burrows-Wheeler Transform ..........^ ..........e ..........e ..........i ..........i ..........n ..........o ..........p ..........r ..........t ..........t

Slide 65

Slide 65 text

Burrows-Wheeler Transform .........n^ .........re .........pe .........ti .........ti .........on .........io .........ep .........^r .........it .........et

Slide 66

Slide 66 text

Burrows-Wheeler Transform .........^r .........ep .........et .........io .........it .........n^ .........on .........pe .........re .........ti .........ti

Slide 67

Slide 67 text

Burrows-Wheeler Transform ........n^r ........rep ........pet ........tio ........tit ........on^ ........ion ........epe ........^re ........iti ........eti

Slide 68

Slide 68 text

Burrows-Wheeler Transform ........^re ........epe ........eti ........ion ........iti ........n^r ........on^ ........pet ........rep ........tio ........tit

Slide 69

Slide 69 text

Burrows-Wheeler Transform .......n^re .......repe .......peti .......tion .......titi .......on^r .......ion^ .......epet .......^rep .......itio .......etit

Slide 70

Slide 70 text

Burrows-Wheeler Transform .......^rep .......epet .......etit .......ion^ .......itio .......n^re .......on^r .......peti .......repe .......tion .......titi

Slide 71

Slide 71 text

Burrows-Wheeler Transform ......n^rep ......repet ......petit ......tion^ ......titio ......on^re ......ion^r ......epeti ......^repe ......ition ......etiti

Slide 72

Slide 72 text

Burrows-Wheeler Transform ......^repe ......epeti ......etiti ......ion^r ......ition ......n^rep ......on^re ......petit ......repet ......tion^ ......titio

Slide 73

Slide 73 text

Burrows-Wheeler Transform .....n^repe .....repeti .....petiti .....tion^r .....tition .....on^rep .....ion^re .....epetit .....^repet .....ition^ .....etitio

Slide 74

Slide 74 text

Burrows-Wheeler Transform .....^repet .....epetit .....etitio .....ion^re .....ition^ .....n^repe .....on^rep .....petiti .....repeti .....tion^r .....tition

Slide 75

Slide 75 text

Burrows-Wheeler Transform ....n^repet ....repetit ....petitio ....tion^re ....tition^ ....on^repe ....ion^rep ....epetiti ....^repeti ....ition^r ....etition

Slide 76

Slide 76 text

Burrows-Wheeler Transform ....^repeti ....epetiti ....etition ....ion^rep ....ition^r ....n^repet ....on^repe ....petitio ....repetit ....tion^re ....tition^

Slide 77

Slide 77 text

Burrows-Wheeler Transform ...n^repeti ...repetiti ...petition ...tion^rep ...tition^r ...on^repet ...ion^repe ...epetitio ...^repetit ...ition^re ...etition^

Slide 78

Slide 78 text

Burrows-Wheeler Transform ...^repetit ...epetitio ...etition^ ...ion^repe ...ition^re ...n^repeti ...on^repet ...petition ...repetiti ...tion^rep ...tition^r

Slide 79

Slide 79 text

Burrows-Wheeler Transform ..n^repetit ..repetitio ..petition^ ..tion^repe ..tition^re ..on^repeti ..ion^repet ..epetition ..^repetiti ..ition^rep ..etition^r

Slide 80

Slide 80 text

Burrows-Wheeler Transform ..^repetiti ..epetition ..etition^r ..ion^repet ..ition^rep ..n^repetit ..on^repeti ..petition^ ..repetitio ..tion^repe ..tition^re

Slide 81

Slide 81 text

Burrows-Wheeler Transform .n^repetiti .repetition .petition^r .tion^repet .tition^rep .on^repetit .ion^repeti .epetition^ .^repetitio .ition^repe .etition^re

Slide 82

Slide 82 text

Burrows-Wheeler Transform .^repetitio .epetition^ .etition^re .ion^repeti .ition^repe .n^repetiti .on^repetit .petition^r .repetition .tion^repet .tition^rep

Slide 83

Slide 83 text

Burrows-Wheeler Transform n^repetitio repetition^ petition^re tion^repeti tition^repe on^repetiti ion^repetit epetition^r ^repetition ition^repet etition^rep

Slide 84

Slide 84 text

Burrows-Wheeler Transform ^repetition epetition^r etition^rep ion^repetit ition^repet n^repetitio on^repetiti petition^re repetition^ tion^repeti tition^repe

Slide 85

Slide 85 text

Burrows-Wheeler Transform ^repetition

Slide 86

Slide 86 text

Compression for embedded? FLICKR: @BARNOID

Slide 87

Slide 87 text

heatshrink FLICKR: @MIGHTYOHM LZSS (sliding window) hard real-time: suspend/resume at any bit of I/O decompress in < 50 bytes RAM compress in < 100 bytes RAM BSD-style license

Slide 88

Slide 88 text

LZSS, for embedded.

Slide 89

Slide 89 text

LZSS, for embedded. suspend / resume loops input output done

Slide 90

Slide 90 text

LZSS, for embedded. bug!

Slide 91

Slide 91 text

heatshrink (LZSS) demo g. You may not use or otherwise export or re-export the Licensed Application except as authorized by United States law and the laws of the jurisdiction in which the Licensed Application was obtained. In particular, but without limitation, the Licensed Application may not be exported or re-exported (a) into any U.S.-embargoed countries or (b) to anyone on the U.S. Treasury Department's Specially Designated Nationals List or the U.S. Department of Commerce Denied Persons List or Entity List. By using the Licensed Application, you represent and warrant that you are not located in any such country or on any such list. You also agree that you will not use these products for any purposes prohibited by United States law, including, without limitation, the development, design, manufacture, or production of nuclear, missile, or chemical or biological weapons.

Slide 92

Slide 92 text

heatshrink (LZSS) demo g#######-#########U#########isdi####w####,##########(a)##U.S.- embargo##(b)####20T#s####'####Design#N#s######Commer#D####E## ###############l###########h######d#lop###,#uf#u###nuc###l#hem# ##og#wea#

Slide 93

Slide 93 text

heatshrink (LZSS) demo g#######-#########U#########isdi####w####,##########(a)##U.S.- embargo##(b)####20T#s####'####Design#N#s######Commer#D####E## ###############l###########h######d#lop###,#uf#u###nuc###l#hem# ##og#wea# Some substitutions: "the Licensed Application" "may not be " "United States law" "ithout limitation, the "

Slide 94

Slide 94 text

heatshrink (LZSS) demo g#######-#########U#########isdi####w####,##########(a)##U.S.- embargo##(b)####20T#s####'####Design#N#s######Commer#D####E## ###############l###########h######d#lop###,#uf#u###nuc###l#hem# ##og#wea# Some substitutions: "ction " "ational" "ed in any "

Slide 95

Slide 95 text

Lossy Compression inherently data-specific smart degradation usually a quality/size continuum common in multimedia

Slide 96

Slide 96 text

Lossy Compression example: JPEG 51 KB 27 KB 14 KB

Slide 97

Slide 97 text

Closing why compression matters designing for compression examples of lossless compression case study: heatshrink examples of lossy compression

Slide 98

Slide 98 text

To learn more

Slide 99

Slide 99 text

We’re Hiring! Detroit & Ann Arbor, MI

Slide 100

Slide 100 text

Questions? @silentbicycle github.com/silentbicycle