Slide 1

Slide 1 text

FINITE STATE TRANSDUCERS CAPITAL GO 2017 g o

Slide 2

Slide 2 text

MINIMAL DETERMINISTIC ACYCLIC FINITE STATE TRANSDUCERS

Slide 3

Slide 3 text

FINITE STATE MACHINE

Slide 4

Slide 4 text

FINITE STATE AUTOMATA (FSA) • cat • cats • dog MATCHES:

Slide 5

Slide 5 text

REGULAR EXPRESSION • cat • cats • dog MATCHES: ^cat(s)?|dog$

Slide 6

Slide 6 text

FINITE STATE AUTOMATA (FSA) - ORDERED SET ▸ Does it contain a key? ▸ Enumerate Ranges of Keys ▸ in lexicographic order

Slide 7

Slide 7 text

FINITE STATE TRANSDUCER (FST) Transition with value Traversal collects values Using SUM() "cat" has value 9

Slide 8

Slide 8 text

FINITE STATE TRANSDUCER (FST) - ORDERED MAP ▸ Does it contain a key? ▸ What is the value associated with a key? ▸ Enumerate Ranges of Key/Value Pairs ▸ in lexicographic order * values must satisfy particular algebra (currently just uint64)

Slide 9

Slide 9 text

FST LIBRARY IN GO ▸ Build ▸ Insert Keys in Lexicographic Order ▸ Bounded Memory ▸ Stream Output while Building ▸ Use ▸ Immutable ▸ FST data is memory mapped (mmap) ▸ References are slice offsets not pointers (less garbage collector impact) Vellum

Slide 10

Slide 10 text

CONSTRUCTING THE FST one - 1 two - 2 three - 3 https://www.flickr.com/photos/wfyurasko/5573962244

Slide 11

Slide 11 text

CREATING A BUILDER builder, err := vellum.New(f, nil) Any io.Writer Options, default nil

Slide 12

Slide 12 text

INSERT err = builder.Insert([]byte("one"), 1)

Slide 13

Slide 13 text

INSERT err = builder.Insert([]byte("three"), 3) frozen

Slide 14

Slide 14 text

INSERT err = builder.Insert([]byte("three"), 3) frozen flushed to disk

Slide 15

Slide 15 text

INSERT err = builder.Insert([]byte("two"), 2) frozen flushed to disk

Slide 16

Slide 16 text

INSERT err = builder.Insert([]byte("two"), 2) frozen flushed to disk

Slide 17

Slide 17 text

INSERT err = builder.Insert([]byte("two"), 2) frozen flushed to disk

Slide 18

Slide 18 text

INSERT err = builder.Insert([]byte("two"), 2) frozen flushed to disk

Slide 19

Slide 19 text

CLOSE err = builder.Close() frozen flushed to disk

Slide 20

Slide 20 text

CLOSE err = builder.Close() frozen flushed to disk

Slide 21

Slide 21 text

CLOSE err = builder.Close() frozen flushed to disk

Slide 22

Slide 22 text

USING THE FST https://commons.wikimedia.org/wiki/File:Metro_Blur.JPG

Slide 23

Slide 23 text

OPEN fst, err = vellum.Open(path) Does NOT read entire FST into memory. * load an FST already in memory with vellum.Load(data)

Slide 24

Slide 24 text

GET val, exists, err = fst.Get(key)

Slide 25

Slide 25 text

ITERATE itr, err := fst.Iterator(start, end) for err == nil { key, val := itr.Current() // do something err = itr.Next() }

Slide 26

Slide 26 text

EFFICIENT SEARCHING WITH OTHER AUTOMATA itr, err := fst.Search(automaton, start, end) Just like iterator, with additional filter.

Slide 27

Slide 27 text

REGULAR EXPRESSION MATCHING r, err := regexp.New(`c.*t`) itr, err := fst.Search(r, start, end)

Slide 28

Slide 28 text

FUZZY MATCHING fuzzy, err := levenshtein.New("cat", 1) itr, err := fst.Search(fuzzy, start, end) at bat cats

Slide 29

Slide 29 text

WHAT ABOUT UNICODE? c a f 63 61 66 63 61 66 Edit distance 2 bytes Edit distance 1 code point Levenshtein/Regex automata have integrated UTF-8 decoding c a f c3 a9 65 é e

Slide 30

Slide 30 text

SYSTEM DICTIONARY bytes %orig txt 2,493,109 - fst 1,224,433 49% gz 747078 30% /usr/share/dict/words 235886 WORDS Not competing on compression ratio!

Slide 31

Slide 31 text

REGULAR EXPRESSION SEARCH DICTIONARY https://asciinema.org/a/az29w7apf7kdiioa1yf150g1u

Slide 32

Slide 32 text

FUZZY SEARCH DICTIONARY https://asciinema.org/a/avqyqbo3uw7y653rlnv7bqivv

Slide 33

Slide 33 text

INDEXING WIKIPEDIA ARTICLE TITLES 6,726,078 TITLES

Slide 34

Slide 34 text

FUTURE WORK ▸ Bleve Full Text Search Library ▸ Next Generation Term Dictionary Format based on Vellum FST ▸ Add Support for storing []byte values (not just uint64) ▸ Continued Optimizations ▸ Less Garbage Created

Slide 35

Slide 35 text

FOR THOUGHT NOT GENERAL PURPOSE A DIFFERENT KIND OF MAP EFFICIENT ACCESS BY AUTOMATA MATCHING KEYS

Slide 36

Slide 36 text

THANKS ▸ Marty Schoch ▸ @mschoch - marty@couchbase.com ▸ Vellum Github - https://github.com/couchbaselabs/vellum ▸ Andrew Gallant's Blog ▸ Index 1,600,000,000 Keys with Automata and Rust ▸ Papers ▸ Direct Construction of Minimal Acyclic Subsequential Transducers