MINIMAL
DETERMINISTIC
ACYCLIC
FINITE
STATE
TRANSDUCERS
Slide 3
Slide 3 text
FINITE STATE MACHINE
Slide 4
Slide 4 text
FINITE STATE AUTOMATA (FSA)
• cat
• cats
• dog
MATCHES:
Slide 5
Slide 5 text
REGULAR EXPRESSION
• cat
• cats
• dog
MATCHES:
^cat(s)?|dog$
Slide 6
Slide 6 text
FINITE STATE AUTOMATA (FSA) - ORDERED SET
▸ Does it contain a key?
▸ Enumerate Ranges of Keys
▸ in lexicographic order
Slide 7
Slide 7 text
FINITE STATE TRANSDUCER (FST)
Transition with value
Traversal collects values
Using SUM() "cat" has value 9
Slide 8
Slide 8 text
FINITE STATE TRANSDUCER (FST) - ORDERED MAP
▸ Does it contain a key?
▸ What is the value associated with a key?
▸ Enumerate Ranges of Key/Value Pairs
▸ in lexicographic order
* values must satisfy particular algebra (currently just uint64)
Slide 9
Slide 9 text
FST LIBRARY IN GO
▸ Build
▸ Insert Keys in Lexicographic Order
▸ Bounded Memory
▸ Stream Output while Building
▸ Use
▸ Immutable
▸ FST data is memory mapped (mmap)
▸ References are slice offsets not
pointers (less garbage collector
impact)
Vellum
Slide 10
Slide 10 text
CONSTRUCTING
THE
FST
one - 1
two - 2
three - 3
https://www.flickr.com/photos/wfyurasko/5573962244
Slide 11
Slide 11 text
CREATING A BUILDER
builder, err := vellum.New(f, nil)
Any io.Writer Options, default nil
FUZZY MATCHING
fuzzy, err := levenshtein.New("cat", 1)
itr, err := fst.Search(fuzzy, start, end)
at
bat
cats
Slide 29
Slide 29 text
WHAT ABOUT UNICODE?
c a f
63 61 66
63 61 66
Edit distance 2 bytes
Edit distance 1 code point
Levenshtein/Regex automata have integrated UTF-8 decoding
c a f
c3 a9
65
é
e
Slide 30
Slide 30 text
SYSTEM DICTIONARY
bytes %orig
txt 2,493,109 -
fst 1,224,433 49%
gz 747078 30%
/usr/share/dict/words
235886 WORDS
Not competing on
compression ratio!
INDEXING WIKIPEDIA ARTICLE TITLES
6,726,078 TITLES
Slide 34
Slide 34 text
FUTURE WORK
▸ Bleve Full Text Search Library
▸ Next Generation Term Dictionary
Format based on Vellum FST
▸ Add Support for storing []byte
values (not just uint64)
▸ Continued Optimizations
▸ Less Garbage Created
Slide 35
Slide 35 text
FOR THOUGHT
NOT GENERAL PURPOSE
A DIFFERENT KIND OF MAP
EFFICIENT ACCESS BY
AUTOMATA MATCHING
KEYS
Slide 36
Slide 36 text
THANKS
▸ Marty Schoch
▸ @mschoch - marty@couchbase.com
▸ Vellum Github - https://github.com/couchbaselabs/vellum
▸ Andrew Gallant's Blog
▸ Index 1,600,000,000 Keys with Automata and Rust
▸ Papers
▸ Direct Construction of Minimal Acyclic Subsequential Transducers