Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finite State Transducers in Go

Finite State Transducers in Go

In this talk the audience will learn about the utility and applications of finite state transducers. First, we'll review finite state machines, a concept many are already familiar with. Then we'll look at finite state automata, and their relationship with regular expressions. Finally, we'll build up to finite state transducers, and discuss a new library named Vellum, which implements them in Go.

https://github.com/couchbaselabs/vellum

Marty Schoch

April 25, 2017
Tweet

More Decks by Marty Schoch

Other Decks in Technology

Transcript

  1. FINITE STATE AUTOMATA (FSA) - ORDERED SET ▸ Does it

    contain a key? ▸ Enumerate Ranges of Keys ▸ in lexicographic order
  2. FINITE STATE TRANSDUCER (FST) - ORDERED MAP ▸ Does it

    contain a key? ▸ What is the value associated with a key? ▸ Enumerate Ranges of Key/Value Pairs ▸ in lexicographic order * values must satisfy particular algebra (currently just uint64)
  3. FST LIBRARY IN GO ▸ Build ▸ Insert Keys in

    Lexicographic Order ▸ Bounded Memory ▸ Stream Output while Building ▸ Use ▸ Immutable ▸ FST data is memory mapped (mmap) ▸ References are slice offsets not pointers (less garbage collector impact) Vellum
  4. CONSTRUCTING THE FST one - 1 two - 2 three

    - 3 https://www.flickr.com/photos/wfyurasko/5573962244
  5. OPEN fst, err = vellum.Open(path) Does NOT read entire FST

    into memory. * load an FST already in memory with vellum.Load(data)
  6. ITERATE itr, err := fst.Iterator(start, end) for err == nil

    { key, val := itr.Current() // do something err = itr.Next() }
  7. FUZZY MATCHING fuzzy, err := levenshtein.New("cat", 1) itr, err :=

    fst.Search(fuzzy, start, end) at bat cats
  8. WHAT ABOUT UNICODE? c a f 63 61 66 63

    61 66 Edit distance 2 bytes Edit distance 1 code point Levenshtein/Regex automata have integrated UTF-8 decoding c a f c3 a9 65 é e
  9. SYSTEM DICTIONARY bytes %orig txt 2,493,109 - fst 1,224,433 49%

    gz 747078 30% /usr/share/dict/words 235886 WORDS Not competing on compression ratio!
  10. FUTURE WORK ▸ Bleve Full Text Search Library ▸ Next

    Generation Term Dictionary Format based on Vellum FST ▸ Add Support for storing []byte values (not just uint64) ▸ Continued Optimizations ▸ Less Garbage Created
  11. FOR THOUGHT NOT GENERAL PURPOSE A DIFFERENT KIND OF MAP

    EFFICIENT ACCESS BY AUTOMATA MATCHING KEYS
  12. THANKS ▸ Marty Schoch ▸ @mschoch - [email protected] ▸ Vellum

    Github - https://github.com/couchbaselabs/vellum ▸ Andrew Gallant's Blog ▸ Index 1,600,000,000 Keys with Automata and Rust ▸ Papers ▸ Direct Construction of Minimal Acyclic Subsequential Transducers