Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finite State Transducers in Go

Finite State Transducers in Go

In this talk the audience will learn about the utility and applications of finite state transducers. First, we'll review finite state machines, a concept many are already familiar with. Then we'll look at finite state automata, and their relationship with regular expressions. Finally, we'll build up to finite state transducers, and discuss a new library named Vellum, which implements them in Go.

https://github.com/couchbaselabs/vellum

16cdfb0c4af5297e261cb36e30fa5c20?s=128

Marty Schoch

April 25, 2017
Tweet

More Decks by Marty Schoch

Other Decks in Technology

Transcript

  1. FINITE STATE TRANSDUCERS CAPITAL GO 2017 g o

  2. MINIMAL DETERMINISTIC ACYCLIC FINITE STATE TRANSDUCERS

  3. FINITE STATE MACHINE

  4. FINITE STATE AUTOMATA (FSA) • cat • cats • dog

    MATCHES:
  5. REGULAR EXPRESSION • cat • cats • dog MATCHES: ^cat(s)?|dog$

  6. FINITE STATE AUTOMATA (FSA) - ORDERED SET ▸ Does it

    contain a key? ▸ Enumerate Ranges of Keys ▸ in lexicographic order
  7. FINITE STATE TRANSDUCER (FST) Transition with value Traversal collects values

    Using SUM() "cat" has value 9
  8. FINITE STATE TRANSDUCER (FST) - ORDERED MAP ▸ Does it

    contain a key? ▸ What is the value associated with a key? ▸ Enumerate Ranges of Key/Value Pairs ▸ in lexicographic order * values must satisfy particular algebra (currently just uint64)
  9. FST LIBRARY IN GO ▸ Build ▸ Insert Keys in

    Lexicographic Order ▸ Bounded Memory ▸ Stream Output while Building ▸ Use ▸ Immutable ▸ FST data is memory mapped (mmap) ▸ References are slice offsets not pointers (less garbage collector impact) Vellum
  10. CONSTRUCTING THE FST one - 1 two - 2 three

    - 3 https://www.flickr.com/photos/wfyurasko/5573962244
  11. CREATING A BUILDER builder, err := vellum.New(f, nil) Any io.Writer

    Options, default nil
  12. INSERT err = builder.Insert([]byte("one"), 1)

  13. INSERT err = builder.Insert([]byte("three"), 3) frozen

  14. INSERT err = builder.Insert([]byte("three"), 3) frozen flushed to disk

  15. INSERT err = builder.Insert([]byte("two"), 2) frozen flushed to disk

  16. INSERT err = builder.Insert([]byte("two"), 2) frozen flushed to disk

  17. INSERT err = builder.Insert([]byte("two"), 2) frozen flushed to disk

  18. INSERT err = builder.Insert([]byte("two"), 2) frozen flushed to disk

  19. CLOSE err = builder.Close() frozen flushed to disk

  20. CLOSE err = builder.Close() frozen flushed to disk

  21. CLOSE err = builder.Close() frozen flushed to disk

  22. USING THE FST https://commons.wikimedia.org/wiki/File:Metro_Blur.JPG

  23. OPEN fst, err = vellum.Open(path) Does NOT read entire FST

    into memory. * load an FST already in memory with vellum.Load(data)
  24. GET val, exists, err = fst.Get(key)

  25. ITERATE itr, err := fst.Iterator(start, end) for err == nil

    { key, val := itr.Current() // do something err = itr.Next() }
  26. EFFICIENT SEARCHING WITH OTHER AUTOMATA itr, err := fst.Search(automaton, start,

    end) Just like iterator, with additional filter.
  27. REGULAR EXPRESSION MATCHING r, err := regexp.New(`c.*t`) itr, err :=

    fst.Search(r, start, end)
  28. FUZZY MATCHING fuzzy, err := levenshtein.New("cat", 1) itr, err :=

    fst.Search(fuzzy, start, end) at bat cats
  29. WHAT ABOUT UNICODE? c a f 63 61 66 63

    61 66 Edit distance 2 bytes Edit distance 1 code point Levenshtein/Regex automata have integrated UTF-8 decoding c a f c3 a9 65 é e
  30. SYSTEM DICTIONARY bytes %orig txt 2,493,109 - fst 1,224,433 49%

    gz 747078 30% /usr/share/dict/words 235886 WORDS Not competing on compression ratio!
  31. REGULAR EXPRESSION SEARCH DICTIONARY https://asciinema.org/a/az29w7apf7kdiioa1yf150g1u

  32. FUZZY SEARCH DICTIONARY https://asciinema.org/a/avqyqbo3uw7y653rlnv7bqivv

  33. INDEXING WIKIPEDIA ARTICLE TITLES 6,726,078 TITLES

  34. FUTURE WORK ▸ Bleve Full Text Search Library ▸ Next

    Generation Term Dictionary Format based on Vellum FST ▸ Add Support for storing []byte values (not just uint64) ▸ Continued Optimizations ▸ Less Garbage Created
  35. FOR THOUGHT NOT GENERAL PURPOSE A DIFFERENT KIND OF MAP

    EFFICIENT ACCESS BY AUTOMATA MATCHING KEYS
  36. THANKS ▸ Marty Schoch ▸ @mschoch - marty@couchbase.com ▸ Vellum

    Github - https://github.com/couchbaselabs/vellum ▸ Andrew Gallant's Blog ▸ Index 1,600,000,000 Keys with Automata and Rust ▸ Papers ▸ Direct Construction of Minimal Acyclic Subsequential Transducers