Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finite State Transducers in Go

Finite State Transducers in Go

In this talk the audience will learn about the utility and applications of finite state transducers. First, we'll review finite state machines, a concept many are already familiar with. Then we'll look at finite state automata, and their relationship with regular expressions. Finally, we'll build up to finite state transducers, and discuss a new library named Vellum, which implements them in Go.

https://github.com/couchbaselabs/vellum

Marty Schoch

April 25, 2017
Tweet

More Decks by Marty Schoch

Other Decks in Technology

Transcript

  1. FINITE STATE TRANSDUCERS
    CAPITAL GO 2017
    g o

    View Slide

  2. MINIMAL
    DETERMINISTIC
    ACYCLIC
    FINITE
    STATE
    TRANSDUCERS

    View Slide

  3. FINITE STATE MACHINE

    View Slide

  4. FINITE STATE AUTOMATA (FSA)
    • cat
    • cats
    • dog
    MATCHES:

    View Slide

  5. REGULAR EXPRESSION
    • cat
    • cats
    • dog
    MATCHES:
    ^cat(s)?|dog$

    View Slide

  6. FINITE STATE AUTOMATA (FSA) - ORDERED SET
    ▸ Does it contain a key?
    ▸ Enumerate Ranges of Keys
    ▸ in lexicographic order

    View Slide

  7. FINITE STATE TRANSDUCER (FST)
    Transition with value
    Traversal collects values
    Using SUM() "cat" has value 9

    View Slide

  8. FINITE STATE TRANSDUCER (FST) - ORDERED MAP
    ▸ Does it contain a key?
    ▸ What is the value associated with a key?
    ▸ Enumerate Ranges of Key/Value Pairs
    ▸ in lexicographic order
    * values must satisfy particular algebra (currently just uint64)

    View Slide

  9. FST LIBRARY IN GO
    ▸ Build
    ▸ Insert Keys in Lexicographic Order
    ▸ Bounded Memory
    ▸ Stream Output while Building
    ▸ Use
    ▸ Immutable
    ▸ FST data is memory mapped (mmap)
    ▸ References are slice offsets not
    pointers (less garbage collector
    impact)
    Vellum

    View Slide

  10. CONSTRUCTING
    THE
    FST
    one - 1
    two - 2
    three - 3
    https://www.flickr.com/photos/wfyurasko/5573962244

    View Slide

  11. CREATING A BUILDER
    builder, err := vellum.New(f, nil)
    Any io.Writer Options, default nil

    View Slide

  12. INSERT
    err = builder.Insert([]byte("one"), 1)

    View Slide

  13. INSERT
    err = builder.Insert([]byte("three"), 3)
    frozen

    View Slide

  14. INSERT
    err = builder.Insert([]byte("three"), 3)
    frozen flushed to disk

    View Slide

  15. INSERT
    err = builder.Insert([]byte("two"), 2)
    frozen flushed to disk

    View Slide

  16. INSERT
    err = builder.Insert([]byte("two"), 2)
    frozen flushed to disk

    View Slide

  17. INSERT
    err = builder.Insert([]byte("two"), 2)
    frozen flushed to disk

    View Slide

  18. INSERT
    err = builder.Insert([]byte("two"), 2)
    frozen flushed to disk

    View Slide

  19. CLOSE
    err = builder.Close()
    frozen flushed to disk

    View Slide

  20. CLOSE
    err = builder.Close()
    frozen flushed to disk

    View Slide

  21. CLOSE
    err = builder.Close()
    frozen flushed to disk

    View Slide

  22. USING
    THE
    FST
    https://commons.wikimedia.org/wiki/File:Metro_Blur.JPG

    View Slide

  23. OPEN
    fst, err = vellum.Open(path)
    Does NOT read entire FST
    into memory.
    * load an FST already in memory with vellum.Load(data)

    View Slide

  24. GET
    val, exists, err = fst.Get(key)

    View Slide

  25. ITERATE
    itr, err := fst.Iterator(start, end)
    for err == nil {
    key, val := itr.Current()
    // do something
    err = itr.Next()
    }

    View Slide

  26. EFFICIENT SEARCHING WITH OTHER AUTOMATA
    itr, err := fst.Search(automaton, start, end)
    Just like iterator,
    with additional filter.

    View Slide

  27. REGULAR EXPRESSION MATCHING
    r, err := regexp.New(`c.*t`)
    itr, err := fst.Search(r, start, end)

    View Slide

  28. FUZZY MATCHING
    fuzzy, err := levenshtein.New("cat", 1)
    itr, err := fst.Search(fuzzy, start, end)
    at
    bat
    cats

    View Slide

  29. WHAT ABOUT UNICODE?
    c a f
    63 61 66
    63 61 66
    Edit distance 2 bytes
    Edit distance 1 code point
    Levenshtein/Regex automata have integrated UTF-8 decoding
    c a f
    c3 a9
    65
    é
    e

    View Slide

  30. SYSTEM DICTIONARY
    bytes %orig
    txt 2,493,109 -
    fst 1,224,433 49%
    gz 747078 30%
    /usr/share/dict/words
    235886 WORDS
    Not competing on
    compression ratio!

    View Slide

  31. REGULAR EXPRESSION SEARCH DICTIONARY
    https://asciinema.org/a/az29w7apf7kdiioa1yf150g1u

    View Slide

  32. FUZZY SEARCH DICTIONARY
    https://asciinema.org/a/avqyqbo3uw7y653rlnv7bqivv

    View Slide

  33. INDEXING WIKIPEDIA ARTICLE TITLES
    6,726,078 TITLES

    View Slide

  34. FUTURE WORK
    ▸ Bleve Full Text Search Library
    ▸ Next Generation Term Dictionary
    Format based on Vellum FST
    ▸ Add Support for storing []byte
    values (not just uint64)
    ▸ Continued Optimizations
    ▸ Less Garbage Created

    View Slide

  35. FOR THOUGHT
    NOT GENERAL PURPOSE
    A DIFFERENT KIND OF MAP
    EFFICIENT ACCESS BY
    AUTOMATA MATCHING
    KEYS

    View Slide

  36. THANKS
    ▸ Marty Schoch
    ▸ @mschoch - [email protected]
    ▸ Vellum Github - https://github.com/couchbaselabs/vellum
    ▸ Andrew Gallant's Blog
    ▸ Index 1,600,000,000 Keys with Automata and Rust
    ▸ Papers
    ▸ Direct Construction of Minimal Acyclic Subsequential Transducers

    View Slide