犬でもわかる Minimal Acyclic Subsequential Transducer / Introduction to Minimal Acyclic Subsequential Transducer

犬でもわかる Minimal Acyclic Subsequential Transducer / Introduction to Minimal Acyclic Subsequential Transducer

はてなの技術勉強会で LT 発表したときの資料です。

13f3313ae1ec1d9b3ed76ccbd746291b?s=128

Takuya Asano

June 27, 2019
Tweet

Transcript

  1. ݘͰ΋Θ͔Δ
 Minimal Acyclic Subsequential Transducer 2019-06-27 ͸ͯͳٕज़ษڧձ id:takuya-a

  2. FSA ͱ FST • FSA (Finite State Automaton) • ༗ݶঢ়ଶΦʔτϚτϯ

    • ೖྗྻΛडཧ͢Δ͔Ͳ͏͔ͷ bool Λฦ͢ • FST (Finite State Transducer) • ༗ݶঢ়ଶม׵ث • FSA ͷҰछ • ೖྗྻΛडཧͨ͠ͱ͖ɺग़ྗྻΛฦ͢ • Minimal Acyclic Subsequential Transducer ͸ FST ͷҰछ { “onk” } { “onk” => “͓Μ͘” }
  3. FST ͷ࢖͍Έͪ • ͍ΘΏΔʮࣙॻҾ͖ʯʹ࢖͑Δ • Ωʔͱ஋ͷϖΞΛอଘͰ͖ΔʢPerl Ͱ͍͏ͱϋογϡͱͯ͠࢖͑Δʣ • ঢ়ଶΛͨͲΔ͚ͩͳͷͰݕࡧ͕ߴ଎ •

    ͱ͘ʹ ڞ௨઀಄ࣙݕࡧ (common prefix search) Ͱ͸༗ར • ΋ͪΖΜ ׬શҰகݕࡧ (exact match) ΋Ͱ͖Δ • ઀಄ࣙ΍઀ඌ͕ࣙڞ༗͞ΕΔͷͰলϝϞϦ
  4. FST ͷԠ༻ઌ • ݕࡧΤϯδϯͷࣙॻͱͯ͠ • Apache Lucene ͷίΞΞϧΰϦζϜͱͯ͠ɺ৭Μͳͱ͜ΖͰ࢖ΘΕ͍ͯΔ • ओʹ୯ޠΛϧοΫΞοϓ͢ΔͨΊʹ࢖ΘΕΔ

    • ܗଶૉղੳثͷࣙॻͱͯ͠ • Janome (Python), Kuromoji (Java) Ͱ࠾༻͞Ε͍ͯΔ • ߴ଎ͳ common prefix search ͕ඞཁ • Ի੠ೝࣝͷݴޠϞσϧͱͯ͠ • ॏΈ෇͖ FST (Weighted FST; WFST) ͕࢖ΘΕΔ • https://www.slideshare.net/JiroNishitoba/wfst-61929888
  5. Minimal Acyclic Subsequential Transducer Minimal
 ࠷খͷ Acyclic
 ϧʔϓͷͳ͍ Subsequential
 ෦෼(จࣈ)ྻͷ

    Transducer
 ม׵ث “takuya” => “a”
 “takaya” => “n”
  6. TRIE • ઀಄ࣙͷΈΛڞ༗͢Δσʔλߏ଄ • πϦʔʹͳΔ • ઀ඌࣙ͸ڞ༗Ͱ͖ͳ͍ • TAIL ഑ྻͱ͍͏ςΫχοΫͰ


    Ұ෦ڞ༗͸Ͱ͖Δ FST TRIE
  7. Minimal Acyclic Subsequential Transducer ͷߏங • ཧ࿦্࠷খͷ FST Λஞ࣍తʹߏஙͰ͖ΔΞϧΰϦζϜ͕͋Δ •

    ৄ͘͠͸ҎԼͷ࿦จΛಡΜͰʂ • Mihov & Maurel (2001), Direct Construction of Minimal Acyclic Subsequential Transducers
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 • ࿦จதͷٙࣅίʔυɺ46ߦ໨͕ؒҧ͑ͯΔ͔ΒؾΛ͚ͭͯͶ • ޡ: SET_OUTPUT • ਖ਼: SET_STATE_OUTPUT
  8. Minimal Acyclic Subsequential Transducer ͷ࣮૷ • https://github.com/takuyaa/cdarts • Java Ͱॻ͍ͨ

    • Lucene ͷ FST ΍ jdartsclone ͱൺֱ͢ΔͨΊ • ଞͷ࣮૷ • Java: https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/util/fst • Go: https://github.com/ikawaha/mast • Python: https://github.com/mocobeta/janome/blob/master/janome/fst.py • Rust: https://github.com/BurntSushi/fst
  9. ࣮ݧʂ

  10. සग़ӳ୯ޠͷ TRIE ͱ FST • Lucene ͷετοϓϫʔυΛΩʔɺ࿈൪Λ஋ͱͯ͠ߏங • શΩʔ਺: 33

    • શจࣈ਺: 97 • TRIE • ঢ়ଶ਺: 58 • ભҠ਺: 57 • FST (Minimal Acyclic Subsequential Transducer) • ঢ়ଶ਺: 25 • ભҠ਺: 51 FST TRIE
  11. ϙέϞϯӳ೔ม׵ثͷ TRIE ͱ FST • ϙέϞϯͷӳޠ໊ΛΩʔɺ೔ຊޠ໊Λ஋ͱͯ͠ߏங • શΩʔ਺: 151 •

    શจࣈ਺: 1103 • TRIE • ঢ়ଶ਺: 809 • ભҠ਺: 808 • FST (Minimal Acyclic Subsequential Transducer) • ঢ়ଶ਺: 459 • ભҠ਺: 604 FST TRIE
  12. FST Λ֦େͨ͠΋ͷ ※ UTF-8 ͰΤϯίʔυ͍ͯͯ͠
 1όΠτ໨͚ͩڞ༗͞ΕͨΓ͢Δ
 ͷͰද্ࣔ͸จࣈԽ͚ͯ͠·͢

  13. ࢀߟ • Mihov & Maurel (2001), Direct Construction of Minimal

    Acyclic Subsequential Transducers
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 • Finite-state automata and directed acyclic graphs
 http://www.jandaciuk.pl/Fsm_algorithms/ • Changing Bits: Using Finite State Transducers in Lucene
 http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html • moco(beta)'s backup: [຋༁] Using Finite State Transducers in Lucene
 https://mocobeta-backup.tumblr.com/post/105777650158/using-finite-state-transducers-in-lucene • Index 1,600,000,000 Keys with Automata and Rust - Andrew Gallant's Blog
 https://blog.burntsushi.net/transducers/ • moco(beta)'s backup: Lucene FST ͷΞϧΰϦζϜ (1) ʙਤղฤʙ
 https://mocobeta-backup.tumblr.com/post/111076688132/lucene-fst-1 • moco(beta)'s backup: Lucene FST ͷΞϧΰϦζϜ (2) ʙ࣮૷ฤʙ
 https://mocobeta-backup.tumblr.com/post/113693778372/lucene-fst-2 • LuceneͰ࢖ΘΕͯΔFSTΛ࣮૷ͯ͠Έͨʢਖ਼نදݱϚονɿVMΞϓϩʔν΁ͷট଴ʣ - Qiita
 https://qiita.com/ikawaha/items/be95304a803020e1b2d1 • Minimal Acyclic Subsequential TransducerͰ༡Ϳ - Negative/Positive Thinking
 https://jetbead.hatenablog.com/entry/20151014/1444756877