Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pythonで作って学ぶ形態素解析

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

 Pythonで作って学ぶ形態素解析

PyCon JP 2015での発表資料です。

Avatar for Tomoko Uchida

Tomoko Uchida

October 11, 2015
Tweet

More Decks by Tomoko Uchida

Other Decks in Programming

Transcript

  1. Who am I ଧాஐࢠ @moco_beta ݩɿWebαʔϏεاۀͰPythonΤϯδχΞΛ͍ͯ͠·ͨ͠ ݱࡏɿݕࡧΤϯδϯ Solr, Elasticsearch ಋೖࢧԉɺӡ༻αϙʔτΛ͍ͯ͠·͢

    ʢגʣϩϯ΢Ποτॴଐ ීஈ͸ओʹJavaͱ͖Ͳ͖Scala԰͞ΜɻPython ͷ͓࢓ࣄ͖ͨΒ͍͍ͳ :-) ڵຯ͕͋Δ͜ͱɿ৘ใݕࡧͱɺͦͷؔ࿈ٕज़ ࣗવݴޠॲཧɾػցֶशɹ୤ϫφϏʔΊͯ͟͠ษڧத 🐥 2
  2. Janome ͱ͸ http://mocobeta.github.io/janome/ “janome (ऄͷ໨) ͸, Pure Python Ͱॻ͔Εͨ, ࣙॻ಺แͷܗଶૉղੳثͰ͢.”

    (venv) $ pip install janome (venv) $ python >>> from janome.tokenizer import Tokenizer >>> t = Tokenizer() >>> for token in t.tokenize("͢΋΋΋΋΋΋΋΋ͷ͏ͪ"): … print(token) … ͢΋΋ ໊ࢺ,Ұൠ,*,*,*,*,͢΋΋,εϞϞ,εϞϞ ΋ ॿࢺ,܎ॿࢺ,*,*,*,*,΋,Ϟ,Ϟ ΋΋ ໊ࢺ,Ұൠ,*,*,*,*,΋΋,ϞϞ,ϞϞ ΋ ॿࢺ,܎ॿࢺ,*,*,*,*,΋,Ϟ,Ϟ ΋΋ ໊ࢺ,Ұൠ,*,*,*,*,΋΋,ϞϞ,ϞϞ ͷ ॿࢺ,࿈ମԽ,*,*,*,*,ͷ,ϊ,ϊ ͏ͪ ໊ࢺ,ඇཱࣗ,෭ࢺՄೳ,*,*,*,͏ͪ,΢ν,΢ν 4
  3. ಛ௃ͱ͔ ࣙॻɺݴޠϞσϧ͸ mecab-ipadic-2.7.0-20070801 Λ࢖༻ ͍͍ͩͨ͸ MeCab ͱಉ͡ղੳ݁ՌʹͳΓ·͢ɻະ஌ޠॲཧͰࠩ ҟ͕Ͱ·͢ Pure Python

    ͔ͭඪ४ϥΠϒϥϦͷΈ࢖༻ ؀ڥ໰ΘͣͲ͜Ͱ΋ಈ͘…͸ͣ Ϣʔβʔࣙॻαϙʔτ (MeCab ࣙॻϑΥʔϚοτ) ୯ޠ௥Ճ͕ࢼͤ·͢ 5
  4. ͢΋΋ / ΋ / ΋΋ / ΋ / ΋΋ /

    ͷ / ͏ͪ Janome Λࢧ͑ΔΞϧΰϦζϜ ໊ࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ॿࢺ ໊ࢺ ඞཁͳ஌ࣝ • ޠኮɿʮ͢΋΋ʯʮ΋΋ʯͱ͍͏໊ࢺɺʮ΋ʯʮͷʯͱ͍͏ॿࢺ • ʢࣙॻʣ • ೔ຊޠΒ͠͞ɿ໊ࢺͷ͋ͱʹ͸ॿࢺ͕͖΍͍͢ • ʢݴޠϞσϧʣ
  5. ࣙॻҾ͖ ࣙॻ͸ɺαΠζ͕ίϯύΫτͰߴ଎ʹҾ͚Δ͜ͱ͕ඞཁ ϋογϡϚοϓ (Python ͷ dict) Ͱ΋͍͍͚Ͳɺ ίϞϯϓϨϑΟοΫεϚονΛ࢖͏ͱࣙॻҾ͖ͷճ਺͕ݮΔͷͰޮ཰త ͘͞Β ࣙॻ

    ͞ɹಈࢺʮ͢Δʯͷ׆༻ܗ ͘͞ɹಈࢺʮ͘͞ʯͷجຊܗ ͘͞ɹܗ༰ࢺʮ͍͘͞ʯͷ׆༻ܗ ͘͞Βɹ໊ࢺ … ଞ 11 ୯ޠ ೖྗ ग़ྗ
  6. ࣙॻҾ͖ σʔλߏ଄ɾΞϧΰϦζϜ ύτϦγΞ໦ (JUMAN) μϒϧ഑ྻ (ChaSen, MeCab) FST (Kuromoji/Lucene൛, Janome)

    http://taku910.github.io/mecab/ http://www.slideshare.net/lucenerevolution/automaton-invasionlucenerevolution2012 13
  7. FST ? Finite State Transducers (ܾఆੑ༗ݶΦʔτϚτϯͷҰछ) ਖ਼֬ʹ͸ Minimal Acyclic Subsequential

    Transducers ೖྗͷϓϨϑΟοΫεɾαϑΟοΫε྆ํΛڞ༗͢Δͷ ͰɺτϥΠ໦ͱൺֱͯ͠ίϯύΫτ (࿦จ) http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.24.3698 14
  8. ਤղFST খ͞ͳࣙॻΤϯτϦ { ‘apr’: ’30’, ‘aug’: ’31’, ‘dec’: ’31’, 


    ‘feb’: [’28’, ’29’], ‘jan,’: ’31’, ‘jul’: ’31’, ‘jun’: ’30’ } ͱ͍͏, ̓ͭͷ ೖྗ => ग़ྗ ͷϖΞ͔ΒFSTΛ࡞Γ·͢ ೖྗ͸ܗଶૉͷจࣈྻʢද૚ܗʣɺग़ྗ͸Ϛον͢Δܗଶૉͷ IDͱࢥ͍ͬͯͩ͘͞ 16
  9. ղੳ ʮ͢΋΋΋΋΋΋΋΋ͷ͏ͪʯʢ;ͨͨͼʣͷ۠੾Γํ͸ͨ͘͞Μ͋Δ 1.͢΋΋ / ΋ / ΋΋ / ΋ /

    ΋΋ / ͷ / ͏ͪ 2.͢΋΋ / ΋ / ΋΋ / ΋΋ / ΋ / ͷ / ͏ͪ 3.͢΋΋ / ΋΋ / ΋ / ΋΋ / ΋ / ͷ / ͏ͪ ͳͲͳͲ… ࣙॻΛҾ͖ͭͭɺ·ͣ͸͢΂ͯͷ෼ׂͷީิΛ਺্͑͛Δ ໊ࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ॿࢺ ໊ࢺ ໊ࢺ ॿࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ໊ࢺ ໊ࢺ ໊ࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ॿࢺ 24
  10. ղੳ ީิͷͳ͔͔ΒɺҰ൪೔ຊޠΒ͍͠ύλʔϯΛબͿ 1. ͢΋΋ / ΋ / ΋΋ / ΋

    / ΋΋ / ͷ / ͏ͪ 2. ͢΋΋ / ΋ / ΋΋ / ΋΋ / ΋ / ͷ / ͏ͪ ʮ2. ΑΓ1.ͷ΄͏͕ࣗવʯΛܭࢉͰٻΊΔʹ͸ʁ ໊ࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ॿࢺ ໊ࢺ ໊ࢺ ॿࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ໊ࢺ 25
  11. Janome ։ൃ;Γ͔͑Γ ͨ·ʹɺͲͷ͘Β͍ͷظؒͰ։ൃͨ͠ͷ͔ฉ͍ͯͩ͘͞Δํ͕͍ΔͷͰɺGithub Ϧϙ δτϦ (https://github.com/mocobeta/janome) ͷίϛοτཤྺΛ௥ͬͯΈͨ 2015/1/20 ࠒ: ४උ

    (FSTͷ࿦จ, LuceneͷιʔεಡΈ࢝Ίͨ) 2015/2/14 ࠒ: FSTͷϕʔεΛ࡞Γ࢝ΊΔ 2015/3/14 ࠒ: γεςϜࣙॻ (mecab-ipadic) Λ࡞Γ࢝ΊΔ 2015/4/1 ࠒ: ϥςΟεΛ࡞Γ࢝ΊΔ 2015/4/7: Ϣʔβʔࣙॻ͕ͭ͘ 2015/4/8: PyPI ʹొ࿥ɺެ։ ։ൃ޻਺͜Μͳײ͡ FSTͷཧղͱ࣮૷ʹ2ϲ݄ mecab-ipadic ͷ಺แࣙॻԽʹ0.5ϲ݄ ϏλϏͷ࣮૷ʹ਺೔ ϦϦʔε࡞ۀʹ1೔ pythonεΩϧ͸ॳʙதڃ͘Β͍
  12. ։ൃͷ͖͔͚ͬ Q: ͱ͜ΖͰͳΜͰ࡞Ζ͏ͱ ࢥͬͨͷ A: ܗଶૉղੳثͷ࣮૷͸ࣗવ ݴޠॲཧͷ 101 Ͱ͢ ^^


    (by @ikawaha) (kuromoji.js ࡞ऀ) (kagome ࡞ऀ) ΋͔ͯ͠͠: ྲྀߦͬͯΔ…?(ҧ)
  13. FST & ಺แγεςϜࣙॻͷ࣮૷ ͕͜͜Ͱ͖Ε͹ऴͬͨΑ͏ͳ΋ͷʂ ʢͨͿΜʣ FSTʢΦʔτϚτϯʣ͸ʮลͷू· ΓʯͱΈͯόΠτ഑ྻʹམͱ͜͠Ή (Apache Lucene ํࣜ)

    string ͸ encode(), decode() Ͱ bytes ʹύοΫ int ͸ struct.pack(), unpack() Ͱ bytes ʹύοΫ http://mocobeta-backup.tumblr.com/post/113693778372/lucene-fst-2 janome ͰͷΤϯίʔυྫ
  14. ϓϩϑΝΠϧΛͱͬͯ஗͍ͱ͜ΖΛಛఆͯ͠ɺ׳Εͳ͍ͳΓʹࢼߦࡨޡͯ͠ΈΔ ʢ஗͍ͱ͜Ζ̍ʣࣙॻҾ͖ʢFSTͷݕࡧʣ͕࣮ߦ࣌ؒͷ൒෼Λ઎ΊΔ ΩϟογϡΛೖΕͨΓ ʢ͜͜͸FSTͷ࣮૷͕͍·͍͔ͪͩΒ͔ͳʣ ΦϒδΣΫτੜ੒Λ΍Ί͍ͯΖ͍Ζ Tuple ʹͨ͠Γ ʢ஗͍ͱ͜Ζ̎ʣ࿈઀ίετݕࡧ͕࣮ߦ࣌ؒͷ 1/4 Λ઎ΊΔ

    ࿈઀ίετͷ࣋ͪํΛɺ࠷ॳ dict (ϋογϡςʔϒϧ) ͩͬͨͷΛɺʢMeCabʹͳΒ͍ʣೋ࣍ݩ഑ྻʹ ม͑Δͱߴ଎ʹɻσʔλߏ଄ͱͬͯ΋େࣄ ^^; ʢ஗͍ͱ͜Ζ̏ʣϥςΟεͷϊʔυੜ੒͕࣮ߦ࣌ؒͷ 1/4 Λ઎ΊΔ (͋ͱϝϞϦ΋৯͏) TODO νϡʔχϯά 33
  15. MeCab (C++) ͷΫϩʔϯɺ·ͨ͸MeCabࣙॻ(Ϟσϧ)Λआ༻͠ ͍ͯΔܗଶૉղੳث͋Ε͜Ε Igo (Java), igo-python (Python), igo-ruby (Ruby),


    igo-javascript (JavaScript) Kuromoji (Java), kuromoji.js (JavaScript) kagome (Go), janome (Python) ଞ... Appendixɿྺ࢙ʹ͍ͭͯগ͚ͩ͠ 37