Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pythonで作って学ぶ形態素解析

 Pythonで作って学ぶ形態素解析

PyCon JP 2015での発表資料です。

Tomoko Uchida

October 11, 2015
Tweet

More Decks by Tomoko Uchida

Other Decks in Programming

Transcript

  1. Who am I ଧాஐࢠ @moco_beta ݩɿWebαʔϏεاۀͰPythonΤϯδχΞΛ͍ͯ͠·ͨ͠ ݱࡏɿݕࡧΤϯδϯ Solr, Elasticsearch ಋೖࢧԉɺӡ༻αϙʔτΛ͍ͯ͠·͢

    ʢגʣϩϯ΢Ποτॴଐ ීஈ͸ओʹJavaͱ͖Ͳ͖Scala԰͞ΜɻPython ͷ͓࢓ࣄ͖ͨΒ͍͍ͳ :-) ڵຯ͕͋Δ͜ͱɿ৘ใݕࡧͱɺͦͷؔ࿈ٕज़ ࣗવݴޠॲཧɾػցֶशɹ୤ϫφϏʔΊͯ͟͠ษڧத 🐥 2
  2. Janome ͱ͸ http://mocobeta.github.io/janome/ “janome (ऄͷ໨) ͸, Pure Python Ͱॻ͔Εͨ, ࣙॻ಺แͷܗଶૉղੳثͰ͢.”

    (venv) $ pip install janome (venv) $ python >>> from janome.tokenizer import Tokenizer >>> t = Tokenizer() >>> for token in t.tokenize("͢΋΋΋΋΋΋΋΋ͷ͏ͪ"): … print(token) … ͢΋΋ ໊ࢺ,Ұൠ,*,*,*,*,͢΋΋,εϞϞ,εϞϞ ΋ ॿࢺ,܎ॿࢺ,*,*,*,*,΋,Ϟ,Ϟ ΋΋ ໊ࢺ,Ұൠ,*,*,*,*,΋΋,ϞϞ,ϞϞ ΋ ॿࢺ,܎ॿࢺ,*,*,*,*,΋,Ϟ,Ϟ ΋΋ ໊ࢺ,Ұൠ,*,*,*,*,΋΋,ϞϞ,ϞϞ ͷ ॿࢺ,࿈ମԽ,*,*,*,*,ͷ,ϊ,ϊ ͏ͪ ໊ࢺ,ඇཱࣗ,෭ࢺՄೳ,*,*,*,͏ͪ,΢ν,΢ν 4
  3. ಛ௃ͱ͔ ࣙॻɺݴޠϞσϧ͸ mecab-ipadic-2.7.0-20070801 Λ࢖༻ ͍͍ͩͨ͸ MeCab ͱಉ͡ղੳ݁ՌʹͳΓ·͢ɻະ஌ޠॲཧͰࠩ ҟ͕Ͱ·͢ Pure Python

    ͔ͭඪ४ϥΠϒϥϦͷΈ࢖༻ ؀ڥ໰ΘͣͲ͜Ͱ΋ಈ͘…͸ͣ Ϣʔβʔࣙॻαϙʔτ (MeCab ࣙॻϑΥʔϚοτ) ୯ޠ௥Ճ͕ࢼͤ·͢ 5
  4. ͢΋΋ / ΋ / ΋΋ / ΋ / ΋΋ /

    ͷ / ͏ͪ Janome Λࢧ͑ΔΞϧΰϦζϜ ໊ࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ॿࢺ ໊ࢺ ඞཁͳ஌ࣝ • ޠኮɿʮ͢΋΋ʯʮ΋΋ʯͱ͍͏໊ࢺɺʮ΋ʯʮͷʯͱ͍͏ॿࢺ • ʢࣙॻʣ • ೔ຊޠΒ͠͞ɿ໊ࢺͷ͋ͱʹ͸ॿࢺ͕͖΍͍͢ • ʢݴޠϞσϧʣ
  5. ࣙॻҾ͖ ࣙॻ͸ɺαΠζ͕ίϯύΫτͰߴ଎ʹҾ͚Δ͜ͱ͕ඞཁ ϋογϡϚοϓ (Python ͷ dict) Ͱ΋͍͍͚Ͳɺ ίϞϯϓϨϑΟοΫεϚονΛ࢖͏ͱࣙॻҾ͖ͷճ਺͕ݮΔͷͰޮ཰త ͘͞Β ࣙॻ

    ͞ɹಈࢺʮ͢Δʯͷ׆༻ܗ ͘͞ɹಈࢺʮ͘͞ʯͷجຊܗ ͘͞ɹܗ༰ࢺʮ͍͘͞ʯͷ׆༻ܗ ͘͞Βɹ໊ࢺ … ଞ 11 ୯ޠ ೖྗ ग़ྗ
  6. ࣙॻҾ͖ σʔλߏ଄ɾΞϧΰϦζϜ ύτϦγΞ໦ (JUMAN) μϒϧ഑ྻ (ChaSen, MeCab) FST (Kuromoji/Lucene൛, Janome)

    http://taku910.github.io/mecab/ http://www.slideshare.net/lucenerevolution/automaton-invasionlucenerevolution2012 13
  7. FST ? Finite State Transducers (ܾఆੑ༗ݶΦʔτϚτϯͷҰछ) ਖ਼֬ʹ͸ Minimal Acyclic Subsequential

    Transducers ೖྗͷϓϨϑΟοΫεɾαϑΟοΫε྆ํΛڞ༗͢Δͷ ͰɺτϥΠ໦ͱൺֱͯ͠ίϯύΫτ (࿦จ) http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.24.3698 14
  8. ਤղFST খ͞ͳࣙॻΤϯτϦ { ‘apr’: ’30’, ‘aug’: ’31’, ‘dec’: ’31’, 


    ‘feb’: [’28’, ’29’], ‘jan,’: ’31’, ‘jul’: ’31’, ‘jun’: ’30’ } ͱ͍͏, ̓ͭͷ ೖྗ => ग़ྗ ͷϖΞ͔ΒFSTΛ࡞Γ·͢ ೖྗ͸ܗଶૉͷจࣈྻʢද૚ܗʣɺग़ྗ͸Ϛον͢Δܗଶૉͷ IDͱࢥ͍ͬͯͩ͘͞ 16
  9. ղੳ ʮ͢΋΋΋΋΋΋΋΋ͷ͏ͪʯʢ;ͨͨͼʣͷ۠੾Γํ͸ͨ͘͞Μ͋Δ 1.͢΋΋ / ΋ / ΋΋ / ΋ /

    ΋΋ / ͷ / ͏ͪ 2.͢΋΋ / ΋ / ΋΋ / ΋΋ / ΋ / ͷ / ͏ͪ 3.͢΋΋ / ΋΋ / ΋ / ΋΋ / ΋ / ͷ / ͏ͪ ͳͲͳͲ… ࣙॻΛҾ͖ͭͭɺ·ͣ͸͢΂ͯͷ෼ׂͷީิΛ਺্͑͛Δ ໊ࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ॿࢺ ໊ࢺ ໊ࢺ ॿࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ໊ࢺ ໊ࢺ ໊ࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ॿࢺ 24
  10. ղੳ ީิͷͳ͔͔ΒɺҰ൪೔ຊޠΒ͍͠ύλʔϯΛબͿ 1. ͢΋΋ / ΋ / ΋΋ / ΋

    / ΋΋ / ͷ / ͏ͪ 2. ͢΋΋ / ΋ / ΋΋ / ΋΋ / ΋ / ͷ / ͏ͪ ʮ2. ΑΓ1.ͷ΄͏͕ࣗવʯΛܭࢉͰٻΊΔʹ͸ʁ ໊ࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ॿࢺ ໊ࢺ ໊ࢺ ॿࢺ ໊ࢺ ໊ࢺ ॿࢺ ॿࢺ ໊ࢺ 25
  11. Janome ։ൃ;Γ͔͑Γ ͨ·ʹɺͲͷ͘Β͍ͷظؒͰ։ൃͨ͠ͷ͔ฉ͍ͯͩ͘͞Δํ͕͍ΔͷͰɺGithub Ϧϙ δτϦ (https://github.com/mocobeta/janome) ͷίϛοτཤྺΛ௥ͬͯΈͨ 2015/1/20 ࠒ: ४උ

    (FSTͷ࿦จ, LuceneͷιʔεಡΈ࢝Ίͨ) 2015/2/14 ࠒ: FSTͷϕʔεΛ࡞Γ࢝ΊΔ 2015/3/14 ࠒ: γεςϜࣙॻ (mecab-ipadic) Λ࡞Γ࢝ΊΔ 2015/4/1 ࠒ: ϥςΟεΛ࡞Γ࢝ΊΔ 2015/4/7: Ϣʔβʔࣙॻ͕ͭ͘ 2015/4/8: PyPI ʹొ࿥ɺެ։ ։ൃ޻਺͜Μͳײ͡ FSTͷཧղͱ࣮૷ʹ2ϲ݄ mecab-ipadic ͷ಺แࣙॻԽʹ0.5ϲ݄ ϏλϏͷ࣮૷ʹ਺೔ ϦϦʔε࡞ۀʹ1೔ pythonεΩϧ͸ॳʙதڃ͘Β͍
  12. ։ൃͷ͖͔͚ͬ Q: ͱ͜ΖͰͳΜͰ࡞Ζ͏ͱ ࢥͬͨͷ A: ܗଶૉղੳثͷ࣮૷͸ࣗવ ݴޠॲཧͷ 101 Ͱ͢ ^^


    (by @ikawaha) (kuromoji.js ࡞ऀ) (kagome ࡞ऀ) ΋͔ͯ͠͠: ྲྀߦͬͯΔ…?(ҧ)
  13. FST & ಺แγεςϜࣙॻͷ࣮૷ ͕͜͜Ͱ͖Ε͹ऴͬͨΑ͏ͳ΋ͷʂ ʢͨͿΜʣ FSTʢΦʔτϚτϯʣ͸ʮลͷू· ΓʯͱΈͯόΠτ഑ྻʹམͱ͜͠Ή (Apache Lucene ํࣜ)

    string ͸ encode(), decode() Ͱ bytes ʹύοΫ int ͸ struct.pack(), unpack() Ͱ bytes ʹύοΫ http://mocobeta-backup.tumblr.com/post/113693778372/lucene-fst-2 janome ͰͷΤϯίʔυྫ
  14. ϓϩϑΝΠϧΛͱͬͯ஗͍ͱ͜ΖΛಛఆͯ͠ɺ׳Εͳ͍ͳΓʹࢼߦࡨޡͯ͠ΈΔ ʢ஗͍ͱ͜Ζ̍ʣࣙॻҾ͖ʢFSTͷݕࡧʣ͕࣮ߦ࣌ؒͷ൒෼Λ઎ΊΔ ΩϟογϡΛೖΕͨΓ ʢ͜͜͸FSTͷ࣮૷͕͍·͍͔ͪͩΒ͔ͳʣ ΦϒδΣΫτੜ੒Λ΍Ί͍ͯΖ͍Ζ Tuple ʹͨ͠Γ ʢ஗͍ͱ͜Ζ̎ʣ࿈઀ίετݕࡧ͕࣮ߦ࣌ؒͷ 1/4 Λ઎ΊΔ

    ࿈઀ίετͷ࣋ͪํΛɺ࠷ॳ dict (ϋογϡςʔϒϧ) ͩͬͨͷΛɺʢMeCabʹͳΒ͍ʣೋ࣍ݩ഑ྻʹ ม͑Δͱߴ଎ʹɻσʔλߏ଄ͱͬͯ΋େࣄ ^^; ʢ஗͍ͱ͜Ζ̏ʣϥςΟεͷϊʔυੜ੒͕࣮ߦ࣌ؒͷ 1/4 Λ઎ΊΔ (͋ͱϝϞϦ΋৯͏) TODO νϡʔχϯά 33
  15. MeCab (C++) ͷΫϩʔϯɺ·ͨ͸MeCabࣙॻ(Ϟσϧ)Λआ༻͠ ͍ͯΔܗଶૉղੳث͋Ε͜Ε Igo (Java), igo-python (Python), igo-ruby (Ruby),


    igo-javascript (JavaScript) Kuromoji (Java), kuromoji.js (JavaScript) kagome (Go), janome (Python) ଞ... Appendixɿྺ࢙ʹ͍ͭͯগ͚ͩ͠ 37