Upgrade to Pro — share decks privately, control downloads, hide ads and more …

言語処理100本ノックをRubyでやったメモ

himkt
August 06, 2016
2.4k

 言語処理100本ノックをRubyでやったメモ

himkt

August 06, 2016
Tweet

Transcript

  1. ࣗݾ঺հͱ΍ͬͨ͜ͱ • B4 at ஜ೾େֶ ʢࣗવݴޠॲཧ? ػցֶश? ʣ • ݚڀɿ৘ใநग़ʢ֬཰Ϟσϧʣ

    • ୲౰ɿݴޠॲཧ100ຊϊοΫΛRubyͰղ͍ͯΈΔ • ύοέʔδϢʔβ https://github.com/himkt/nlp-100knock
  2. ૉੑநग़ • ࣗવݴޠॲཧʹ͓͍ͯૉੑʹͳΔ΋ͷɿ୯ޠʢଟ͘ͷ৔߹ʣ • ग़ݱ͢Δ୯ޠͷ਺͸ͱͯ΋ଟ͍ʢ਺ສ - ਺ेສʣ • ͢΂ͯͷ୯ޠΛૉੑͱͯ͠࢖͏ͱֶश͕͏·͍͔͘ͳ͍ •

    ޮ཰తͳૉੑநग़͕ඞཁ • Python:scikit-learn::feature_extraction • Ruby:ܾఆ൛తͳϥΠϒϥϦ͸ଘࡏ͠ͳ͍ • ࠓճ͸͓ख੡ʢhttps://github.com/himkt/rblearnʣ
  3. ϩδεςΟοΫճؼ • ϥΠϒϥϦ • Statsample-glmɿDaruͱҰॹʹ࢖͏͜ͱ͕૝ఆ͞Ε͍ͯΔʁ • Liblinear-RubyɿNMatrix, NArrayʹରԠ͍ͯ͠ͳ͍ • σʔλϑϨʔϜɿΧϥϜ͕ଟ͍σʔλΛѻ͏ͷʹ޲͔ͳ͍ʁ*

    • ࢥ͍ࠐΈ͔΋஌Εͳ͍ʢࠓճͷσʔλ͸10000 * 10000͘Β͍ʣ • NArrayͰ࣮૷ͨ͠ • ඞཁͳ΋ͷɿίετؔ਺ͱޯ഑ • ߦྻͷੵͰදݱՄೳʢNArrayͷػೳ͚ͩͰ࣮૷Մʣ
  4. ΫϩεόϦσʔγϣϯ • σʔληοτΛ෼ׂͯ͠ෳ਺ճֶशΛߦ͏ ͜ͱͰ༧ଌϞσϧͷ൚ԽੑೳΛௐ΂Δ • Python: sklearn::cross_validation • ഑ྻͷΠϯσοΫεΛฦ͍ͯ͠Δ͚ͩ •

    Integer array indexing (masking ?) • NArrayʹ͸͋Δ NMatrixʹ͸ͳ͍ ը૾ɿhttps://pydata.tokyo/ipynb/tutorial-1/ml.html ࢀߟɿhttp://watanabe-www.math.dis.titech.ac.jp/users/swatanab/cross-val.html
  5. ΫϩεόϦσʔγϣϯ • Ruby: ݱঢ়Ͱ͸ϥΠϒϥϦଆͰ࣮૷͞Ε͍ͯͨΓ͢Δ • e.g. Liblinear.cross_validation (liblinear-ruby) • Python:

    scikit-learn::cross_validation • ϞσϧʢLogistic Regressionʣ͸܇࿅σʔλΛड͚औΓֶश͢Δ͚ͩ ΫϩεόϦσʔγϣϯ͢ΔϥΠϒϥϦΛ࡞ͬͨʢhttps://github.com/himkt/rblearnʣ ΫϩεόϦσʔγϣϯͱ ֶशͷϩδοΫ͕෼཭
  6. ओ੒෼෼ੳ • ϥΠϒϥϦ • Ruby: statsample • σʔλ͕Ͱ͔͍ͷͰɼૄߦྻͷ··ѻ͏ඞཁ͕͋Δ • DataFrameΛͭ͘Δඞཁ͕͋Δʁ

    • ݻ༗஋ɾݻ༗ϕΫτϧܭࢉͱͯ͠ղ͘ • NArray, NMatrixʢs.t. ૄߦྻʣ • NArray: ૄߦྻ͸·ͩରԠ͍ͯ͠ͳ͍ • NMatrix: ૄߦྻͷݻ༗஋ɾݻ༗ϕΫτϧܭࢉ͸ະ࣮૷ -> อཹ
  7. word2vec • ϥΠϒϥϦ • Python: gensim • Ruby: ແ͍ʢଟ෼ʣ •

    NArrayͰ࣮૷ • word2vec͸ϞσϧΛ܇࿅ͨ͠ޙʹ୯ޠϕΫτϧ͕ಘΒΕΕ͹ྑ͍ • ࣮ࡍʹඞཁͳͷ͸ϕΫτϧಉ࢜ͷίαΠϯྨࣅ౓ͷܭࢉ͚ͩ ʢNArray NMatrixͷػೳͰॆ෼ʣ • NArrayͷ΄͏͕଎͔ͬͨͷͰNArrayΛ࢖ͬͨ
  8. k-means t-SNE • ϥΠϒϥϦ • Python: sklearn.clustering • Ruby: AI4Rʢhttp://ai4r.org/ʣ

    • NArray NMatrixະରԠ • ߋ৽ࢭ·ͬͯΔʁ • NArray͚ͩͰ࣮૷ͨ͠ʢNArrayͷ΄͏͕଎͍ʣ • ಛʹ٧·Δ͜ͱͳ࣮͘૷Ͱ͖Δ
  9. ·ͱΊ • ݴޠॲཧ100ຊϊοΫΛղ͍ͯΈͨ • ͍͍ͩͨ͸NArray, NMatrix͕͋Ε͹ղ͚Δ • େن໛ͳσʔλͷओ੒෼෼ੳͱ͔͸Ͱ͖ͳ͍ • scikit-learnΈ͍ͨͳϥΠϒϥϦ͕ඞཁ͔ʁ

    • աڈϩάΛݟͨʢࡢ೔ʣ • ༗Ε͹خ͍͠ʢRuby͸ࣗવݴޠॲཧʹ޲͍͍ͯΔͱࢥ͏ʣ • ϥΠϒϥϦ: NArrayͳΓNMatrixͳΓDaruͷVector?ͳΓ
 ͳΜΒ͔ͷܾΊΒΕͨσʔλߏ଄͕౷Ұతʹ࢖͑ͯ΄͍͠ • ΫϩεόϦσʔγϣϯͱ͔ૉੑநग़ͱ͔
  10. ΄͍͠ • NArray: ૄߦྻରԠ • NMatrix: linalgͷૄߦྻରԠ • NArray, NMatrix:

    ΦϒδΣΫτͷγϦΞϥΠζ • NMatrix: Integer Array indexing • Feature Extractor, Feature Vectorizer