Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#javajo Java/Scala ではじめる機械学習

#javajo Java/Scala ではじめる機械学習

https://javajo.doorkeeper.jp/events/27588 での発表資料です。

KOMIYA Atsushi

July 23, 2015
Tweet

More Decks by KOMIYA Atsushi

Other Decks in Programming

Transcript

  1. Java/Scala Ͱ͸͡ΊΔ
    ػցֶश
    2015-07-23 Javajo
    @komiya_atsushi

    View full-size slide

  2. ͓·ͩΕ
    ʢ͓લ୭Αʁʣ

    View full-size slide

  3. ,0.*:""UTVTIJ
    !LPNJZB@BUTVTIJ

    View full-size slide

  4. bit.ly/WeLoveSmartNews

    View full-size slide

  5. We’re hiring!
    iOSΤϯδχΞ / AndroidΤϯδχΞ
    / WebΞϓϦέʔγϣϯΤϯδχΞ
    / ϓϩμΫςΟϏςΟΤϯδχΞ
    / ػցֶश / ࣗવݴޠॲཧΤϯδχΞ
    / άϩʔεϋοΫΤϯδχΞ
    / αʔόαΠυΤϯδχΞ
    / ޿ࠂΤϯδχΞ…

    View full-size slide

  6. ࠓ೔ͷτϐοΫ

    View full-size slide

  7. ݟಀ͍ͯ͠·ͨ͠ɾɾɾ
    ͝ΊΜͳ͍͞N @@
    N

    View full-size slide

  8. ػցֶशʂ ػցֶशʂ
    • JJUG φΠτηϛφʔʮػցֶशɾࣗવݴޠॲ
    ཧಛूʯͷͱ͖ͷൃදͷম͖௚͠Ͱ͢
    • ฉ͍ͨ͜ͱͷ͋Δํ͸৸͍͍ͯͯͩ͘͞
    • ػցֶशͷಋೖతͳ࿩ͱɺJava/Scala Ͱ

    ΧδϡΞϧʹػցֶश͢Δ࿩Λ͠·͢
    • Ψνͷํ͸৸͍͍ͯͯͩ͘͞

    View full-size slide

  9. ػցֶशΛ͸͡ΊΔલʹ
    ஌͓͖͍ͬͯͨ͜ͱ

    View full-size slide

  10. ػցֶशͱ͸ͳΜͧ΍ʁ
    ػցֶशνϡʔτϦΞϧˏ+VCBUVT$BTVBM5BMLT
    http://www.slideshare.net/unnonouno/jubatus-­‐casual-­‐talksΑΓҾ༻͠·ͨ͠ɻ

    View full-size slide

  11. ػցֶशͰԿ͕Ͱ͖Δͷ͔
    ෼ྨɾࣝผ
    ༧ଌɾճؼ
    ύλʔϯϚΠχϯά
    ΞιγΤʔγϣϯϧʔϧ
    ΫϥελϦϯά
    εύϜϝʔϧͷݕ஌
    χϡʔεهࣄͷΧςΰϦ෼ྨ ঎඼Ϩίϝϯσʔγϣϯ
    धཁɾച্༧ଌ
    ʜͳͲͳͲʢҰྫʣ

    View full-size slide

  12. ػցֶशͰԿ͕Ͱ͖Δͷ͔
    ύλʔϯϚΠχϯά
    ΞιγΤʔγϣϯϧʔϧ
    ΫϥελϦϯά
    εύϜϝʔϧͷݕ஌
    χϡʔεهࣄͷΧςΰϦ෼ྨ ঎඼Ϩίϝϯσʔγϣϯ
    धཁɾച্༧ଌ
    ڭࢣ͋Γֶश
    ɾਖ਼ղ͕͋Δ
    ɾʮϞσϧʯΛ࡞Δ
    ෼ྨɾࣝผ
    ༧ଌɾճؼ

    View full-size slide

  13. ػցֶशͰԿ͕Ͱ͖Δͷ͔
    ࣝผɾ෼ྨ
    ճؼɾ༧ଌ
    ύλʔϯϚΠχϯά
    ΞιγΤʔγϣϯϧʔϧ
    ΫϥελϦϯά
    εύϜϝʔϧͷݕ஌
    χϡʔεهࣄͷΧςΰϦ෼ྨ ঎඼Ϩίϝϯσʔγϣϯ
    धཁɾച্༧ଌ
    ڭࢣͳֶ͠श
    ໌֬ͳਖ਼ղ͸ͳ͍

    View full-size slide

  14. • ਺஋ྻʢϕΫτϧʣ͔͠ѻ͑ͳ͍
    • ඇߏ଄σʔλʢը૾ɺԻ੠ɺςΩετɺ

    ΞΫηεϩάɺetc.ʣ͸ͦͷ··Ͱ͸ѻ͑ͳ͍
    • ಛ௃நग़ͯ͠ϕΫτϧʹ͢Δඞཁ͕͋Δ
    • ͍ΘΏΔ feature engineering
    • ڭࢣ͋Γֶशͷڭࢣσʔλͷ৔߹͸ɺՃ͑ͯ

    ʮϥϕϧʯͱͳΔਖ਼ղ৘ใΛ෇༩͢Δ
    ԿΛೖྗσʔλͱ͢Δͷ͔

    View full-size slide

  15. ԿΛೖྗσʔλͱ͢Δͷ͔
    • ಛ௃ྔͷநग़ɾม׵
    • ΧςΰϦม਺ɿOne-hot encoding
    • ࣗવݴޠɿTerm frequency, Word2vec ͳͲ
    • ը૾ɿSIFT, SURF, AKAZE ͳͲ
    • ࠷ۙͩͱಛ௃நग़ʹ Deep learning Λ࢖ͬͨΓ΋
    • ߴ࣍ݩˍૄͳಛ௃ϕΫτϧͷදݱ
    • Feature hashing

    View full-size slide

  16. ಘΒΕͨ݁Ռ͸ਖ਼͍͠ͷ͔
    • ਖ਼͠͞Λ͔֬ΊΔ
    • k-෼ׂަࠩݕূ (k-fold cross validation)
    • ਖ਼͠͞ΛଌΔ
    • ෼ྨɾࣝผ
    • AUC, Precision, Recall, F-measure
    • ༧ଌɾճؼ
    • ૬ؔ܎਺ɺܾఆ܎਺ɺMAE, RMSE, LogLoss

    View full-size slide

  17. ઢܗ෼཭ɾඇઢܗ
    • Ұຊͷઢʢ௒ฏ໘ʣͰ఺ʢαϯϓϧʣΛ෼཭Ͱ͖Δ͔ʁ
    ઢܗ෼཭Մೳ ઢܗ෼཭ෆՄೳ ඇઢܗ

    View full-size slide

  18. ΦϯϥΠϯֶशɾΦϑϥΠϯֶश
    • ΦϯϥΠϯֶश
    • ஞ࣍ಘΒΕΔσʔλΛ΋ͱʹɺϞσϧΛਵ࣌ߋ৽͢Δ
    • ετϦʔϜॲཧతͳΠϝʔδ
    • ར༻ͨ͠σʔλ͸஝ੵ͢Δ͜ͱͳ͘ഁغͰ͖Δ
    • ʢઍ੾ͬͯ͸౤͛ɺઍ੾ͬͯ͸౤͛…ʣ
    • ΦϑϥΠϯֶश
    • ஝ੵ͞ΕͨσʔλΛ΋ͱʹɺϞσϧΛҰؾʹߋ৽͢Δ
    • όονॲཧʹ૬౰͢Δ

    View full-size slide

  19. Java/Scala Ͱػցֶश͢Δ

    View full-size slide

  20. ࠷ॳʹݴ͓ͬͯ͘

    View full-size slide

  21. ंྠͷ࠶ൃ໌͸΍Ίͯɺ
    طଘϥΠϒϥϦ౳Λ׆༻͠Α͏

    View full-size slide

  22. ػցֶशͷ࣮૷ɺਏΈ͔͠ͳ͍
    • ػցֶशΞϧΰϦζϜͷςετɺͱʹ͔͘ਏ͍
    • ʮςετॻ͔ͳ͍ͱ͔͓લ̋̋ͷલͰ΋ಉ͜͡
    ͱݴ͑Μͷʁʯ
    • ࣌ؒɾۭؒޮ཰ͷΑ͍࣮૷͸೉͍͠
    • طଘϥΠϒϥϦ౳Λ࢖͏͚ͩͰ͸Ͳ͏ͯ͠΋ղܾ
    Ͱ͖ͳ͍৔߹ʹͷΈɺࣗલ࣮૷͢ΔΑ͏ʹ͍ͨ͠
    • ΪϦΪϦ·Ͱਫ਼౓ΛߴΊΔ͜ͱʹܦࡁతՁ஋͕
    ͋Δɺͱ͔

    View full-size slide

  23. Java ʹΑΔػցֶश
    ޲͖ɾෆ޲͖

    View full-size slide

  24. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ

    View full-size slide

  25. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ
    ͜ͷ͋ͨΓͰ
    ػցֶशΛ
    ׆༻͢Δ

    View full-size slide

  26. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ
    ͜ͷ͋ͨΓ͸
    ΞυϗοΫͳ
    ෼ੳ͕ඞཁ

    View full-size slide

  27. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ
    +BWBʹ޲͍ͯ
    ͍Δͷ͸
    ͜ͷ͋ͨΓ

    View full-size slide

  28. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ
    4QBSL 4DBMB

    ͳΒ͜ͷ͋ͨΓ
    ΋ΧόʔͰ͖Δ
    ͔΋

    View full-size slide

  29. దࡐదॴͰ͍͜͏
    • ΞυϗοΫͳ෼ੳ΍Ϟσϧͷߏங͸ R ΍ Python Ͱ
    • ΠϯλϥΫςΟϒͳૢ࡞͕͠΍͍͢
    • ࢼߦࡨޡͷ܁Γฦ͕͠͠΍͍͢
    • Spark ͷ interactive shell ΋͍͍͔΋͠Εͳ͍
    • γεςϜԽͷ෦෼ͰɺPython ΍ Java, Scala Λར༻͢Δ
    • ύϑΥʔϚϯεͷٻΊΒΕΔέʔεͦ͜ɺJava ΍
    Scala ͷग़൪

    View full-size slide

  30. Java/Scala ͔Β࢖͑Δ
    ػցֶशϥΠϒϥϦ
    ˍ
    ϑϨʔϜϫʔΫ

    View full-size slide

  31. Spark / MLlib
    • gradle ‘org.apache.spark:spark-mllib_2.10:1.1.1'
    • https://github.com/apache/spark
    • ˒ 2,336 → 4,813
    • ෼ࢄॲཧϑϨʔϜϫʔΫ Spark ্Ͱͷར༻Λલఏ
    ͱͨ͠ػցֶशϥΠϒϥϦ MLlib
    • ػೳ௥Ճɾվળ͕ࠓͩ੝Μ
    • ΞυϗοΫ෼ੳͷ؀ڥͱͯ͠΋ར༻Ͱ͖Δ

    View full-size slide

  32. liblinear-java
    • gradle ‘de.bwaldvogel:liblinear:1.95'
    • https://github.com/bwaldvogel/liblinear-java
    • ˒ 121 → 144
    • LibSVM Λઢܗ෼ྨɾճؼʹಛԽͨ͠΋ͷɺͷ

    Java ϙʔςΟϯά
    • ϥΠϒϥϦ
    • ΘΓͱؤுͬͯɺຊମ (C++ ൛) ͷ࠷৽όʔδϣϯʹ௥ै͠
    Α͏ͱ͍ͯ͠Δ

    View full-size slide

  33. Weka
    • gradle ‘nz.ac.waikato.cms.weka:weka-stable:
    3.6.11'
    • ଟछଟ༷ͳػցֶशΞϧΰϦζϜ͕ఏڙ͞Ε
    ͍ͯΔ෼ੳϓϥοτϑΥʔϜ
    • ϥΠϒϥϦͱͯ͠΋ར༻͢Δ͜ͱ͕Ͱ͖Δ
    • ͱΓ͋͑ͣػցֶशΛ͸͡ΊͯΈ͍ͨͳΒɺ

    ͜ΕΛར༻ͯ͠ΈΔͷ΋ѱ͘ͳ͍͔ͱ

    View full-size slide

  34. Mahout
    • gradle ‘org.apache.mahout:mahout-core:0.9'
    • https://github.com/apache/mahout
    • ˒ 229 → 507
    • ෼ࢄॲཧϑϨʔϜϫʔΫ Hadoop ্ͷػցֶशϥΠϒϥϦ
    • Goodbye MapReduce ͯ͠ɺSpark ΍ h2o ͱͷ਌࿨ੑΛߴ
    ΊΔ։ൃ͕͞Ε͍ͯΔ༷ࢠ
    • https://issues.apache.org/jira/browse/MAHOUT-1510
    • ͔͠͠ɺͦ͜͸͔ͱͳ͘ඬ͏Φϫίϯײ…

    View full-size slide

  35. SAMOA
    • https://github.com/yahoo/samoa
    • ˒ 363 → 397
    • Storm ͳͲͷ෼ࢄετϦʔϛϯάϑϨʔϜ
    ϫʔΫ্Ͱར༻Ͱ͖ΔػցֶशϥΠϒϥϦ
    • Yahoo! Labs ۘ੡
    • ͜͜࠷ۙ͸͋·Γ։ൃ׆ൃͰͳ͍ʁ

    View full-size slide

  36. Jubatus
    • https://github.com/jubatus/jubatus
    • ˒ 389 → 453
    • ෼ࢄॲཧϑϨʔϜϫʔΫˍΦϯϥΠϯػցֶशϥΠϒϥ
    Ϧ
    • ຊମ͸ C++ ࣮૷͕ͩɺJava ͷΫϥΠΞϯτϥΠϒϥϦ
    ͕ఏڙ͞Ε͍ͯΔ
    • Bandit algorithm ͕࣮૷͞ΕͨΓͱɺ·ͩ·ͩ։ൃܧଓ

    View full-size slide

  37. h2o
    • https://github.com/h2oai/h2o
    • ˒ 1,333 → 1,741
    • ෼ࢄॲཧϑϨʔϜϫʔΫ Hadoop ্Ͱར༻Ͱ
    ͖ΔػցֶशϥΠϒϥϦ
    • Կ͔ͱ࿩୊ͷ Deep learning Λ Java Ͱ͍ͨ͠
    ͳΒɺ͜Ε୒Ұʂʁ

    View full-size slide

  38. ͸͡ΊͯΈΑ͏ػցֶश

    View full-size slide

  39. ͓୊ɿ
    εύϜϝʔϧ൑ఆΛ
    ͯ͠ΈΑ͏

    View full-size slide

  40. εύϜϝʔϧ൑ఆ
    • ༩͑ΒΕͨςΩετ͕εύϜϝʔϧ͔൱͔Λ൑ఆ͢Δ
    • ڭࢣ͋Γֶशͷࣝผɾ෼ྨͷλεΫʹ૬౰
    • ςΩετ͔Β term frequency Λಛ௃ͱͯ͠நग़͢Δ
    • ࠓճ͸ʢΑ͋͘Δ Naive Bayes ͡Όͳͯ͘ʣ

    ϩδεςΟοΫճؼΛར༻͢Δ
    • άϦουαʔνͰύϥϝʔλνϡʔχϯάͭͭ͠ɺ

    k-෼ׂަࠩݕূ & AUC ͰϞσϧΛධՁ͢Δ

    View full-size slide

  41. άϦουαʔν
    ྲྀΕʹ͢Δͱ͜Μͳײ͡
    ςΩετ ಛ௃நग़ ֶश ަࠩݕূ Ϟσϧ

    View full-size slide

  42. bit.ly/javajo-­‐ml

    View full-size slide

  43. σʔληοτ

    View full-size slide

  44. UCI Machine learning repository
    • https://archive.ics.uci.edu/ml/datasets.html
    • CSV ϑΝΠϧͳͲͷॻࣜͰఏڙ͞Ε͍ͯΔ
    • σʔλ෼ੳɾػցֶशք۾ͷ Hello world తͳ
    σʔληοτ Iris (ΞϠϝ) ΋͋ΔΑ
    • ࠓճ͸ SMS Spam collection Λར༻͠·͢
    • https://archive.ics.uci.edu/ml/datasets/SMS
    +Spam+Collection

    View full-size slide

  45. SMS Spam collection
    ham Go until jurong point, crazy..
    Available only in bugis n great world la e
    buffet... Cine there got amore wat...
    ham Ok lar... Joking wif u oni...
    spam Free entry in 2 a wkly comp to win
    FA Cup final tkts 21st May 2005. Text FA to
    87121 to receive entry question(std txt
    rate)T&C's apply 08452810075over18's
    547ϑΝΠϧ

    View full-size slide

  46. SMS Spam collection
    ham Go until jurong point, crazy..
    Available only in bugis n great world la e
    buffet... Cine there got amore wat...
    ham Ok lar... Joking wif u oni...
    spam Free entry in 2 a wkly comp to win
    FA Cup final tkts 21st May 2005. Text FA to
    87121 to receive entry question(std txt
    rate)T&C's apply 08452810075over18's
    547ϑΝΠϧ
    ϥϕϧ
    IBNˠεύϜ͡Όͳ͍
    TQBNˠεύϜ

    View full-size slide

  47. ػցֶश؀ڥ

    View full-size slide

  48. Spark / MLlib
    • Spark cluster Λߏஙͯ͠ར༻͢Δͷ͕Ұൠత
    • ໘౗ͳͷͰࠓճ͸ self-contained ͳΞϓϦͰ

    ͓஡Λ୙͠·͢
    • https://spark.apache.org/docs/latest/quick-
    start.html#self-contained-applications
    • ࠓ෩ͷ Spark ΞϓϦέʔγϣϯΒ͘͠ʢʁʣɺ

    ML Pipeline API ͱ DataFrame API Λ࢖ͬͯΈ·͢

    View full-size slide

  49. ML Pipeline API
    • ػցֶशϫʔΫϑϩʔΛͦͷλεΫͷྻڍͰදݱ͢Δ
    • ※ ಛ௃நग़ɺֶशɺύϥϝʔλνϡʔχϯάɺݕূ
    ͳͲ

    View full-size slide

  50. ML Pipeline API
    • MLlib ͷ֤छΞϧΰϦζϜΛ࢖͍΍͘͢͢Δ࢓૊Έ
    ʹա͗ͳ͍
    • MLlib ͷΞϧΰϦζϜ౳͕͢΂͕ͯ࢖͑ΔΘ͚Ͱ͸
    ͳ͍͜ͱʹ஫ҙ
    • “Developers should contribute new algorithms
    to spark.mllib and can optionally contribute to
    spark.ml.”
    • K-Means ͳͲ͸࢖͑ͳ͍

    View full-size slide

  51. ML Pipeline API
    • ·ͩ·ͩ։ൃ్্
    • ύϥϝʔλνϡʔχϯά݁Ռͷϕετύϥ
    ϝʔλΛऔಘͰ͖ͳ͍ͬΆ͍
    • ϞσϧͷӬଓԽ͕Ͱ͖ͳ͍
    • Production Ͱ࢖͏ʹ͸·ͩݫ͍͠

    View full-size slide

  52. DataFrame API
    • εΩʔϚ৘ใΛ൐ͬͨσʔληοτ
    • σʔληοτʹରͯ͠ SQL తͳૢ࡞͕Ͱ͖Δ
    • select / join / filter / aggregation ͳͲͳͲ
    • RDD ͱൺֱͯ͠ɺ֤ݴޠ binding ؒͷύϑΥʔϚϯεࠩҟ͕খ͍͞
    • ৄ͘͠͸ Ishikawa ͞Μͷ slideshare Λ͝ཡԼ͍͞
    • http://www.slideshare.net/yuishikawa/2015-0312-lt2-spark-
    dataframe-introduction

    View full-size slide

  53. σʔλϩʔυ

    View full-size slide

  54. CSV / TSV to DataFrame
    • DataFrame ͱͯ͠ CSV / TSV ϑΝΠϧΛϩʔυ
    ͢Δʹ͸ spark-csv Λ࢖͏
    • https://github.com/databricks/spark-csv
    • εΩʔϚ͸໌ࣔతʹࢦఆ͓͍ͯͨ͠ํ͕Αͦ͞͏
    • ࢦఆ͠ͳ͍ͱจࣈྻѻ͍͞Εͯ͠·͏ͨΊɺ

    ਺஋ྻΛؚΉ৔߹͸ಛʹཁ஫ҙ

    View full-size slide

  55. ಛ௃நग़

    View full-size slide

  56. ςΩετσʔλ͔Βͷಛ௃நग़
    • ςΩετΛ white space tokenize ͢Δ
    • org.apache.spark.ml.feature.Tokenizer
    • ֤୯ޠͷग़ݱස౓ (TF, term frequency) ΛͱΔ
    • org.apache.spark.ml.feature.HashingTF
    • Hashing trick Λར༻͍ͯ͠Δ
    • LogisticRegression ʹೖྗ͢Δ DataFrame ͱ͢ΔͨΊʹ
    label ΧϥϜͱ features ΧϥϜΛ༻ҙ͢Δ

    View full-size slide

  57. ϩδεςΟοΫճؼ

    View full-size slide

  58. ϩδεςΟοΫճؼ
    • ڭࢣ͋Γػցֶश
    • ʮճؼʯͱݴͬͯ͸͍Δ͕ɺͲͪΒ͔ͱ

    ݴ͑͹ʮ෼ྨɾࣝผʯͰΑ͘࢖ΘΕΔʁ
    • ෼ྨ֬཰͕ٻ·Δͷ͕خ͍͠
    • ʮ༧ଌɾճؼʯͱͯ͠ͷར༻ྫ
    • Web ޿ࠂͷΫϦοΫ཰༧ଌ

    View full-size slide

  59. Spark / MLlib ͷϩδεςΟοΫճؼ
    • org.apache.spark.ml.classification.LogisticRegression
    • optimizer ͸(ࠓͷͱ͜Ζ?) LBFGS ͷΈɺSGD ͸࢖͑ͳ͍
    • ύϥϝʔλ
    • regParam: ਖ਼ଇԽύϥϝʔλ
    • elasticNetParam: 0.0 Λઃఆ͢Δͱ L2 ਖ਼ଇԽ,

    1.0 Λઃఆ͢Δͱ L1 ਖ਼ଇԽͱͳΔ
    • maxIter: ऩଋ·Ͱʹ܁Γฦ͢ճ਺

    View full-size slide

  60. ύϥϝʔλνϡʔχϯά

    View full-size slide

  61. άϦουαʔνʹΑΔύϥϝʔλνϡʔχϯά
    • org.apache.spark.ml.tuning.ParamGridBuilder
    • #addGrid() Ͱࢼͯ͠ΈΔύϥϝʔλΛྻڍ͢Δ
    • #build() ͰಘͨΦϒδΣΫτΛ
    CrossValidator#setEstimatorParamMaps() Ͱઃఆ͢Δ
    • Evaluator ͰಘΒΕΔ metrics Λࢀߟʹɺ
    CrossValidator ͕ best parameters ΛউखʹٻΊͯ
    ͘ΕΔ

    View full-size slide

  62. ࠓճͷνϡʔχϯάཁૉ
    • ಛ௃நग़
    • numFeatures: Feature hashing ޙͷ࣍ݩ਺
    • ϩδεςΟοΫճؼ
    • regParam: ਖ਼ଇԽύϥϝʔλ
    • maxIter: ऩଋ·ͰͷΠςϨʔγϣϯճ਺

    View full-size slide

  63. k-෼ׂަࠩݕূ

    View full-size slide

  64. K-fold cross validation
    Lݸʹ෼ׂ LݸͰUSBJOJOHͯ͠
    NFUSJDTΛͱΔɺΛ܁Γฦ͢
    %BUBTFU
    ʜ
    Q Q QL
    ЄQL
    ฏۉ஋Λ
    ٻΊΔ

    View full-size slide

  65. AUC (Area under the curve)
    ը૾͸Ԟଜઌੜͷʮ30$ۂઢʯΑΓҾ༻
    IUUQTPLVFEVNJFVBDKQdPLVNVSBTUBU30$IUNM
    ͜͜ͷ໘ੵ͕"6$
    ໘ੵ͕޿͚Ε͹޿͍ ʹ͍ۙ
    ΄Ͳɺ
    Α͍ਫ਼౓Ͱ͋Δ͜ͱΛҙຯ͢Δ
    5SVFQPTJUJWF
    ɹεύϜΛਖ਼͘͠ݕग़ͨ֬͠཰
    'BMTFQPTJUJWF
    ɹؒҧͬͯεύϜͱ൑ఆͨ֬͠཰

    View full-size slide

  66. Cross validation & metrics
    • ަࠩݕূ
    • org.apache.spark.ml.tuning.CrossValidator
    • #fit() Ͱ༩͑ΒΕͨڭࢣσʔλʹֶ͍ͭͯशͨ͠Ϟσϧͱ

    ϕετύϥϝʔλΛฦ٫͢Δ
    • ύϥϝʔλબ୒ͷϝτϦΫε
    • org.apache.spark.ml.evaluation.BinaryClassificationEval
    uator
    • AUC Λܭࢉͯ͘͠ΕΔ

    View full-size slide

  67. ͋Γ͕ͱ͏͍͟͝·ͨ͠ʂ

    View full-size slide