Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#javajo Java/Scala ではじめる機械学習

#javajo Java/Scala ではじめる機械学習

https://javajo.doorkeeper.jp/events/27588 での発表資料です。

KOMIYA Atsushi

July 23, 2015
Tweet

More Decks by KOMIYA Atsushi

Other Decks in Programming

Transcript

  1. Java/Scala Ͱ͸͡ΊΔ
    ػցֶश
    2015-07-23 Javajo
    @komiya_atsushi

    View Slide

  2. ͓·ͩΕ
    ʢ͓લ୭Αʁʣ

    View Slide

  3. ,0.*:""UTVTIJ
    [email protected]

    View Slide

  4. View Slide

  5. View Slide

  6. bit.ly/WeLoveSmartNews

    View Slide

  7. We’re hiring!
    iOSΤϯδχΞ / AndroidΤϯδχΞ
    / WebΞϓϦέʔγϣϯΤϯδχΞ
    / ϓϩμΫςΟϏςΟΤϯδχΞ
    / ػցֶश / ࣗવݴޠॲཧΤϯδχΞ
    / άϩʔεϋοΫΤϯδχΞ
    / αʔόαΠυΤϯδχΞ
    / ޿ࠂΤϯδχΞ…

    View Slide

  8. ࠓ೔ͷτϐοΫ

    View Slide

  9. View Slide

  10. ݟಀ͍ͯ͠·ͨ͠ɾɾɾ
    ͝ΊΜͳ͍͞N @@
    N

    View Slide

  11. ػցֶशʂ ػցֶशʂ
    • JJUG φΠτηϛφʔʮػցֶशɾࣗવݴޠॲ
    ཧಛूʯͷͱ͖ͷൃදͷম͖௚͠Ͱ͢
    • ฉ͍ͨ͜ͱͷ͋Δํ͸৸͍͍ͯͯͩ͘͞
    • ػցֶशͷಋೖతͳ࿩ͱɺJava/Scala Ͱ

    ΧδϡΞϧʹػցֶश͢Δ࿩Λ͠·͢
    • Ψνͷํ͸৸͍͍ͯͯͩ͘͞

    View Slide

  12. ػցֶशΛ͸͡ΊΔલʹ
    ஌͓͖͍ͬͯͨ͜ͱ

    View Slide

  13. ػցֶशͱ͸ͳΜͧ΍ʁ
    ػցֶशνϡʔτϦΞϧˏ+VCBUVT$BTVBM5BMLT
    http://www.slideshare.net/unnonouno/jubatus-­‐casual-­‐talksΑΓҾ༻͠·ͨ͠ɻ

    View Slide

  14. ػցֶशͰԿ͕Ͱ͖Δͷ͔
    ෼ྨɾࣝผ
    ༧ଌɾճؼ
    ύλʔϯϚΠχϯά
    ΞιγΤʔγϣϯϧʔϧ
    ΫϥελϦϯά
    εύϜϝʔϧͷݕ஌
    χϡʔεهࣄͷΧςΰϦ෼ྨ ঎඼Ϩίϝϯσʔγϣϯ
    धཁɾച্༧ଌ
    ʜͳͲͳͲʢҰྫʣ

    View Slide

  15. ػցֶशͰԿ͕Ͱ͖Δͷ͔
    ύλʔϯϚΠχϯά
    ΞιγΤʔγϣϯϧʔϧ
    ΫϥελϦϯά
    εύϜϝʔϧͷݕ஌
    χϡʔεهࣄͷΧςΰϦ෼ྨ ঎඼Ϩίϝϯσʔγϣϯ
    धཁɾച্༧ଌ
    ڭࢣ͋Γֶश
    ɾਖ਼ղ͕͋Δ
    ɾʮϞσϧʯΛ࡞Δ
    ෼ྨɾࣝผ
    ༧ଌɾճؼ

    View Slide

  16. ػցֶशͰԿ͕Ͱ͖Δͷ͔
    ࣝผɾ෼ྨ
    ճؼɾ༧ଌ
    ύλʔϯϚΠχϯά
    ΞιγΤʔγϣϯϧʔϧ
    ΫϥελϦϯά
    εύϜϝʔϧͷݕ஌
    χϡʔεهࣄͷΧςΰϦ෼ྨ ঎඼Ϩίϝϯσʔγϣϯ
    धཁɾച্༧ଌ
    ڭࢣͳֶ͠श
    ໌֬ͳਖ਼ղ͸ͳ͍

    View Slide

  17. • ਺஋ྻʢϕΫτϧʣ͔͠ѻ͑ͳ͍
    • ඇߏ଄σʔλʢը૾ɺԻ੠ɺςΩετɺ

    ΞΫηεϩάɺetc.ʣ͸ͦͷ··Ͱ͸ѻ͑ͳ͍
    • ಛ௃நग़ͯ͠ϕΫτϧʹ͢Δඞཁ͕͋Δ
    • ͍ΘΏΔ feature engineering
    • ڭࢣ͋Γֶशͷڭࢣσʔλͷ৔߹͸ɺՃ͑ͯ

    ʮϥϕϧʯͱͳΔਖ਼ղ৘ใΛ෇༩͢Δ
    ԿΛೖྗσʔλͱ͢Δͷ͔

    View Slide

  18. ԿΛೖྗσʔλͱ͢Δͷ͔
    • ಛ௃ྔͷநग़ɾม׵
    • ΧςΰϦม਺ɿOne-hot encoding
    • ࣗવݴޠɿTerm frequency, Word2vec ͳͲ
    • ը૾ɿSIFT, SURF, AKAZE ͳͲ
    • ࠷ۙͩͱಛ௃நग़ʹ Deep learning Λ࢖ͬͨΓ΋
    • ߴ࣍ݩˍૄͳಛ௃ϕΫτϧͷදݱ
    • Feature hashing

    View Slide

  19. ಘΒΕͨ݁Ռ͸ਖ਼͍͠ͷ͔
    • ਖ਼͠͞Λ͔֬ΊΔ
    • k-෼ׂަࠩݕূ (k-fold cross validation)
    • ਖ਼͠͞ΛଌΔ
    • ෼ྨɾࣝผ
    • AUC, Precision, Recall, F-measure
    • ༧ଌɾճؼ
    • ૬ؔ܎਺ɺܾఆ܎਺ɺMAE, RMSE, LogLoss

    View Slide

  20. ઢܗ෼཭ɾඇઢܗ
    • Ұຊͷઢʢ௒ฏ໘ʣͰ఺ʢαϯϓϧʣΛ෼཭Ͱ͖Δ͔ʁ
    ઢܗ෼཭Մೳ ઢܗ෼཭ෆՄೳ ඇઢܗ

    View Slide

  21. ΦϯϥΠϯֶशɾΦϑϥΠϯֶश
    • ΦϯϥΠϯֶश
    • ஞ࣍ಘΒΕΔσʔλΛ΋ͱʹɺϞσϧΛਵ࣌ߋ৽͢Δ
    • ετϦʔϜॲཧతͳΠϝʔδ
    • ར༻ͨ͠σʔλ͸஝ੵ͢Δ͜ͱͳ͘ഁغͰ͖Δ
    • ʢઍ੾ͬͯ͸౤͛ɺઍ੾ͬͯ͸౤͛…ʣ
    • ΦϑϥΠϯֶश
    • ஝ੵ͞ΕͨσʔλΛ΋ͱʹɺϞσϧΛҰؾʹߋ৽͢Δ
    • όονॲཧʹ૬౰͢Δ

    View Slide

  22. Java/Scala Ͱػցֶश͢Δ

    View Slide

  23. ࠷ॳʹݴ͓ͬͯ͘

    View Slide

  24. ंྠͷ࠶ൃ໌͸΍Ίͯɺ
    طଘϥΠϒϥϦ౳Λ׆༻͠Α͏

    View Slide

  25. ػցֶशͷ࣮૷ɺਏΈ͔͠ͳ͍
    • ػցֶशΞϧΰϦζϜͷςετɺͱʹ͔͘ਏ͍
    • ʮςετॻ͔ͳ͍ͱ͔͓લ̋̋ͷલͰ΋ಉ͜͡
    ͱݴ͑Μͷʁʯ
    • ࣌ؒɾۭؒޮ཰ͷΑ͍࣮૷͸೉͍͠
    • طଘϥΠϒϥϦ౳Λ࢖͏͚ͩͰ͸Ͳ͏ͯ͠΋ղܾ
    Ͱ͖ͳ͍৔߹ʹͷΈɺࣗલ࣮૷͢ΔΑ͏ʹ͍ͨ͠
    • ΪϦΪϦ·Ͱਫ਼౓ΛߴΊΔ͜ͱʹܦࡁతՁ஋͕
    ͋Δɺͱ͔

    View Slide

  26. Java ʹΑΔػցֶश
    ޲͖ɾෆ޲͖

    View Slide

  27. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ

    View Slide

  28. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ
    ͜ͷ͋ͨΓͰ
    ػցֶशΛ
    ׆༻͢Δ

    View Slide

  29. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ
    ͜ͷ͋ͨΓ͸
    ΞυϗοΫͳ
    ෼ੳ͕ඞཁ

    View Slide

  30. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ
    +BWBʹ޲͍ͯ
    ͍Δͷ͸
    ͜ͷ͋ͨΓ

    View Slide

  31. ྫ͑͹͜ΜͳϫʔΫϑϩʔ
    1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ
    • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ
    2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ
    • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ
    3. ϞσϧΛ࡞੒͢Δ
    • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ
    • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ
    4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ
    • ਫ਼౓͸͍͔΄Ͳ͔ʁ
    5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ
    4QBSL 4DBMB

    ͳΒ͜ͷ͋ͨΓ
    ΋ΧόʔͰ͖Δ
    ͔΋

    View Slide

  32. దࡐదॴͰ͍͜͏
    • ΞυϗοΫͳ෼ੳ΍Ϟσϧͷߏங͸ R ΍ Python Ͱ
    • ΠϯλϥΫςΟϒͳૢ࡞͕͠΍͍͢
    • ࢼߦࡨޡͷ܁Γฦ͕͠͠΍͍͢
    • Spark ͷ interactive shell ΋͍͍͔΋͠Εͳ͍
    • γεςϜԽͷ෦෼ͰɺPython ΍ Java, Scala Λར༻͢Δ
    • ύϑΥʔϚϯεͷٻΊΒΕΔέʔεͦ͜ɺJava ΍
    Scala ͷग़൪

    View Slide

  33. Java/Scala ͔Β࢖͑Δ
    ػցֶशϥΠϒϥϦ
    ˍ
    ϑϨʔϜϫʔΫ

    View Slide

  34. Spark / MLlib
    • gradle ‘org.apache.spark:spark-mllib_2.10:1.1.1'
    • https://github.com/apache/spark
    • ˒ 2,336 → 4,813
    • ෼ࢄॲཧϑϨʔϜϫʔΫ Spark ্Ͱͷར༻Λલఏ
    ͱͨ͠ػցֶशϥΠϒϥϦ MLlib
    • ػೳ௥Ճɾվળ͕ࠓͩ੝Μ
    • ΞυϗοΫ෼ੳͷ؀ڥͱͯ͠΋ར༻Ͱ͖Δ

    View Slide

  35. liblinear-java
    • gradle ‘de.bwaldvogel:liblinear:1.95'
    • https://github.com/bwaldvogel/liblinear-java
    • ˒ 121 → 144
    • LibSVM Λઢܗ෼ྨɾճؼʹಛԽͨ͠΋ͷɺͷ

    Java ϙʔςΟϯά
    • ϥΠϒϥϦ
    • ΘΓͱؤுͬͯɺຊମ (C++ ൛) ͷ࠷৽όʔδϣϯʹ௥ै͠
    Α͏ͱ͍ͯ͠Δ

    View Slide

  36. Weka
    • gradle ‘nz.ac.waikato.cms.weka:weka-stable:
    3.6.11'
    • ଟछଟ༷ͳػցֶशΞϧΰϦζϜ͕ఏڙ͞Ε
    ͍ͯΔ෼ੳϓϥοτϑΥʔϜ
    • ϥΠϒϥϦͱͯ͠΋ར༻͢Δ͜ͱ͕Ͱ͖Δ
    • ͱΓ͋͑ͣػցֶशΛ͸͡ΊͯΈ͍ͨͳΒɺ

    ͜ΕΛར༻ͯ͠ΈΔͷ΋ѱ͘ͳ͍͔ͱ

    View Slide

  37. Mahout
    • gradle ‘org.apache.mahout:mahout-core:0.9'
    • https://github.com/apache/mahout
    • ˒ 229 → 507
    • ෼ࢄॲཧϑϨʔϜϫʔΫ Hadoop ্ͷػցֶशϥΠϒϥϦ
    • Goodbye MapReduce ͯ͠ɺSpark ΍ h2o ͱͷ਌࿨ੑΛߴ
    ΊΔ։ൃ͕͞Ε͍ͯΔ༷ࢠ
    • https://issues.apache.org/jira/browse/MAHOUT-1510
    • ͔͠͠ɺͦ͜͸͔ͱͳ͘ඬ͏Φϫίϯײ…

    View Slide

  38. SAMOA
    • https://github.com/yahoo/samoa
    • ˒ 363 → 397
    • Storm ͳͲͷ෼ࢄετϦʔϛϯάϑϨʔϜ
    ϫʔΫ্Ͱར༻Ͱ͖ΔػցֶशϥΠϒϥϦ
    • Yahoo! Labs ۘ੡
    • ͜͜࠷ۙ͸͋·Γ։ൃ׆ൃͰͳ͍ʁ

    View Slide

  39. Jubatus
    • https://github.com/jubatus/jubatus
    • ˒ 389 → 453
    • ෼ࢄॲཧϑϨʔϜϫʔΫˍΦϯϥΠϯػցֶशϥΠϒϥ
    Ϧ
    • ຊମ͸ C++ ࣮૷͕ͩɺJava ͷΫϥΠΞϯτϥΠϒϥϦ
    ͕ఏڙ͞Ε͍ͯΔ
    • Bandit algorithm ͕࣮૷͞ΕͨΓͱɺ·ͩ·ͩ։ൃܧଓ

    View Slide

  40. h2o
    • https://github.com/h2oai/h2o
    • ˒ 1,333 → 1,741
    • ෼ࢄॲཧϑϨʔϜϫʔΫ Hadoop ্Ͱར༻Ͱ
    ͖ΔػցֶशϥΠϒϥϦ
    • Կ͔ͱ࿩୊ͷ Deep learning Λ Java Ͱ͍ͨ͠
    ͳΒɺ͜Ε୒Ұʂʁ

    View Slide

  41. ͸͡ΊͯΈΑ͏ػցֶश

    View Slide

  42. ͓୊ɿ
    εύϜϝʔϧ൑ఆΛ
    ͯ͠ΈΑ͏

    View Slide

  43. εύϜϝʔϧ൑ఆ
    • ༩͑ΒΕͨςΩετ͕εύϜϝʔϧ͔൱͔Λ൑ఆ͢Δ
    • ڭࢣ͋Γֶशͷࣝผɾ෼ྨͷλεΫʹ૬౰
    • ςΩετ͔Β term frequency Λಛ௃ͱͯ͠நग़͢Δ
    • ࠓճ͸ʢΑ͋͘Δ Naive Bayes ͡Όͳͯ͘ʣ

    ϩδεςΟοΫճؼΛར༻͢Δ
    • άϦουαʔνͰύϥϝʔλνϡʔχϯάͭͭ͠ɺ

    k-෼ׂަࠩݕূ & AUC ͰϞσϧΛධՁ͢Δ

    View Slide

  44. άϦουαʔν
    ྲྀΕʹ͢Δͱ͜Μͳײ͡
    ςΩετ ಛ௃நग़ ֶश ަࠩݕূ Ϟσϧ

    View Slide

  45. bit.ly/javajo-­‐ml

    View Slide

  46. σʔληοτ

    View Slide

  47. UCI Machine learning repository
    • https://archive.ics.uci.edu/ml/datasets.html
    • CSV ϑΝΠϧͳͲͷॻࣜͰఏڙ͞Ε͍ͯΔ
    • σʔλ෼ੳɾػցֶशք۾ͷ Hello world తͳ
    σʔληοτ Iris (ΞϠϝ) ΋͋ΔΑ
    • ࠓճ͸ SMS Spam collection Λར༻͠·͢
    • https://archive.ics.uci.edu/ml/datasets/SMS
    +Spam+Collection

    View Slide

  48. SMS Spam collection
    ham Go until jurong point, crazy..
    Available only in bugis n great world la e
    buffet... Cine there got amore wat...
    ham Ok lar... Joking wif u oni...
    spam Free entry in 2 a wkly comp to win
    FA Cup final tkts 21st May 2005. Text FA to
    87121 to receive entry question(std txt
    rate)T&C's apply 08452810075over18's
    547ϑΝΠϧ

    View Slide

  49. SMS Spam collection
    ham Go until jurong point, crazy..
    Available only in bugis n great world la e
    buffet... Cine there got amore wat...
    ham Ok lar... Joking wif u oni...
    spam Free entry in 2 a wkly comp to win
    FA Cup final tkts 21st May 2005. Text FA to
    87121 to receive entry question(std txt
    rate)T&C's apply 08452810075over18's
    547ϑΝΠϧ
    ϥϕϧ
    IBNˠεύϜ͡Όͳ͍
    TQBNˠεύϜ

    View Slide

  50. ػցֶश؀ڥ

    View Slide

  51. Spark / MLlib
    • Spark cluster Λߏஙͯ͠ར༻͢Δͷ͕Ұൠత
    • ໘౗ͳͷͰࠓճ͸ self-contained ͳΞϓϦͰ

    ͓஡Λ୙͠·͢
    • https://spark.apache.org/docs/latest/quick-
    start.html#self-contained-applications
    • ࠓ෩ͷ Spark ΞϓϦέʔγϣϯΒ͘͠ʢʁʣɺ

    ML Pipeline API ͱ DataFrame API Λ࢖ͬͯΈ·͢

    View Slide

  52. ML Pipeline API
    • ػցֶशϫʔΫϑϩʔΛͦͷλεΫͷྻڍͰදݱ͢Δ
    • ※ ಛ௃நग़ɺֶशɺύϥϝʔλνϡʔχϯάɺݕূ
    ͳͲ

    View Slide

  53. ML Pipeline API
    • MLlib ͷ֤छΞϧΰϦζϜΛ࢖͍΍͘͢͢Δ࢓૊Έ
    ʹա͗ͳ͍
    • MLlib ͷΞϧΰϦζϜ౳͕͢΂͕ͯ࢖͑ΔΘ͚Ͱ͸
    ͳ͍͜ͱʹ஫ҙ
    • “Developers should contribute new algorithms
    to spark.mllib and can optionally contribute to
    spark.ml.”
    • K-Means ͳͲ͸࢖͑ͳ͍

    View Slide

  54. ML Pipeline API
    • ·ͩ·ͩ։ൃ్্
    • ύϥϝʔλνϡʔχϯά݁Ռͷϕετύϥ
    ϝʔλΛऔಘͰ͖ͳ͍ͬΆ͍
    • ϞσϧͷӬଓԽ͕Ͱ͖ͳ͍
    • Production Ͱ࢖͏ʹ͸·ͩݫ͍͠

    View Slide

  55. DataFrame API
    • εΩʔϚ৘ใΛ൐ͬͨσʔληοτ
    • σʔληοτʹରͯ͠ SQL తͳૢ࡞͕Ͱ͖Δ
    • select / join / filter / aggregation ͳͲͳͲ
    • RDD ͱൺֱͯ͠ɺ֤ݴޠ binding ؒͷύϑΥʔϚϯεࠩҟ͕খ͍͞
    • ৄ͘͠͸ Ishikawa ͞Μͷ slideshare Λ͝ཡԼ͍͞
    • http://www.slideshare.net/yuishikawa/2015-0312-lt2-spark-
    dataframe-introduction

    View Slide

  56. σʔλϩʔυ

    View Slide

  57. CSV / TSV to DataFrame
    • DataFrame ͱͯ͠ CSV / TSV ϑΝΠϧΛϩʔυ
    ͢Δʹ͸ spark-csv Λ࢖͏
    • https://github.com/databricks/spark-csv
    • εΩʔϚ͸໌ࣔతʹࢦఆ͓͍ͯͨ͠ํ͕Αͦ͞͏
    • ࢦఆ͠ͳ͍ͱจࣈྻѻ͍͞Εͯ͠·͏ͨΊɺ

    ਺஋ྻΛؚΉ৔߹͸ಛʹཁ஫ҙ

    View Slide

  58. ಛ௃நग़

    View Slide

  59. ςΩετσʔλ͔Βͷಛ௃நग़
    • ςΩετΛ white space tokenize ͢Δ
    • org.apache.spark.ml.feature.Tokenizer
    • ֤୯ޠͷग़ݱස౓ (TF, term frequency) ΛͱΔ
    • org.apache.spark.ml.feature.HashingTF
    • Hashing trick Λར༻͍ͯ͠Δ
    • LogisticRegression ʹೖྗ͢Δ DataFrame ͱ͢ΔͨΊʹ
    label ΧϥϜͱ features ΧϥϜΛ༻ҙ͢Δ

    View Slide

  60. ϩδεςΟοΫճؼ

    View Slide

  61. ϩδεςΟοΫճؼ
    • ڭࢣ͋Γػցֶश
    • ʮճؼʯͱݴͬͯ͸͍Δ͕ɺͲͪΒ͔ͱ

    ݴ͑͹ʮ෼ྨɾࣝผʯͰΑ͘࢖ΘΕΔʁ
    • ෼ྨ֬཰͕ٻ·Δͷ͕خ͍͠
    • ʮ༧ଌɾճؼʯͱͯ͠ͷར༻ྫ
    • Web ޿ࠂͷΫϦοΫ཰༧ଌ

    View Slide

  62. Spark / MLlib ͷϩδεςΟοΫճؼ
    • org.apache.spark.ml.classification.LogisticRegression
    • optimizer ͸(ࠓͷͱ͜Ζ?) LBFGS ͷΈɺSGD ͸࢖͑ͳ͍
    • ύϥϝʔλ
    • regParam: ਖ਼ଇԽύϥϝʔλ
    • elasticNetParam: 0.0 Λઃఆ͢Δͱ L2 ਖ਼ଇԽ,

    1.0 Λઃఆ͢Δͱ L1 ਖ਼ଇԽͱͳΔ
    • maxIter: ऩଋ·Ͱʹ܁Γฦ͢ճ਺

    View Slide

  63. ύϥϝʔλνϡʔχϯά

    View Slide

  64. άϦουαʔνʹΑΔύϥϝʔλνϡʔχϯά
    • org.apache.spark.ml.tuning.ParamGridBuilder
    • #addGrid() Ͱࢼͯ͠ΈΔύϥϝʔλΛྻڍ͢Δ
    • #build() ͰಘͨΦϒδΣΫτΛ
    CrossValidator#setEstimatorParamMaps() Ͱઃఆ͢Δ
    • Evaluator ͰಘΒΕΔ metrics Λࢀߟʹɺ
    CrossValidator ͕ best parameters ΛউखʹٻΊͯ
    ͘ΕΔ

    View Slide

  65. ࠓճͷνϡʔχϯάཁૉ
    • ಛ௃நग़
    • numFeatures: Feature hashing ޙͷ࣍ݩ਺
    • ϩδεςΟοΫճؼ
    • regParam: ਖ਼ଇԽύϥϝʔλ
    • maxIter: ऩଋ·ͰͷΠςϨʔγϣϯճ਺

    View Slide

  66. k-෼ׂަࠩݕূ

    View Slide

  67. K-fold cross validation
    Lݸʹ෼ׂ LݸͰUSBJOJOHͯ͠
    NFUSJDTΛͱΔɺΛ܁Γฦ͢
    %BUBTFU
    ʜ
    Q Q QL
    ЄQL
    ฏۉ஋Λ
    ٻΊΔ

    View Slide

  68. AUC (Area under the curve)
    ը૾͸Ԟଜઌੜͷʮ30$ۂઢʯΑΓҾ༻
    IUUQTPLVFEVNJFVBDKQdPLVNVSBTUBU30$IUNM
    ͜͜ͷ໘ੵ͕"6$
    ໘ੵ͕޿͚Ε͹޿͍ ʹ͍ۙ
    ΄Ͳɺ
    Α͍ਫ਼౓Ͱ͋Δ͜ͱΛҙຯ͢Δ
    5SVFQPTJUJWF
    ɹεύϜΛਖ਼͘͠ݕग़ͨ֬͠཰
    'BMTFQPTJUJWF
    ɹؒҧͬͯεύϜͱ൑ఆͨ֬͠཰

    View Slide

  69. Cross validation & metrics
    • ަࠩݕূ
    • org.apache.spark.ml.tuning.CrossValidator
    • #fit() Ͱ༩͑ΒΕͨڭࢣσʔλʹֶ͍ͭͯशͨ͠Ϟσϧͱ

    ϕετύϥϝʔλΛฦ٫͢Δ
    • ύϥϝʔλબ୒ͷϝτϦΫε
    • org.apache.spark.ml.evaluation.BinaryClassificationEval
    uator
    • AUC Λܭࢉͯ͘͠ΕΔ

    View Slide

  70. σϞ

    View Slide

  71. ͋Γ͕ͱ͏͍͟͝·ͨ͠ʂ

    View Slide