Slide 1

Slide 1 text

Java/Scala Ͱ͸͡ΊΔ ػցֶश 2015-07-23 Javajo @komiya_atsushi

Slide 2

Slide 2 text

͓·ͩΕ ʢ͓લ୭Αʁʣ

Slide 3

Slide 3 text

,0.*:""UTVTIJ !LPNJZB@BUTVTIJ

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

bit.ly/WeLoveSmartNews

Slide 7

Slide 7 text

We’re hiring! iOSΤϯδχΞ / AndroidΤϯδχΞ / WebΞϓϦέʔγϣϯΤϯδχΞ / ϓϩμΫςΟϏςΟΤϯδχΞ / ػցֶश / ࣗવݴޠॲཧΤϯδχΞ / άϩʔεϋοΫΤϯδχΞ / αʔόαΠυΤϯδχΞ / ޿ࠂΤϯδχΞ…

Slide 8

Slide 8 text

ࠓ೔ͷτϐοΫ

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

ݟಀ͍ͯ͠·ͨ͠ɾɾɾ ͝ΊΜͳ͍͞N @@ N

Slide 11

Slide 11 text

ػցֶशʂ ػցֶशʂ • JJUG φΠτηϛφʔʮػցֶशɾࣗવݴޠॲ ཧಛूʯͷͱ͖ͷൃදͷম͖௚͠Ͱ͢ • ฉ͍ͨ͜ͱͷ͋Δํ͸৸͍͍ͯͯͩ͘͞ • ػցֶशͷಋೖతͳ࿩ͱɺJava/Scala Ͱ
 ΧδϡΞϧʹػցֶश͢Δ࿩Λ͠·͢ • Ψνͷํ͸৸͍͍ͯͯͩ͘͞

Slide 12

Slide 12 text

ػցֶशΛ͸͡ΊΔલʹ ஌͓͖͍ͬͯͨ͜ͱ

Slide 13

Slide 13 text

ػցֶशͱ͸ͳΜͧ΍ʁ ػցֶशνϡʔτϦΞϧˏ+VCBUVT$BTVBM5BMLT http://www.slideshare.net/unnonouno/jubatus-­‐casual-­‐talksΑΓҾ༻͠·ͨ͠ɻ

Slide 14

Slide 14 text

ػցֶशͰԿ͕Ͱ͖Δͷ͔ ෼ྨɾࣝผ ༧ଌɾճؼ ύλʔϯϚΠχϯά ΞιγΤʔγϣϯϧʔϧ ΫϥελϦϯά εύϜϝʔϧͷݕ஌ χϡʔεهࣄͷΧςΰϦ෼ྨ ঎඼Ϩίϝϯσʔγϣϯ धཁɾച্༧ଌ ʜͳͲͳͲʢҰྫʣ

Slide 15

Slide 15 text

ػցֶशͰԿ͕Ͱ͖Δͷ͔ ύλʔϯϚΠχϯά ΞιγΤʔγϣϯϧʔϧ ΫϥελϦϯά εύϜϝʔϧͷݕ஌ χϡʔεهࣄͷΧςΰϦ෼ྨ ঎඼Ϩίϝϯσʔγϣϯ धཁɾച্༧ଌ ڭࢣ͋Γֶश ɾਖ਼ղ͕͋Δ ɾʮϞσϧʯΛ࡞Δ ෼ྨɾࣝผ ༧ଌɾճؼ

Slide 16

Slide 16 text

ػցֶशͰԿ͕Ͱ͖Δͷ͔ ࣝผɾ෼ྨ ճؼɾ༧ଌ ύλʔϯϚΠχϯά ΞιγΤʔγϣϯϧʔϧ ΫϥελϦϯά εύϜϝʔϧͷݕ஌ χϡʔεهࣄͷΧςΰϦ෼ྨ ঎඼Ϩίϝϯσʔγϣϯ धཁɾച্༧ଌ ڭࢣͳֶ͠श ໌֬ͳਖ਼ղ͸ͳ͍

Slide 17

Slide 17 text

• ਺஋ྻʢϕΫτϧʣ͔͠ѻ͑ͳ͍ • ඇߏ଄σʔλʢը૾ɺԻ੠ɺςΩετɺ
 ΞΫηεϩάɺetc.ʣ͸ͦͷ··Ͱ͸ѻ͑ͳ͍ • ಛ௃நग़ͯ͠ϕΫτϧʹ͢Δඞཁ͕͋Δ • ͍ΘΏΔ feature engineering • ڭࢣ͋Γֶशͷڭࢣσʔλͷ৔߹͸ɺՃ͑ͯ
 ʮϥϕϧʯͱͳΔਖ਼ղ৘ใΛ෇༩͢Δ ԿΛೖྗσʔλͱ͢Δͷ͔

Slide 18

Slide 18 text

ԿΛೖྗσʔλͱ͢Δͷ͔ • ಛ௃ྔͷநग़ɾม׵ • ΧςΰϦม਺ɿOne-hot encoding • ࣗવݴޠɿTerm frequency, Word2vec ͳͲ • ը૾ɿSIFT, SURF, AKAZE ͳͲ • ࠷ۙͩͱಛ௃நग़ʹ Deep learning Λ࢖ͬͨΓ΋ • ߴ࣍ݩˍૄͳಛ௃ϕΫτϧͷදݱ • Feature hashing

Slide 19

Slide 19 text

ಘΒΕͨ݁Ռ͸ਖ਼͍͠ͷ͔ • ਖ਼͠͞Λ͔֬ΊΔ • k-෼ׂަࠩݕূ (k-fold cross validation) • ਖ਼͠͞ΛଌΔ • ෼ྨɾࣝผ • AUC, Precision, Recall, F-measure • ༧ଌɾճؼ • ૬ؔ܎਺ɺܾఆ܎਺ɺMAE, RMSE, LogLoss

Slide 20

Slide 20 text

ઢܗ෼཭ɾඇઢܗ • Ұຊͷઢʢ௒ฏ໘ʣͰ఺ʢαϯϓϧʣΛ෼཭Ͱ͖Δ͔ʁ ઢܗ෼཭Մೳ ઢܗ෼཭ෆՄೳ ඇઢܗ

Slide 21

Slide 21 text

ΦϯϥΠϯֶशɾΦϑϥΠϯֶश • ΦϯϥΠϯֶश • ஞ࣍ಘΒΕΔσʔλΛ΋ͱʹɺϞσϧΛਵ࣌ߋ৽͢Δ • ετϦʔϜॲཧతͳΠϝʔδ • ར༻ͨ͠σʔλ͸஝ੵ͢Δ͜ͱͳ͘ഁغͰ͖Δ • ʢઍ੾ͬͯ͸౤͛ɺઍ੾ͬͯ͸౤͛…ʣ • ΦϑϥΠϯֶश • ஝ੵ͞ΕͨσʔλΛ΋ͱʹɺϞσϧΛҰؾʹߋ৽͢Δ • όονॲཧʹ૬౰͢Δ

Slide 22

Slide 22 text

Java/Scala Ͱػցֶश͢Δ

Slide 23

Slide 23 text

࠷ॳʹݴ͓ͬͯ͘

Slide 24

Slide 24 text

ंྠͷ࠶ൃ໌͸΍Ίͯɺ طଘϥΠϒϥϦ౳Λ׆༻͠Α͏

Slide 25

Slide 25 text

ػցֶशͷ࣮૷ɺਏΈ͔͠ͳ͍ • ػցֶशΞϧΰϦζϜͷςετɺͱʹ͔͘ਏ͍ • ʮςετॻ͔ͳ͍ͱ͔͓લ̋̋ͷલͰ΋ಉ͜͡ ͱݴ͑Μͷʁʯ • ࣌ؒɾۭؒޮ཰ͷΑ͍࣮૷͸೉͍͠ • طଘϥΠϒϥϦ౳Λ࢖͏͚ͩͰ͸Ͳ͏ͯ͠΋ղܾ Ͱ͖ͳ͍৔߹ʹͷΈɺࣗલ࣮૷͢ΔΑ͏ʹ͍ͨ͠ • ΪϦΪϦ·Ͱਫ਼౓ΛߴΊΔ͜ͱʹܦࡁతՁ஋͕ ͋Δɺͱ͔

Slide 26

Slide 26 text

Java ʹΑΔػցֶश ޲͖ɾෆ޲͖

Slide 27

Slide 27 text

ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3. ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ

Slide 28

Slide 28 text

ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3. ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ ͜ͷ͋ͨΓͰ ػցֶशΛ ׆༻͢Δ

Slide 29

Slide 29 text

ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3. ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ ͜ͷ͋ͨΓ͸ ΞυϗοΫͳ ෼ੳ͕ඞཁ

Slide 30

Slide 30 text

ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3. ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ +BWBʹ޲͍ͯ ͍Δͷ͸ ͜ͷ͋ͨΓ

Slide 31

Slide 31 text

ྫ͑͹͜ΜͳϫʔΫϑϩʔ 1. ର৅ͱ͢Δ໰୊Λೝࣝ͢Δ • ͲͷΑ͏ͳλεΫ͕߹͏ͷ͔ʁ 2. อ༗͍ͯ͠Δσʔλʹ͍ͭͯཧղΛਂΊΔ • ͲͷΑ͏ͳಛ௃ྔ͕நग़Ͱ͖Δͷ͔ʁ 3. ϞσϧΛ࡞੒͢Δ • ͲͷΞϧΰϦζϜΛར༻͢΂͖͔ʁ • Ͳͷಛ௃ྔΛར༻͢΂͖͔ʁ 4. ࡞੒ͨ͠ϞσϧΛධՁ͢Δ • ਫ਼౓͸͍͔΄Ͳ͔ʁ 5. γεςϜʹ૊ΈࠐΉɾγεςϜԽ͢Δ 4QBSL 4DBMB ͳΒ͜ͷ͋ͨΓ ΋ΧόʔͰ͖Δ ͔΋

Slide 32

Slide 32 text

దࡐదॴͰ͍͜͏ • ΞυϗοΫͳ෼ੳ΍Ϟσϧͷߏங͸ R ΍ Python Ͱ • ΠϯλϥΫςΟϒͳૢ࡞͕͠΍͍͢ • ࢼߦࡨޡͷ܁Γฦ͕͠͠΍͍͢ • Spark ͷ interactive shell ΋͍͍͔΋͠Εͳ͍ • γεςϜԽͷ෦෼ͰɺPython ΍ Java, Scala Λར༻͢Δ • ύϑΥʔϚϯεͷٻΊΒΕΔέʔεͦ͜ɺJava ΍ Scala ͷग़൪

Slide 33

Slide 33 text

Java/Scala ͔Β࢖͑Δ ػցֶशϥΠϒϥϦ ˍ ϑϨʔϜϫʔΫ

Slide 34

Slide 34 text

Spark / MLlib • gradle ‘org.apache.spark:spark-mllib_2.10:1.1.1' • https://github.com/apache/spark • ˒ 2,336 → 4,813 • ෼ࢄॲཧϑϨʔϜϫʔΫ Spark ্Ͱͷར༻Λલఏ ͱͨ͠ػցֶशϥΠϒϥϦ MLlib • ػೳ௥Ճɾվળ͕ࠓͩ੝Μ • ΞυϗοΫ෼ੳͷ؀ڥͱͯ͠΋ར༻Ͱ͖Δ

Slide 35

Slide 35 text

liblinear-java • gradle ‘de.bwaldvogel:liblinear:1.95' • https://github.com/bwaldvogel/liblinear-java • ˒ 121 → 144 • LibSVM Λઢܗ෼ྨɾճؼʹಛԽͨ͠΋ͷɺͷ
 Java ϙʔςΟϯά • ϥΠϒϥϦ • ΘΓͱؤுͬͯɺຊମ (C++ ൛) ͷ࠷৽όʔδϣϯʹ௥ै͠ Α͏ͱ͍ͯ͠Δ

Slide 36

Slide 36 text

Weka • gradle ‘nz.ac.waikato.cms.weka:weka-stable: 3.6.11' • ଟछଟ༷ͳػցֶशΞϧΰϦζϜ͕ఏڙ͞Ε ͍ͯΔ෼ੳϓϥοτϑΥʔϜ • ϥΠϒϥϦͱͯ͠΋ར༻͢Δ͜ͱ͕Ͱ͖Δ • ͱΓ͋͑ͣػցֶशΛ͸͡ΊͯΈ͍ͨͳΒɺ
 ͜ΕΛར༻ͯ͠ΈΔͷ΋ѱ͘ͳ͍͔ͱ

Slide 37

Slide 37 text

Mahout • gradle ‘org.apache.mahout:mahout-core:0.9' • https://github.com/apache/mahout • ˒ 229 → 507 • ෼ࢄॲཧϑϨʔϜϫʔΫ Hadoop ্ͷػցֶशϥΠϒϥϦ • Goodbye MapReduce ͯ͠ɺSpark ΍ h2o ͱͷ਌࿨ੑΛߴ ΊΔ։ൃ͕͞Ε͍ͯΔ༷ࢠ • https://issues.apache.org/jira/browse/MAHOUT-1510 • ͔͠͠ɺͦ͜͸͔ͱͳ͘ඬ͏Φϫίϯײ…

Slide 38

Slide 38 text

SAMOA • https://github.com/yahoo/samoa • ˒ 363 → 397 • Storm ͳͲͷ෼ࢄετϦʔϛϯάϑϨʔϜ ϫʔΫ্Ͱར༻Ͱ͖ΔػցֶशϥΠϒϥϦ • Yahoo! Labs ۘ੡ • ͜͜࠷ۙ͸͋·Γ։ൃ׆ൃͰͳ͍ʁ

Slide 39

Slide 39 text

Jubatus • https://github.com/jubatus/jubatus • ˒ 389 → 453 • ෼ࢄॲཧϑϨʔϜϫʔΫˍΦϯϥΠϯػցֶशϥΠϒϥ Ϧ • ຊମ͸ C++ ࣮૷͕ͩɺJava ͷΫϥΠΞϯτϥΠϒϥϦ ͕ఏڙ͞Ε͍ͯΔ • Bandit algorithm ͕࣮૷͞ΕͨΓͱɺ·ͩ·ͩ։ൃܧଓ த

Slide 40

Slide 40 text

h2o • https://github.com/h2oai/h2o • ˒ 1,333 → 1,741 • ෼ࢄॲཧϑϨʔϜϫʔΫ Hadoop ্Ͱར༻Ͱ ͖ΔػցֶशϥΠϒϥϦ • Կ͔ͱ࿩୊ͷ Deep learning Λ Java Ͱ͍ͨ͠ ͳΒɺ͜Ε୒Ұʂʁ

Slide 41

Slide 41 text

͸͡ΊͯΈΑ͏ػցֶश

Slide 42

Slide 42 text

͓୊ɿ εύϜϝʔϧ൑ఆΛ ͯ͠ΈΑ͏

Slide 43

Slide 43 text

εύϜϝʔϧ൑ఆ • ༩͑ΒΕͨςΩετ͕εύϜϝʔϧ͔൱͔Λ൑ఆ͢Δ • ڭࢣ͋Γֶशͷࣝผɾ෼ྨͷλεΫʹ૬౰ • ςΩετ͔Β term frequency Λಛ௃ͱͯ͠நग़͢Δ • ࠓճ͸ʢΑ͋͘Δ Naive Bayes ͡Όͳͯ͘ʣ
 ϩδεςΟοΫճؼΛར༻͢Δ • άϦουαʔνͰύϥϝʔλνϡʔχϯάͭͭ͠ɺ
 k-෼ׂަࠩݕূ & AUC ͰϞσϧΛධՁ͢Δ

Slide 44

Slide 44 text

άϦουαʔν ྲྀΕʹ͢Δͱ͜Μͳײ͡ ςΩετ ಛ௃நग़ ֶश ަࠩݕূ Ϟσϧ

Slide 45

Slide 45 text

bit.ly/javajo-­‐ml

Slide 46

Slide 46 text

σʔληοτ

Slide 47

Slide 47 text

UCI Machine learning repository • https://archive.ics.uci.edu/ml/datasets.html • CSV ϑΝΠϧͳͲͷॻࣜͰఏڙ͞Ε͍ͯΔ • σʔλ෼ੳɾػցֶशք۾ͷ Hello world తͳ σʔληοτ Iris (ΞϠϝ) ΋͋ΔΑ • ࠓճ͸ SMS Spam collection Λར༻͠·͢ • https://archive.ics.uci.edu/ml/datasets/SMS +Spam+Collection

Slide 48

Slide 48 text

SMS Spam collection ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... ham Ok lar... Joking wif u oni... spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 547ϑΝΠϧ

Slide 49

Slide 49 text

SMS Spam collection ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... ham Ok lar... Joking wif u oni... spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 547ϑΝΠϧ ϥϕϧ IBNˠεύϜ͡Όͳ͍ TQBNˠεύϜ

Slide 50

Slide 50 text

ػցֶश؀ڥ

Slide 51

Slide 51 text

Spark / MLlib • Spark cluster Λߏஙͯ͠ར༻͢Δͷ͕Ұൠత • ໘౗ͳͷͰࠓճ͸ self-contained ͳΞϓϦͰ
 ͓஡Λ୙͠·͢ • https://spark.apache.org/docs/latest/quick- start.html#self-contained-applications • ࠓ෩ͷ Spark ΞϓϦέʔγϣϯΒ͘͠ʢʁʣɺ
 ML Pipeline API ͱ DataFrame API Λ࢖ͬͯΈ·͢

Slide 52

Slide 52 text

ML Pipeline API • ػցֶशϫʔΫϑϩʔΛͦͷλεΫͷྻڍͰදݱ͢Δ • ※ ಛ௃நग़ɺֶशɺύϥϝʔλνϡʔχϯάɺݕূ ͳͲ

Slide 53

Slide 53 text

ML Pipeline API • MLlib ͷ֤छΞϧΰϦζϜΛ࢖͍΍͘͢͢Δ࢓૊Έ ʹա͗ͳ͍ • MLlib ͷΞϧΰϦζϜ౳͕͢΂͕ͯ࢖͑ΔΘ͚Ͱ͸ ͳ͍͜ͱʹ஫ҙ • “Developers should contribute new algorithms to spark.mllib and can optionally contribute to spark.ml.” • K-Means ͳͲ͸࢖͑ͳ͍

Slide 54

Slide 54 text

ML Pipeline API • ·ͩ·ͩ։ൃ్্ • ύϥϝʔλνϡʔχϯά݁Ռͷϕετύϥ ϝʔλΛऔಘͰ͖ͳ͍ͬΆ͍ • ϞσϧͷӬଓԽ͕Ͱ͖ͳ͍ • Production Ͱ࢖͏ʹ͸·ͩݫ͍͠

Slide 55

Slide 55 text

DataFrame API • εΩʔϚ৘ใΛ൐ͬͨσʔληοτ • σʔληοτʹରͯ͠ SQL తͳૢ࡞͕Ͱ͖Δ • select / join / filter / aggregation ͳͲͳͲ • RDD ͱൺֱͯ͠ɺ֤ݴޠ binding ؒͷύϑΥʔϚϯεࠩҟ͕খ͍͞ • ৄ͘͠͸ Ishikawa ͞Μͷ slideshare Λ͝ཡԼ͍͞ • http://www.slideshare.net/yuishikawa/2015-0312-lt2-spark- dataframe-introduction

Slide 56

Slide 56 text

σʔλϩʔυ

Slide 57

Slide 57 text

CSV / TSV to DataFrame • DataFrame ͱͯ͠ CSV / TSV ϑΝΠϧΛϩʔυ ͢Δʹ͸ spark-csv Λ࢖͏ • https://github.com/databricks/spark-csv • εΩʔϚ͸໌ࣔతʹࢦఆ͓͍ͯͨ͠ํ͕Αͦ͞͏ • ࢦఆ͠ͳ͍ͱจࣈྻѻ͍͞Εͯ͠·͏ͨΊɺ
 ਺஋ྻΛؚΉ৔߹͸ಛʹཁ஫ҙ

Slide 58

Slide 58 text

ಛ௃நग़

Slide 59

Slide 59 text

ςΩετσʔλ͔Βͷಛ௃நग़ • ςΩετΛ white space tokenize ͢Δ • org.apache.spark.ml.feature.Tokenizer • ֤୯ޠͷग़ݱස౓ (TF, term frequency) ΛͱΔ • org.apache.spark.ml.feature.HashingTF • Hashing trick Λར༻͍ͯ͠Δ • LogisticRegression ʹೖྗ͢Δ DataFrame ͱ͢ΔͨΊʹ label ΧϥϜͱ features ΧϥϜΛ༻ҙ͢Δ

Slide 60

Slide 60 text

ϩδεςΟοΫճؼ

Slide 61

Slide 61 text

ϩδεςΟοΫճؼ • ڭࢣ͋Γػցֶश • ʮճؼʯͱݴͬͯ͸͍Δ͕ɺͲͪΒ͔ͱ
 ݴ͑͹ʮ෼ྨɾࣝผʯͰΑ͘࢖ΘΕΔʁ • ෼ྨ֬཰͕ٻ·Δͷ͕خ͍͠ • ʮ༧ଌɾճؼʯͱͯ͠ͷར༻ྫ • Web ޿ࠂͷΫϦοΫ཰༧ଌ

Slide 62

Slide 62 text

Spark / MLlib ͷϩδεςΟοΫճؼ • org.apache.spark.ml.classification.LogisticRegression • optimizer ͸(ࠓͷͱ͜Ζ?) LBFGS ͷΈɺSGD ͸࢖͑ͳ͍ • ύϥϝʔλ • regParam: ਖ਼ଇԽύϥϝʔλ • elasticNetParam: 0.0 Λઃఆ͢Δͱ L2 ਖ਼ଇԽ,
 1.0 Λઃఆ͢Δͱ L1 ਖ਼ଇԽͱͳΔ • maxIter: ऩଋ·Ͱʹ܁Γฦ͢ճ਺

Slide 63

Slide 63 text

ύϥϝʔλνϡʔχϯά

Slide 64

Slide 64 text

άϦουαʔνʹΑΔύϥϝʔλνϡʔχϯά • org.apache.spark.ml.tuning.ParamGridBuilder • #addGrid() Ͱࢼͯ͠ΈΔύϥϝʔλΛྻڍ͢Δ • #build() ͰಘͨΦϒδΣΫτΛ CrossValidator#setEstimatorParamMaps() Ͱઃఆ͢Δ • Evaluator ͰಘΒΕΔ metrics Λࢀߟʹɺ CrossValidator ͕ best parameters ΛউखʹٻΊͯ ͘ΕΔ

Slide 65

Slide 65 text

ࠓճͷνϡʔχϯάཁૉ • ಛ௃நग़ • numFeatures: Feature hashing ޙͷ࣍ݩ਺ • ϩδεςΟοΫճؼ • regParam: ਖ਼ଇԽύϥϝʔλ • maxIter: ऩଋ·ͰͷΠςϨʔγϣϯճ਺

Slide 66

Slide 66 text

k-෼ׂަࠩݕূ

Slide 67

Slide 67 text

K-fold cross validation Lݸʹ෼ׂ LݸͰUSBJOJOHͯ͠ NFUSJDTΛͱΔɺΛ܁Γฦ͢ %BUBTFU ʜ Q Q QL ЄQL ฏۉ஋Λ ٻΊΔ

Slide 68

Slide 68 text

AUC (Area under the curve) ը૾͸Ԟଜઌੜͷʮ30$ۂઢʯΑΓҾ༻ IUUQTPLVFEVNJFVBDKQdPLVNVSBTUBU30$IUNM ͜͜ͷ໘ੵ͕"6$ ໘ੵ͕޿͚Ε͹޿͍ ʹ͍ۙ ΄Ͳɺ Α͍ਫ਼౓Ͱ͋Δ͜ͱΛҙຯ͢Δ 5SVFQPTJUJWF ɹεύϜΛਖ਼͘͠ݕग़ͨ֬͠཰ 'BMTFQPTJUJWF ɹؒҧͬͯεύϜͱ൑ఆͨ֬͠཰

Slide 69

Slide 69 text

Cross validation & metrics • ަࠩݕূ • org.apache.spark.ml.tuning.CrossValidator • #fit() Ͱ༩͑ΒΕͨڭࢣσʔλʹֶ͍ͭͯशͨ͠Ϟσϧͱ
 ϕετύϥϝʔλΛฦ٫͢Δ • ύϥϝʔλબ୒ͷϝτϦΫε • org.apache.spark.ml.evaluation.BinaryClassificationEval uator • AUC Λܭࢉͯ͘͠ΕΔ

Slide 70

Slide 70 text

σϞ

Slide 71

Slide 71 text

͋Γ͕ͱ͏͍͟͝·ͨ͠ʂ