Slide 1

Slide 1 text

1ZUIPOͱσʔλ෼ੳίϯςετͷ࣮ફ ,PIFJ0[BLJ !TNMZ '*5νϡʔτϦΞϧεϥΠυ

Slide 2

Slide 2 text

͜ͷνϡʔτϦΞϧʹ͍ͭͯ σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠ɺ 1ZUIPOʹΑΔύοέʔδར༻ྫΛ঺հ͢Δɻ ▶  1BSU༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ NJO –  ༧ଌϞσϧͷ࡞੒ϓϩηεͱؔ࿈ύοέʔδͷུ֓Λ঺հ͢Δ –  1BOEBT 4DJLJU-FBSO ▶  1BSUσʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ NJO –  σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠۩ମతͳར༻ྫΛ঺հ͢Δ –  1BOEBT TFBCPSO 9(#PPTU 1Z5BCMFT ▶  1BSUൃలతͳ࿩୊ NJO –  ໨తʢେن໛σʔλɾը૾σʔλʣ͝ͱʹಛԽͨ͠ར༻ྫΛ঺հ͢Δ –  1BOEBT 3FETIJGU #JH2VFSZ 0QFO$7

Slide 3

Slide 3 text

ิ଍ɿ഑෍ࢿྉͱ%PDLFSʹΑΔ࠶ݱੑͷ֬อ ຊεϥΠυͰ࢖ΘΕΔίʔυ͸+VQZUFSOPUFCPPLͱͯ͠ҎԼͰެ։͍ͯ͠Δɽ ▶  εϥΠυIUUQHPPHM.F.;Z0 ▶  1BSUࢿྉIUUQHPPHMZ;N,S ▶  1BSUࢿྉIUUQHPPHM"5$ZW ɽຊࢿྉͰར༻ͨ͠+VQZUFSOPUFCPPL͸ެ։͍ͯ͠Δ%PDLFSΠϝʔδ͔Β ࡞੒ͨ͠؀ڥΛ࢖͍ͬͯΔɽҎԼͷίϚϯυͰಉ༷ͷ؀ڥΛࢼ͢͜ͱ͕Ͱ͖Δɻ $ docker run --rm -ti \ -p 8888:8888 \ -v /path/to/data_directory:/mnt \ -v /path/to/working_directory:/home/kohei/work \ smly/notebook:0.4

Slide 4

Slide 4 text

1"35 ༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ

Slide 5

Slide 5 text

Ұൠతͳ༧ଌϞσϧ࡞੒ͷϓϩηε ▶  ,BHHMF΍,%%$VQͳͲͷίϯςετͰ͸ɼz༩͑ΒΕͨσʔλͱλεΫz ʹରͯ͠ɼzܾΊΒΕͨධՁࢦඪzʹΑΔείΞͰॱҐΛ෇͚Λߦ͏ ▶  ͦΕͧΕͷϓϩηεʹ͓͍ͯ༗༻ͳϞδϡʔϧʹ͍ͭͯ঺հ͢Δ λεΫઃܭ σʔλ࡞੒ γεςϜ΁ ૊ΈࠐΈ લॲཧ ಛ௃நग़ ༧ଌϞσϧ ͷධՁ ༧ଌϞσϧ ͷ࡞੒ ・ pandas ・ psycopg2 ・ sklearn.feature_extraction ・ sklearn.preprocessing ・ pytables ・ seaborn ・ sklearn.linear_model ・ sklearn.svm ・ sklearn.ensemble ・ xgboost ・ stats-models ・ sklearn.metrics ・ sklearn.cross_validation ・ ml_metrics

Slide 6

Slide 6 text

QBOEBTσʔλϑϨʔϜͰσʔλૢ࡞ ೖग़ྗ ▶  /VN1ZͷOEBSSBZʹࣅͨσʔλΛૢ࡞͢ΔͨΊͷσʔλߏ଄ ▶  ΧϥϜͱΠϯσοΫε͕͋Γɺ໊લ෇͖ΧϥϜͰσʔλૢ࡞͕Մೳ ▶  ΧϥϜ͝ͱʹҟͳΔܕΛ࣋ͭ͜ͱ͕Ͱ͖Δ ▶  ߦͷૠೖ΍ྻͷ࡟আͳͲ͕ߦ͑ΔՄม NVUBCMF ͳΦϒδΣΫτ ▶  ๛෋ͳσʔλૢ࡞ػೳͱೖग़ྗΠϯλʔϑΣʔε લॲཧ ಛ௃நग़ NDArray Internal dimensions dim count dtype strides data * 2 3 3 5 * 12 float32 0 1 2 3 4 5 6 7 8 DataFrame Internal I N D E X Columns float string integer

Slide 7

Slide 7 text

ಛ௃ྔͷΤϯίʔυ ઢܗϞσϧͳͲΧςΰϦΧϧม਺Λ௚઀ѻ͏͜ͱ͕Ͱ͖ͳ͍ϞσϧͰ͸ɼΧςΰ ϦΧϧม਺Λ਺஋ม਺ͱͯ͠දݱ͢Δඞཁ͕͋Δɽ ༷ʑͳಛ௃ྔͷΤϯίʔυํ๏ͱ࣮૷͕TDJLJUMFBSOʹΑͬͯఏڙ͞Ε͍ͯΔɽ ▶  TLMFBSOGFBUVSF@FYUSBDUJPO0OF)PU&ODPEFS ▶  TLMFBSOGFBUVSF@FYUSBDUJPO-BCFM&ODPEFS ▶  TLMFBSOGFBUVSF@FYUSBDUJPO%JDU7FDUPSJ[F ▶  TLMFBSOGFBUVSF@FYUSBDUJPO'FBUVSF)BTIFS ▶  TLMFBSOGFBUVSF@FYUSBDUJPOUFYU5pEG7FDUPSJ[FS ▶  ʜ ୅දతͳΫϥεΛ঺հ͢Δ લॲཧ ಛ௃நग़

Slide 8

Slide 8 text

0OF)PU&ODPEFS ΧςΰϦΧϧม਺ΛPG,දهʹΤϯίʔυ͢Δɽ ͋ΔΧςΰϦΧϧม਺ͷDBSEJOBMJUZΛ,ͱͨ͠ͱ͖ɼಛ௃Λ,ྻͷͱ͠ ͯѻ͏ɽͦΕͧΕͷྻʹ͸ΧςΰϦΧϧม਺ͷಛఆͷ஋͕ରԠ͓ͯ͠ΓɼରԠ͢ Δྻͷ஋ΛɼͦΕҎ֎ΛͰදݱ͢Δɽ ʢQEHFU@EVNNJFT ؔ਺ʹΑͬͯQBOEBTͰ΋ಉ౳ͷૢ࡞͕ՄೳͰ͋Δʣ લॲཧ ಛ௃நग़ WeekDay 0 Friday 1 Monday 2 Monday 3 Sunday 4 Tuesday 5 Saturday 6 Monday 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 ndarray or csr_matrix K=5

Slide 9

Slide 9 text

-BCFM&ODPEFS ΧςΰϦΧϧม਺Λ੔਺஋ʹΤϯίʔυ͢Δɽ ΧςΰϦΧϧม਺ͷDBSEJOBMJUZ͕,Ͱ͋Δ৔߹ɼ ม਺ͷ஋Λ< ,>ͷ੔਺஋ʹஔ͖׵͑Δɽ ʢQEGBDUPSJ[F ؔ਺ʹΑͬͯQBOEBTͰ΋ಉ౳ͷૢ࡞͕ՄೳͰ͋Δʣ લॲཧ ಛ௃நग़ WeekDay 0 Friday 1 Monday 2 Monday 3 Sunday 4 Tuesday 5 Saturday 6 Monday 0 1 1 2 3 4 1 ndarray

Slide 10

Slide 10 text

%JDU7FDUPSJ[F ࣙॻΦϒδΣΫτͷϦετΛ4DJ1ZͷૄߦྻΦϒδΣΫτ΍OEBSSBZΦϒδΣ Ϋτʹม׵͢Δɽ ࣙॻΦϒδΣΫτͷLFZ͕ߦྻͷಛఆͷྻʹҰରҰରԠ͢Δɽ WBMVF͕ಛ௃஋ʹରԠ͢Δʢ%BUB'SBNFΦϒδΣΫτͷίϯετϥΫλʹ౉͠ ͯpMMOB BT@NBUSJY ͰθϩϑΟϧˍม׵͢Δ͜ͱͰQBOEBTͰ΋ಉ౳ͷૢ࡞ ͕ՄೳͰ͋Δʣ લॲཧ ಛ௃நग़ list of dict [ {‘like’: 1, ‘rt’: 9}, {‘like’: 2, ‘rt’: 2}, {‘like’: 4}, ] 1  9 2  2 4  0 ndarray

Slide 11

Slide 11 text

TDJLJUMFBSO΍9(#PPTUͰ༧ଌϞσϧΛ࡞੒ TDJLJUMFBSOͰ͸pU Ͱσʔλ౰ͯ͸ΊɼQSFEJDU Ͱ༧ଌɼQSFEJDU@QSPCB Ͱ༧ଌ֬཰Λ౴͑Δڞ௨ͷΠϯλʔϑΣʔεΛ͍࣋ͬͯΔʢ˞෼ྨͷ৔߹ʣ >>> import xgboost as xgb >>> dtrain, dtest = xgb.DMatrix(), xgb.DMatrix() >>> watchlist = [(dtrain, 'train')] >>> booster = xgb.Booster() >>> gbtree = booster.train(dtrain, params) >>> y_pred = gbtree.predict(dtest) >>> from sklearn.linear_model import LogisticRegression >>> clf = LogisticRegression() >>> clf.fit(X_train, y_rain) >>> y_pred = clf.predict_proba(X_test) 9(#PPTU͸TDJLJUMFBSOͱҟͳΔΠϯλʔϑΣʔεͰ͋Δ͕ɼΠςϨʔγϣϯ͝ͱͷ FWBMVBUJPOɼFBSMZTUPQQJOHɼ࠷దԽؔ਺ͷΧελϚΠζͳͲͷ༷ʑͳػೳ͕͋Δɽ 9(#PPTU͸ϝϞϦޮ཰ͱ܇࿅଎౓ʹ࠷దԽ͞Εͨ಺෦σʔλߏ଄%.BUSJYͱͯ͠ σʔλΛѻ͏ɽOEBSSBZ͔Β%.BUSJYΦϒδΣΫτΛ࡞੒Ͱ͖Δɽ ༧ଌϞσϧ ͷ࡞੒

Slide 12

Slide 12 text

ϞσϧͷੑೳධՁ TLMFBSODSPTT@WBMJEBUJPOϞδϡʔϧ͸༧ଌϞσϧͷੑೳΛධՁ͢ΔͨΊͷ ϔϧύʔؔ਺΍ΠςϨʔλʔΛఏڙ͢Δɻ ༧ଌϞσϧ ͷධՁ >>> clf = LogisticRegression() >>> scores = cross_validation.cross_val_score(clf, X, y, cv=5) >>> scores array([ 0.92..., 1. ..., 0.92..., 1. ]) ༧ଌϞσϧ͕TLMFBSO$MBTTJpFS.JYJOΛܧঝͨ͠ΫϥεͰ͋Ε͹ϔϧύʔؔ਺Λ ༻͍ͯަࠩ֬ೝΛ؆ܿʹهड़Ͱ͖Δɻ ΠςϨʔλʔ͸֬ೝ༻ͷ܇࿅σʔλͱςετσʔλͷΠϯσοΫεϦετΛฦ͢ >>> kf = KFold(n_samples, n_folds=5) >>> for idx_train, idx_test in kf: ... y_train, y_test = y[idx_train], y[idx_test] ... X_train, X_test = X[idx_train], X[idx_test] ... clf.fit(X_train, y_train) ... y_pred = clf.predict(X_test) ֤'PMEͷ"DDVSBDZ

Slide 13

Slide 13 text

1"35 σʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ

Slide 14

Slide 14 text

σʔλͷ֬ೝ ▶  l8BMNBSU3FDSVJUJOH5SJQ5ZQF$MBTTJpDBUJPOzίϯςετΛ୊ࡐͱͯ͠ѻ͏ ▶  ʮങ͍٬ʯΛʮങ͍෺ͷ঎඼εΩϟϯཤྺʯ͔ΒଟΫϥε෼ྨ͢ΔλεΫ ▶  USBJODTW UFTUDTW TBNQMF4VCNJTTJPODTWͷ̏ͭͷϑΝΠϧ͕ఏڙ͞Ε͍ͯΔ ܇࿅ࣄྫσʔλUSBJODTW͸ʮങ͍෺ͷ঎඼εΩϟϯཤྺʯͰ͋Δɻ ϑΟʔϧυ໊ આ໌ 5SBJO 5FTU 5SJQ5ZQF ໨ඪม਺ ʮങ͍٬ʯͷΧςΰϦΧϧͳ*% ✔ ✗ 7JTJU/VNCFS *OTUBODF*% ͋ΔҰਓͷސ٬ͷങ͍෺ʹରԠ͢Δ*% ✔ ✔ 8FFLEBZ ങ͍෺Λͨ͠ि ✔ ✔ 6QD ߪೖ঎඼ͷ61$൪߸ ✔ ✔ 4DBO$PVOU ߪೖ͞Εͨݸ਺ʢෛͷ஋͸ฦ٫͞Εͨ঎඼ʣ ✔ ✔ %FQBSUNFOU%FTDSJQUJPO ߪೖ঎඼ͷδϟϯϧ ✔ ✔ 'JOFMJOF ߪೖ঎඼Λߋʹࡉ͔͘෼͚Δδϟϯϧͷ*% ✔ ✔

Slide 15

Slide 15 text

σʔλͷ֬ೝ ▶  ܇࿅σʔλͰ͸ങ͍෺٬ʢ7JTJU/VNCFSʣͷങ͍෺͔͝ཤྺͱɺ๚໰٬ʹର Ԡ͢ΔΧςΰϦʢ5SJQ5ZQFʣ͕༩͑ΒΕ͍ͯΔɻ ▶  ςετσʔλͰ͸ΧςΰϦ͕༩͑ΒΕ͍ͯͳ͍ɻςετσʔλͷΧςΰϦΛ܇ ࿅σʔλʹج͍ͮͯ༧ଌ͢Δ͜ͱ͕༩͑ΒΕͨλεΫͰ͋Δɻ +VQZUFSOPUFCPPL্Ͱ Θ͔Γ΍͘͢ςʔϒϧͰදࣔ͞ΕΔ

Slide 16

Slide 16 text

σʔλͷ֬ೝ ▶  ༧ଌ݁Ռͷఏग़ϑΥʔϚοτ͸๚໰٬൪߸ͱ֤ΧςΰϦΛྻͱͨ͠ςʔϒϧɻ ๚໰٬൪߸͝ͱʹΧςΰϦʹଐ͢Δ֬཰Λղ౴͢Δɻ

Slide 17

Slide 17 text

TFBCPSOʹΑΔσʔλͷՄࢹԽ TFBCPSO͸ϋΠϨϕϧͳΠϯλʔϑΣʔεΛ࣋ͭՄࢹԽπʔϧɽ σʔλϑϨʔϜΛೖྗͱͯ͠౷ܭάϥϑΛදࣔ͢Δ͜ͱ͕Ͱ͖Δɽ ໨ඪม਺5SJQ5ZQF͝ͱʹภΓ͕͋Δ͜ͱ͕֬ೝͰ͖Δ 7JTJU/VNCFS͕6OJRVFͱͳΔΑ͏ ϨίʔυΛࣺͯΔ 5SJQ5ZQFͷ஋͝ͱͷΧ΢ϯτ

Slide 18

Slide 18 text

TFBCPSOʹΑΔσʔλͷՄࢹԽ ▶  ܇࿅ࣄྫͱςετࣄྫ͸࣌ܥྻʢ༵೔ʣ͕ಉ༷ͷ෼෍ͱͳΔΑ͏ʹαϯϓϧ͞ Ε͍ͯΔ͜ͱ͕Θ͔Δ ▶  ෼෍Λ֬ೝ͢Δͱ༵೔ͷ৘ใΛֶशϞσϧʹಛ௃ྔͱͯ͠༻͍ͨ৔߹ʹɺ܇࿅ ࣄྫʹରͯ͠աֶश͠ͳ͍ͱਪଌͰ͖Δ ܇࿅σʔλ͸্ஈͷϓϩοτ ςετσʔλ͸Լஈͷϓϩοτ 8FFLEBZͷ஋͝ͱʹΧ΢ϯτ

Slide 19

Slide 19 text

Լ४උɿධՁࢦඪΛఆٛ͢Δ ධՁࢦඪͷ࣮૷͸TLMFBSONFUSJDTϞδϡʔϧ΍NM@NFUSJDTύοέʔδͳͲ͕ ଘࡏ͢Δɻ௨ৗ͸͜ΕΒͷύοέʔδΛ࢖͑͹໰୊ͳ͍ɻ ▶  TLMFBSONFUSJDTMPH@MPTT ؔ਺͸ʮ܇࿅ࣄྫͷΫϥε਺ʯͱʮςετࣄྫͷ Ϋϥε਺ʯ͕Ұக͠ͳ͍৔߹͸ΤϥʔͱͳΔɻଟΫϥε෼ྨʹ͓͍ͯɺرগͳ Ϋϥε͕ςετࣄྫʹग़ݱ͠ͳ͍ͱ͍͏͜ͱ͕ى͜ΓಘΔɻ ▶  ࠓճ͸͜ͷέʔεʹ౰ͯ͸·ΔͨΊɺύοέʔδΛ࢖Θͣʹఆ͍ٛͯ͠Δɻ

Slide 20

Slide 20 text

Լ४උɿϞσϧධՁͷͨΊͷํ๏Λ༻ҙ͢Δ ▶  ࠓճ͸෼ׂަࠩ֬ೝ TLMFBSODSPTT@WBMJEBUJPO,'PME Λ༻͍Δ ▶  ͢΂ͯͷΧςΰϦʹରͯ͠౰֬཰Ͱ͋Δͱ౴͑ΔϕʔεϥΠϯ͸ ઌʹఆٛͨ͠NVMUJMPHMPTTͰ είΞϦϯά

Slide 21

Slide 21 text

Լ४උɿ1Z5BCMFTͰಛ௃ྔΛதؒϑΝΠϧʹѹॖͯ͠อଘ ▶  1Z5BCMFT͸)%'ܗࣜͷͨΊͷΠϯλʔϑΣʔεΛఏڙ͢Δύοέʔδ ▶  σʔλͷѹॖʹ͸CMPTDΛનΊΔɽCMPTD͸4*.%໋ྩΛαϙʔτͨ͠ߴ଎ ͳγϦΞϥΠζɾσγϦΞϥΠζ͕Մೳɽ ɹ ˞σʔλϑϨʔϜΛγϦΞϥΠζɾσγϦΞϥΠζ͢Δ৔߹͸ ɹQBOEBTͷ)%'4UPSF 1Z5BCMFTΛར༻ Ϋϥε͕༗༻Ͱ͋Δ ѹॖํ๏ͱͯ͠ CMPTDΛࢦఆ QEGBDUPSJ[F Ͱ໨ඪม਺Λ JOEFYFEͳ੔਺ʹΤϯίʔυ

Slide 22

Slide 22 text

ಛ௃ྔΛ࡞੒͢Δ ྫʰ͋Δങ͍෺ʹ͓͍ͯɺ Ͳͷ঎඼ΧςΰϦ͕Կݸ εΩϟϯ͞Ε͔ͨʱ QEQJWPU@UBCMF ؔ਺Ͱ τϥϯβΫγϣϯΛू໿ >>> df_long = pd.concat([ df_train, df_test, ]).fillna(“_NA_”) >>> df = pd.pivot_table( df_long, index=“VisitNumber”, columns=[“DepartmentDescription”], values=[“ScanCount”], aggfunc=[np.sum], )[‘sum’][‘ScanCount’] QEQJWPU@UBCMF EG@MPOH JOEFYl7JTJU/VNCFSz DPMVNOT WBMVFT BHHGVOD $PMVNO7BMVF͕λςํ޲ʹฒΜͰ͍Δ MPOHGPSNBU $PMVNO7BMVFͷରԠ͕ू໿͞ΕϤίํ޲ʹฒͿ XJEFGPSNBU

Slide 23

Slide 23 text

ಛ௃ྔͷอଘܗࣜΛܾΊΔ ▶  ͜͜Ͱ͸ʮ࠷ॳͷߦΛ܇࿅ࣄྫʯʮଓ͘ߦΛςετࣄྫʯ ͱͯ͠ߦྻΛಛ௃ྔͷߦྻΛOEBSSBZͱͯ͠อଘ͢Δ͜ͱΛߟ͑Δ ▶  EGͷΠϯσοΫε͸7JTJU/VNCFSͰ͋Δɽ܇࿅ࣄྫͷߦྻͱςετࣄྫͷ ߦྻ͕7JTJU/VNCFSॱʹͳΔΑ͏ʹMPDϝιουͰฒͼସ͑Δ PSEFSCZ l7JTJU/VNCFSz PSEFSCZ l7JTJU/VNCFSz >>> visit_number_order = ( df_train.VisitNumber.drop_duplicates().append( df_test.VisitNumber.drop_duplicates() ) >>> df = df.loc[visit_order] >>> X = df.fillna(0).as_matrix() >>> X.shape (191348, 69) 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 ܇࿅ࣄྫͷߦྻ OEBSSBZPCKFDU ςετࣄྫͷߦྻ OEBSSBZPCKFDU 7JTJU/VNCFS ͕ΠϯσοΫε

Slide 24

Slide 24 text

༧ଌϞσϧΛ࡞Γަࠩ֬ೝʹΑͬͯධՁ TDJLJUMFBSO ࡞੒ͨ͠ಛ௃ྔΛ࢖͍TDJLJUMFBSOͷϩδεςΟοΫճؼϞσϧΛ࡞੒ ݁Ռɼަࠩ֬ೝʹΑΓϞσϧΛධՁ͠ͷείΞΛಘͨɽ ܇࿅ࣄྫͷ਺O@TBNQMFTͰ ަࠩ֬ೝΛ͢ΔʢςετσʔλΛ࢖Θͳ͍ʣ

Slide 25

Slide 25 text

DSPTT@WBM@TDPSF ؔ਺ͷ՝୊ͱϫʔΫΞϥ΢ϯυ ʲ՝୊ʳݱঢ়ͷDSPTT@WBM@TDPSF ϔϧύʔؔ਺͸QSFEJDU@QSPCB ϝιου Λݺͼग़͢͜ͱ͕Ͱ͖ͣɺQSFEJDU ϝιουΛݺͿ TDJLJUMFBSOEFW NBTUFS)&"% Ͱ͸ TLMFBSONPEFM@TFMFDUJPODSPTT@WBM@TDPSF ϔϧύʔؔ਺͕վम͞Ε͍ͯΔɽ QSFEJDU@QSPCB ϝιουΛΦϓγϣϯͰࢦఆ ͢Δ͜ͱͰݺͼग़͢͜ͱ͕Ͱ͖Δɻ ˠ<8PSLBSPVOE>ΫϥεΛܧঝͯ͠QSFEJDU ϝιουΛݺͿͱ QSFEJDU@QSPCB ϝιου͕ݺ͹ΕΔΫϥεΛఆٛ͢Ε͹࢖͏͜ͱ͕Ͱ͖Δɻ

Slide 26

Slide 26 text

༧ଌϞσϧΛ࡞Γަࠩ֬ೝʹΑͬͯධՁ 9(#PPTU XBUDIMJTUʹొ࿥͞Εͨσʔληοτ͸ 3FHSFTTJPO5SFFΛ૿΍ͨ͢ͼʹ ධՁ͞Εͯλʔϛφϧʹग़ྗ͞ΕΔɽ %.BUSJYΦϒδΣΫτ͸ରԠ͢Δ໨ඪม਺ͱ ηοτͰఆٛ͢Δ͜ͱ͕Ͱ͖Δ 9(#PPTU࣮૷ͷϒʔεςΟϯάϞσϧʹஔ͖׵͑ͯަࠩ֬ೝ 4DPSFˠ

Slide 27

Slide 27 text

༧ଌ݁ՌΛ,BHHMFʹఏग़͢Δ ༧ଌ݁Ռ͔ΒσʔλϑϨʔϜΛ࡞੒ͯ͠,BHHMFʹ౤ߘ͢Δ QEGBDUPSJ[F ͷೋ൪໨ͷฦΓ஋ ໨ඪม਺ͷΤϯίʔυॱ൪ UP@DTW Ͱ$47ϑΝΠϧ΁ग़ྗ

Slide 28

Slide 28 text

༧ଌ݁ՌΛ,BHHMFʹఏग़͢Δ ༧ଌ݁Ռ͔ΒσʔλϑϨʔϜΛ࡞੒ͯ͠,BHHMFʹ౤ߘ͢Δ QEGBDUPSJ[F ͷೋ൪໨ͷฦΓ஋ ໨ඪม਺ͷΤϯίʔυॱ൪ UP@DTW Ͱ$47ϑΝΠϧ΁ग़ྗ ͨͩͪʹ)PMEPVUTFUͰͷධՁ͕͸͡·Γ ίϯςετ಺ʹ͓͚ΔॱҐ͕ࣔ͞ΕΔ

Slide 29

Slide 29 text

ಛ௃ྔΛ૿΍͢ ▶  ങ͍෺ͷ༵೔ˠQEGBDUPSJ[F ▶  τϥϯβΫγϣϯϨίʔυ਺ɺ঎඼ฦ٫ͷτϥϯβΫγϣϯϨίʔυ਺ɺ ങ͍෺ʹඥͮ͘%FQBSUNFOU%FTDSJQUJPOͷϢχʔΫ਺ˠHSPVQCZ ू໿ͷͨΊͷؔ਺͸ ΧελϚΠζͰ͖Δ

Slide 30

Slide 30 text

ಛ௃ྔΛ૿΍ͯ͠Ϟσϧͷަࠩ֬ೝ தؒϑΝΠϧʹอଘͨͦ͠ΕͧΕͷಛ௃ྔߦྻΛOQITUBDL ؔ਺Ͱ ݁߹ͯ͠୯ҰͷOEBSSBZΛ࡞੒͢Δɽ ༧ଌϞσϧ͸ઌʹఆٛͨ͠ύϥϝʔλΛͦͷ··ར༻ʢˠ$74DPSFʣ 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 2 3 1 1 1 1 2 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 1 0 0 0 1 0 0 0 3 1 0 0 0 1 0 0 0 1 2 0 0 0 0 1 0 0 1 1 1 1 OQITUBDL

Slide 31

Slide 31 text

ಛ௃ྔΛ૿΍ͯ͠Ϟσϧͷަࠩ֬ೝ தؒϑΝΠϧʹอଘͨͦ͠ΕͧΕͷಛ௃ྔߦྻΛOQITUBDL ؔ਺Ͱ ྻํ޲ʹܨ͛ͯ୯ҰͷOEBSSBZΛ࡞੒͢Δɽ ༧ଌϞσϧ͸ઌʹఆٛͨ͠ύϥϝʔλΛͦͷ··ར༻ʢˠ$74DPSFʣ 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 2 3 1 1 1 1 2 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 1 0 0 0 1 0 0 0 3 1 0 0 0 1 0 0 0 1 2 0 0 0 0 1 0 0 1 1 1 1 OQITUBDL ˠ લճͷ༧ଌϞσϧΑΓվળͯ͠ॱҐ্͕͕ͬͨ

Slide 32

Slide 32 text

1"35 ൃలతͳ࿩୊

Slide 33

Slide 33 text

ൃలతͳ࿩୊ɿσʔλʹಛԽͨ͠࿩୊ ༷ʑͳσʔλ͝ͱʹ༗༻ͱͳΔύοέʔδΛ঺հ͢Δ ▶  େن໛σʔλ ˞͜͜Ͱ͸%#શମͰ(#ن໛ͷ3%#Λ૝ఆ͠·͢ –  3FETIJGU –  #JH2VFSZ –  ϕϯνϚʔΫ 3FETIJGUWT#JH2VFSZ ▶  ը૾σʔλ –  ಛ௃఺ͷαϯϓϦϯά –  ࣹӨม׵

Slide 34

Slide 34 text

42-ΫΤϦ͔ΒQBOEBTσʔλϑϨʔϜΛ࡞੒͢Δ QBOEBT͸*0ͱͯ͠42-Λαϙʔτ͍ͯ͠Δɽ QBOEBTSFBE@TRM ؔ਺Λ࢖͏ͱΫΤϦൃߦͷ ݁ՌΛσʔλϑϨʔϜͱͯ͠ड͚औΔ͜ͱ͕Ͱ͖Δ ▶  ෳࡶͳ3%#͸ѻ͏৔߹ɼू໿ॲཧͳͲΛ42-ʹΑͬͯ؆ܿʹهड़Ͱ͖Δ ▶  େن໛ͳσʔλΛѻ͏৔߹ɼ͢΂ͯͷσʔλΛखݩͷ؀ڥʹϩʔυ͢Δ͜ͱͳ ͘42-αʔόʔͷܭࢉϦιʔεͰॲཧͰ͖Δ ˞*NBHF4PVSDFIUUQTXXXLBHHMFDPNDDFSWJDBMDBODFSTDSFFOJOHEBUB

Slide 35

Slide 35 text

Ϋϥ΢υࢿݯΛར༻͠େن໛σʔλΛॲཧ 3FETIJGU 3FETIJGU͸"84্ʹىಈͨ͠ߴີ౓ετϨʔδϊʔυ͋Δ͍͸ ߴີ౓ίϯϐϡʔτϊʔυʹΑΔΫϥελ্Ͱಈ࡞͢Δɽ ### 1. Amazon S3 ΁σʔλΛΞοϓϩʔυ $ aws s3 sync data/input s3://kaggle-kohei/walmart_triptype/input ### 2. ςʔϒϧΛ࡞੒ͯ͠S3 ͔Β Redshift ΁σʔλΛϩʔυ͢Δ $ psql < schema.sql ### 3. SQL ΫΤϦΛൃߦͯ͠ Pandas σʔλϑϨʔϜΛ࡞੒͢Δ >>> import psycopg2 as pg >>> import pandas as pd conn_string = ' '.join([ "dbname='dwh'", "port='5439'", "user='kohei_ozaki'", "password='{}'".format(os.environ['REDSHIFT_PWD']), "host='{}'".format(os.environ['REDSHIFT_HOST']), ]) >>> conn = pg.connect(conn_string) >>> pd.read_sql("SELECT * FROM train WHERE TripType = 999", conn) ىಈͨ͠3FETIJGUͷ઀ଓઌΛࢦఆ 42-͔ΒσʔλϑϨʔϜ࡞੒

Slide 36

Slide 36 text

Ϋϥ΢υࢿݯΛར༻͠େن໛σʔλΛॲཧ #JH2VFSZ ### 1. Cloud Storage へデータをアップロードする $ gsutil rsync -r data/input gs://kaggle-kohei.appspot.com/walmart_triptype ### 2. BigQuery テーブルとしてデータをインポート $ bq load --skip_leading_rows 1 \ kaggle-kohei:walmart_triptype.train \ gs://kaggle-kohei.appspot.com/walmart_triptype/train.csv \ train.json ### 3. Pandas からクエリを発行して,結果をデータフレームとして受け取る >>> df = pd.read_gbq(""" SELECT t1.trip_type, COUNT(1) AS n_visitors FROM ( SELECT FIRST(trip_type) AS trip_type FROM walmart_triptype.train GROUP BY visit_number ) t1 GROUP BY t1.trip_type""", "kaggle-kohei") 1BOEBTΑΓQESFBE@HCR ؔ਺͔Β#JH2VFSZʹΫΤϦΛൃߦͰ͖Δɽ ϑϧϚωʔδυͳαʔϏεͰ͋ΔͨΊΫϥελΛҙࣝ͢Δඞཁ͕ͳ͍ར఺͕͋Δɽ +40/ܗࣜͰςʔϒϧεΩʔϚΛఆٛ͢Δ 42-͔ΒσʔλϑϨʔϜ࡞੒

Slide 37

Slide 37 text

3FETIJGUͱ#JH2VFSZͷϕϯνϚʔΫ ▶  (#ͷ$47ɼH[JQѹॖͰ(#ͷԯߦ ߦ ςʔϒϧσʔλΛϩʔυͯ͠ΫΤϦʢू໿ؔ਺ʣͷ࣮ߦ࣌ؒΛൺֱ ▶  ܭଌର৅͸#JH2VFSZ 3FETIJGU Y 3FETIJGU Y 3FETIJGU Y P⒎UPQJD ࠓճͷઃఆͰ͸୆਺Λ૿΍ͤ͹3FETIJGUͷੑೳΛ্͛Δ͜ͱ͕Ͱ͖ͨɽ ˞3FETIJGU OPEFT POEFNBOE EDMBSHFJOTUBODF ͷίετ͸64%IPVS#JH2VFSZͷΫΤϦίετ͸ԁະຬ

Slide 38

Slide 38 text

Ϋϥ΢υࢿݯΛར༻͢ΔͨΊͷΞυόΠε ▶  σʔλͷϩʔυʹ͸͕͔͔࣌ؒΔ͕ɼʮΞυϗοΫΫΤϦΛԿ౓΋࣮ߦ͢Δʯ ʮ֤ΞυϗοΫΫΤϦͷλʔϯΞϥ΢ϯυλΠϜΛ୹͍ͨ͘͠ʯͱ͍͏ڧ͍ཁ ੥͕͋ΔͳΒ͹අ༻ରޮՌ͕ߴ͍ ▶  #JH2VFSZ͸ڊେͳѹॖϑΝΠϧΛҰ౓ʹόονͰϩʔυ͢Δ͜ͱ͕Ͱ͖ͳ͍ɽ ͦͷ৔߹͸ߦ୯ҐͰ෼ׂˍѹॖͯ͠$MPVE4UPSBHFʹΞοϓϩʔυͰ͖Δ $ zcat prescription_head.csv.gz | \ split –d –C 1G --filter='gzip > $FILE.gz' – prescription_head.csv.part prescription_head.csv.gz prescription_head.csv.part01.gz prescription_head.csv.part02.gz prescription_head.csv.part03.gz zcat & split gsuIl rsync P⒎UPQJD

Slide 39

Slide 39 text

1ZUIPOʹΑΔը૾σʔλͷॲཧ 1ZUIPOͰը૾ॲཧΛ͢ΔͨΊͷύοέʔδͱͯ͠ɼ TDJLJUJNBHF΍0QFO$7ͷ1ZUIPOόΠϯσΟϯά͕͋Δɽ ֤ύοέʔδͰ͸ը૾ΛOEBSSBZΦϒδΣΫτͷߦྻͱͯ͠ѻ͏ɽ ▶  มܗʢճసɼ֦େɼ΅͔͠ʣ ▶  υϩʔΠϯά ▶  ώετάϥϜฏୱԽ ▶  ը૾ͷηάϝϯςʔγϣϯ ▶  Τοδநग़ɼը૾ಛ௃఺ͷܭࢉ ▶  ը૾ಛ௃఺ͷαϯϓϦϯάɼϚονϯά ▶  FUD im[:, :, 2] im[:, :, 1] im[:, :, 0] #(3νϟϯωϧͷը૾දݱͷྫ

Slide 40

Slide 40 text

0QFO$7ʹΑΔը૾ಛ௃఺ͷܭࢉ DW#3*4,@DSFBUF ؔ਺͸#3*4,ΞϧΰϦζϜͷ0QFO$7࣮૷ɽ #3*4,Λܭࢉ͢ΔͨΊͷEFUFDUPSΦϒδΣΫτΛ࡞੒͢Δ %3"1&34BUFMMJUF*NBHF$ISPOPMPHZίϯςετͷσʔλΛྫ୊ͱͯ͠঺հ͢Δ IUUQTXXXLBHHMFDPNDESBQFSTBUFMMJUFJNBHFDISPOPMPHZ

Slide 41

Slide 41 text

ಛ௃఺ͷϚονϯάͱ݁ՌΛ࢖ࣹͬͨӨม׵ DW%FTDSJQUPS.BUDIFS@DSFBUF ؔ਺Ͱಛ௃఺ͷϚονϯά͢Δɽ Ϛονϯά݁ՌΛ࢖͍ɼࡱӨ֯౓΍εέʔϧͷҧ͍Λࣸ૾ม׵Ͱิਖ਼͢Δɽ DWQFSTQFDUJWF5SBOTGPSN ؔ਺Ͱ ϗϞάϥϑΟߦྻΛجʹࣹӨม׵ DWpOE)PNPHSBQIZ ؔ਺Ͱ Ϛονϯά݁ՌΛجʹ ϗϞάϥϑΟߦྻΛܭࢉ

Slide 42

Slide 42 text

ը૾ࠩ෼ͷՄࢹԽ ࣹӨม׵ͯ͠ࡱӨ֯౓΍εέʔϧͷҟͳΔը૾ΛҰகͤͨ͞ɽ ೋ஋Խͨ͠ը૾ΛϐΫηϧ୯ҐͰൺֱ͢Δ͜ͱͰɼࠩ෼ΛՄࢹԽͰ͖Δ 3FG5VUPSJBMWJTVBMJ[JOHEJ⒎FSFODFCFUXFFOUISFFQJDUVSFT IUUQTXXXLBHHMFDPNDESBQFSTBUFMMJUFJNBHFDISPOPMPHZGPSVNTUUVUPSJBMWJTVBMJ[JOHEJ⒎FSFODFCFUXFFOUISFFQJDUVSFT ը૾"ʹͷΈग़ݱ ը૾#ʹͷΈग़ݱ ˞ίʔυͱৄࡉʹ͍ͭͯ͸ࢿྉͷ +VQZUFSOPUFCPPLΛࢀর͍ͯͩ͘͠͞ɽ ˞ಛ௃఺ͷܭࢉͳͲ͸0QFO$7ͷ ֦ுΛΠϯετʔϧ͢Δඞཁ͕͋Δɽ

Slide 43

Slide 43 text

͜ͷνϡʔτϦΞϧʹ͍ͭͯʢ࠶ܝʣ σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠ɺ 1ZUIPOʹΑΔύοέʔδར༻ྫΛ঺հ͢Δɻ ▶  1BSU༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ NJO –  ༧ଌϞσϧͷ࡞੒ϓϩηεͱؔ࿈ύοέʔδͷུ֓Λ঺հ͢Δ –  1BOEBT 4DJLJU-FBSO ▶  1BSUσʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ NJO –  σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠۩ମతͳར༻ྫΛ঺հ͢Δ –  1BOEBT TFBCPSO 9(#PPTU 1Z5BCMFT ▶  1BSUൃలతͳ࿩୊ NJO –  ໨తʢେن໛σʔλɾը૾σʔλʣ͝ͱʹಛԽͨ͠ར༻ྫΛ঺հ͢Δ –  1BOEBT 3FETIJGU #JH2VFSZ 0QFO$7

Slide 44

Slide 44 text

ิ଍ɿ഑෍ࢿྉͱ%PDLFSʹΑΔ࠶ݱੑͷ֬อʢ࠶ܝʣ ຊεϥΠυͰ࢖ΘΕΔίʔυ͸+VQZUFSOPUFCPPLͱͯ͠ҎԼͰެ։͍ͯ͠Δɽ ▶  εϥΠυIUUQHPPHM.F.;Z0 ▶  1BSUࢿྉIUUQHPPHMZ;N,S ▶  1BSUࢿྉIUUQHPPHM"5$ZW ɽຊࢿྉͰར༻ͨ͠+VQZUFSOPUFCPPL͸ެ։͍ͯ͠Δ%PDLFSΠϝʔδ͔Β ࡞੒ͨ͠؀ڥΛ࢖͍ͬͯΔɽҎԼͷίϚϯυͰಉ༷ͷ؀ڥΛࢼ͢͜ͱ͕Ͱ͖Δɻ $ docker run --rm -ti \ -p 8888:8888 \ -v /path/to/data_directory:/mnt \ -v /path/to/working_directory:/home/kohei/work \ smly/notebook:0.4