Python とデータ分析コンテストの実践

4742812a011db89b01a52af6722640b8?s=47 @smly
September 08, 2016
8.1k

Python とデータ分析コンテストの実践

FIT 2016 Tutorial 資料
スライド: http://goo.gl/MeMZyO
Part2 資料: http://goo.gl/y6ZmKr
Part3 資料: http://goo.gl/ATC9yv

4742812a011db89b01a52af6722640b8?s=128

@smly

September 08, 2016
Tweet

Transcript

  1. 1ZUIPOͱσʔλ෼ੳίϯςετͷ࣮ફ ,PIFJ0[BLJ !TNMZ  '*5νϡʔτϦΞϧεϥΠυ

  2. ͜ͷνϡʔτϦΞϧʹ͍ͭͯ σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠ɺ 1ZUIPOʹΑΔύοέʔδར༻ྫΛ঺հ͢Δɻ ▶  1BSU༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ NJO  –  ༧ଌϞσϧͷ࡞੒ϓϩηεͱؔ࿈ύοέʔδͷུ֓Λ঺հ͢Δ – 

    1BOEBT 4DJLJU-FBSO ▶  1BSUσʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ NJO  –  σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠۩ମతͳར༻ྫΛ঺հ͢Δ –  1BOEBT TFBCPSO 9(#PPTU 1Z5BCMFT ▶  1BSUൃలతͳ࿩୊ NJO  –  ໨తʢେن໛σʔλɾը૾σʔλʣ͝ͱʹಛԽͨ͠ར༻ྫΛ঺հ͢Δ –  1BOEBT 3FETIJGU #JH2VFSZ 0QFO$7
  3. ิ଍ɿ഑෍ࢿྉͱ%PDLFSʹΑΔ࠶ݱੑͷ֬อ ຊεϥΠυͰ࢖ΘΕΔίʔυ͸+VQZUFSOPUFCPPLͱͯ͠ҎԼͰެ։͍ͯ͠Δɽ ▶  εϥΠυIUUQHPPHM.F.;Z0 ▶  1BSUࢿྉIUUQHPPHMZ;N,S ▶  1BSUࢿྉIUUQHPPHM"5$ZW  ɽຊࢿྉͰར༻ͨ͠+VQZUFSOPUFCPPL͸ެ։͍ͯ͠Δ%PDLFSΠϝʔδ͔Β

    ࡞੒ͨ͠؀ڥΛ࢖͍ͬͯΔɽҎԼͷίϚϯυͰಉ༷ͷ؀ڥΛࢼ͢͜ͱ͕Ͱ͖Δɻ $ docker run --rm -ti \ -p 8888:8888 \ -v /path/to/data_directory:/mnt \ -v /path/to/working_directory:/home/kohei/work \ smly/notebook:0.4
  4. 1"35 ༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ

  5. Ұൠతͳ༧ଌϞσϧ࡞੒ͷϓϩηε ▶  ,BHHMF΍,%%$VQͳͲͷίϯςετͰ͸ɼz༩͑ΒΕͨσʔλͱλεΫz ʹରͯ͠ɼzܾΊΒΕͨධՁࢦඪzʹΑΔείΞͰॱҐΛ෇͚Λߦ͏ ▶  ͦΕͧΕͷϓϩηεʹ͓͍ͯ༗༻ͳϞδϡʔϧʹ͍ͭͯ঺հ͢Δ λεΫઃܭ σʔλ࡞੒ γεςϜ΁ ૊ΈࠐΈ

    લॲཧ ಛ௃நग़ ༧ଌϞσϧ ͷධՁ ༧ଌϞσϧ ͷ࡞੒ ・ pandas ・ psycopg2 ・ sklearn.feature_extraction ・ sklearn.preprocessing ・ pytables ・ seaborn ・ sklearn.linear_model ・ sklearn.svm ・ sklearn.ensemble ・ xgboost ・ stats-models ・ sklearn.metrics ・ sklearn.cross_validation ・ ml_metrics
  6. QBOEBTσʔλϑϨʔϜͰσʔλૢ࡞ ೖग़ྗ  ▶  /VN1ZͷOEBSSBZʹࣅͨσʔλΛૢ࡞͢ΔͨΊͷσʔλߏ଄ ▶  ΧϥϜͱΠϯσοΫε͕͋Γɺ໊લ෇͖ΧϥϜͰσʔλૢ࡞͕Մೳ ▶  ΧϥϜ͝ͱʹҟͳΔܕΛ࣋ͭ͜ͱ͕Ͱ͖Δ ▶ 

    ߦͷૠೖ΍ྻͷ࡟আͳͲ͕ߦ͑ΔՄม NVUBCMF ͳΦϒδΣΫτ ▶  ๛෋ͳσʔλૢ࡞ػೳͱೖग़ྗΠϯλʔϑΣʔε લॲཧ ಛ௃நग़ NDArray Internal dimensions dim count dtype strides data * 2 3 3 5 * 12 float32 0 1 2 3 4 5 6 7 8 DataFrame Internal I N D E X Columns float string integer
  7. ಛ௃ྔͷΤϯίʔυ ઢܗϞσϧͳͲΧςΰϦΧϧม਺Λ௚઀ѻ͏͜ͱ͕Ͱ͖ͳ͍ϞσϧͰ͸ɼΧςΰ ϦΧϧม਺Λ਺஋ม਺ͱͯ͠දݱ͢Δඞཁ͕͋Δɽ  ༷ʑͳಛ௃ྔͷΤϯίʔυํ๏ͱ࣮૷͕TDJLJUMFBSOʹΑͬͯఏڙ͞Ε͍ͯΔɽ  ▶  TLMFBSOGFBUVSF@FYUSBDUJPO0OF)PU&ODPEFS ▶  TLMFBSOGFBUVSF@FYUSBDUJPO-BCFM&ODPEFS

    ▶  TLMFBSOGFBUVSF@FYUSBDUJPO%JDU7FDUPSJ[F ▶  TLMFBSOGFBUVSF@FYUSBDUJPO'FBUVSF)BTIFS ▶  TLMFBSOGFBUVSF@FYUSBDUJPOUFYU5pEG7FDUPSJ[FS ▶  ʜ  ୅දతͳΫϥεΛ঺հ͢Δ લॲཧ ಛ௃நग़
  8. 0OF)PU&ODPEFS ΧςΰϦΧϧม਺ΛPG,දهʹΤϯίʔυ͢Δɽ  ͋ΔΧςΰϦΧϧม਺ͷDBSEJOBMJUZΛ,ͱͨ͠ͱ͖ɼಛ௃Λ,ྻͷͱ͠ ͯѻ͏ɽͦΕͧΕͷྻʹ͸ΧςΰϦΧϧม਺ͷಛఆͷ஋͕ରԠ͓ͯ͠ΓɼରԠ͢ Δྻͷ஋ΛɼͦΕҎ֎ΛͰදݱ͢Δɽ ʢQEHFU@EVNNJFT ؔ਺ʹΑͬͯQBOEBTͰ΋ಉ౳ͷૢ࡞͕ՄೳͰ͋Δʣ લॲཧ ಛ௃நग़

    WeekDay 0 Friday 1 Monday 2 Monday 3 Sunday 4 Tuesday 5 Saturday 6 Monday 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 ndarray or csr_matrix K=5
  9. -BCFM&ODPEFS ΧςΰϦΧϧม਺Λ੔਺஋ʹΤϯίʔυ͢Δɽ  ΧςΰϦΧϧม਺ͷDBSEJOBMJUZ͕,Ͱ͋Δ৔߹ɼ ม਺ͷ஋Λ< ,>ͷ੔਺஋ʹஔ͖׵͑Δɽ ʢQEGBDUPSJ[F ؔ਺ʹΑͬͯQBOEBTͰ΋ಉ౳ͷૢ࡞͕ՄೳͰ͋Δʣ  લॲཧ

    ಛ௃நग़ WeekDay 0 Friday 1 Monday 2 Monday 3 Sunday 4 Tuesday 5 Saturday 6 Monday 0 1 1 2 3 4 1 ndarray
  10. %JDU7FDUPSJ[F ࣙॻΦϒδΣΫτͷϦετΛ4DJ1ZͷૄߦྻΦϒδΣΫτ΍OEBSSBZΦϒδΣ Ϋτʹม׵͢Δɽ  ࣙॻΦϒδΣΫτͷLFZ͕ߦྻͷಛఆͷྻʹҰରҰରԠ͢Δɽ WBMVF͕ಛ௃஋ʹରԠ͢Δʢ%BUB'SBNFΦϒδΣΫτͷίϯετϥΫλʹ౉͠ ͯpMMOB  BT@NBUSJY ͰθϩϑΟϧˍม׵͢Δ͜ͱͰQBOEBTͰ΋ಉ౳ͷૢ࡞

    ͕ՄೳͰ͋Δʣ લॲཧ ಛ௃நग़ list of dict [ {‘like’: 1, ‘rt’: 9}, {‘like’: 2, ‘rt’: 2}, {‘like’: 4}, ] 1  9 2  2 4  0 ndarray
  11. TDJLJUMFBSO΍9(#PPTUͰ༧ଌϞσϧΛ࡞੒ TDJLJUMFBSOͰ͸pU Ͱσʔλ౰ͯ͸ΊɼQSFEJDU Ͱ༧ଌɼQSFEJDU@QSPCB  Ͱ༧ଌ֬཰Λ౴͑Δڞ௨ͷΠϯλʔϑΣʔεΛ͍࣋ͬͯΔʢ˞෼ྨͷ৔߹ʣ >>> import xgboost as

    xgb >>> dtrain, dtest = xgb.DMatrix(), xgb.DMatrix() >>> watchlist = [(dtrain, 'train')] >>> booster = xgb.Booster() >>> gbtree = booster.train(dtrain, params) >>> y_pred = gbtree.predict(dtest) >>> from sklearn.linear_model import LogisticRegression >>> clf = LogisticRegression() >>> clf.fit(X_train, y_rain) >>> y_pred = clf.predict_proba(X_test) 9(#PPTU͸TDJLJUMFBSOͱҟͳΔΠϯλʔϑΣʔεͰ͋Δ͕ɼΠςϨʔγϣϯ͝ͱͷ FWBMVBUJPOɼFBSMZTUPQQJOHɼ࠷దԽؔ਺ͷΧελϚΠζͳͲͷ༷ʑͳػೳ͕͋Δɽ 9(#PPTU͸ϝϞϦޮ཰ͱ܇࿅଎౓ʹ࠷దԽ͞Εͨ಺෦σʔλߏ଄%.BUSJYͱͯ͠ σʔλΛѻ͏ɽOEBSSBZ͔Β%.BUSJYΦϒδΣΫτΛ࡞੒Ͱ͖Δɽ ༧ଌϞσϧ ͷ࡞੒
  12. ϞσϧͷੑೳධՁ TLMFBSODSPTT@WBMJEBUJPOϞδϡʔϧ͸༧ଌϞσϧͷੑೳΛධՁ͢ΔͨΊͷ ϔϧύʔؔ਺΍ΠςϨʔλʔΛఏڙ͢Δɻ ༧ଌϞσϧ ͷධՁ >>> clf = LogisticRegression() >>>

    scores = cross_validation.cross_val_score(clf, X, y, cv=5) >>> scores array([ 0.92..., 1. ..., 0.92..., 1. ]) ༧ଌϞσϧ͕TLMFBSO$MBTTJpFS.JYJOΛܧঝͨ͠ΫϥεͰ͋Ε͹ϔϧύʔؔ਺Λ ༻͍ͯަࠩ֬ೝΛ؆ܿʹهड़Ͱ͖Δɻ ΠςϨʔλʔ͸֬ೝ༻ͷ܇࿅σʔλͱςετσʔλͷΠϯσοΫεϦετΛฦ͢ >>> kf = KFold(n_samples, n_folds=5) >>> for idx_train, idx_test in kf: ... y_train, y_test = y[idx_train], y[idx_test] ... X_train, X_test = X[idx_train], X[idx_test] ... clf.fit(X_train, y_train) ... y_pred = clf.predict(X_test) ֤'PMEͷ"DDVSBDZ
  13. 1"35 σʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ

  14. σʔλͷ֬ೝ ▶  l8BMNBSU3FDSVJUJOH5SJQ5ZQF$MBTTJpDBUJPOzίϯςετΛ୊ࡐͱͯ͠ѻ͏ ▶  ʮങ͍٬ʯΛʮങ͍෺ͷ঎඼εΩϟϯཤྺʯ͔ΒଟΫϥε෼ྨ͢ΔλεΫ ▶  USBJODTW UFTUDTW TBNQMF4VCNJTTJPODTWͷ̏ͭͷϑΝΠϧ͕ఏڙ͞Ε͍ͯΔ 

    ܇࿅ࣄྫσʔλUSBJODTW͸ʮങ͍෺ͷ঎඼εΩϟϯཤྺʯͰ͋Δɻ ϑΟʔϧυ໊ આ໌ 5SBJO 5FTU 5SJQ5ZQF ໨ඪม਺  ʮങ͍٬ʯͷΧςΰϦΧϧͳ*% ✔ ✗ 7JTJU/VNCFS *OTUBODF*%  ͋ΔҰਓͷސ٬ͷങ͍෺ʹରԠ͢Δ*% ✔ ✔ 8FFLEBZ ങ͍෺Λͨ͠ि ✔ ✔ 6QD ߪೖ঎඼ͷ61$൪߸ ✔ ✔ 4DBO$PVOU ߪೖ͞Εͨݸ਺ʢෛͷ஋͸ฦ٫͞Εͨ঎඼ʣ ✔ ✔ %FQBSUNFOU%FTDSJQUJPO ߪೖ঎඼ͷδϟϯϧ ✔ ✔ 'JOFMJOF ߪೖ঎඼Λߋʹࡉ͔͘෼͚Δδϟϯϧͷ*% ✔ ✔
  15. σʔλͷ֬ೝ ▶  ܇࿅σʔλͰ͸ങ͍෺٬ʢ7JTJU/VNCFSʣͷങ͍෺͔͝ཤྺͱɺ๚໰٬ʹର Ԡ͢ΔΧςΰϦʢ5SJQ5ZQFʣ͕༩͑ΒΕ͍ͯΔɻ ▶  ςετσʔλͰ͸ΧςΰϦ͕༩͑ΒΕ͍ͯͳ͍ɻςετσʔλͷΧςΰϦΛ܇ ࿅σʔλʹج͍ͮͯ༧ଌ͢Δ͜ͱ͕༩͑ΒΕͨλεΫͰ͋Δɻ +VQZUFSOPUFCPPL্Ͱ Θ͔Γ΍͘͢ςʔϒϧͰදࣔ͞ΕΔ

  16. σʔλͷ֬ೝ ▶  ༧ଌ݁Ռͷఏग़ϑΥʔϚοτ͸๚໰٬൪߸ͱ֤ΧςΰϦΛྻͱͨ͠ςʔϒϧɻ ๚໰٬൪߸͝ͱʹΧςΰϦʹଐ͢Δ֬཰Λղ౴͢Δɻ

  17. TFBCPSOʹΑΔσʔλͷՄࢹԽ TFBCPSO͸ϋΠϨϕϧͳΠϯλʔϑΣʔεΛ࣋ͭՄࢹԽπʔϧɽ σʔλϑϨʔϜΛೖྗͱͯ͠౷ܭάϥϑΛදࣔ͢Δ͜ͱ͕Ͱ͖Δɽ ໨ඪม਺5SJQ5ZQF͝ͱʹภΓ͕͋Δ͜ͱ͕֬ೝͰ͖Δ 7JTJU/VNCFS͕6OJRVFͱͳΔΑ͏ ϨίʔυΛࣺͯΔ 5SJQ5ZQFͷ஋͝ͱͷΧ΢ϯτ

  18. TFBCPSOʹΑΔσʔλͷՄࢹԽ ▶  ܇࿅ࣄྫͱςετࣄྫ͸࣌ܥྻʢ༵೔ʣ͕ಉ༷ͷ෼෍ͱͳΔΑ͏ʹαϯϓϧ͞ Ε͍ͯΔ͜ͱ͕Θ͔Δ ▶  ෼෍Λ֬ೝ͢Δͱ༵೔ͷ৘ใΛֶशϞσϧʹಛ௃ྔͱͯ͠༻͍ͨ৔߹ʹɺ܇࿅ ࣄྫʹରͯ͠աֶश͠ͳ͍ͱਪଌͰ͖Δ ܇࿅σʔλ͸্ஈͷϓϩοτ ςετσʔλ͸Լஈͷϓϩοτ 8FFLEBZͷ஋͝ͱʹΧ΢ϯτ

  19. Լ४උɿධՁࢦඪΛఆٛ͢Δ ධՁࢦඪͷ࣮૷͸TLMFBSONFUSJDTϞδϡʔϧ΍NM@NFUSJDTύοέʔδͳͲ͕ ଘࡏ͢Δɻ௨ৗ͸͜ΕΒͷύοέʔδΛ࢖͑͹໰୊ͳ͍ɻ ▶  TLMFBSONFUSJDTMPH@MPTT ؔ਺͸ʮ܇࿅ࣄྫͷΫϥε਺ʯͱʮςετࣄྫͷ Ϋϥε਺ʯ͕Ұக͠ͳ͍৔߹͸ΤϥʔͱͳΔɻଟΫϥε෼ྨʹ͓͍ͯɺرগͳ Ϋϥε͕ςετࣄྫʹग़ݱ͠ͳ͍ͱ͍͏͜ͱ͕ى͜ΓಘΔɻ ▶  ࠓճ͸͜ͷέʔεʹ౰ͯ͸·ΔͨΊɺύοέʔδΛ࢖Θͣʹఆ͍ٛͯ͠Δɻ

  20. Լ४උɿϞσϧධՁͷͨΊͷํ๏Λ༻ҙ͢Δ ▶  ࠓճ͸෼ׂަࠩ֬ೝ TLMFBSODSPTT@WBMJEBUJPO,'PME Λ༻͍Δ ▶  ͢΂ͯͷΧςΰϦʹରͯ͠౰֬཰Ͱ͋Δͱ౴͑ΔϕʔεϥΠϯ͸ ઌʹఆٛͨ͠NVMUJMPHMPTTͰ είΞϦϯά

  21. Լ४උɿ1Z5BCMFTͰಛ௃ྔΛதؒϑΝΠϧʹѹॖͯ͠อଘ ▶  1Z5BCMFT͸)%'ܗࣜͷͨΊͷΠϯλʔϑΣʔεΛఏڙ͢Δύοέʔδ ▶  σʔλͷѹॖʹ͸CMPTDΛનΊΔɽCMPTD͸4*.%໋ྩΛαϙʔτͨ͠ߴ଎ ͳγϦΞϥΠζɾσγϦΞϥΠζ͕Մೳɽ ɹ ˞σʔλϑϨʔϜΛγϦΞϥΠζɾσγϦΞϥΠζ͢Δ৔߹͸ ɹQBOEBTͷ)%'4UPSF 1Z5BCMFTΛར༻

    Ϋϥε͕༗༻Ͱ͋Δ  ѹॖํ๏ͱͯ͠ CMPTDΛࢦఆ QEGBDUPSJ[F Ͱ໨ඪม਺Λ JOEFYFEͳ੔਺ʹΤϯίʔυ
  22. ಛ௃ྔΛ࡞੒͢Δ ྫʰ͋Δങ͍෺ʹ͓͍ͯɺ Ͳͷ঎඼ΧςΰϦ͕Կݸ εΩϟϯ͞Ε͔ͨʱ  QEQJWPU@UBCMF ؔ਺Ͱ τϥϯβΫγϣϯΛू໿ >>> df_long

    = pd.concat([ df_train, df_test, ]).fillna(“_NA_”) >>> df = pd.pivot_table( df_long, index=“VisitNumber”, columns=[“DepartmentDescription”], values=[“ScanCount”], aggfunc=[np.sum], )[‘sum’][‘ScanCount’] QEQJWPU@UBCMF EG@MPOH  JOEFYl7JTJU/VNCFSz  DPMVNOT<l%FQBSUNFOU%FTDSJQUJPOz>  WBMVFT<l4DBO$PVOUz>  BHHGVOD<OQTVN>  $PMVNO7BMVF͕λςํ޲ʹฒΜͰ͍Δ MPOHGPSNBU  $PMVNO7BMVFͷରԠ͕ू໿͞ΕϤίํ޲ʹฒͿ XJEFGPSNBU 
  23. ಛ௃ྔͷอଘܗࣜΛܾΊΔ ▶  ͜͜Ͱ͸ʮ࠷ॳͷߦΛ܇࿅ࣄྫʯʮଓ͘ߦΛςετࣄྫʯ ͱͯ͠ߦྻΛಛ௃ྔͷߦྻΛOEBSSBZͱͯ͠อଘ͢Δ͜ͱΛߟ͑Δ ▶  EGͷΠϯσοΫε͸7JTJU/VNCFSͰ͋Δɽ܇࿅ࣄྫͷߦྻͱςετࣄྫͷ ߦྻ͕7JTJU/VNCFSॱʹͳΔΑ͏ʹMPDϝιουͰฒͼସ͑Δ PSEFSCZ l7JTJU/VNCFSz PSEFSCZ

    l7JTJU/VNCFSz >>> visit_number_order = ( df_train.VisitNumber.drop_duplicates().append( df_test.VisitNumber.drop_duplicates() ) >>> df = df.loc[visit_order] >>> X = df.fillna(0).as_matrix() >>> X.shape (191348, 69) 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 ܇࿅ࣄྫͷߦྻ OEBSSBZPCKFDU ςετࣄྫͷߦྻ OEBSSBZPCKFDU 7JTJU/VNCFS ͕ΠϯσοΫε
  24. ༧ଌϞσϧΛ࡞Γަࠩ֬ೝʹΑͬͯධՁ TDJLJUMFBSO  ࡞੒ͨ͠ಛ௃ྔΛ࢖͍TDJLJUMFBSOͷϩδεςΟοΫճؼϞσϧΛ࡞੒ ݁Ռɼަࠩ֬ೝʹΑΓϞσϧΛධՁ͠ͷείΞΛಘͨɽ ܇࿅ࣄྫͷ਺O@TBNQMFTͰ ަࠩ֬ೝΛ͢ΔʢςετσʔλΛ࢖Θͳ͍ʣ

  25. DSPTT@WBM@TDPSF ؔ਺ͷ՝୊ͱϫʔΫΞϥ΢ϯυ ʲ՝୊ʳݱঢ়ͷDSPTT@WBM@TDPSF ϔϧύʔؔ਺͸QSFEJDU@QSPCB ϝιου Λݺͼग़͢͜ͱ͕Ͱ͖ͣɺQSFEJDU ϝιουΛݺͿ TDJLJUMFBSOEFW NBTUFS)&"% Ͱ͸

    TLMFBSONPEFM@TFMFDUJPODSPTT@WBM@TDPSF  ϔϧύʔؔ਺͕վम͞Ε͍ͯΔɽ QSFEJDU@QSPCB ϝιουΛΦϓγϣϯͰࢦఆ ͢Δ͜ͱͰݺͼग़͢͜ͱ͕Ͱ͖Δɻ ˠ<8PSLBSPVOE>ΫϥεΛܧঝͯ͠QSFEJDU ϝιουΛݺͿͱ QSFEJDU@QSPCB ϝιου͕ݺ͹ΕΔΫϥεΛఆٛ͢Ε͹࢖͏͜ͱ͕Ͱ͖Δɻ
  26. ༧ଌϞσϧΛ࡞Γަࠩ֬ೝʹΑͬͯධՁ 9(#PPTU  XBUDIMJTUʹొ࿥͞Εͨσʔληοτ͸ 3FHSFTTJPO5SFFΛ૿΍ͨ͢ͼʹ ධՁ͞Εͯλʔϛφϧʹग़ྗ͞ΕΔɽ %.BUSJYΦϒδΣΫτ͸ରԠ͢Δ໨ඪม਺ͱ ηοτͰఆٛ͢Δ͜ͱ͕Ͱ͖Δ 9(#PPTU࣮૷ͷϒʔεςΟϯάϞσϧʹஔ͖׵͑ͯަࠩ֬ೝ 4DPSFˠ

  27. ༧ଌ݁ՌΛ,BHHMFʹఏग़͢Δ ༧ଌ݁Ռ͔ΒσʔλϑϨʔϜΛ࡞੒ͯ͠,BHHMFʹ౤ߘ͢Δ QEGBDUPSJ[F ͷೋ൪໨ͷฦΓ஋ ໨ඪม਺ͷΤϯίʔυॱ൪ UP@DTW Ͱ$47ϑΝΠϧ΁ग़ྗ

  28. ༧ଌ݁ՌΛ,BHHMFʹఏग़͢Δ ༧ଌ݁Ռ͔ΒσʔλϑϨʔϜΛ࡞੒ͯ͠,BHHMFʹ౤ߘ͢Δ QEGBDUPSJ[F ͷೋ൪໨ͷฦΓ஋ ໨ඪม਺ͷΤϯίʔυॱ൪ UP@DTW Ͱ$47ϑΝΠϧ΁ग़ྗ ͨͩͪʹ)PMEPVUTFUͰͷධՁ͕͸͡·Γ ίϯςετ಺ʹ͓͚ΔॱҐ͕ࣔ͞ΕΔ

  29. ಛ௃ྔΛ૿΍͢ ▶  ങ͍෺ͷ༵೔ˠQEGBDUPSJ[F  ▶  τϥϯβΫγϣϯϨίʔυ਺ɺ঎඼ฦ٫ͷτϥϯβΫγϣϯϨίʔυ਺ɺ ങ͍෺ʹඥͮ͘%FQBSUNFOU%FTDSJQUJPOͷϢχʔΫ਺ˠHSPVQCZ  ू໿ͷͨΊͷؔ਺͸ ΧελϚΠζͰ͖Δ

  30. ಛ௃ྔΛ૿΍ͯ͠Ϟσϧͷަࠩ֬ೝ தؒϑΝΠϧʹอଘͨͦ͠ΕͧΕͷಛ௃ྔߦྻΛOQITUBDL ؔ਺Ͱ ݁߹ͯ͠୯ҰͷOEBSSBZΛ࡞੒͢Δɽ ༧ଌϞσϧ͸ઌʹఆٛͨ͠ύϥϝʔλΛͦͷ··ར༻ʢˠ$74DPSFʣ 1 0 0 0 0

    0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 2 3 1 1 1 1 2 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 1 0 0 0 1 0 0 0 3 1 0 0 0 1 0 0 0 1 2 0 0 0 0 1 0 0 1 1 1 1 OQITUBDL 
  31. ಛ௃ྔΛ૿΍ͯ͠Ϟσϧͷަࠩ֬ೝ தؒϑΝΠϧʹอଘͨͦ͠ΕͧΕͷಛ௃ྔߦྻΛOQITUBDL ؔ਺Ͱ ྻํ޲ʹܨ͛ͯ୯ҰͷOEBSSBZΛ࡞੒͢Δɽ ༧ଌϞσϧ͸ઌʹఆٛͨ͠ύϥϝʔλΛͦͷ··ར༻ʢˠ$74DPSFʣ 1 0 0 0 0

    0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 2 3 1 1 1 1 2 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 1 0 0 0 1 0 0 0 3 1 0 0 0 1 0 0 0 1 2 0 0 0 0 1 0 0 1 1 1 1 OQITUBDL  ˠ લճͷ༧ଌϞσϧΑΓվળͯ͠ॱҐ্͕͕ͬͨ
  32. 1"35 ൃలతͳ࿩୊

  33. ൃలతͳ࿩୊ɿσʔλʹಛԽͨ͠࿩୊ ༷ʑͳσʔλ͝ͱʹ༗༻ͱͳΔύοέʔδΛ঺հ͢Δ  ▶  େن໛σʔλ ˞͜͜Ͱ͸%#શମͰ(#ن໛ͷ3%#Λ૝ఆ͠·͢  –  3FETIJGU – 

    #JH2VFSZ –  ϕϯνϚʔΫ 3FETIJGUWT#JH2VFSZ  ▶  ը૾σʔλ –  ಛ௃఺ͷαϯϓϦϯά –  ࣹӨม׵ 
  34. 42-ΫΤϦ͔ΒQBOEBTσʔλϑϨʔϜΛ࡞੒͢Δ QBOEBT͸*0ͱͯ͠42-Λαϙʔτ͍ͯ͠Δɽ QBOEBTSFBE@TRM ؔ਺Λ࢖͏ͱΫΤϦൃߦͷ ݁ՌΛσʔλϑϨʔϜͱͯ͠ड͚औΔ͜ͱ͕Ͱ͖Δ ▶  ෳࡶͳ3%#͸ѻ͏৔߹ɼू໿ॲཧͳͲΛ42-ʹΑͬͯ؆ܿʹهड़Ͱ͖Δ ▶  େن໛ͳσʔλΛѻ͏৔߹ɼ͢΂ͯͷσʔλΛखݩͷ؀ڥʹϩʔυ͢Δ͜ͱͳ ͘42-αʔόʔͷܭࢉϦιʔεͰॲཧͰ͖Δ

    ˞*NBHF4PVSDFIUUQTXXXLBHHMFDPNDDFSWJDBMDBODFSTDSFFOJOHEBUB
  35. Ϋϥ΢υࢿݯΛར༻͠େن໛σʔλΛॲཧ 3FETIJGU  3FETIJGU͸"84্ʹىಈͨ͠ߴີ౓ετϨʔδϊʔυ͋Δ͍͸ ߴີ౓ίϯϐϡʔτϊʔυʹΑΔΫϥελ্Ͱಈ࡞͢Δɽ ### 1. Amazon S3 ΁σʔλΛΞοϓϩʔυ

    $ aws s3 sync data/input s3://kaggle-kohei/walmart_triptype/input ### 2. ςʔϒϧΛ࡞੒ͯ͠S3 ͔Β Redshift ΁σʔλΛϩʔυ͢Δ $ psql < schema.sql ### 3. SQL ΫΤϦΛൃߦͯ͠ Pandas σʔλϑϨʔϜΛ࡞੒͢Δ >>> import psycopg2 as pg >>> import pandas as pd conn_string = ' '.join([ "dbname='dwh'", "port='5439'", "user='kohei_ozaki'", "password='{}'".format(os.environ['REDSHIFT_PWD']), "host='{}'".format(os.environ['REDSHIFT_HOST']), ]) >>> conn = pg.connect(conn_string) >>> pd.read_sql("SELECT * FROM train WHERE TripType = 999", conn) ىಈͨ͠3FETIJGUͷ઀ଓઌΛࢦఆ 42-͔ΒσʔλϑϨʔϜ࡞੒
  36. Ϋϥ΢υࢿݯΛར༻͠େن໛σʔλΛॲཧ #JH2VFSZ  ### 1. Cloud Storage へデータをアップロードする $ gsutil

    rsync -r data/input gs://kaggle-kohei.appspot.com/walmart_triptype ### 2. BigQuery テーブルとしてデータをインポート $ bq load --skip_leading_rows 1 \ kaggle-kohei:walmart_triptype.train \ gs://kaggle-kohei.appspot.com/walmart_triptype/train.csv \ train.json ### 3. Pandas からクエリを発行して,結果をデータフレームとして受け取る >>> df = pd.read_gbq(""" SELECT t1.trip_type, COUNT(1) AS n_visitors FROM ( SELECT FIRST(trip_type) AS trip_type FROM walmart_triptype.train GROUP BY visit_number ) t1 GROUP BY t1.trip_type""", "kaggle-kohei") 1BOEBTΑΓQESFBE@HCR ؔ਺͔Β#JH2VFSZʹΫΤϦΛൃߦͰ͖Δɽ ϑϧϚωʔδυͳαʔϏεͰ͋ΔͨΊΫϥελΛҙࣝ͢Δඞཁ͕ͳ͍ར఺͕͋Δɽ +40/ܗࣜͰςʔϒϧεΩʔϚΛఆٛ͢Δ 42-͔ΒσʔλϑϨʔϜ࡞੒
  37. 3FETIJGUͱ#JH2VFSZͷϕϯνϚʔΫ ▶  (#ͷ$47ɼH[JQѹॖͰ(#ͷԯߦ   ߦ  ςʔϒϧσʔλΛϩʔυͯ͠ΫΤϦʢू໿ؔ਺ʣͷ࣮ߦ࣌ؒΛൺֱ ▶  ܭଌର৅͸#JH2VFSZ

    3FETIJGU Y 3FETIJGU Y 3FETIJGU Y  P⒎UPQJD ࠓճͷઃఆͰ͸୆਺Λ૿΍ͤ͹3FETIJGUͷੑೳΛ্͛Δ͜ͱ͕Ͱ͖ͨɽ ˞3FETIJGU OPEFT POEFNBOE EDMBSHFJOTUBODF ͷίετ͸64%IPVS#JH2VFSZͷΫΤϦίετ͸ԁະຬ
  38. Ϋϥ΢υࢿݯΛར༻͢ΔͨΊͷΞυόΠε ▶  σʔλͷϩʔυʹ͸͕͔͔࣌ؒΔ͕ɼʮΞυϗοΫΫΤϦΛԿ౓΋࣮ߦ͢Δʯ ʮ֤ΞυϗοΫΫΤϦͷλʔϯΞϥ΢ϯυλΠϜΛ୹͍ͨ͘͠ʯͱ͍͏ڧ͍ཁ ੥͕͋ΔͳΒ͹අ༻ରޮՌ͕ߴ͍ ▶  #JH2VFSZ͸ڊେͳѹॖϑΝΠϧΛҰ౓ʹόονͰϩʔυ͢Δ͜ͱ͕Ͱ͖ͳ͍ɽ ͦͷ৔߹͸ߦ୯ҐͰ෼ׂˍѹॖͯ͠$MPVE4UPSBHFʹΞοϓϩʔυͰ͖Δ $ zcat

    prescription_head.csv.gz | \ split –d –C 1G --filter='gzip > $FILE.gz' – prescription_head.csv.part prescription_head.csv.gz prescription_head.csv.part01.gz prescription_head.csv.part02.gz prescription_head.csv.part03.gz zcat & split gsuIl rsync P⒎UPQJD
  39. 1ZUIPOʹΑΔը૾σʔλͷॲཧ 1ZUIPOͰը૾ॲཧΛ͢ΔͨΊͷύοέʔδͱͯ͠ɼ TDJLJUJNBHF΍0QFO$7ͷ1ZUIPOόΠϯσΟϯά͕͋Δɽ ֤ύοέʔδͰ͸ը૾ΛOEBSSBZΦϒδΣΫτͷߦྻͱͯ͠ѻ͏ɽ  ▶  มܗʢճసɼ֦େɼ΅͔͠ʣ ▶  υϩʔΠϯά ▶ 

    ώετάϥϜฏୱԽ ▶  ը૾ͷηάϝϯςʔγϣϯ ▶  Τοδநग़ɼը૾ಛ௃఺ͷܭࢉ ▶  ը૾ಛ௃఺ͷαϯϓϦϯάɼϚονϯά ▶  FUD im[:, :, 2] im[:, :, 1] im[:, :, 0] #(3νϟϯωϧͷը૾දݱͷྫ
  40. 0QFO$7ʹΑΔը૾ಛ௃఺ͷܭࢉ DW#3*4,@DSFBUF ؔ਺͸#3*4,ΞϧΰϦζϜͷ0QFO$7࣮૷ɽ #3*4,Λܭࢉ͢ΔͨΊͷEFUFDUPSΦϒδΣΫτΛ࡞੒͢Δ %3"1&34BUFMMJUF*NBHF$ISPOPMPHZίϯςετͷσʔλΛྫ୊ͱͯ͠঺հ͢Δ IUUQTXXXLBHHMFDPNDESBQFSTBUFMMJUFJNBHFDISPOPMPHZ

  41. ಛ௃఺ͷϚονϯάͱ݁ՌΛ࢖ࣹͬͨӨม׵ DW%FTDSJQUPS.BUDIFS@DSFBUF ؔ਺Ͱಛ௃఺ͷϚονϯά͢Δɽ Ϛονϯά݁ՌΛ࢖͍ɼࡱӨ֯౓΍εέʔϧͷҧ͍Λࣸ૾ม׵Ͱิਖ਼͢Δɽ DWQFSTQFDUJWF5SBOTGPSN ؔ਺Ͱ ϗϞάϥϑΟߦྻΛجʹࣹӨม׵ DWpOE)PNPHSBQIZ ؔ਺Ͱ Ϛονϯά݁ՌΛجʹ

    ϗϞάϥϑΟߦྻΛܭࢉ
  42. ը૾ࠩ෼ͷՄࢹԽ ࣹӨม׵ͯ͠ࡱӨ֯౓΍εέʔϧͷҟͳΔը૾ΛҰகͤͨ͞ɽ ೋ஋Խͨ͠ը૾ΛϐΫηϧ୯ҐͰൺֱ͢Δ͜ͱͰɼࠩ෼ΛՄࢹԽͰ͖Δ 3FG5VUPSJBMWJTVBMJ[JOHEJ⒎FSFODFCFUXFFOUISFFQJDUVSFT IUUQTXXXLBHHMFDPNDESBQFSTBUFMMJUFJNBHFDISPOPMPHZGPSVNTUUVUPSJBMWJTVBMJ[JOHEJ⒎FSFODFCFUXFFOUISFFQJDUVSFT ը૾"ʹͷΈग़ݱ ը૾#ʹͷΈग़ݱ ˞ίʔυͱৄࡉʹ͍ͭͯ͸ࢿྉͷ +VQZUFSOPUFCPPLΛࢀর͍ͯͩ͘͠͞ɽ 

    ˞ಛ௃఺ͷܭࢉͳͲ͸0QFO$7ͷ ֦ுΛΠϯετʔϧ͢Δඞཁ͕͋Δɽ 
  43. ͜ͷνϡʔτϦΞϧʹ͍ͭͯʢ࠶ܝʣ σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠ɺ 1ZUIPOʹΑΔύοέʔδར༻ྫΛ঺հ͢Δɻ ▶  1BSU༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ NJO  –  ༧ଌϞσϧͷ࡞੒ϓϩηεͱؔ࿈ύοέʔδͷུ֓Λ঺հ͢Δ – 

    1BOEBT 4DJLJU-FBSO ▶  1BSUσʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ NJO  –  σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠۩ମతͳར༻ྫΛ঺հ͢Δ –  1BOEBT TFBCPSO 9(#PPTU 1Z5BCMFT ▶  1BSUൃలతͳ࿩୊ NJO  –  ໨తʢେن໛σʔλɾը૾σʔλʣ͝ͱʹಛԽͨ͠ར༻ྫΛ঺հ͢Δ –  1BOEBT 3FETIJGU #JH2VFSZ 0QFO$7
  44. ิ଍ɿ഑෍ࢿྉͱ%PDLFSʹΑΔ࠶ݱੑͷ֬อʢ࠶ܝʣ ຊεϥΠυͰ࢖ΘΕΔίʔυ͸+VQZUFSOPUFCPPLͱͯ͠ҎԼͰެ։͍ͯ͠Δɽ ▶  εϥΠυIUUQHPPHM.F.;Z0 ▶  1BSUࢿྉIUUQHPPHMZ;N,S ▶  1BSUࢿྉIUUQHPPHM"5$ZW  ɽຊࢿྉͰར༻ͨ͠+VQZUFSOPUFCPPL͸ެ։͍ͯ͠Δ%PDLFSΠϝʔδ͔Β

    ࡞੒ͨ͠؀ڥΛ࢖͍ͬͯΔɽҎԼͷίϚϯυͰಉ༷ͷ؀ڥΛࢼ͢͜ͱ͕Ͱ͖Δɻ $ docker run --rm -ti \ -p 8888:8888 \ -v /path/to/data_directory:/mnt \ -v /path/to/working_directory:/home/kohei/work \ smly/notebook:0.4