Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python とデータ分析コンテストの実践

@smly
September 08, 2016
8.7k

Python とデータ分析コンテストの実践

FIT 2016 Tutorial 資料
スライド: http://goo.gl/MeMZyO
Part2 資料: http://goo.gl/y6ZmKr
Part3 資料: http://goo.gl/ATC9yv

@smly

September 08, 2016
Tweet

More Decks by @smly

Transcript

  1. ͜ͷνϡʔτϦΞϧʹ͍ͭͯ σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠ɺ 1ZUIPOʹΑΔύοέʔδར༻ྫΛ঺հ͢Δɻ ▶  1BSU༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ NJO  –  ༧ଌϞσϧͷ࡞੒ϓϩηεͱؔ࿈ύοέʔδͷུ֓Λ঺հ͢Δ – 

    1BOEBT 4DJLJU-FBSO ▶  1BSUσʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ NJO  –  σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠۩ମతͳར༻ྫΛ঺հ͢Δ –  1BOEBT TFBCPSO 9(#PPTU 1Z5BCMFT ▶  1BSUൃలతͳ࿩୊ NJO  –  ໨తʢେن໛σʔλɾը૾σʔλʣ͝ͱʹಛԽͨ͠ར༻ྫΛ঺հ͢Δ –  1BOEBT 3FETIJGU #JH2VFSZ 0QFO$7
  2. Ұൠతͳ༧ଌϞσϧ࡞੒ͷϓϩηε ▶  ,BHHMF΍,%%$VQͳͲͷίϯςετͰ͸ɼz༩͑ΒΕͨσʔλͱλεΫz ʹରͯ͠ɼzܾΊΒΕͨධՁࢦඪzʹΑΔείΞͰॱҐΛ෇͚Λߦ͏ ▶  ͦΕͧΕͷϓϩηεʹ͓͍ͯ༗༻ͳϞδϡʔϧʹ͍ͭͯ঺հ͢Δ λεΫઃܭ σʔλ࡞੒ γεςϜ΁ ૊ΈࠐΈ

    લॲཧ ಛ௃நग़ ༧ଌϞσϧ ͷධՁ ༧ଌϞσϧ ͷ࡞੒ ・ pandas ・ psycopg2 ・ sklearn.feature_extraction ・ sklearn.preprocessing ・ pytables ・ seaborn ・ sklearn.linear_model ・ sklearn.svm ・ sklearn.ensemble ・ xgboost ・ stats-models ・ sklearn.metrics ・ sklearn.cross_validation ・ ml_metrics
  3. QBOEBTσʔλϑϨʔϜͰσʔλૢ࡞ ೖग़ྗ  ▶  /VN1ZͷOEBSSBZʹࣅͨσʔλΛૢ࡞͢ΔͨΊͷσʔλߏ଄ ▶  ΧϥϜͱΠϯσοΫε͕͋Γɺ໊લ෇͖ΧϥϜͰσʔλૢ࡞͕Մೳ ▶  ΧϥϜ͝ͱʹҟͳΔܕΛ࣋ͭ͜ͱ͕Ͱ͖Δ ▶ 

    ߦͷૠೖ΍ྻͷ࡟আͳͲ͕ߦ͑ΔՄม NVUBCMF ͳΦϒδΣΫτ ▶  ๛෋ͳσʔλૢ࡞ػೳͱೖग़ྗΠϯλʔϑΣʔε લॲཧ ಛ௃நग़ NDArray Internal dimensions dim count dtype strides data * 2 3 3 5 * 12 float32 0 1 2 3 4 5 6 7 8 DataFrame Internal I N D E X Columns float string integer
  4. TDJLJUMFBSO΍9(#PPTUͰ༧ଌϞσϧΛ࡞੒ TDJLJUMFBSOͰ͸pU Ͱσʔλ౰ͯ͸ΊɼQSFEJDU Ͱ༧ଌɼQSFEJDU@QSPCB  Ͱ༧ଌ֬཰Λ౴͑Δڞ௨ͷΠϯλʔϑΣʔεΛ͍࣋ͬͯΔʢ˞෼ྨͷ৔߹ʣ >>> import xgboost as

    xgb >>> dtrain, dtest = xgb.DMatrix(), xgb.DMatrix() >>> watchlist = [(dtrain, 'train')] >>> booster = xgb.Booster() >>> gbtree = booster.train(dtrain, params) >>> y_pred = gbtree.predict(dtest) >>> from sklearn.linear_model import LogisticRegression >>> clf = LogisticRegression() >>> clf.fit(X_train, y_rain) >>> y_pred = clf.predict_proba(X_test) 9(#PPTU͸TDJLJUMFBSOͱҟͳΔΠϯλʔϑΣʔεͰ͋Δ͕ɼΠςϨʔγϣϯ͝ͱͷ FWBMVBUJPOɼFBSMZTUPQQJOHɼ࠷దԽؔ਺ͷΧελϚΠζͳͲͷ༷ʑͳػೳ͕͋Δɽ 9(#PPTU͸ϝϞϦޮ཰ͱ܇࿅଎౓ʹ࠷దԽ͞Εͨ಺෦σʔλߏ଄%.BUSJYͱͯ͠ σʔλΛѻ͏ɽOEBSSBZ͔Β%.BUSJYΦϒδΣΫτΛ࡞੒Ͱ͖Δɽ ༧ଌϞσϧ ͷ࡞੒
  5. ϞσϧͷੑೳධՁ TLMFBSODSPTT@WBMJEBUJPOϞδϡʔϧ͸༧ଌϞσϧͷੑೳΛධՁ͢ΔͨΊͷ ϔϧύʔؔ਺΍ΠςϨʔλʔΛఏڙ͢Δɻ ༧ଌϞσϧ ͷධՁ >>> clf = LogisticRegression() >>>

    scores = cross_validation.cross_val_score(clf, X, y, cv=5) >>> scores array([ 0.92..., 1. ..., 0.92..., 1. ]) ༧ଌϞσϧ͕TLMFBSO$MBTTJpFS.JYJOΛܧঝͨ͠ΫϥεͰ͋Ε͹ϔϧύʔؔ਺Λ ༻͍ͯަࠩ֬ೝΛ؆ܿʹهड़Ͱ͖Δɻ ΠςϨʔλʔ͸֬ೝ༻ͷ܇࿅σʔλͱςετσʔλͷΠϯσοΫεϦετΛฦ͢ >>> kf = KFold(n_samples, n_folds=5) >>> for idx_train, idx_test in kf: ... y_train, y_test = y[idx_train], y[idx_test] ... X_train, X_test = X[idx_train], X[idx_test] ... clf.fit(X_train, y_train) ... y_pred = clf.predict(X_test) ֤'PMEͷ"DDVSBDZ
  6. σʔλͷ֬ೝ ▶  l8BMNBSU3FDSVJUJOH5SJQ5ZQF$MBTTJpDBUJPOzίϯςετΛ୊ࡐͱͯ͠ѻ͏ ▶  ʮങ͍٬ʯΛʮങ͍෺ͷ঎඼εΩϟϯཤྺʯ͔ΒଟΫϥε෼ྨ͢ΔλεΫ ▶  USBJODTW UFTUDTW TBNQMF4VCNJTTJPODTWͷ̏ͭͷϑΝΠϧ͕ఏڙ͞Ε͍ͯΔ 

    ܇࿅ࣄྫσʔλUSBJODTW͸ʮങ͍෺ͷ঎඼εΩϟϯཤྺʯͰ͋Δɻ ϑΟʔϧυ໊ આ໌ 5SBJO 5FTU 5SJQ5ZQF ໨ඪม਺  ʮങ͍٬ʯͷΧςΰϦΧϧͳ*% ✔ ✗ 7JTJU/VNCFS *OTUBODF*%  ͋ΔҰਓͷސ٬ͷങ͍෺ʹରԠ͢Δ*% ✔ ✔ 8FFLEBZ ങ͍෺Λͨ͠ि ✔ ✔ 6QD ߪೖ঎඼ͷ61$൪߸ ✔ ✔ 4DBO$PVOU ߪೖ͞Εͨݸ਺ʢෛͷ஋͸ฦ٫͞Εͨ঎඼ʣ ✔ ✔ %FQBSUNFOU%FTDSJQUJPO ߪೖ঎඼ͷδϟϯϧ ✔ ✔ 'JOFMJOF ߪೖ঎඼Λߋʹࡉ͔͘෼͚Δδϟϯϧͷ*% ✔ ✔
  7. ಛ௃ྔΛ࡞੒͢Δ ྫʰ͋Δങ͍෺ʹ͓͍ͯɺ Ͳͷ঎඼ΧςΰϦ͕Կݸ εΩϟϯ͞Ε͔ͨʱ  QEQJWPU@UBCMF ؔ਺Ͱ τϥϯβΫγϣϯΛू໿ >>> df_long

    = pd.concat([ df_train, df_test, ]).fillna(“_NA_”) >>> df = pd.pivot_table( df_long, index=“VisitNumber”, columns=[“DepartmentDescription”], values=[“ScanCount”], aggfunc=[np.sum], )[‘sum’][‘ScanCount’] QEQJWPU@UBCMF EG@MPOH  JOEFYl7JTJU/VNCFSz  DPMVNOT<l%FQBSUNFOU%FTDSJQUJPOz>  WBMVFT<l4DBO$PVOUz>  BHHGVOD<OQTVN>  $PMVNO7BMVF͕λςํ޲ʹฒΜͰ͍Δ MPOHGPSNBU  $PMVNO7BMVFͷରԠ͕ू໿͞ΕϤίํ޲ʹฒͿ XJEFGPSNBU 
  8. ಛ௃ྔͷอଘܗࣜΛܾΊΔ ▶  ͜͜Ͱ͸ʮ࠷ॳͷߦΛ܇࿅ࣄྫʯʮଓ͘ߦΛςετࣄྫʯ ͱͯ͠ߦྻΛಛ௃ྔͷߦྻΛOEBSSBZͱͯ͠อଘ͢Δ͜ͱΛߟ͑Δ ▶  EGͷΠϯσοΫε͸7JTJU/VNCFSͰ͋Δɽ܇࿅ࣄྫͷߦྻͱςετࣄྫͷ ߦྻ͕7JTJU/VNCFSॱʹͳΔΑ͏ʹMPDϝιουͰฒͼସ͑Δ PSEFSCZ l7JTJU/VNCFSz PSEFSCZ

    l7JTJU/VNCFSz >>> visit_number_order = ( df_train.VisitNumber.drop_duplicates().append( df_test.VisitNumber.drop_duplicates() ) >>> df = df.loc[visit_order] >>> X = df.fillna(0).as_matrix() >>> X.shape (191348, 69) 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 ܇࿅ࣄྫͷߦྻ OEBSSBZPCKFDU ςετࣄྫͷߦྻ OEBSSBZPCKFDU 7JTJU/VNCFS ͕ΠϯσοΫε
  9. DSPTT@WBM@TDPSF ؔ਺ͷ՝୊ͱϫʔΫΞϥ΢ϯυ ʲ՝୊ʳݱঢ়ͷDSPTT@WBM@TDPSF ϔϧύʔؔ਺͸QSFEJDU@QSPCB ϝιου Λݺͼग़͢͜ͱ͕Ͱ͖ͣɺQSFEJDU ϝιουΛݺͿ TDJLJUMFBSOEFW NBTUFS)&"% Ͱ͸

    TLMFBSONPEFM@TFMFDUJPODSPTT@WBM@TDPSF  ϔϧύʔؔ਺͕վम͞Ε͍ͯΔɽ QSFEJDU@QSPCB ϝιουΛΦϓγϣϯͰࢦఆ ͢Δ͜ͱͰݺͼग़͢͜ͱ͕Ͱ͖Δɻ ˠ<8PSLBSPVOE>ΫϥεΛܧঝͯ͠QSFEJDU ϝιουΛݺͿͱ QSFEJDU@QSPCB ϝιου͕ݺ͹ΕΔΫϥεΛఆٛ͢Ε͹࢖͏͜ͱ͕Ͱ͖Δɻ
  10. ಛ௃ྔΛ૿΍ͯ͠Ϟσϧͷަࠩ֬ೝ தؒϑΝΠϧʹอଘͨͦ͠ΕͧΕͷಛ௃ྔߦྻΛOQITUBDL ؔ਺Ͱ ྻํ޲ʹܨ͛ͯ୯ҰͷOEBSSBZΛ࡞੒͢Δɽ ༧ଌϞσϧ͸ઌʹఆٛͨ͠ύϥϝʔλΛͦͷ··ར༻ʢˠ$74DPSFʣ 1 0 0 0 0

    0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 2 3 1 1 1 1 2 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 1 0 0 0 1 0 0 0 3 1 0 0 0 1 0 0 0 1 2 0 0 0 0 1 0 0 1 1 1 1 OQITUBDL  ˠ લճͷ༧ଌϞσϧΑΓվળͯ͠ॱҐ্͕͕ͬͨ
  11. Ϋϥ΢υࢿݯΛར༻͠େن໛σʔλΛॲཧ 3FETIJGU  3FETIJGU͸"84্ʹىಈͨ͠ߴີ౓ετϨʔδϊʔυ͋Δ͍͸ ߴີ౓ίϯϐϡʔτϊʔυʹΑΔΫϥελ্Ͱಈ࡞͢Δɽ ### 1. Amazon S3 ΁σʔλΛΞοϓϩʔυ

    $ aws s3 sync data/input s3://kaggle-kohei/walmart_triptype/input ### 2. ςʔϒϧΛ࡞੒ͯ͠S3 ͔Β Redshift ΁σʔλΛϩʔυ͢Δ $ psql < schema.sql ### 3. SQL ΫΤϦΛൃߦͯ͠ Pandas σʔλϑϨʔϜΛ࡞੒͢Δ >>> import psycopg2 as pg >>> import pandas as pd conn_string = ' '.join([ "dbname='dwh'", "port='5439'", "user='kohei_ozaki'", "password='{}'".format(os.environ['REDSHIFT_PWD']), "host='{}'".format(os.environ['REDSHIFT_HOST']), ]) >>> conn = pg.connect(conn_string) >>> pd.read_sql("SELECT * FROM train WHERE TripType = 999", conn) ىಈͨ͠3FETIJGUͷ઀ଓઌΛࢦఆ 42-͔ΒσʔλϑϨʔϜ࡞੒
  12. Ϋϥ΢υࢿݯΛར༻͠େن໛σʔλΛॲཧ #JH2VFSZ  ### 1. Cloud Storage へデータをアップロードする $ gsutil

    rsync -r data/input gs://kaggle-kohei.appspot.com/walmart_triptype ### 2. BigQuery テーブルとしてデータをインポート $ bq load --skip_leading_rows 1 \ kaggle-kohei:walmart_triptype.train \ gs://kaggle-kohei.appspot.com/walmart_triptype/train.csv \ train.json ### 3. Pandas からクエリを発行して,結果をデータフレームとして受け取る >>> df = pd.read_gbq(""" SELECT t1.trip_type, COUNT(1) AS n_visitors FROM ( SELECT FIRST(trip_type) AS trip_type FROM walmart_triptype.train GROUP BY visit_number ) t1 GROUP BY t1.trip_type""", "kaggle-kohei") 1BOEBTΑΓQESFBE@HCR ؔ਺͔Β#JH2VFSZʹΫΤϦΛൃߦͰ͖Δɽ ϑϧϚωʔδυͳαʔϏεͰ͋ΔͨΊΫϥελΛҙࣝ͢Δඞཁ͕ͳ͍ར఺͕͋Δɽ +40/ܗࣜͰςʔϒϧεΩʔϚΛఆٛ͢Δ 42-͔ΒσʔλϑϨʔϜ࡞੒
  13. 3FETIJGUͱ#JH2VFSZͷϕϯνϚʔΫ ▶  (#ͷ$47ɼH[JQѹॖͰ(#ͷԯߦ   ߦ  ςʔϒϧσʔλΛϩʔυͯ͠ΫΤϦʢू໿ؔ਺ʣͷ࣮ߦ࣌ؒΛൺֱ ▶  ܭଌର৅͸#JH2VFSZ

    3FETIJGU Y 3FETIJGU Y 3FETIJGU Y  P⒎UPQJD ࠓճͷઃఆͰ͸୆਺Λ૿΍ͤ͹3FETIJGUͷੑೳΛ্͛Δ͜ͱ͕Ͱ͖ͨɽ ˞3FETIJGU OPEFT POEFNBOE EDMBSHFJOTUBODF ͷίετ͸64%IPVS#JH2VFSZͷΫΤϦίετ͸ԁະຬ
  14. Ϋϥ΢υࢿݯΛར༻͢ΔͨΊͷΞυόΠε ▶  σʔλͷϩʔυʹ͸͕͔͔࣌ؒΔ͕ɼʮΞυϗοΫΫΤϦΛԿ౓΋࣮ߦ͢Δʯ ʮ֤ΞυϗοΫΫΤϦͷλʔϯΞϥ΢ϯυλΠϜΛ୹͍ͨ͘͠ʯͱ͍͏ڧ͍ཁ ੥͕͋ΔͳΒ͹අ༻ରޮՌ͕ߴ͍ ▶  #JH2VFSZ͸ڊେͳѹॖϑΝΠϧΛҰ౓ʹόονͰϩʔυ͢Δ͜ͱ͕Ͱ͖ͳ͍ɽ ͦͷ৔߹͸ߦ୯ҐͰ෼ׂˍѹॖͯ͠$MPVE4UPSBHFʹΞοϓϩʔυͰ͖Δ $ zcat

    prescription_head.csv.gz | \ split –d –C 1G --filter='gzip > $FILE.gz' – prescription_head.csv.part prescription_head.csv.gz prescription_head.csv.part01.gz prescription_head.csv.part02.gz prescription_head.csv.part03.gz zcat & split gsuIl rsync P⒎UPQJD
  15. 1ZUIPOʹΑΔը૾σʔλͷॲཧ 1ZUIPOͰը૾ॲཧΛ͢ΔͨΊͷύοέʔδͱͯ͠ɼ TDJLJUJNBHF΍0QFO$7ͷ1ZUIPOόΠϯσΟϯά͕͋Δɽ ֤ύοέʔδͰ͸ը૾ΛOEBSSBZΦϒδΣΫτͷߦྻͱͯ͠ѻ͏ɽ  ▶  มܗʢճసɼ֦େɼ΅͔͠ʣ ▶  υϩʔΠϯά ▶ 

    ώετάϥϜฏୱԽ ▶  ը૾ͷηάϝϯςʔγϣϯ ▶  Τοδநग़ɼը૾ಛ௃఺ͷܭࢉ ▶  ը૾ಛ௃఺ͷαϯϓϦϯάɼϚονϯά ▶  FUD im[:, :, 2] im[:, :, 1] im[:, :, 0] #(3νϟϯωϧͷը૾දݱͷྫ
  16. ͜ͷνϡʔτϦΞϧʹ͍ͭͯʢ࠶ܝʣ σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠ɺ 1ZUIPOʹΑΔύοέʔδར༻ྫΛ঺հ͢Δɻ ▶  1BSU༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ NJO  –  ༧ଌϞσϧͷ࡞੒ϓϩηεͱؔ࿈ύοέʔδͷུ֓Λ঺հ͢Δ – 

    1BOEBT 4DJLJU-FBSO ▶  1BSUσʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ NJO  –  σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠۩ମతͳར༻ྫΛ঺հ͢Δ –  1BOEBT TFBCPSO 9(#PPTU 1Z5BCMFT ▶  1BSUൃలతͳ࿩୊ NJO  –  ໨తʢେن໛σʔλɾը૾σʔλʣ͝ͱʹಛԽͨ͠ར༻ྫΛ঺հ͢Δ –  1BOEBT 3FETIJGU #JH2VFSZ 0QFO$7