$30 off During Our Annual Pro Sale. View Details »

Python とデータ分析コンテストの実践

@smly
September 08, 2016
8.6k

Python とデータ分析コンテストの実践

FIT 2016 Tutorial 資料
スライド: http://goo.gl/MeMZyO
Part2 資料: http://goo.gl/y6ZmKr
Part3 資料: http://goo.gl/ATC9yv

@smly

September 08, 2016
Tweet

Transcript

  1. 1ZUIPOͱσʔλ෼ੳίϯςετͷ࣮ફ
    ,PIFJ0[BLJ !TNMZ

    '*5νϡʔτϦΞϧεϥΠυ

    View Slide

  2. ͜ͷνϡʔτϦΞϧʹ͍ͭͯ
    σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠ɺ
    1ZUIPOʹΑΔύοέʔδར༻ྫΛ঺հ͢Δɻ
    ▶ 
    1BSU༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ NJO

    –  ༧ଌϞσϧͷ࡞੒ϓϩηεͱؔ࿈ύοέʔδͷུ֓Λ঺հ͢Δ
    –  1BOEBT 4DJLJU-FBSO
    ▶ 
    1BSUσʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ NJO

    –  σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠۩ମతͳར༻ྫΛ঺հ͢Δ
    –  1BOEBT TFBCPSO 9(#PPTU 1Z5BCMFT
    ▶ 
    1BSUൃలతͳ࿩୊ NJO

    –  ໨తʢେن໛σʔλɾը૾σʔλʣ͝ͱʹಛԽͨ͠ར༻ྫΛ঺հ͢Δ
    –  1BOEBT 3FETIJGU #JH2VFSZ 0QFO$7

    View Slide

  3. ิ଍ɿ഑෍ࢿྉͱ%PDLFSʹΑΔ࠶ݱੑͷ֬อ
    ຊεϥΠυͰ࢖ΘΕΔίʔυ͸+VQZUFSOPUFCPPLͱͯ͠ҎԼͰެ։͍ͯ͠Δɽ
    ▶ 
    εϥΠυIUUQHPPHM.F.;Z0
    ▶ 
    1BSUࢿྉIUUQHPPHMZ;N,S
    ▶ 
    1BSUࢿྉIUUQHPPHM"5$ZW

    ɽຊࢿྉͰར༻ͨ͠+VQZUFSOPUFCPPL͸ެ։͍ͯ͠Δ%PDLFSΠϝʔδ͔Β
    ࡞੒ͨ͠؀ڥΛ࢖͍ͬͯΔɽҎԼͷίϚϯυͰಉ༷ͷ؀ڥΛࢼ͢͜ͱ͕Ͱ͖Δɻ
    $ docker run --rm -ti \
    -p 8888:8888 \
    -v /path/to/data_directory:/mnt \
    -v /path/to/working_directory:/home/kohei/work \
    smly/notebook:0.4

    View Slide

  4. 1"35
    ༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ

    View Slide

  5. Ұൠతͳ༧ଌϞσϧ࡞੒ͷϓϩηε
    ▶ 
    ,BHHMF΍,%%$VQͳͲͷίϯςετͰ͸ɼz༩͑ΒΕͨσʔλͱλεΫz
    ʹରͯ͠ɼzܾΊΒΕͨධՁࢦඪzʹΑΔείΞͰॱҐΛ෇͚Λߦ͏
    ▶ 
    ͦΕͧΕͷϓϩηεʹ͓͍ͯ༗༻ͳϞδϡʔϧʹ͍ͭͯ঺հ͢Δ
    λεΫઃܭ
    σʔλ࡞੒
    γεςϜ΁
    ૊ΈࠐΈ
    લॲཧ
    ಛ௃நग़
    ༧ଌϞσϧ
    ͷධՁ
    ༧ଌϞσϧ
    ͷ࡞੒
    ・ pandas
    ・ psycopg2
    ・ sklearn.feature_extraction
    ・ sklearn.preprocessing
    ・ pytables
    ・ seaborn
    ・ sklearn.linear_model
    ・ sklearn.svm
    ・ sklearn.ensemble
    ・ xgboost
    ・ stats-models
    ・ sklearn.metrics
    ・ sklearn.cross_validation
    ・ ml_metrics

    View Slide

  6. QBOEBTσʔλϑϨʔϜͰσʔλૢ࡞ ೖग़ྗ

    ▶ 
    /VN1ZͷOEBSSBZʹࣅͨσʔλΛૢ࡞͢ΔͨΊͷσʔλߏ଄
    ▶ 
    ΧϥϜͱΠϯσοΫε͕͋Γɺ໊લ෇͖ΧϥϜͰσʔλૢ࡞͕Մೳ
    ▶ 
    ΧϥϜ͝ͱʹҟͳΔܕΛ࣋ͭ͜ͱ͕Ͱ͖Δ
    ▶ 
    ߦͷૠೖ΍ྻͷ࡟আͳͲ͕ߦ͑ΔՄม NVUBCMF
    ͳΦϒδΣΫτ
    ▶ 
    ๛෋ͳσʔλૢ࡞ػೳͱೖग़ྗΠϯλʔϑΣʔε
    લॲཧ
    ಛ௃நग़
    NDArray Internal
    dimensions
    dim count
    dtype
    strides
    data
    *
    2
    3 3
    5
    *
    12
    float32
    0 1 2 3 4 5 6 7 8
    DataFrame Internal
    I
    N
    D
    E
    X
    Columns
    float string integer

    View Slide

  7. ಛ௃ྔͷΤϯίʔυ
    ઢܗϞσϧͳͲΧςΰϦΧϧม਺Λ௚઀ѻ͏͜ͱ͕Ͱ͖ͳ͍ϞσϧͰ͸ɼΧςΰ
    ϦΧϧม਺Λ਺஋ม਺ͱͯ͠දݱ͢Δඞཁ͕͋Δɽ

    ༷ʑͳಛ௃ྔͷΤϯίʔυํ๏ͱ࣮૷͕TDJLJUMFBSOʹΑͬͯఏڙ͞Ε͍ͯΔɽ

    ▶ 
    TLMFBSOGFBUVSF@FYUSBDUJPO0OF)PU&ODPEFS
    ▶ 
    TLMFBSOGFBUVSF@FYUSBDUJPO-BCFM&ODPEFS
    ▶ 
    TLMFBSOGFBUVSF@FYUSBDUJPO%JDU7FDUPSJ[F
    ▶ 
    TLMFBSOGFBUVSF@FYUSBDUJPO'FBUVSF)BTIFS
    ▶ 
    TLMFBSOGFBUVSF@FYUSBDUJPOUFYU5pEG7FDUPSJ[FS
    ▶ 
    ʜ

    ୅දతͳΫϥεΛ঺հ͢Δ
    લॲཧ
    ಛ௃நग़

    View Slide

  8. 0OF)PU&ODPEFS
    ΧςΰϦΧϧม਺ΛPG,දهʹΤϯίʔυ͢Δɽ

    ͋ΔΧςΰϦΧϧม਺ͷDBSEJOBMJUZΛ,ͱͨ͠ͱ͖ɼಛ௃Λ,ྻͷͱ͠
    ͯѻ͏ɽͦΕͧΕͷྻʹ͸ΧςΰϦΧϧม਺ͷಛఆͷ஋͕ରԠ͓ͯ͠ΓɼରԠ͢
    Δྻͷ஋ΛɼͦΕҎ֎ΛͰදݱ͢Δɽ
    ʢQEHFU@EVNNJFT
    ؔ਺ʹΑͬͯQBOEBTͰ΋ಉ౳ͷૢ࡞͕ՄೳͰ͋Δʣ
    લॲཧ
    ಛ௃நग़
    WeekDay
    0 Friday
    1 Monday
    2 Monday
    3 Sunday
    4 Tuesday
    5 Saturday
    6 Monday
    1 0 0 0 0
    0 1 0 0 0
    0 1 0 0 0
    0 0 1 0 0
    0 0 0 1 0
    0 0 0 0 1
    0 1 0 0 0
    ndarray or
    csr_matrix
    K=5

    View Slide

  9. -BCFM&ODPEFS
    ΧςΰϦΧϧม਺Λ੔਺஋ʹΤϯίʔυ͢Δɽ

    ΧςΰϦΧϧม਺ͷDBSEJOBMJUZ͕,Ͱ͋Δ৔߹ɼ
    ม਺ͷ஋Λ< ,>ͷ੔਺஋ʹஔ͖׵͑Δɽ
    ʢQEGBDUPSJ[F
    ؔ਺ʹΑͬͯQBOEBTͰ΋ಉ౳ͷૢ࡞͕ՄೳͰ͋Δʣ

    લॲཧ
    ಛ௃நग़
    WeekDay
    0 Friday
    1 Monday
    2 Monday
    3 Sunday
    4 Tuesday
    5 Saturday
    6 Monday
    0 1 1 2 3 4 1 ndarray

    View Slide

  10. %JDU7FDUPSJ[F
    ࣙॻΦϒδΣΫτͷϦετΛ4DJ1ZͷૄߦྻΦϒδΣΫτ΍OEBSSBZΦϒδΣ
    Ϋτʹม׵͢Δɽ

    ࣙॻΦϒδΣΫτͷLFZ͕ߦྻͷಛఆͷྻʹҰରҰରԠ͢Δɽ
    WBMVF͕ಛ௃஋ʹରԠ͢Δʢ%BUB'SBNFΦϒδΣΫτͷίϯετϥΫλʹ౉͠
    ͯpMMOB
    BT@NBUSJY
    ͰθϩϑΟϧˍม׵͢Δ͜ͱͰQBOEBTͰ΋ಉ౳ͷૢ࡞
    ͕ՄೳͰ͋Δʣ
    લॲཧ
    ಛ௃நग़
    list of dict
    [
    {‘like’: 1, ‘rt’: 9},
    {‘like’: 2, ‘rt’: 2},
    {‘like’: 4},
    ]
    1  9
    2  2
    4  0 ndarray

    View Slide

  11. TDJLJUMFBSO΍9(#PPTUͰ༧ଌϞσϧΛ࡞੒
    TDJLJUMFBSOͰ͸pU
    Ͱσʔλ౰ͯ͸ΊɼQSFEJDU
    Ͱ༧ଌɼQSFEJDU@QSPCB

    Ͱ༧ଌ֬཰Λ౴͑Δڞ௨ͷΠϯλʔϑΣʔεΛ͍࣋ͬͯΔʢ˞෼ྨͷ৔߹ʣ
    >>> import xgboost as xgb
    >>> dtrain, dtest = xgb.DMatrix(), xgb.DMatrix()
    >>> watchlist = [(dtrain, 'train')]
    >>> booster = xgb.Booster()
    >>> gbtree = booster.train(dtrain, params)
    >>> y_pred = gbtree.predict(dtest)
    >>> from sklearn.linear_model import LogisticRegression
    >>> clf = LogisticRegression()
    >>> clf.fit(X_train, y_rain)
    >>> y_pred = clf.predict_proba(X_test)
    9(#PPTU͸TDJLJUMFBSOͱҟͳΔΠϯλʔϑΣʔεͰ͋Δ͕ɼΠςϨʔγϣϯ͝ͱͷ
    FWBMVBUJPOɼFBSMZTUPQQJOHɼ࠷దԽؔ਺ͷΧελϚΠζͳͲͷ༷ʑͳػೳ͕͋Δɽ
    9(#PPTU͸ϝϞϦޮ཰ͱ܇࿅଎౓ʹ࠷దԽ͞Εͨ಺෦σʔλߏ଄%.BUSJYͱͯ͠
    σʔλΛѻ͏ɽOEBSSBZ͔Β%.BUSJYΦϒδΣΫτΛ࡞੒Ͱ͖Δɽ
    ༧ଌϞσϧ
    ͷ࡞੒

    View Slide

  12. ϞσϧͷੑೳධՁ
    TLMFBSODSPTT@WBMJEBUJPOϞδϡʔϧ͸༧ଌϞσϧͷੑೳΛධՁ͢ΔͨΊͷ
    ϔϧύʔؔ਺΍ΠςϨʔλʔΛఏڙ͢Δɻ
    ༧ଌϞσϧ
    ͷධՁ
    >>> clf = LogisticRegression()
    >>> scores = cross_validation.cross_val_score(clf, X, y, cv=5)
    >>> scores
    array([ 0.92..., 1. ..., 0.92..., 1. ])
    ༧ଌϞσϧ͕TLMFBSO$MBTTJpFS.JYJOΛܧঝͨ͠ΫϥεͰ͋Ε͹ϔϧύʔؔ਺Λ
    ༻͍ͯަࠩ֬ೝΛ؆ܿʹهड़Ͱ͖Δɻ
    ΠςϨʔλʔ͸֬ೝ༻ͷ܇࿅σʔλͱςετσʔλͷΠϯσοΫεϦετΛฦ͢
    >>> kf = KFold(n_samples, n_folds=5)
    >>> for idx_train, idx_test in kf:
    ... y_train, y_test = y[idx_train], y[idx_test]
    ... X_train, X_test = X[idx_train], X[idx_test]
    ... clf.fit(X_train, y_train)
    ... y_pred = clf.predict(X_test)
    ֤'PMEͷ"DDVSBDZ

    View Slide

  13. 1"35
    σʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ

    View Slide

  14. σʔλͷ֬ೝ
    ▶ 
    l8BMNBSU3FDSVJUJOH5SJQ5ZQF$MBTTJpDBUJPOzίϯςετΛ୊ࡐͱͯ͠ѻ͏
    ▶ 
    ʮങ͍٬ʯΛʮങ͍෺ͷ঎඼εΩϟϯཤྺʯ͔ΒଟΫϥε෼ྨ͢ΔλεΫ
    ▶ 
    USBJODTW UFTUDTW TBNQMF4VCNJTTJPODTWͷ̏ͭͷϑΝΠϧ͕ఏڙ͞Ε͍ͯΔ

    ܇࿅ࣄྫσʔλUSBJODTW͸ʮങ͍෺ͷ঎඼εΩϟϯཤྺʯͰ͋Δɻ
    ϑΟʔϧυ໊ આ໌ 5SBJO 5FTU
    5SJQ5ZQF ໨ඪม਺
    ʮങ͍٬ʯͷΧςΰϦΧϧͳ*% ✔ ✗
    7JTJU/VNCFS *OTUBODF*%
    ͋ΔҰਓͷސ٬ͷങ͍෺ʹରԠ͢Δ*% ✔ ✔
    8FFLEBZ ങ͍෺Λͨ͠ि ✔ ✔
    6QD ߪೖ঎඼ͷ61$൪߸ ✔ ✔
    4DBO$PVOU ߪೖ͞Εͨݸ਺ʢෛͷ஋͸ฦ٫͞Εͨ঎඼ʣ ✔ ✔
    %FQBSUNFOU%FTDSJQUJPO ߪೖ঎඼ͷδϟϯϧ ✔ ✔
    'JOFMJOF ߪೖ঎඼Λߋʹࡉ͔͘෼͚Δδϟϯϧͷ*% ✔ ✔

    View Slide

  15. σʔλͷ֬ೝ
    ▶ 
    ܇࿅σʔλͰ͸ങ͍෺٬ʢ7JTJU/VNCFSʣͷങ͍෺͔͝ཤྺͱɺ๚໰٬ʹର
    Ԡ͢ΔΧςΰϦʢ5SJQ5ZQFʣ͕༩͑ΒΕ͍ͯΔɻ
    ▶ 
    ςετσʔλͰ͸ΧςΰϦ͕༩͑ΒΕ͍ͯͳ͍ɻςετσʔλͷΧςΰϦΛ܇
    ࿅σʔλʹج͍ͮͯ༧ଌ͢Δ͜ͱ͕༩͑ΒΕͨλεΫͰ͋Δɻ
    +VQZUFSOPUFCPPL্Ͱ
    Θ͔Γ΍͘͢ςʔϒϧͰදࣔ͞ΕΔ

    View Slide

  16. σʔλͷ֬ೝ
    ▶ 
    ༧ଌ݁Ռͷఏग़ϑΥʔϚοτ͸๚໰٬൪߸ͱ֤ΧςΰϦΛྻͱͨ͠ςʔϒϧɻ
    ๚໰٬൪߸͝ͱʹΧςΰϦʹଐ͢Δ֬཰Λղ౴͢Δɻ

    View Slide

  17. TFBCPSOʹΑΔσʔλͷՄࢹԽ
    TFBCPSO͸ϋΠϨϕϧͳΠϯλʔϑΣʔεΛ࣋ͭՄࢹԽπʔϧɽ
    σʔλϑϨʔϜΛೖྗͱͯ͠౷ܭάϥϑΛදࣔ͢Δ͜ͱ͕Ͱ͖Δɽ
    ໨ඪม਺5SJQ5ZQF͝ͱʹภΓ͕͋Δ͜ͱ͕֬ೝͰ͖Δ
    7JTJU/VNCFS͕6OJRVFͱͳΔΑ͏
    ϨίʔυΛࣺͯΔ
    5SJQ5ZQFͷ஋͝ͱͷΧ΢ϯτ

    View Slide

  18. TFBCPSOʹΑΔσʔλͷՄࢹԽ
    ▶ 
    ܇࿅ࣄྫͱςετࣄྫ͸࣌ܥྻʢ༵೔ʣ͕ಉ༷ͷ෼෍ͱͳΔΑ͏ʹαϯϓϧ͞
    Ε͍ͯΔ͜ͱ͕Θ͔Δ
    ▶ 
    ෼෍Λ֬ೝ͢Δͱ༵೔ͷ৘ใΛֶशϞσϧʹಛ௃ྔͱͯ͠༻͍ͨ৔߹ʹɺ܇࿅
    ࣄྫʹରͯ͠աֶश͠ͳ͍ͱਪଌͰ͖Δ
    ܇࿅σʔλ͸্ஈͷϓϩοτ
    ςετσʔλ͸Լஈͷϓϩοτ
    8FFLEBZͷ஋͝ͱʹΧ΢ϯτ

    View Slide

  19. Լ४උɿධՁࢦඪΛఆٛ͢Δ
    ධՁࢦඪͷ࣮૷͸TLMFBSONFUSJDTϞδϡʔϧ΍NM@NFUSJDTύοέʔδͳͲ͕
    ଘࡏ͢Δɻ௨ৗ͸͜ΕΒͷύοέʔδΛ࢖͑͹໰୊ͳ͍ɻ
    ▶ 
    TLMFBSONFUSJDTMPH@MPTT
    ؔ਺͸ʮ܇࿅ࣄྫͷΫϥε਺ʯͱʮςετࣄྫͷ
    Ϋϥε਺ʯ͕Ұக͠ͳ͍৔߹͸ΤϥʔͱͳΔɻଟΫϥε෼ྨʹ͓͍ͯɺرগͳ
    Ϋϥε͕ςετࣄྫʹग़ݱ͠ͳ͍ͱ͍͏͜ͱ͕ى͜ΓಘΔɻ
    ▶ 
    ࠓճ͸͜ͷέʔεʹ౰ͯ͸·ΔͨΊɺύοέʔδΛ࢖Θͣʹఆ͍ٛͯ͠Δɻ

    View Slide

  20. Լ४උɿϞσϧධՁͷͨΊͷํ๏Λ༻ҙ͢Δ
    ▶ 
    ࠓճ͸෼ׂަࠩ֬ೝ TLMFBSODSPTT@WBMJEBUJPO,'PME
    Λ༻͍Δ
    ▶ 
    ͢΂ͯͷΧςΰϦʹରͯ͠౰֬཰Ͱ͋Δͱ౴͑ΔϕʔεϥΠϯ͸
    ઌʹఆٛͨ͠NVMUJMPHMPTTͰ
    είΞϦϯά

    View Slide

  21. Լ४උɿ1Z5BCMFTͰಛ௃ྔΛதؒϑΝΠϧʹѹॖͯ͠อଘ
    ▶ 
    1Z5BCMFT͸)%'ܗࣜͷͨΊͷΠϯλʔϑΣʔεΛఏڙ͢Δύοέʔδ
    ▶ 
    σʔλͷѹॖʹ͸CMPTDΛનΊΔɽCMPTD͸4*.%໋ྩΛαϙʔτͨ͠ߴ଎
    ͳγϦΞϥΠζɾσγϦΞϥΠζ͕Մೳɽ
    ɹ ˞σʔλϑϨʔϜΛγϦΞϥΠζɾσγϦΞϥΠζ͢Δ৔߹͸
    ɹQBOEBTͷ)%'4UPSF 1Z5BCMFTΛར༻
    Ϋϥε͕༗༻Ͱ͋Δ

    ѹॖํ๏ͱͯ͠
    CMPTDΛࢦఆ
    QEGBDUPSJ[F
    Ͱ໨ඪม਺Λ
    JOEFYFEͳ੔਺ʹΤϯίʔυ

    View Slide

  22. ಛ௃ྔΛ࡞੒͢Δ
    ྫʰ͋Δങ͍෺ʹ͓͍ͯɺ
    Ͳͷ঎඼ΧςΰϦ͕Կݸ
    εΩϟϯ͞Ε͔ͨʱ

    QEQJWPU@UBCMF
    ؔ਺Ͱ
    τϥϯβΫγϣϯΛू໿
    >>> df_long = pd.concat([
    df_train, df_test,
    ]).fillna(“_NA_”)
    >>> df = pd.pivot_table(
    df_long,
    index=“VisitNumber”,
    columns=[“DepartmentDescription”],
    values=[“ScanCount”],
    aggfunc=[np.sum],
    )[‘sum’][‘ScanCount’]
    QEQJWPU@UBCMF EG@MPOH
    JOEFYl7JTJU/VNCFSz
    DPMVNOT
    WBMVFT
    BHHGVOD


    $PMVNO7BMVF͕λςํ޲ʹฒΜͰ͍Δ MPOHGPSNBU

    $PMVNO7BMVFͷରԠ͕ू໿͞ΕϤίํ޲ʹฒͿ XJEFGPSNBU

    View Slide

  23. ಛ௃ྔͷอଘܗࣜΛܾΊΔ
    ▶ 
    ͜͜Ͱ͸ʮ࠷ॳͷߦΛ܇࿅ࣄྫʯʮଓ͘ߦΛςετࣄྫʯ
    ͱͯ͠ߦྻΛಛ௃ྔͷߦྻΛOEBSSBZͱͯ͠อଘ͢Δ͜ͱΛߟ͑Δ
    ▶ 
    EGͷΠϯσοΫε͸7JTJU/VNCFSͰ͋Δɽ܇࿅ࣄྫͷߦྻͱςετࣄྫͷ
    ߦྻ͕7JTJU/VNCFSॱʹͳΔΑ͏ʹMPDϝιουͰฒͼସ͑Δ
    PSEFSCZ
    l7JTJU/VNCFSz
    PSEFSCZ
    l7JTJU/VNCFSz
    >>> visit_number_order = (
    df_train.VisitNumber.drop_duplicates().append(
    df_test.VisitNumber.drop_duplicates()
    )
    >>> df = df.loc[visit_order]
    >>> X = df.fillna(0).as_matrix()
    >>> X.shape
    (191348, 69)
    1 0 0 0 0
    0 1 0 0 0
    0 1 0 0 0
    0 0 1 0 0
    0 0 0 1 0
    0 0 0 0 1
    0 1 0 0 0
    ܇࿅ࣄྫͷߦྻ
    OEBSSBZPCKFDU
    ςετࣄྫͷߦྻ
    OEBSSBZPCKFDU
    7JTJU/VNCFS
    ͕ΠϯσοΫε

    View Slide

  24. ༧ଌϞσϧΛ࡞Γަࠩ֬ೝʹΑͬͯධՁ TDJLJUMFBSO

    ࡞੒ͨ͠ಛ௃ྔΛ࢖͍TDJLJUMFBSOͷϩδεςΟοΫճؼϞσϧΛ࡞੒
    ݁Ռɼަࠩ֬ೝʹΑΓϞσϧΛධՁ͠ͷείΞΛಘͨɽ
    ܇࿅ࣄྫͷ਺O@TBNQMFTͰ
    ަࠩ֬ೝΛ͢ΔʢςετσʔλΛ࢖Θͳ͍ʣ

    View Slide

  25. DSPTT@WBM@TDPSF
    ؔ਺ͷ՝୊ͱϫʔΫΞϥ΢ϯυ
    ʲ՝୊ʳݱঢ়ͷDSPTT@WBM@TDPSF
    ϔϧύʔؔ਺͸QSFEJDU@QSPCB
    ϝιου
    Λݺͼग़͢͜ͱ͕Ͱ͖ͣɺQSFEJDU
    ϝιουΛݺͿ
    TDJLJUMFBSOEFW NBTUFS)&"%
    Ͱ͸
    TLMFBSONPEFM@TFMFDUJPODSPTT@WBM@TDPSF

    ϔϧύʔؔ਺͕վम͞Ε͍ͯΔɽ
    QSFEJDU@QSPCB
    ϝιουΛΦϓγϣϯͰࢦఆ
    ͢Δ͜ͱͰݺͼग़͢͜ͱ͕Ͱ͖Δɻ
    ˠ<8PSLBSPVOE>ΫϥεΛܧঝͯ͠QSFEJDU
    ϝιουΛݺͿͱ
    QSFEJDU@QSPCB
    ϝιου͕ݺ͹ΕΔΫϥεΛఆٛ͢Ε͹࢖͏͜ͱ͕Ͱ͖Δɻ

    View Slide

  26. ༧ଌϞσϧΛ࡞Γަࠩ֬ೝʹΑͬͯධՁ 9(#PPTU

    XBUDIMJTUʹొ࿥͞Εͨσʔληοτ͸
    3FHSFTTJPO5SFFΛ૿΍ͨ͢ͼʹ
    ධՁ͞Εͯλʔϛφϧʹग़ྗ͞ΕΔɽ
    %.BUSJYΦϒδΣΫτ͸ରԠ͢Δ໨ඪม਺ͱ
    ηοτͰఆٛ͢Δ͜ͱ͕Ͱ͖Δ
    9(#PPTU࣮૷ͷϒʔεςΟϯάϞσϧʹஔ͖׵͑ͯަࠩ֬ೝ
    4DPSFˠ

    View Slide

  27. ༧ଌ݁ՌΛ,BHHMFʹఏग़͢Δ
    ༧ଌ݁Ռ͔ΒσʔλϑϨʔϜΛ࡞੒ͯ͠,BHHMFʹ౤ߘ͢Δ
    QEGBDUPSJ[F
    ͷೋ൪໨ͷฦΓ஋
    ໨ඪม਺ͷΤϯίʔυॱ൪
    UP@DTW
    Ͱ$47ϑΝΠϧ΁ग़ྗ

    View Slide

  28. ༧ଌ݁ՌΛ,BHHMFʹఏग़͢Δ
    ༧ଌ݁Ռ͔ΒσʔλϑϨʔϜΛ࡞੒ͯ͠,BHHMFʹ౤ߘ͢Δ
    QEGBDUPSJ[F
    ͷೋ൪໨ͷฦΓ஋
    ໨ඪม਺ͷΤϯίʔυॱ൪
    UP@DTW
    Ͱ$47ϑΝΠϧ΁ग़ྗ
    ͨͩͪʹ)PMEPVUTFUͰͷධՁ͕͸͡·Γ
    ίϯςετ಺ʹ͓͚ΔॱҐ͕ࣔ͞ΕΔ

    View Slide

  29. ಛ௃ྔΛ૿΍͢
    ▶ 
    ങ͍෺ͷ༵೔ˠQEGBDUPSJ[F

    ▶ 
    τϥϯβΫγϣϯϨίʔυ਺ɺ঎඼ฦ٫ͷτϥϯβΫγϣϯϨίʔυ਺ɺ
    ങ͍෺ʹඥͮ͘%FQBSUNFOU%FTDSJQUJPOͷϢχʔΫ਺ˠHSPVQCZ

    ू໿ͷͨΊͷؔ਺͸
    ΧελϚΠζͰ͖Δ

    View Slide

  30. ಛ௃ྔΛ૿΍ͯ͠Ϟσϧͷަࠩ֬ೝ
    தؒϑΝΠϧʹอଘͨͦ͠ΕͧΕͷಛ௃ྔߦྻΛOQITUBDL
    ؔ਺Ͱ
    ݁߹ͯ͠୯ҰͷOEBSSBZΛ࡞੒͢Δɽ
    ༧ଌϞσϧ͸ઌʹఆٛͨ͠ύϥϝʔλΛͦͷ··ར༻ʢˠ$74DPSFʣ
    1 0 0 0 0
    0 1 0 0 0
    0 1 0 0 0
    0 0 1 0 0
    2
    3
    1
    1
    1
    1
    2
    1
    0
    0
    0
    1
    0
    0
    0
    1
    1 0 0 0 0 2 1 0 0
    0 1 0 0 0 3 1 0 0
    0 1 0 0 0 1 2 0 0
    0 0 1 0 0 1 1 1 1
    OQITUBDL

    View Slide

  31. ಛ௃ྔΛ૿΍ͯ͠Ϟσϧͷަࠩ֬ೝ
    தؒϑΝΠϧʹอଘͨͦ͠ΕͧΕͷಛ௃ྔߦྻΛOQITUBDL
    ؔ਺Ͱ
    ྻํ޲ʹܨ͛ͯ୯ҰͷOEBSSBZΛ࡞੒͢Δɽ
    ༧ଌϞσϧ͸ઌʹఆٛͨ͠ύϥϝʔλΛͦͷ··ར༻ʢˠ$74DPSFʣ
    1 0 0 0 0
    0 1 0 0 0
    0 1 0 0 0
    0 0 1 0 0
    2
    3
    1
    1
    1
    1
    2
    1
    0
    0
    0
    1
    0
    0
    0
    1
    1 0 0 0 0 2 1 0 0
    0 1 0 0 0 3 1 0 0
    0 1 0 0 0 1 2 0 0
    0 0 1 0 0 1 1 1 1
    OQITUBDL

    ˠ
    લճͷ༧ଌϞσϧΑΓվળͯ͠ॱҐ্͕͕ͬͨ

    View Slide

  32. 1"35
    ൃలతͳ࿩୊

    View Slide

  33. ൃలతͳ࿩୊ɿσʔλʹಛԽͨ͠࿩୊
    ༷ʑͳσʔλ͝ͱʹ༗༻ͱͳΔύοέʔδΛ঺հ͢Δ

    ▶ 
    େن໛σʔλ ˞͜͜Ͱ͸%#શମͰ(#ن໛ͷ3%#Λ૝ఆ͠·͢

    –  3FETIJGU
    –  #JH2VFSZ
    –  ϕϯνϚʔΫ 3FETIJGUWT#JH2VFSZ

    ▶ 
    ը૾σʔλ
    –  ಛ௃఺ͷαϯϓϦϯά
    –  ࣹӨม׵

    View Slide

  34. 42-ΫΤϦ͔ΒQBOEBTσʔλϑϨʔϜΛ࡞੒͢Δ
    QBOEBT͸*0ͱͯ͠42-Λαϙʔτ͍ͯ͠Δɽ
    QBOEBTSFBE@TRM
    ؔ਺Λ࢖͏ͱΫΤϦൃߦͷ
    ݁ՌΛσʔλϑϨʔϜͱͯ͠ड͚औΔ͜ͱ͕Ͱ͖Δ
    ▶ 
    ෳࡶͳ3%#͸ѻ͏৔߹ɼू໿ॲཧͳͲΛ42-ʹΑͬͯ؆ܿʹهड़Ͱ͖Δ
    ▶ 
    େن໛ͳσʔλΛѻ͏৔߹ɼ͢΂ͯͷσʔλΛखݩͷ؀ڥʹϩʔυ͢Δ͜ͱͳ
    ͘42-αʔόʔͷܭࢉϦιʔεͰॲཧͰ͖Δ
    ˞*NBHF4PVSDFIUUQTXXXLBHHMFDPNDDFSWJDBMDBODFSTDSFFOJOHEBUB

    View Slide

  35. Ϋϥ΢υࢿݯΛར༻͠େن໛σʔλΛॲཧ 3FETIJGU

    3FETIJGU͸"84্ʹىಈͨ͠ߴີ౓ετϨʔδϊʔυ͋Δ͍͸
    ߴີ౓ίϯϐϡʔτϊʔυʹΑΔΫϥελ্Ͱಈ࡞͢Δɽ
    ### 1. Amazon S3 ΁σʔλΛΞοϓϩʔυ
    $ aws s3 sync data/input s3://kaggle-kohei/walmart_triptype/input
    ### 2. ςʔϒϧΛ࡞੒ͯ͠S3 ͔Β Redshift ΁σʔλΛϩʔυ͢Δ
    $ psql < schema.sql
    ### 3. SQL ΫΤϦΛൃߦͯ͠ Pandas σʔλϑϨʔϜΛ࡞੒͢Δ
    >>> import psycopg2 as pg
    >>> import pandas as pd
    conn_string = ' '.join([
    "dbname='dwh'",
    "port='5439'",
    "user='kohei_ozaki'",
    "password='{}'".format(os.environ['REDSHIFT_PWD']),
    "host='{}'".format(os.environ['REDSHIFT_HOST']), ])
    >>> conn = pg.connect(conn_string)
    >>> pd.read_sql("SELECT * FROM train WHERE TripType = 999", conn)
    ىಈͨ͠3FETIJGUͷ઀ଓઌΛࢦఆ
    42-͔ΒσʔλϑϨʔϜ࡞੒

    View Slide

  36. Ϋϥ΢υࢿݯΛར༻͠େن໛σʔλΛॲཧ #JH2VFSZ

    ### 1. Cloud Storage へデータをアップロードする
    $ gsutil rsync -r data/input gs://kaggle-kohei.appspot.com/walmart_triptype
    ### 2. BigQuery テーブルとしてデータをインポート
    $ bq load --skip_leading_rows 1 \
    kaggle-kohei:walmart_triptype.train \
    gs://kaggle-kohei.appspot.com/walmart_triptype/train.csv \
    train.json
    ### 3. Pandas からクエリを発行して,結果をデータフレームとして受け取る
    >>> df = pd.read_gbq("""
    SELECT t1.trip_type, COUNT(1) AS n_visitors FROM (
    SELECT FIRST(trip_type) AS trip_type FROM walmart_triptype.train
    GROUP BY visit_number
    ) t1 GROUP BY t1.trip_type""", "kaggle-kohei")
    1BOEBTΑΓQESFBE@HCR
    ؔ਺͔Β#JH2VFSZʹΫΤϦΛൃߦͰ͖Δɽ
    ϑϧϚωʔδυͳαʔϏεͰ͋ΔͨΊΫϥελΛҙࣝ͢Δඞཁ͕ͳ͍ར఺͕͋Δɽ
    +40/ܗࣜͰςʔϒϧεΩʔϚΛఆٛ͢Δ
    42-͔ΒσʔλϑϨʔϜ࡞੒

    View Slide

  37. 3FETIJGUͱ#JH2VFSZͷϕϯνϚʔΫ
    ▶ 
    (#ͷ$47ɼH[JQѹॖͰ(#ͷԯߦ ߦ

    ςʔϒϧσʔλΛϩʔυͯ͠ΫΤϦʢू໿ؔ਺ʣͷ࣮ߦ࣌ؒΛൺֱ
    ▶ 
    ܭଌର৅͸#JH2VFSZ 3FETIJGU Y
    3FETIJGU Y
    3FETIJGU Y

    P⒎UPQJD
    ࠓճͷઃఆͰ͸୆਺Λ૿΍ͤ͹3FETIJGUͷੑೳΛ্͛Δ͜ͱ͕Ͱ͖ͨɽ
    ˞3FETIJGU OPEFT POEFNBOE EDMBSHFJOTUBODF
    ͷίετ͸64%IPVS#JH2VFSZͷΫΤϦίετ͸ԁະຬ

    View Slide

  38. Ϋϥ΢υࢿݯΛར༻͢ΔͨΊͷΞυόΠε
    ▶ 
    σʔλͷϩʔυʹ͸͕͔͔࣌ؒΔ͕ɼʮΞυϗοΫΫΤϦΛԿ౓΋࣮ߦ͢Δʯ
    ʮ֤ΞυϗοΫΫΤϦͷλʔϯΞϥ΢ϯυλΠϜΛ୹͍ͨ͘͠ʯͱ͍͏ڧ͍ཁ
    ੥͕͋ΔͳΒ͹අ༻ରޮՌ͕ߴ͍
    ▶ 
    #JH2VFSZ͸ڊେͳѹॖϑΝΠϧΛҰ౓ʹόονͰϩʔυ͢Δ͜ͱ͕Ͱ͖ͳ͍ɽ
    ͦͷ৔߹͸ߦ୯ҐͰ෼ׂˍѹॖͯ͠$MPVE4UPSBHFʹΞοϓϩʔυͰ͖Δ
    $ zcat prescription_head.csv.gz | \
    split –d –C 1G --filter='gzip > $FILE.gz' – prescription_head.csv.part
    prescription_head.csv.gz prescription_head.csv.part01.gz
    prescription_head.csv.part02.gz
    prescription_head.csv.part03.gz
    zcat & split gsuIl rsync
    P⒎UPQJD

    View Slide

  39. 1ZUIPOʹΑΔը૾σʔλͷॲཧ
    1ZUIPOͰը૾ॲཧΛ͢ΔͨΊͷύοέʔδͱͯ͠ɼ
    TDJLJUJNBHF΍0QFO$7ͷ1ZUIPOόΠϯσΟϯά͕͋Δɽ
    ֤ύοέʔδͰ͸ը૾ΛOEBSSBZΦϒδΣΫτͷߦྻͱͯ͠ѻ͏ɽ

    ▶ 
    มܗʢճసɼ֦େɼ΅͔͠ʣ
    ▶ 
    υϩʔΠϯά
    ▶ 
    ώετάϥϜฏୱԽ
    ▶ 
    ը૾ͷηάϝϯςʔγϣϯ
    ▶ 
    Τοδநग़ɼը૾ಛ௃఺ͷܭࢉ
    ▶ 
    ը૾ಛ௃఺ͷαϯϓϦϯάɼϚονϯά
    ▶ 
    FUD
    im[:, :, 2]
    im[:, :, 1]
    im[:, :, 0]
    #(3νϟϯωϧͷը૾දݱͷྫ

    View Slide

  40. 0QFO$7ʹΑΔը૾ಛ௃఺ͷܭࢉ
    DW#3*4,@DSFBUF
    ؔ਺͸#3*4,ΞϧΰϦζϜͷ0QFO$7࣮૷ɽ
    #3*4,Λܭࢉ͢ΔͨΊͷEFUFDUPSΦϒδΣΫτΛ࡞੒͢Δ
    %3"1&34BUFMMJUF*NBHF$ISPOPMPHZίϯςετͷσʔλΛྫ୊ͱͯ͠঺հ͢Δ
    IUUQTXXXLBHHMFDPNDESBQFSTBUFMMJUFJNBHFDISPOPMPHZ

    View Slide

  41. ಛ௃఺ͷϚονϯάͱ݁ՌΛ࢖ࣹͬͨӨม׵
    DW%FTDSJQUPS.BUDIFS@DSFBUF
    ؔ਺Ͱಛ௃఺ͷϚονϯά͢Δɽ
    Ϛονϯά݁ՌΛ࢖͍ɼࡱӨ֯౓΍εέʔϧͷҧ͍Λࣸ૾ม׵Ͱิਖ਼͢Δɽ
    DWQFSTQFDUJWF5SBOTGPSN
    ؔ਺Ͱ
    ϗϞάϥϑΟߦྻΛجʹࣹӨม׵
    DWpOE)PNPHSBQIZ
    ؔ਺Ͱ
    Ϛονϯά݁ՌΛجʹ
    ϗϞάϥϑΟߦྻΛܭࢉ

    View Slide

  42. ը૾ࠩ෼ͷՄࢹԽ
    ࣹӨม׵ͯ͠ࡱӨ֯౓΍εέʔϧͷҟͳΔը૾ΛҰகͤͨ͞ɽ
    ೋ஋Խͨ͠ը૾ΛϐΫηϧ୯ҐͰൺֱ͢Δ͜ͱͰɼࠩ෼ΛՄࢹԽͰ͖Δ
    3FG5VUPSJBMWJTVBMJ[JOHEJ⒎FSFODFCFUXFFOUISFFQJDUVSFT
    IUUQTXXXLBHHMFDPNDESBQFSTBUFMMJUFJNBHFDISPOPMPHZGPSVNTUUVUPSJBMWJTVBMJ[JOHEJ⒎FSFODFCFUXFFOUISFFQJDUVSFT
    ը૾"ʹͷΈग़ݱ
    ը૾#ʹͷΈग़ݱ
    ˞ίʔυͱৄࡉʹ͍ͭͯ͸ࢿྉͷ
    +VQZUFSOPUFCPPLΛࢀর͍ͯͩ͘͠͞ɽ

    ˞ಛ௃఺ͷܭࢉͳͲ͸0QFO$7ͷ
    ֦ுΛΠϯετʔϧ͢Δඞཁ͕͋Δɽ

    View Slide

  43. ͜ͷνϡʔτϦΞϧʹ͍ͭͯʢ࠶ܝʣ
    σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠ɺ
    1ZUIPOʹΑΔύοέʔδར༻ྫΛ঺հ͢Δɻ
    ▶ 
    1BSU༧ଌϞσϧ࡞੒ϓϩηεͱؔ࿈ύοέʔδ NJO

    –  ༧ଌϞσϧͷ࡞੒ϓϩηεͱؔ࿈ύοέʔδͷུ֓Λ঺հ͢Δ
    –  1BOEBT 4DJLJU-FBSO
    ▶ 
    1BSUσʔλ෼ੳίϯςετʹ͓͚Δ࣮ફ NJO

    –  σʔλ෼ੳίϯςετΛ୊ࡐͱͯ͠۩ମతͳར༻ྫΛ঺հ͢Δ
    –  1BOEBT TFBCPSO 9(#PPTU 1Z5BCMFT
    ▶ 
    1BSUൃలతͳ࿩୊ NJO

    –  ໨తʢେن໛σʔλɾը૾σʔλʣ͝ͱʹಛԽͨ͠ར༻ྫΛ঺հ͢Δ
    –  1BOEBT 3FETIJGU #JH2VFSZ 0QFO$7

    View Slide

  44. ิ଍ɿ഑෍ࢿྉͱ%PDLFSʹΑΔ࠶ݱੑͷ֬อʢ࠶ܝʣ
    ຊεϥΠυͰ࢖ΘΕΔίʔυ͸+VQZUFSOPUFCPPLͱͯ͠ҎԼͰެ։͍ͯ͠Δɽ
    ▶ 
    εϥΠυIUUQHPPHM.F.;Z0
    ▶ 
    1BSUࢿྉIUUQHPPHMZ;N,S
    ▶ 
    1BSUࢿྉIUUQHPPHM"5$ZW

    ɽຊࢿྉͰར༻ͨ͠+VQZUFSOPUFCPPL͸ެ։͍ͯ͠Δ%PDLFSΠϝʔδ͔Β
    ࡞੒ͨ͠؀ڥΛ࢖͍ͬͯΔɽҎԼͷίϚϯυͰಉ༷ͷ؀ڥΛࢼ͢͜ͱ͕Ͱ͖Δɻ
    $ docker run --rm -ti \
    -p 8888:8888 \
    -v /path/to/data_directory:/mnt \
    -v /path/to/working_directory:/home/kohei/work \
    smly/notebook:0.4

    View Slide