Upgrade to Pro — share decks privately, control downloads, hide ads and more …

特徴量エンジニアリングと野球選手の成績予測 - 野球ではじめる機械学習 / Baseball Player Performance Prediction Using Feature Engineering with Machine Learning and Python

特徴量エンジニアリングと野球選手の成績予測 - 野球ではじめる機械学習 / Baseball Player Performance Prediction Using Feature Engineering with Machine Learning and Python

PyCon JP 2020 8/28 「スポーツデータを用いた特徴量エンジニアリングと野球選手の成績予測 - PythonとRを行ったり来たり」登壇資料

https://pycon.jp/2020/timetable/?id=203110

#Baseball #SABRmetrics #Python #MachineLearning #Datascience #PyConJP

Shinichi Nakagawa

August 28, 2020
Tweet

More Decks by Shinichi Nakagawa

Other Decks in Science

Transcript

  1. Baseball Player Performance Prediction Using Feature Engineering with Python ⁶

    R Shinichi Nakagawa(@shinyorke) PyCon JP 2020 Online 8/28 εϙʔπσʔλΛ༻͍ͨಛ௃ྔΤϯδχΞϦϯάͱ໺ٿબखͷ੒੷༧ଌ - PythonͱRΛߦͬͨΓདྷͨΓ
  2. Who am I ?ʢ͓લ୭Αʣ • Shinichi Nakagawaʢத઒ ৳Ұʣ • େ఍ͷSNSͰʮshinyorkeʢ͠ΜΑʔ͘ʣʯͱ໊৐͍ͬͯ·͢

    • JX Press Corporation Senior Engineer ʢJX௨৴ࣾ γχΞɾΤϯδχΞʣ • Baseball Engineer, Data Scientist ʢ໺ੜͷ໺ٿΤϯδχΞɾσʔλαΠΤϯςΟετʣ • #Python #DataScience #Baseball⚾ #SABRmetrics #σʔλج൫ #ٕज़ސ໰
  3. JX௨৴ࣾͱPython • αʔόʔαΠυ, ػցֶश, SREͳͲͳͲPythonΛੲ͔Β࢖ͬͯ·͢ • PyCon JP͸εϙϯαʔΛԿ౓͔΍ͬͯ·͢. 2016, 2017,

    2019, 2020(New!) • ࠓ೥͸͞ΒʹεϐʔΧʔ͕ೋਓʂ(@YAMITZKY, @shinyorke) • Techϒϩάؤுͬͯ·͢, ಡΜͰͶ https://tech.jxpress.net/
  4. Python΋͘΋ࣗ͘शࣨ #jisyupy • ʮٕज़ͱΠΠΰϋϯʯΛָ͠Έͳ͕Βʮֶࣗࣗशʯ͢Δձ. • 2017೥ʹ #rettypy ͱͯ͠ελʔτ, ͣͬͱΦʔΨφΠβʔͯ͠·͢. •

    2020೥͔ΒΦʔΨφΠθʔγϣϯมߋͱ͔Ͱͪΐͬͱ͚ͩຯม. • ݱ࣌఺Ͱ͸ΦϯϥΠϯɾෆఆظ։࠵, ͍ͣΕϦΞϧ΋΍Γ͍ͨ. • ࣍ճ͸9/19 https://jisyupy.connpass.com/event/186611/
  5. ಛ௃ྔΤϯδχΞϦϯάͱ⚾ • ਺஋ -> ਺஋ • ͦͷ··࢖͑ΔϞϊ͕ଟ͍. ྫ͑͹҆ଧ, ࢛ٿ, ࡾৼͳͲ.

    • ਖ਼نԽɾεέʔϦϯά͢Δ. RC, wRAA, wOBAͳͲͷηΠόʔϝτϦΫεࢦඪ. • ਺஋Ҏ֎ͷσʔλ -> ਺஋ • ར͖࿹, ଧ੮ͷࠨӈ, etc… • બखͷಛ௃ͱͳΔ༗ޮͳσʔλΛԿ͔͠ΒͷܗͰ਺஋Խ.
  6. ⚾ʹ͓͚Δಛ௃ྔͷߟ͑ํ - DIPS, RC, LWTS • DIPS: ౤ଧͷϓϨʔΛʮࣗ੹ʯʮଞ੹ʯʹ෼ྨ͠ѻ͏ʢԼਤΛࢀরʣ • RC:

    ಘ఺ೳྗΛʮʢग़ྥೳྗ + ਐྥೳྗʣ / ग़৔ػձʯͷϞσϧͰઆ໌͢Δ΍Γํ • LWTS: ϓϨʔͷҰͭҰͭΛʮಘ఺ʯʹ׵ࢉ͠, ౤ଧͷϓϨʔΛධՁ͢Δ • ʮηΠόʔϝτϦΫεʯͱ͍͏໺ٿͷ౷ܭϞσϧతͳߟ͑ํͰ͢&ৄ͘͠͸ࢲͷϒϩάʹͯ https://shinyorke.hatenablog.com/entry/sabr-metrics-batting-stats ಛ௃ ओͳࢦඪ ࣗ੹ ݸਓͷೳྗʹґଘ ύϫʔ εϐʔυ બٿ؟FUDʜ ຊྥଧ ࡾৼ ࢛ࢮٿ ଞ੹ આ໌ม਺͕ଟ͍ νʔϜ ٿ৔ ৹൑FUDʜ ࣗ੹఺ʢ๷ޚ཰ʣ ࣦࡦ
  7. ಛ௃ྔΫοΩϯάΤϯδχΞϦϯάૣݟද • Python, R, SQLͰ΍ΕΔ͜ͱʹେࠩφγʢಘҙෆಘҙ͸͋Δʣ • هड़ྔ, ؔ਺ͷ໨త, ॲཧ଎౓ʢ෼ࢄॲཧͷ༗ແʣͰ࢖͍෼͚ ࢀߟɿ

    https://shinyorke.hatenablog.com/entry/r-to-python ൺֱ߲໨ 1ZUIPO 3 42- هड़ྔ ʢಉ͡ࣄΛͨ͠ͱͯ͠ʣ Մ΋ͳ͘ෆՄ΋ͳ͠ σʔλΛѻ͏఺Ͱ͸ ൺֱతγϯϓϧ 1ZUIPO 3ͱൺ΂ ΍΍৑௕ ؔ਺ ʢܭࢉॲཧʣ 001తͳΞϓϩʔνଞ ϓϩάϥϚϒϧͰ͋Δ ਺ࣜతͳϞσϧΛ ਺ࣜͷ··Ͱ͖Δ ؔ਺ͰϐλΰϥεΠον ࢖͍ํΛؒҧ͑Δͱ஍ࠈ ॲཧ଎౓ ෼ࢄॲཧ ฒྻԽɾ෼ࢄॲཧͰ ૯߹తͳνϡʔχϯάڧ͍ ͋͘·Ͱܭࢉɾ౷ܭπʔϧ ॲཧ଎౓ɾ෼ࢄ͸΄Ͳ΄Ͳ %#ΤϯδϯʹΑͬͯ ଎౓ɾ෼ࢄͷߟ͕͑ҟͳΔ
  8. ໺ٿͰ͸͡ΊΔػցֶश⚾ 1. Planning - ௐࠪɾاը 2. Data Engineering - σʔλऔಘ

    3. Feature Engineering - ಛ௃ྔநग़ 4. Clustering - ΫϥελϦϯάʹΑΔ෼ྨ 5. Predict - ੒੷༧ଌ ࣅͨλΠτϧͷຊ͕͋Δͩͱ? ؾͷ͍ͤͰ͢Αؾͷ͍ͤʢখ੠ʣ
  9. Analyzing Baseball Data with R • ηΠόʔϝτϦΫεΛ࢖ͬͨ໺ٿσʔλ෼ੳʹ͓͚Δఆ൪ຊ • ΞϝϦΧൃͰ2018೥ʹSecond Editionൃද,

    ΋ͪΖΜӳޠ • ໊લͷ௨Γ, ⚾σʔλ෼ੳͷຊͰ, ίʔυ͸͢΂ͯR த਎Λཧղ͢ΔͨΊ, RͷίʔυΛಡΈͳ͕ΒPythonʹࣸܦ
  10. R -> Pythonʹࣸܦͨ݁͠Ռ • RͰ΍Ζ͏͕PythonͰ΍Ζ͏͕݁Ռ͸มΘΒͳ͍, ͱཧղ • ࠓޙ࢖͏ϥΠϒϥϦʢ&ࣗ෼ͷशख़౓ʣߟ͑ͨΒPython • ͔͠͠,

    RͰ਺ࣜͱ͔ΊͪΌཧղͰ͖ͨͷͰRʹײँ ͪͳΈʹ౰࣌ͷ࡞ۀϩάʢ2019/11ʹ࣮ࢪʣ͸ϒϩάʹͯ͠·͢ https://shinyorke.hatenablog.com/entry/r-to-python
  11. ⚾Data is Ͳ͜& • Lahman’s Baseball Database • MLBશબखͷ௨ࢉ੒੷ɾग़਎ͷσʔλ. CSV੡.

    • http://www.seanlahman.com/baseball-archive/statistics/ • https://github.com/chadwickbureau/baseballdatabank • Retrosheet • ࢼ߹৘ใΛଧ੮୯ҐͰه࿥͍ͯ͠Δσʔληοτ. • https://www.retrosheet.org/ • https://github.com/chadwickbureau/retrosheet ݩʑ, shinyorke͕PyCon JP 2014Ҏདྷ͓ੈ࿩ʹͳ͍ͬͯͨσʔλͰ͢&࠷ۙ͸GitHubʹ΋͋ͬͯศརʂ
  12. ʲࢀߟʳBigQueryͷίετपΓ͸& • ͋͘·Ͱࢲͷܦݧ্Ͱ͕͢, ݸਓ։ൃͰ࢖͏ఔ౓ͷσʔλྔͩͬͨΒແྉ࿮ͷൣғ಺Ͱ࢖͑·͢. ※਺10GBఔ౓, 1ΫΤϦ͋ͨΓ਺100MB͙Β͍ͷར༻ • جຊΛकΕ͹اۀϨϕϧͰ΋ޮ཰త͔ͭ௒ϥΫʹ࢖͑·͢. GCPެࣜ&৭Μͳਓ͕ݴٴ͍ͯ͠·͢. •

    ແବͳྻΛऔಘ͠ͳ͍, σʔλ͸Ұׅૠೖ • partition key׆༻Ͱޮ཰తͳΞΫηε • ίετ؂ࢹ&ͳΜ͔͋ͬͨΒSlack౳Ͱ௨ใ • JX௨৴ࣾͰ΋BigQueryΊͬͪΌ׆༻͍ͯ͠·͢ https://tech.jxpress.net/entry/kowakunai-bigquery
  13. JupyterLab͔ΒBigQueryͰ͍͍ײ͡ʹ • JupyterLabΛϕʔεͱͨ͠؀ڥ • Pandas • scikit-learn • plotly •

    BigQuery Client͔Βͦͷ·· Dataframeʹ͍͍ͯ͠ײ͡ʹॲཧ • ͜ͷޙͷΫϥελϦϯάͱ͔͸શ෦͜Ε
  14. ANNʢۙࣅ࠷ۙ๣୳ࡧʣͰڑ཭ΛٻΊΔ • ώοτ, ຊྥଧ, etc…౳ͷ୅දతͳ੒੷. ৄࡉ͸ൿີ' • ग़৔ࢼ߹਺ͱ͔஍ຯͳ੒੷. ͜Ε΋ൿີ' •

    ্هΛಛ௃ྔͱͯ͠ANNʢۙࣅ࠷ۙ๣୳ࡧʣΛ͔ͭͬͯ ϢʔΫϦουڑ཭Λࢉग़͠, ͍ۙબखΛूΊΔ͜ͱʹ. • ଧऀ༧ଌͱ͸ผωλͰࢼ͠, ݁Ռ্ʑͩͬͨͷͰͦͷ··࠾༻ https://shinyorke.hatenablog.com/entry/feature-faridyu-san • ࣮૷͸Annoyͱ͍͏௒ศརͳϥΠϒϥϦΛ࢖͍·ͨ͠. • ࣮ݧίʔυΛ৮ͬͯյΕΔͱΞϨͳͷͰGitHub ActionsͰAuto Test
  15. ࣮ࡍͲ͏΍ͬͯ΍͔ͬͨ? 1. બखXʹࣅ͍ͯΔબखYʢෳ਺ਓʣͷ25ࡀ࣌఺ͷ੒੷ΛूΊΔ. 2. 1.ͷ੒੷σʔλΛݩʹ, ʮڧ͍ɾී௨ɾऑ͍ʯతͳlabelΛ෇͚Δ. 3. 1.Λtraining data, 2.Λlabelͱͨ͠෼ྨλεΫΛ࣮ࢪ

    4. બखXͷ25ࡀ੒੷σʔλΛ࢖ͬͯ༧ଌ. 5. ฦ͖ͬͯͨlabelͱಉ͡label͕෇͍ͨબखͷ੒੷Λݩʹ༧ଌ੒੷Λ࡞੒. ෼ྨ͸φΠʔϒϕΠζ, ࣮૷͸scikit-learnͰΤΠοͱ΍ͬͨʢίʔυ͸ׂѪʣ
  16. • ౷ܭɾػցֶशεΩϧΛຏ͘, ΤϯδχΞϦϯάɾεΩϧΛ ৳͹͢໨తͰʮݸਓ։ൃͰػցֶशʯΛڧ͓͘͢͢Ί͠·͢ʂ • ͜ͷൃද͸ۓٸࣄଶએݴதͷࣗॗظؒʹ΄΅΍Γ͖Γ·ͨ͠. ʮDone is better than

    perfectʯΛStay Homeظؒͷ͓͔͛Ͱ΍Εͨ. • ͪͳΈʹ, ݸਓ։ൃΛ࠳ંͤͣଓ͚Δ࿩͸ͪΐͬͱલʹॻ͍ͨ https://shinyorke.hatenablog.com/entry/botti-development σʔλαΠΤϯςΟετͦ͜ݸਓ։ൃΛ