$30 off During Our Annual Pro Sale. View Details »

Python使いのためのスポーツデータ解析のきほん - PySparkとメジャーリーグデータを添えて #PyConJP 2022

Python使いのためのスポーツデータ解析のきほん - PySparkとメジャーリーグデータを添えて #PyConJP 2022

Shinichi Nakagawa

October 15, 2022
Tweet

More Decks by Shinichi Nakagawa

Other Decks in Technology

Transcript

  1. No Baseball, No Engineering!
    High Performance Data Platform
    Knowledge of PySpark, Cloud and ⚾
    Python࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯ
    Shinichi Nakagawa@shinyorke 2022/10/15 PyConJP 2022 Talk Session

    View Slide

  2. Onboardingʢ͜ͷηογϣϯͷ͝Ҋ಺ʣ
    • PythonͱSparkʢPySparkʣͱύϒϦοΫΫϥ΢υʢGoogle CloudʣͰ
    ਺GBҎ্ͷσʔλΛ͍͍ײ͡ʹॲཧͯ͠ѻ͓͏ͥʂ, ͱ͍͏τʔΫͰ͢.
    • ಺༰తʹ͸தڃऀʙ্ڃऀ޲͚Ͱ͢, ॳ৺ऀͷํͷࢦ਑ʹͳΔͱخ͍͠Ͱ͢.
    ʢ㲈Θ͔Βͳ͍ɾ஌Βͳ͍͜ͱ͸ࣗ͝෼ͷʮ৳ͼ͠Ζʯͩͱࢥ͍ͬͯͩ͘͞ʣ
    • σʔλͷ୊ࡐ͸ʮϝδϟʔϦʔάʯͰ͢⽁, εϙʔπσʔλͷ࿩΋গ͠.
    • ໺ٿʹڵຯͳ͍ʢor޷͖͡Όͳ͍ʣํͱ΋Ұॹʹָ͠ΊͨΒ޾͍Ͱ͢.
    ࠓ೔ͷτʔΫΛ͖͔͚ͬʹ໺ٿʹڵຯ΋ͯΔΑ͏ͳ࿩Λؤுͬͯ΍Γ·͢"

    View Slide

  3. օ༷ʹظ଴͢Δલఏ஌ࣝͱϞνϕʔγϣϯ
    • ʲMustʳPandas΍SQLͰσʔλॲཧɾ෼ੳΛखΛಈ͔ͯ͠΍ͬͨ͜ͱ͕͋Δ.
    • ʲMustʳGoogle CloudʢGCPʣ, AWS, AzureͳͲͷPublic CloudͰ
    PythonΛ࢖ͬͨ͜ͱ͕͋Δ. ※αʔϏε͸໰ΘͣʢEC2, App Engine, etc…ʣ
    • ϑϧϚωʔδυͷαʔόϨε؀ڥͰͷ։ൃܦݧʢ৮ͬͨ͜ͱ͋Ε͹OKʣ.
    AWS Lambda, AWS App Runner, App Engine, Cloud RunͳͲ͕֘౰.
    • ʢ޷͖ݏ͍ؔ܎ͳ͘ʣ໺ٿͷϧʔϧͱΦΦλχαϯ͸೺Ѳ͍ͯ͠Δ.

    View Slide

  4. Who am ɹ?
    ʢ͓લ୭Α?ʣ
    • Shinichi Nakagawa@shinyorke
    • େख֎ࢿITίϯαϧاۀϚωʔδϟʔ
    ʢݩɾࣄۀձࣾͷϑϧαΠΫϧΤϯδχΞʣ
    • Ϋϥ΢υΠϯϑϥΛѻ͏νʔϜͷϚωʔδϟʔ
    • झຯͱ࣮ӹΛ݉Ͷͯݸਓ։ൃͯ͠·͢#
    ʢओʹ໺ٿͱϑΟδΧϧέΞ໨తʣ
    • ໺ٿͱҿΈͳ͕Βͷϓϩάϥϛϯάେ޷͖.
    • ਪ͠: ৽ঙ߶ࢤ, ສ೾தਖ਼, ୩઒ݪ݈ଠʢͷڧݞʣ
    #Python #Serverless #GoogleCloud #Baseball
    #DataScience #SABRmetrics

    View Slide

  5. ຊ೔ͷελʔςΟϯάϝϯόʔ
    • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏
    • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫
    • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ
    • ໺ٿϏοάσʔλ͕ਪ͢ʮΤά͍ʓʓͨͪʯ

    View Slide

  6. ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏

    View Slide

  7. ϝδϟʔϦʔάͷϏοάσʔλ
    • ϝδϟʔϦʔά͸ʮStatcastʯͱ͍͏γεςϜͰ৭ΜͳσʔλΛه࿥͍ͯ͠·͢.
    ※ΧϝϥɾϨʔμʔͰه࿥, Ұ෦౷ܭ஋ɾਓྗͰه࿥
    • ྫ͑͹, ͜ͷลͷ࣮گͷݩωλ͸͢΂ͯ͜ͷʮStatcastʯͱ͍͏Ϗοάσʔλ͕ݩωλʹͳ͍ͬͯ·͢.
    • ΦΦλχαϯʂ˓߸ຊྥଧʂଧٿ଎౓180km/h, ඈڑ཭130m
    • ΦΦλχαϯʂ162km/hͷਅͬ௚͙Ͱݟಀ͠ࡾৼʂʂʂ
    • ໺ٿͷҰڍखҰ౤଍, ͢΂ͯͷ౤ٿɾଧٿσʔλ͕ه࿥͞ΕΔ.
    • ϨΪϡϥʔγʔζϯʢ30νʔϜɾ162ࢼ߹ʣͰ͓͓Αͦ70ʙ80ສٿલޙ. ϙετγʔζϯɾय़Ωϟϯϓσʔλ΋͋Δ.
    • σʔλ͸91ݸͷ߲໨ʢ!?ʣͰߏ੒͞ΕΔ, ϨΪϡϥʔγʔζϯ෼Ͱ͓͓Αͦ400MBʙ600MB͙Β͍ͷσʔλ.
    • baseballsavant.mlb.com ͱ͍͏αΠτͰ୭Ͱ΋Ӿཡɾμ΢ϯϩʔυʢCSV ϑΥʔϚοτʣͰ͖·͢.

    View Slide

  8. σʔλͷ࢓༷ʢެࣜʣ͸ͪ͜Β.
    https://baseballsavant.mlb.com/csv-docs
    ࢲͷղઆɾ຋༁൛͸ͬͪ͜.
    https://shinyorke.hatenablog.com/
    entry/statcast-csv-docs-ja
    ֤σʔλ߲໨, νϥοͱ͓ݟͤ͠·͢.

    View Slide

  9. View Slide

  10. View Slide

  11. ???ʮਏ͍Ͱ͢…߲໨ͱҙຯ͕Θ͔Βͳ͍͔Β.ʯ
    શ91߲໨, ୯Ґͱ͔ଌఆج४΋ॳݟࡴ͠Ͱ͢ʢ&৽Ҫ͞Μ޿ౡ؂ಜब೚͓Ίʣ

    View Slide

  12. StatcastσʔλͰৼΓฦΔʮΦΦλχαϯͷ2022೥ʯ
    ͪ͜ΒΛྫʹStatcastσʔλΛݟ͍͖ͯ·͠ΐ͏.

    View Slide

  13. https://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022
    ্هͷStatcastΛ࢖ͬͨαϯϓϧΛݩʹղઆ͠·͢&ίʔυެ։ͯ͠ΔͷͰͥͻ༡ΜͰ͍ͩ͘͞.
    ※ΦϦδφϧσʔλ͸mile/h & feetͰ͕͢, ࣄલʹkm/h & mʹม׵ࡁΈʢΦϦδφϧσʔλʹ͸ແ͍ͷͰ஫ҙʣ.

    View Slide

  14. 2022೥ͷΦΦλχαϯ,
    εϥΠμʔͱ2γʔϜ,
    ΧοτϘʔϧܑ͞ΜʹͳΔ
    • ࠓ೥ͷΦΦλχαϯ, ΊͬͪΌ
    εϥΠμʔ౤͍͛ͯΔ
    • ͓ؾ͖ͮͩΖ͏͔?ޙ൒ઓ͸
    2γʔϜʢσʔλ্͸Sinkerʣ͕
    ૿͍͑ͯΔ͜ͱʹ!?
    • εϥΠμʔ, 2γʔϜ, ΧοτϘʔϧͰ
    บ͕ڧ͍ۂ͕Γٿ౤͛ΔϚϯʹΩϟϥม

    View Slide

  15. ͱ͋ΔΦΦλχαϯͷొ൘೔ʢ2022/9/29, 8ճ10ୣࡾৼແࣦ఺ʣ
    ൒෼ۙ͘εϥΠμʔΛ౤͛ͯ2γʔϜͱΧοτͰԡ͍ͯ͘͠Πϝʔδ
    ౤͛ͨ৔ॴʢัख໨ઢʣ ϦϦʔεϙΠϯτʢัख໨ઢʣ
    ٿछͷׂ߹

    View Slide

  16. StatcastσʔλΛJupyter Lab + PlotlyͰோΊΔ
    • ৭Μͳσʔλ͕͋ΔͷͰ݁ߏͳ͜ͱ͕Θ͔Γͦ͏.
    • ࣌ܥྻσʔλͳͷͰ, ύϑΥʔϚϯεͷมԽ΋͔ͭΊΔ.
    • ٿ଎͕લͱҧ͏ͱ͔, ಥવ2γʔϜ૿͑ͨͳ?ͱ͔.
    • Ϙʔϧͷ଎౓ɾ࠲ඪܥσʔλ͕ἧ͍ͬͯΔ.
    • ؤுͬͯ࠲ඪܭࢉ͍͍ͯ͠ײ͡ʹͨ͠Β3Dඳըͱ͔͍͚Δ.
    ʢҙ༁ɾࠓճ͸ؤுΔ༨༟ͳ͔ͬͨͷͰ΍ͬͯ·ͤΜ$ʣ

    View Slide

  17. ϫΠʮຖ೔ຖࢼ߹ݟΔ࢓૊Έཉ͍͠ʯ
    https://baseballsavant.mlb.com/ ͕ඍົʹ࢖͍ʹ͍͘ࣄ΋͋Γ…w
    ࢖͍΍͍͢σʔλج൫ʹͪ͠Ό͑ʂͱ͍͏ΞΠσΞ͕͋Δ೔ࢥ͍ͭ͘.

    View Slide

  18. ͱ͍͏Θ͚Ͱ, ͪΐͬ͜ͱ࡞ͬͯΈ·ͨ͠.

    View Slide

  19. PythonͱGoogle CloudͰ࡞Δ
    αʔόϨεͰ͍͍ײ͡ͳ
    σʔλج൫ʢ໺ٿฤʣ

    View Slide

  20. ΞʔΩςΫνϟͷશମ૾

    View Slide

  21. View Slide

  22. ΞʔΩςΫνϟղઆʢ㲈ͩ͜ΘΓϙΠϯτʣ
    • ຖ೔σʔλ֬ೝɾຖ೔σʔλߋ৽Λ͍͍ײ͡ʹ࣮ݱ͢ΔͨΊ,
    ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯΛશ໘తʹ׆༻ͯ͠ߏஙɾӡ༻.
    • ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯ #ͱ͸
    • CLI΍ίϯιʔϧͰϙνϙν͢Δ͚ͩͰͻͱ·্ཱ͕ͣͪΔ
    • Πϯϑϥɾαʔόʔͷϝϯςφϯε͕ෆཁʢࣗ෼͡Όͳͯ͘, Ϋϥ΢υαʔϏεଆ͕΍Δʣ
    • ΑΓ۩ମతʹ͸, ࣗ෼ͰK8sΫϥελ΍VMΛݐͯͳͯ͘΋ྑ͍ʢωοτϫʔΫ౳ͷઃఆ͸ൃੜʣ
    • GitHub Actions౳ͷCI/CDͷύΠϓϥΠϯʹ૊ΈࠐΜͰσϓϩΠɾεέʔϧͰ͖ͨΓ
    جຊతʹ͸ʮ࢖ͬͨ෼͚ͩ՝ۚʯʹͳΔͷͰ͓ࡒ෍ʹ΋༏͍͠%

    View Slide

  23. Ϣʔεέʔεͱ࢖ͬͨαʔϏε

    View Slide

  24. View Slide

  25. μογϡϘʔυΞϓϦ
    • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε
    • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ
    • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

    View Slide

  26. View Slide

  27. σʔλऩू&BigQueryอଘ
    • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ
    • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ
    • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ

    View Slide

  28. View Slide

  29. Firestore౤ೖʢDatabaseʹσʔλҠૹʣ
    • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
    • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ
    • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

    View Slide

  30. PySpark + DataprocͰ࣮ݱ͢Δ
    αʔόϨεͳσʔλॲཧ
    ※͕͜͜͜ͷτʔΫͷຊ୊ͱͳΓ·͢.

    View Slide

  31. ͜ͷ࿩ͷείʔϓ

    View Slide

  32. View Slide

  33. 33.4ඵͰΘ͔ͬͨʢؾʹͳΔʣ&
    SparkͱPySpark

    View Slide

  34. SparkͱPySpark
    • ʮେ͖͍σʔλΛ͍͍ײ͡ʹ෼ࢄͯ͠ॲཧ͢ΔʯͨΊͷFramework
    • Sparkຊମͷ࣮૷͸Java͕ͩ, PythonͷInterfaceͰ͋ΔʮPySparkʯΛ
    ࢖͏ࣄ͕ଟ͍ʢଞͷݴޠͩͱR΋࢖͑ͨΓ͢Δʣ.
    • σʔλॲཧόονͷϓϩάϥϜͱͯ͠ಈ͔͢ or Jupyter Lab, ZeppelinͰnotebook࣮ߦ.
    • Python࢖͍ʹ͸ඇৗʹೃછΈ͕ਂ͍, DataFrameͳInterface͕͋Δ.
    • SparkಠࣗͷDataframe. ͪͳΈʹPandas Dataframeʹม׵Մೳ
    • Pandas APIʢSpark 3.2Ҏ߱ͰPandasͷػೳΛར༻Մೳ, Ұ෦੍໿༗Γʣ

    View Slide

  35. SparkΛͲ͜Ͱߏஙɾӡ༻͢Δ͔'
    ؀ڥɾखஈ ߏஙͷखؒ ӡ༻͠΍͢͞ උߟ
    ΦϯϓϨϛεͰ
    શͯࣗલߏஙɾӡ༻
    શͯࣗલͰઃఆ͢Δ
    ඞཁ͕͋Δ
    Կ͔ΒԿ·Ͱ
    ࣗ෼ͰݟΔඞཁ͕͋Δ
    Ұ൪େมͳύλʔϯ
    ຊ৬ͷΠϯϑϥΤϯδχΞ
    Ͱ΋͖͍ͭ࢓ࣄ
    Ϋϥ΢υ্ͷ7.,Tʹ
    ࣗલͰߏஙɾӡ༻
    શͯࣗલͰઃఆ͢Δ
    ඞཁ͕͋Δ
    ͋Δఔ౓Ϋϥ΢υαʔϏε
    ͷԸܙʹत͔ΕΔ
    4QBSL؀ڥͷࣗલߏங͸
    ׂͱ೉қ౓͕ߴ͍
    Ϋϥ΢υαʔϏεఏڙͷ
    ϚωʔδυαʔϏεΛ࢖͏
    ˞࠷΋ਪ঑͢Δํࣜ
    (6*Ͱϙνϙν͢Δ
    ΋͘͠͸$-*"1*Ͱ
    ͍͍ײ͡ʹ࣮ߦ
    $16౳ͷϦιʔεΛ؂ࢹ
    ঢ়گʹԠͯ͡ϝϯςφϯε
    ࠷΋ָ͔ͭεϚʔτͳํ๏
    "84 (PPHMF$MPVEଞ
    ֤ࣾαʔϏε༗

    View Slide

  36. Google Cloudʹ͓͚ΔSparkӡ༻ͷબ୒ࢶ
    ؀ڥɾखஈ ߏங ӡ༻ ࢖͑Δػೳ උߟ
    ($&PS(,&ʹ
    ؀ڥΛ࡞ͬͯӡ༻
    ࣗલͰߏஙޙ
    4QBSLΛಋೖ
    શͯࣗલͰӡ༻
    ໘౗ΛݟΔඞཁ༗
    શͯͷػೳ
    ݁ہͷॴ%BUBQSPDͰ
    ग़དྷΔ͜ͱͳͷͰ
    ͓͢͢Ί͠ͳ͍
    %BUBQSPD
    HDMPVEίϚϯυ
    "1* ίϯιʔϧͷ
    ͲΕ͔Ͱߏங
    %BUBQSPD͕࡞ͬͨ
    (,&PS($&؀ڥ
    Λ؂ࢹɾӡ༻
    શͯͷػೳ Ұ൪ඪ४తͳߏ੒
    %BUBQSPD
    4FSWFSMFTT
    HDMPVEίϚϯυ
    "1* ίϯιʔϧ
    ্هͷͲΕ͔Ͱߏங
    ࣮ߦதͷ؂ࢹͷΈ
    ؀ڥ͸ॲཧޙʹ
    ࣗಈ࡟আ
    όονॲཧͷΈରԠ
    OPUFCPPL࢖͑ͳ͍
    ఆظతͳόονॲཧ
    ͸͜Ε͕Ұ൪͍͍
    ※Spark in BigQueryͱ͍͏, BigQueryͷετΞυͱͯ͠SparkΛ࣮ߦ͢Δػೳͷఏڙ༧ఆ༗Γʢby Google Cloud Next ‘22ʣ

    View Slide

  37. DataprocͱDataproc Serverless
    • Google Cloudʹ͸Dataprocͱ͍͏SparkʢHadoopʣͷϚωʔδυαʔϏε͕ଘࡏ͢Δ.
    • ࠓ·Ͱ͸GCE΍GKEʢK8sʣͰʮϗετ΋͘͠͸Cluster͕ଘࡏʯલఏͷ
    ӡ༻͔͠Ͱ͖ͳ͔͕ͬͨ, ͍ͭ࠷ۙServerlessͱ͍͏બ୒ࢶ͕ര஀
    • ʮ1೔1ճʯʮ30෼͓͖ʯΈ͍ͨͳόονӡ༻Ͱ͋Ε͹Serverless͕࢖͑Δʂ
    ͳ͓, notebookͷ࣮ߦʢJupyterͳͲʣ͸ະରԠͳͷͰΞυϗοΫʹ͸࢖͑ͳ͍.
    • Serverless͸࢖ͬͨ෼͚ͩ՝ۚͳͷͰ͓ࡒ෍ʹ΋༏͍͠%
    • όʔδϣϯ͸Spark 3.2, PySpark͔ΒPandas API࢖͑·͢ʢ͕ࠓճ͸࢖ͬͯ·ͤΜʣ.

    View Slide

  38. PySparkΛ࢖ͬͯ΍ͬͨλεΫΛ঺հ
    • σʔλऩू&BigQuery΁ͷσʔλ౤ೖ
    • μογϡϘʔυΞϓϦ༻DBʢFirestoreʣ΁ͷσʔλ౤ೖ

    View Slide

  39. ʲ࠶ܝʳσʔλऩू&BigQueryอଘ
    • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ
    • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ
    • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ

    View Slide

  40. σʔλऩू
    ʢnot Sparkʣ
    • WebεΫϨΠϐϯά͸SparkͰ
    ΍Δ΂͖͜ͱͰ͸ͳ͍.
    • λεΫΛrequests-htmlͰ࣮૷,
    Cloud FunctionsͰӡ༻ͯ͠ରॲ.
    • Cloud SchedulerͷCronઃఆͰ
    ఆظ࣮ߦ, GCSʹอଘ

    View Slide

  41. CSVσʔλΛ
    BigQueryʹ౤ೖ
    • Dataproc্Ͱ΍ΔλεΫͱͯ͠
    ద੾ͳൣғɾॲཧͷҰͭ
    • GCSͷύε͔ΒϑΝΠϧநग़
    Spark SQLͰॲཧͯ͠BigQuery΁
    • DataFrameͱSQL͕Θ͔Ε͹
    ͍͍ײ͡ʹ࣮૷ɾӡ༻Մೳ

    View Slide

  42. DataprocΛ࢖͓͏
    • Google CloudͷυΩϡϝϯτɾαϯϓϧΛࣸܦ͠ͳ͕Β΍Δͱྑ͖
    • https://cloud.google.com/dataproc
    • https://cloud.google.com/dataproc-serverless/docs
    • https://github.com/GoogleCloudDataproc/cloud-dataproc
    • Serverlessͷ৔߹, ࣄલʹVPC subnetΛ࡞੒, ࣮ߦ࣌ʹࢦఆ͢Δඞཁ͋Γ.
    • ࣍ϖʔδ͔Β, PySparkΛ࢖ͬͯ΍Δ৔߹ͷαϯϓϧΛগ͠঺հ͠·͢.
    • Spark DataFrameΛݩʹ, ʮσʔλΛಡΜͰՃ޻ͯ͠ॻ͖ࠐΈʯతͳόονॲཧ.
    • ͲͷΫϥε͔Θ͔Γ΍͘͢͢ΔͨΊ, Type Hints෇͖Ͱ࣮૷͍ͯ͠·͢.

    View Slide

  43. ͻͱ·࣮ͣ૷
    1. SessionΛ࡞Δ
    • DB connectionతͳ΍ͭ
    • SparkSessionͷObjectΛ࡞Δ
    • Object࡞੒࣌ʹParameterࢦఆ
    • BigQueryΛ࢖͏࣌͸
    JARͷࢦఆ͕ඞਢͳͷͰ஫ҙ

    View Slide

  44. ͻͱ·࣮ͣ૷
    2. SchemaΛ࡞Δ
    • CSVͷ৔߹SchemaΛ࡞Δ
    • ࡞੒͞ΕΔDataframeʹ
    ܕΛ͚ͭΔҝ, ઈରඞཁ
    • ࠓճ͸91߲໨෼ͷSchema
    ؤுͬͯॻ͖·ͨ͠ྦ

    View Slide

  45. ͻͱ·࣮ͣ૷
    3. CSVಡΈࠐΉ
    • sparkηογϣϯͷreadΛ
    ࢖͏, formatʹCSVΛࢦఆ
    • ϔομʔͱͯ͠ઌ΄Ͳͷ
    SchemaΛࢦఆ
    • GCSͷϑϧύεΛࢦఆ

    View Slide

  46. ͻͱ·࣮ͣ૷
    4. BigQueryอଘ
    • DataFrameͷwriteؔ਺
    • bigqueryΛࢦఆ
    • ྫ͸طଘςʔϒϧ΁ͷ
    ௥هॻ͖ࠐΈ

    View Slide

  47. Dataproc ServerlessΛ࢖࣮ͬͯߦ

    View Slide

  48. BigQuery͔ΒGCSʹϑΝΠϧग़ྗ for Dataproc
    • BigQueryͷσʔλΛSpark DataFrameʹ
    • Spark DataFrameΛϑΝΠϧग़ྗ
    ͪͳΈʹ࣮ߦํ๏ʢgcloud CLIʣ͸มΘΒͳ͍ͷͰׂѪ͠·͢.

    View Slide

  49. ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ
    • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
    • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ
    • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

    View Slide

  50. ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ
    • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
    • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ
    • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

    View Slide

  51. ͻͱ·࣮ͣ૷
    5. BigQueryಡࠐ
    • อଘͱಉ͘͡BigQueryͷ
    JARΛࢦఆ
    • spark readͰBigQueryΛࢦఆ
    • BigQueryͷViewʹରͯ͠
    ߦ͏৔߹, Φϓγϣϯ͕ඞཁ

    View Slide

  52. ͻͱ·࣮ͣ૷
    6. GCSอଘ
    • DataFrameͷwriteؔ਺
    • jsonΛࢦఆ
    • ࠷ऴతͳύεΛࢦఆ

    View Slide

  53. PySparkͱDataproc Serverless
    • ʮ࢖͍͍ͨͱ͖͚ͩSparkΛ࢖͏ʯͱ͍͏ϢʔεέʔεΛ࣮ݱՄೳ.
    ͜Ε͕αʔόϨεαʔϏεΛ࢖͏΂͖࠷େͷཧ༝.
    • ࠓճͷΞϓϦέʔγϣϯͷσʔλαΠζʢ1೥Ͱ1GB͍͔ͳ͍ʣͩͱ
    Ըܙʹत͔Εͳ͍͕, ʮ਺GB/೔ఔ౓ͷσʔλΛαΫοͱόονॲཧʯ
    Έ͍ͨͳϢʔεέʔεʹͳΔͱ݁ߏศརͳؾ͕͠·͢ʢલॲཧɾΫϨϯδϯά͢Δͱ͔ʣ.
    • ʮॲཧ͢Δͱ͖͚ͩಈ͔͢ʯͱ͍͏ײ͡ͷ͍ܰίʔυͳͷͰPySparkͱ΋૬ੑόπάϯ.
    • ͳ͓, ॲཧͷࣗಈԽ͸ͪΐͬͱบ͕͋Γ·͢, Cloud ComposerʢAirflowʣ͕ඞཁ.
    ※ৄࡉ͸౰ࢿྉͷAppendixΛࢀর

    View Slide

  54. ٕज़ύʔτ͸͜͜Ͱऴྃ.
    ࠑॲ͔Βઌ͸…

    View Slide

  55. ΍͖͏ͷ͔͡Μͩ͋͋͋͋͋⽁

    View Slide

  56. 2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5
    1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄
    2. FIGHTERS GIRL 2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ
    ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ
    3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ
    4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ
    5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?

    View Slide

  57. 2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5
    1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄
    2. FIGHTERS GIRL 2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ
    ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ
    3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ
    4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ
    5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?

    View Slide

  58. Statcast ʢ&ࢲʣ͕ਪ͢
    ʮΤά͍֎໺खͨͪʯ
    • ਎ମೳྗ͓Խ͚ͰΩϨοΩϨ
    • ύϫʔ, ڧݞ, ޷कͦͯ͠٭͕ചΓ
    • όοτΛৼΓճ͢໺ੑͬΆ͞
    • ଧٿ଎౓Λݩʹਪ͠Λ3ਓ঺հ
    • ݱ໾࣌୅ͷ৽ঙ߶ࢤͬΆ͍
    Ӊ஦ਓૉ੖Β͍͠֎໺खͰ͢(

    View Slide

  59. ຊ೔͝঺հ͢ΔΤά͍֎໺खͨͪ
    • Judge, AaronʢΞʔϩϯɾδϟοδʣ
    • Rodríguez, JulioʢϑϦΦɾϩυϦήεʣ
    • Buxton, ByronʢόΠϩϯɾόΫετϯʣ
    300ଧ੮Ҏ্ཱ͍ͬͯΔ֎໺ख͔ͭ, ଧٿ଎౓͕଎ͯ͘௕ଧ͕ग़·͘Δ,
    ݪଇηϯλʔΛक͍ͬͯΔબखΛ3ਓ঺հ͠·͢.

    View Slide

  60. Ξʔϩϯɾδϟοδ
    ʢ2022೥ຊྥଧԦʣ
    • ϠϯΩʔεͷڧଧऀͰ,
    ΦΦλχαϯͷϥΠόϧ
    • ݱ໾࠷ڧͷϗʔϜϥϯόολʔ
    • ͨͩύϫʔ͕͋Δ͚ͩͰͳ͘
    2mͷ਎௕Λੜ͔ͨ͠֎໺कඋ
    ηϯλʔकΕΔػಈྗ͕ചΓ

    View Slide

  61. ϑϦΦɾϩυϦήε
    ʢγΞτϧظ଴ͷ੕ʣ
    • ϚϦφʔζʹᰜ૘ͱݱΕͨظ଴ͷ੕
    ͪͳΈʹࠓ೥ͷϧʔΩʔ
    • एख࣌୅ͷBIG BOSSΈ͍ͨͳ੒੷
    ਎ମೳྗΛੜ͔ͨ͠ϓϨʔ͕ັྗ
    • ଧٿ֯౓্͕͕ͬͯόϨϧ૿͑ͨΒ
    Πνϩʔࢯʹগͣͭۙͮ͘͠ͷͰ͸?
    10೥ܖ໿ʹԠ͑Δ׆༂Λظ଴ʂ

    View Slide

  62. όΠϩϯɾόΫετϯ
    ʢϛωιλͷສ೾ʣ
    • ϛωιλɾπΠϯζෆಈͷηϯλʔ
    • ໺ٿ͡Όͳ͍ڝٕ΋ߦ͚ͦ͏?
    ͱ͍͏Τήπͳ͍٭ྗͱݞͷ࣋ͪओ,
    ͦͷׂʹଧٿ֯౓͕ύϫʔώολʔ
    • ৭ʑࡶͬΆ͍ॴͱελΠϧͷྑ͞Ͱ
    ສ೾தਖ਼ʢϑΝΠλʔζʣʹࣅ͍ͯΔ.
    Ϛϯνϡ΢, ๺ͷόΫετϯʹͳͬͯ͘Εʂ

    View Slide

  63. ࠓ೥͸֎໺͸कͬͯ·ͤΜ͕.
    ͜ͷํ΋΍͸ΓΤά͍όολʔͰͨ͠

    View Slide

  64. ΦΦλχαϯʂʂΩϡϯͰ͢ὑ
    300ଧ੮Ҏ্ͷ࠷ߴଧٿ଎౓ϥϯΩϯά, 2ҐͰͨ͠ʢࢲௐ΂ʣ

    View Slide

  65. ݁ͼ

    View Slide

  66. ʲ࠶ܝʳຊ೔ͷελʔςΟϯάϝϯόʔ
    • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏
    • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫
    • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ
    • Ϗοάσʔλ͕ਪ͢ʮΤά͍ΞεϦʔτܥ֎໺खʯ
    ָ͓͠Έ͍͚ͨͩ·͔ͨ͠?৘ใྔ͕ଟ͔ͬͨͷͰཧղ͢Δ·Ͱ೉͍͔͠΋׼
    ࢿྉ͸ެ։͠·͢ͷͰ, ͥͻৼΓฦΓͱ͓ͯ͠ಡΈ͍ͩ͘͞)

    View Slide

  67. ࠓ೔ͷ࿩Λཁ໿͢Δͱ…
    • εϙʔπσʔλͷղੳɾ෼ੳͷ͓୊໨ͱͯ͠໺ٿ͸໘ന͍Αʂ
    Baseball Savantͱ͍͏τϥοΩϯάσʔλΛ࢖͏ͱྑ͖.
    • PythonͰ͍͍ײ͡ʹσʔλॲཧΛ͢ΔͷʹPySpark͸ྑ͍ͧ.
    • PySpark͸Ϋϥ΢υͰಈ͔ͤ·͢, ࠓ೔͸DataprocΛ঺հ͠·ͨ͠.
    • αʔόϨεʹΫϥ΢υΛ࢖͑ΔΑ͏ʹͳΔͱ,৭ʑͱָʢ੍ͨͩ͠ݶ΋͋Δʣ.
    • ϝδϟʔ͸Τά͍֎໺ख͕͍Δ͕, εϥΠμʔͱ2γʔϜ͓Խ͚ͷΦΦλχαϯڧ͍.

    View Slide

  68. ͓࢓ࣄʢۀ຿ʣͰࢀߟʹ͠Α͏ͱࢥͬͨํ΁
    • ࠓճ঺հͨ͠΍Γํɾߏ੒͸ઈରతͳճ౴ɾϕεϓϥͰ͸ͳ͍Ͱ͢.
    ྫ͑͹αʔόϨεɾΞʔΩςΫνϟʹ͢΂͖/͢΂͖͡Όͳ͍ঢ়گ͸࣮֬ʹଘࡏ͠·͢.
    • ͜ͷ࿩͸ࢲʢshinyorkeʣ͕΍Γ͍ͨࣄ, ͍͍ͱࢥͬͯΔࣄʢ&৮Γ͍ͨϞϊʣΛ
    ٧ΊࠐΜͰ࡞ͬͨ, ࣗ෼͕΍Γ͍ͨࣄͷूେ੒Ͱ, ͋͘·Ͱ౴ͷग़͠ํͷҰͭͰ͢.
    • ΋ͬͱݴ͑͹, ʮαʔόϨεͱ͔PySparkͰͲ͜·Ͱ͍͍ײ͡ʹ࡞ΕΔ?ʯͱ͍͏
    ϓϩτλΠϓͱͯ͠࡞Γ·ͨ͠&࣮͸ࠓޙSpark͸֎ͭ͢΋ΓͰ͢ʢৄ͘͠͸Appendixʹͯʣ.
    • ʢίϯςΩετͷཧղ͕த్൒୺ͳ··ʣͦͷ··ਅࣅ͢Δͱരࢮ͠·͢.
    ·ͣ͸खΛಈ͔͠, ֶशͨ͠Γಈ͔ͨ͠Γ͍͍ͯ͠΋ͷΛݟ͚ͭΔࢀߟʹͲ͏ͧʂ

    View Slide

  69. ʲଓ͖ʳAppendix - ΋͏ͪΐͬͱৄ͍͠࿩
    • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ
    • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022
    • SparkΛ࢖Θͳ͍, େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud
    • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ
    ؾʹͳΔํ͸ࢿྉͷଓ͖ΛಡΜͰ&ձ৔ͷํ͸࣭ٙԠ౴Ͱ࿩͠·͠ΐ͏ʂ

    View Slide

  70. ͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠⽁
    Shinichi Nakagawa@shinyorke

    View Slide

  71. Python࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯ
    ΦϚέฤʮຊฤͰ͸࿩͞ͳ͔ͬͨTips&ࢀߟࢿྉΛҰؾʹެ։͠·͢ʯ

    View Slide

  72. Appendix - ΋͏ͪΐͬͱৄ͍͠࿩
    • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ
    • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022
    • SparkΛ࢖Θͳ͍, େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud
    • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ
    • ࢀߟࢿྉ

    View Slide

  73. Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ

    View Slide

  74. Dataproc ServerlessͷࣗಈԽ
    • ૝ఆ͞ΕΔखஈ͸ҎԼͷ3ͭ.
    1. APIΛ࢖͍͍ͬͯײ͡ʹ࣮ߦ͢ΔҝͷDocker imageΛ࡞੒
    ͜ΕΛԿ͔͠Βͷํ๏ͰContainerͱͯ͠ಈ͔͢ʢK8sͳͲʣ
    2. CLIʢgcloudίϚϯυʣͰ࣮ߦͰ͖ΔͷͰ, gcloudίϚϯυͷ
    Docker imageΛ࡞੒ʢҎԼ, 1.ͱಉจʣ
    3. AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢
    • 1.ͱ2.͸ۤߦ͔ͭ, αʔόϨεͷҙຯ͕ແ͘ͳΔՄೳੑ͕͋Δʢ͔ͭ, 1ͱ2͸΄΅ಉ͜͡ͱΛݴ͍ͬͯΔʣ.
    Cloud Run౳Ͱಈ͔ͤΕ͹Α͍͕, ߏஙɾӡ༻ͱ΋ʹϦεΫ͕͋Γͦ͏ͳ༧ײ.
    • ϕεϓϥͬΆ͍໛ൣղ౴͸ʮ3.AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢ʯͰ͢.

    View Slide

  75. ʲ໛ൣղ౴ʳAirflowͷOperatorΛܦ༝ͯ͠ಈ͔͢
    Google CloudͷϚωʔδυɾαʔϏεʮCloud ComposerʯΛ࢖͏ͱྑͦ͞͏

    View Slide

  76. Dataproc ServerlessͷॲཧࣗಈԽ
    • Google CloudͰcronతͳॲཧΛࣗಈԽ͢ΔͷʹPub/Sub + Schedulerʢ΋͘͠͸Cloud TaskʣΛ
    ࢖͏ͱ͍͏ϕετɾϓϥΫςΟε͕ଘࡏ͢Δ.
    • ͨͩ, 2022೥10݄ݱࡏ, Dataproc Serverless͸Pub/SubΛInterfaceͱ࣮ͯ͠ߦ͢Δखஈ͕ແ͍ҝ,
    ࢒೦ͳ͕Β͜ͷํ๏͸࢖͑ͳ͍.
    • ͳͷͰ, ࠷΋εϚʔτͳํ๏͸AirflowͷDataprocܥOperatorΛ࢖࣮ͬͯߦ͢Δ͜ͱʹͳΔ.
    Cloud ComposerͰAirflowΫϥελΛ্ཱͪ͛ͯӡ༻͢Δ.
    • https://cloud.google.com/composer/docs/composer-2/run-dataproc-workloads
    • ͪͳΈʹCloud Composer͸αʔόϨεͰ͸ͳ͍Ͱ͢ʢϑϧϚωʔδυͰ͸͋Δ͕ʣ
    &K8sʢGKEʣΫϥελΛཱͯΔ͜ͱʹͳΔͷͰίετ໘΋஫ҙʢ࣮຿͸ͱ΋͔͘ݸਓͰ࢖͏ʹ͸ߴ͍ʣ

    View Slide

  77. SparkΛΫϥ΢υͰ࢖͏
    Google CloudҎ֎ͷ৔߹

    View Slide

  78. Google CloudҎ֎ͷSparkαʔϏεબ୒ࢶ
    • AWS, Azureͦͯ͠ʢ͋Δҙຯ͝ຊՈͰ͋ΔʣDatabricks͕ީิ.
    • ύϒϦοΫΫϥ΢υΛΠϯϑϥͱͯ͠ѻ͏Ϣʔεέʔεͷ৔߹,
    Databricks͕࠷༗ྗީิʹʢϚϧνΫϥ΢υԽ͍ͨ͠౳ͷέʔεʣ.
    • ࣮͸͜ͷ෼໺, AWS͕ॆ࣮͍ͯͯ͠, EMRͱGlueͰϢʔεέʔεʹ
    ߹Θͤͯબ୒͢Δͱ͍͍Α͏ͳؾ͕͢Δ.
    • Azure͸৮ͬͨ͜ͱແ͍ͷͰΘ͔Βͳ͍…*

    View Slide

  79. Google CloudҎ֎ͷSparkαʔϏεબ୒ࢶ
    Ϋϥ΢υαʔϏε
    ˞શͯͰ͸ͳ͍Ͱ͢
    63- ֓ཁ
    %BUBCSJDLT IUUQTXXXEBUBCSJDLTDPNKQ
    ϚϧνΫϥ΢υ૝ఆͩͱબ୒ࢶʹ
    4QBSLͷੜΈͷ਌͕։ൃɾఏڙ
    "84&.3 IUUQTBXTBNB[PODPNKQFNS
    "84ͷϚωʔδυ4QBSL)BEPPQ
    4QBSLͱͯ͠࢖͏ͳΒͬͪ͜
    "84(MVF IUUQTBXTBNB[PODPNKQHMVF
    4QBSLΛ&5-ͱͯ͠࢖͏৔߹
    &.3ΑΓ(MVFΛ࢖͏ͷ͕ϕετ
    "[VSF)%*OTJHIU
    IUUQTB[VSFNJDSPTPGUDPNKBKQ
    TFSWJDFTIEJOTJHIUPWFSWJFX
    "[VSFʹ͓͚Δબ୒ࢶ
    ʢࢲ͸৮ͬͨ͜ͱͳ͍Ͱ͕͢ʜ

    View Slide

  80. SparkʢDataprocʣΛ࢖Θͳ͍
    ৔߹ͷ͍͍ײ͡ͳσʔλॲཧ
    for Google Cloud

    View Slide

  81. ͍͍ײ͡ͳσʔλॲཧ for Google Cloud
    • Dataflow
    • DataFusion
    • Dataprep
    • Cloud Run
    • Cloud Functions

    View Slide

  82. ༻్ʹ߹Θͤͯ࢖͍෼͚·͠ΐ͏ʂ
    (PPHMF$MPVE4FSWJDF 63- ֓ཁ
    %BUBqPX
    IUUQTDMPVEHPPHMFDPNEBUBqPX
    IMKB
    "QBDIF#FBN͕ϕʔε
    ετϦʔϛϯάॲཧͳΒ͜Ε
    %BUB'VTJPO
    IUUQTDMPVEHPPHMFDPNEBUB
    GVTJPOEPDT IMKB
    ΦϯϓϨΛؚΉɺطଘσʔλΛ
    औΓࠐΉ&5-తͳαʔϏε
    %BUBQSFQ
    IUUQTDMPVEHPPHMFDPNEBUBQSFQ
    IMKB
    σʔλલॲཧɾΫϨϯδϯάத৺
    ͲͪΒ͔ͱ͍͑͹ϩʔίʔυ
    $MPVE3VO IUUQTDMPVEHPPHMFDPNSVO IMKB
    ޷͖ͳݴޠɾ'8Ͱ࡞ΔͳΒ͜Ε
    1VC4VC౳ͰτϦΨʔͯ͠ಈ͔͢
    $MPVE'VODUJPOT
    IUUQTDMPVEHPPHMFDPNGVODUJPOT
    IMKB
    $MPVE3VOΑΓ੍໿͋Δ͕
    αΫοͱ࡞ͬͯಈ͔͢ͳΒ

    View Slide

  83. ݱ࣮తͳબ୒ࢶɾצॴ
    • ϦΞϧλΠϜܥͷॲཧ͸Dataflow͕࠷༗ྗͷબ୒ࢶ.
    • طଘͷσʔλͱ౷߹ͨ͠Γ·ͱΊͨΓ͸DataFusion.
    • ػցֶश౳ͷσʔλલॲཧ͸Dataprep.
    • PythonʹݶΒͣ, ࣗ෼Ͱ࡞ͬͯಈ͔͢ͳΒCloud Run.
    • ʮPandasͱBigQuery, GCS࢖͏ʯ͙Β͍ͳΒCloud FunctionsͰ
    αΫοͱ΍Ε·͢ʢ࣮͸͜ͷϢʔεέʔεଟ͍ͷͰ͸ʁʣ.

    View Slide

  84. Dash + Cloud RunͰӡ༻͢Δ
    σʔλՄࢹԽμογϡϘʔυ
    ※Spark͓ΑͼDataproc͸ొ৔͠·ͤΜ

    View Slide

  85. μογϡϘʔυΞϓϦʢຊฤͰׂѪͨ͠࿩ʣ
    • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε
    • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ
    • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

    View Slide

  86. μογϡϘʔυΞϓϦʢຊฤͰׂѪͨ͠࿩ʣ
    • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε
    • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ
    • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

    View Slide

  87. Dash + Cloud RunͰͷ
    ϗεςΟϯά
    • Dash͸Flask͕ݩʹͳͬͯΔͷͰ
    gunicornͰ͍͍ײ͡ʹಈ͔͢తͳ
    ํ๏ͰϗεςΟϯάՄೳ.
    • ͜Ε΋αʔόϨεͳͷͰ, ࢖ͬͨ࣌ؒɾϦιʔε
    ͚ͩಈ͘ײ͡ʹͳΔ, ࣗલͰՄࢹԽΞϓϦΛ
    ࡞Γ͍ͨํ͸΍ͬͯΈΔͱྑ͍͔΋?
    • ͪͳΈʹAWSͷ৔߹, App RunnerͰಉ͡
    ํ๏͕औΕΔͱࢥ͍·͢ʢࢼͯ͠͸͍·ͤΜ͕ʣ.

    View Slide

  88. ͳ͓, CI/CDϫʔΫϑϩʔ͸͜Μͳײ͡.
    • GitHub Repositoryʹpushͨ͠ΒGitHub Actions͕ൃՐ, ςετ -> Docker Build -> Cloud RunσϓϩΠ
    • ςετ͸pytest, flake8, mypyΛGitHub Actions্Ͱ࣮ࢪʢunit, integration·Ͱ୲อ͢ΔΠϝʔδʣ
    • Docker build͸Cloud Runͷඪ४తͳ΍Γํʹै͏.
    • Cloud Build্ͰϏϧυ
    • Artifact Registryʹpush
    • Cloud Run΁ͷσϓϩΠ͸Github ActionsͷެࣜΛ࢖࣮ͬͯࢪ.

    View Slide

  89. ࢀߟࢿྉ

    View Slide

  90. Spark / PySparkؔ࿈
    • PySpark Documents
    https://spark.apache.org/docs/latest/api/python/
    • ೖ໳PySparkɹ˞ͪΐͬͱݹ͍ॻ੶Ͱ͢, ಺༰ͱ͔஫ҙ.
    https://www.oreilly.co.jp/books/9784873118185/
    • PythonͰେྔσʔλॲཧʂ PySparkΛ༻͍ͨσʔλॲཧͱ෼ੳͷ͖΄Μ
    ʢPyCon JP 2017ʣ
    https://speakerdeck.com/chie8842/pythondeda-liang-detachu-li-
    pysparkwoyong-itadetachu-li-tofen-xi-falsekihon

    View Slide

  91. Google CloudʢDataprocʣ
    • ެࣜυΩϡϝϯτ
    https://cloud.google.com/dataproc/docs?hl=ja
    • PySparkͷެࣜαϯϓϧʢ͔͜͜Βࣸܦָ͕ʣ
    https://github.com/googleapis/python-dataproc
    • ެࣜαϯϓϧͦͷ2ʢΑΓ࣮ફతʣ
    https://github.com/GoogleCloudDataproc/cloud-dataproc

    View Slide

  92. Google Cloudʢॳ৺ऀɾ࢖͍͍ͨਓ޲͚ʣ
    • ެࣜυΩϡϝϯτ
    https://cloud.google.com/docs?hl=ja
    • ࢿ֨
    https://cloud.google.com/certification?hl=ja
    • ΤϯλʔϓϥΠζͷͨΊͷGoogle Cloudʢਪ͠ͷॻ੶Ͱ͢ʣ
    https://www.shoeisha.co.jp/book/detail/9784798175256

    View Slide

  93. ࣗ෼ͷϒϩάهࣄʢPySpark/Dataؔ࿈ʣ
    • ໺ٿͷϏοάσʔλΛGCPͱPySparkͰ͍͍ײ͡ʹ࢖͍΍ͯ͘͢͠Έͨ
    https://shinyorke.hatenablog.com/entry/dataproc-baseball
    • SparkΛαʔόʔ؅ཧͤͣʹ࢖͏ํ๏
    https://shinyorke.hatenablog.com/entry/dataproc-serverless
    • Google CloudͰSparkΛ࢖͏؀ڥΛαΫοͱखʹೖΕΔ
    https://shinyorke.hatenablog.com/entry/dataproc-terraform
    • WebΞϓϦͱσʔλج൫ΛαΫοͱ্ཱͪ͛ΔͨΊͷϓϥΫςΟε
    https://shinyorke.hatenablog.com/entry/cloud-arch-serverless

    View Slide

  94. ໺ٿؔ܎ͷࢀߟϒϩάɾίʔυ
    • ໺ٿ޷͖ͱσʔλ޷͖ͷͨΊͷStatcastσʔλೖ໳
    https://shinyorke.hatenablog.com/entry/statcast-csv-docs-ja
    • StatcastσʔλͱPlotlyΛ࢖ͬͯʮଧٿͷ౸ୡҐஔʯΛՄࢹԽ͢Δ
    https://shinyorke.hatenablog.com/entry/statcast-visualization-for-batting
    • Baseball SavantͰΦΦλχαϯͷσʔλΛோΊΔαϯϓϧ
    https://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022
    • RʹΑΔηΠόʔϝτϦΫεೖ໳
    https://gihyo.jp/book/2020/978-4-297-11684-2

    View Slide

  95. Done.
    ࠷ޙ·Ͱ͝ഈಡ͋Γ͕ͱ͏͍͟͝·ͨ͠.

    View Slide