Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python使いのためのスポーツデータ解析のきほん - PySparkとメジャーリーグデータを添えて #PyConJP 2022

Python使いのためのスポーツデータ解析のきほん - PySparkとメジャーリーグデータを添えて #PyConJP 2022

Shinichi Nakagawa

October 15, 2022
Tweet

More Decks by Shinichi Nakagawa

Other Decks in Technology

Transcript

  1. No Baseball, No Engineering! High Performance Data Platform Knowledge of

    PySpark, Cloud and ⚾ Python࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯ Shinichi [email protected] 2022/10/15 PyConJP 2022 Talk Session
  2. Onboardingʢ͜ͷηογϣϯͷ͝Ҋ಺ʣ • PythonͱSparkʢPySparkʣͱύϒϦοΫΫϥ΢υʢGoogle CloudʣͰ ਺GBҎ্ͷσʔλΛ͍͍ײ͡ʹॲཧͯ͠ѻ͓͏ͥʂ, ͱ͍͏τʔΫͰ͢. • ಺༰తʹ͸தڃऀʙ্ڃऀ޲͚Ͱ͢, ॳ৺ऀͷํͷࢦ਑ʹͳΔͱخ͍͠Ͱ͢. ʢ㲈Θ͔Βͳ͍ɾ஌Βͳ͍͜ͱ͸ࣗ͝෼ͷʮ৳ͼ͠Ζʯͩͱࢥ͍ͬͯͩ͘͞ʣ

    • σʔλͷ୊ࡐ͸ʮϝδϟʔϦʔάʯͰ͢⽁, εϙʔπσʔλͷ࿩΋গ͠. • ໺ٿʹڵຯͳ͍ʢor޷͖͡Όͳ͍ʣํͱ΋Ұॹʹָ͠ΊͨΒ޾͍Ͱ͢. ࠓ೔ͷτʔΫΛ͖͔͚ͬʹ໺ٿʹڵຯ΋ͯΔΑ͏ͳ࿩Λؤுͬͯ΍Γ·͢"
  3. օ༷ʹظ଴͢Δલఏ஌ࣝͱϞνϕʔγϣϯ • ʲMustʳPandas΍SQLͰσʔλॲཧɾ෼ੳΛखΛಈ͔ͯ͠΍ͬͨ͜ͱ͕͋Δ. • ʲMustʳGoogle CloudʢGCPʣ, AWS, AzureͳͲͷPublic CloudͰ PythonΛ࢖ͬͨ͜ͱ͕͋Δ.

    ※αʔϏε͸໰ΘͣʢEC2, App Engine, etc…ʣ • ϑϧϚωʔδυͷαʔόϨε؀ڥͰͷ։ൃܦݧʢ৮ͬͨ͜ͱ͋Ε͹OKʣ. AWS Lambda, AWS App Runner, App Engine, Cloud RunͳͲ͕֘౰. • ʢ޷͖ݏ͍ؔ܎ͳ͘ʣ໺ٿͷϧʔϧͱΦΦλχαϯ͸೺Ѳ͍ͯ͠Δ.
  4. Who am ɹ? ʢ͓લ୭Α?ʣ • Shinichi [email protected] • େख֎ࢿITίϯαϧاۀϚωʔδϟʔ ʢݩɾࣄۀձࣾͷϑϧαΠΫϧΤϯδχΞʣ

    • Ϋϥ΢υΠϯϑϥΛѻ͏νʔϜͷϚωʔδϟʔ • झຯͱ࣮ӹΛ݉Ͷͯݸਓ։ൃͯ͠·͢# ʢओʹ໺ٿͱϑΟδΧϧέΞ໨తʣ • ໺ٿͱҿΈͳ͕Βͷϓϩάϥϛϯάେ޷͖. • ਪ͠: ৽ঙ߶ࢤ, ສ೾தਖ਼, ୩઒ݪ݈ଠʢͷڧݞʣ #Python #Serverless #GoogleCloud #Baseball #DataScience #SABRmetrics
  5. ຊ೔ͷελʔςΟϯάϝϯόʔ • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏ • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫ • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ

    • ໺ٿϏοάσʔλ͕ਪ͢ʮΤά͍ʓʓͨͪʯ
  6. ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏

  7. ϝδϟʔϦʔάͷϏοάσʔλ • ϝδϟʔϦʔά͸ʮStatcastʯͱ͍͏γεςϜͰ৭ΜͳσʔλΛه࿥͍ͯ͠·͢. ※ΧϝϥɾϨʔμʔͰه࿥, Ұ෦౷ܭ஋ɾਓྗͰه࿥ • ྫ͑͹, ͜ͷลͷ࣮گͷݩωλ͸͢΂ͯ͜ͷʮStatcastʯͱ͍͏Ϗοάσʔλ͕ݩωλʹͳ͍ͬͯ·͢. • ΦΦλχαϯʂ˓߸ຊྥଧʂଧٿ଎౓180km/h,

    ඈڑ཭130m • ΦΦλχαϯʂ162km/hͷਅͬ௚͙Ͱݟಀ͠ࡾৼʂʂʂ • ໺ٿͷҰڍखҰ౤଍, ͢΂ͯͷ౤ٿɾଧٿσʔλ͕ه࿥͞ΕΔ. • ϨΪϡϥʔγʔζϯʢ30νʔϜɾ162ࢼ߹ʣͰ͓͓Αͦ70ʙ80ສٿલޙ. ϙετγʔζϯɾय़Ωϟϯϓσʔλ΋͋Δ. • σʔλ͸91ݸͷ߲໨ʢ!?ʣͰߏ੒͞ΕΔ, ϨΪϡϥʔγʔζϯ෼Ͱ͓͓Αͦ400MBʙ600MB͙Β͍ͷσʔλ. • baseballsavant.mlb.com ͱ͍͏αΠτͰ୭Ͱ΋Ӿཡɾμ΢ϯϩʔυʢCSV ϑΥʔϚοτʣͰ͖·͢.
  8. σʔλͷ࢓༷ʢެࣜʣ͸ͪ͜Β. https://baseballsavant.mlb.com/csv-docs ࢲͷղઆɾ຋༁൛͸ͬͪ͜. https://shinyorke.hatenablog.com/ entry/statcast-csv-docs-ja ֤σʔλ߲໨, νϥοͱ͓ݟͤ͠·͢.

  9. None
  10. None
  11. ???ʮਏ͍Ͱ͢…߲໨ͱҙຯ͕Θ͔Βͳ͍͔Β.ʯ શ91߲໨, ୯Ґͱ͔ଌఆج४΋ॳݟࡴ͠Ͱ͢ʢ&৽Ҫ͞Μ޿ౡ؂ಜब೚͓Ίʣ

  12. StatcastσʔλͰৼΓฦΔʮΦΦλχαϯͷ2022೥ʯ ͪ͜ΒΛྫʹStatcastσʔλΛݟ͍͖ͯ·͠ΐ͏.

  13. https://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022 ্هͷStatcastΛ࢖ͬͨαϯϓϧΛݩʹղઆ͠·͢&ίʔυެ։ͯ͠ΔͷͰͥͻ༡ΜͰ͍ͩ͘͞. ※ΦϦδφϧσʔλ͸mile/h & feetͰ͕͢, ࣄલʹkm/h & mʹม׵ࡁΈʢΦϦδφϧσʔλʹ͸ແ͍ͷͰ஫ҙʣ.

  14. 2022೥ͷΦΦλχαϯ, εϥΠμʔͱ2γʔϜ, ΧοτϘʔϧܑ͞ΜʹͳΔ • ࠓ೥ͷΦΦλχαϯ, ΊͬͪΌ εϥΠμʔ౤͍͛ͯΔ • ͓ؾ͖ͮͩΖ͏͔?ޙ൒ઓ͸ 2γʔϜʢσʔλ্͸Sinkerʣ͕

    ૿͍͑ͯΔ͜ͱʹ!? • εϥΠμʔ, 2γʔϜ, ΧοτϘʔϧͰ บ͕ڧ͍ۂ͕Γٿ౤͛ΔϚϯʹΩϟϥม
  15. ͱ͋ΔΦΦλχαϯͷొ൘೔ʢ2022/9/29, 8ճ10ୣࡾৼແࣦ఺ʣ ൒෼ۙ͘εϥΠμʔΛ౤͛ͯ2γʔϜͱΧοτͰԡ͍ͯ͘͠Πϝʔδ ౤͛ͨ৔ॴʢัख໨ઢʣ ϦϦʔεϙΠϯτʢัख໨ઢʣ ٿछͷׂ߹

  16. StatcastσʔλΛJupyter Lab + PlotlyͰோΊΔ • ৭Μͳσʔλ͕͋ΔͷͰ݁ߏͳ͜ͱ͕Θ͔Γͦ͏. • ࣌ܥྻσʔλͳͷͰ, ύϑΥʔϚϯεͷมԽ΋͔ͭΊΔ. •

    ٿ଎͕લͱҧ͏ͱ͔, ಥવ2γʔϜ૿͑ͨͳ?ͱ͔. • Ϙʔϧͷ଎౓ɾ࠲ඪܥσʔλ͕ἧ͍ͬͯΔ. • ؤுͬͯ࠲ඪܭࢉ͍͍ͯ͠ײ͡ʹͨ͠Β3Dඳըͱ͔͍͚Δ. ʢҙ༁ɾࠓճ͸ؤுΔ༨༟ͳ͔ͬͨͷͰ΍ͬͯ·ͤΜ$ʣ
  17. ϫΠʮຖ೔ຖࢼ߹ݟΔ࢓૊Έཉ͍͠ʯ https://baseballsavant.mlb.com/ ͕ඍົʹ࢖͍ʹ͍͘ࣄ΋͋Γ…w ࢖͍΍͍͢σʔλج൫ʹͪ͠Ό͑ʂͱ͍͏ΞΠσΞ͕͋Δ೔ࢥ͍ͭ͘.

  18. ͱ͍͏Θ͚Ͱ, ͪΐͬ͜ͱ࡞ͬͯΈ·ͨ͠.

  19. PythonͱGoogle CloudͰ࡞Δ αʔόϨεͰ͍͍ײ͡ͳ σʔλج൫ʢ໺ٿฤʣ

  20. ΞʔΩςΫνϟͷશମ૾

  21. None
  22. ΞʔΩςΫνϟղઆʢ㲈ͩ͜ΘΓϙΠϯτʣ • ຖ೔σʔλ֬ೝɾຖ೔σʔλߋ৽Λ͍͍ײ͡ʹ࣮ݱ͢ΔͨΊ, ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯΛશ໘తʹ׆༻ͯ͠ߏஙɾӡ༻. • ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯ #ͱ͸ • CLI΍ίϯιʔϧͰϙνϙν͢Δ͚ͩͰͻͱ·্ཱ͕ͣͪΔ •

    Πϯϑϥɾαʔόʔͷϝϯςφϯε͕ෆཁʢࣗ෼͡Όͳͯ͘, Ϋϥ΢υαʔϏεଆ͕΍Δʣ • ΑΓ۩ମతʹ͸, ࣗ෼ͰK8sΫϥελ΍VMΛݐͯͳͯ͘΋ྑ͍ʢωοτϫʔΫ౳ͷઃఆ͸ൃੜʣ • GitHub Actions౳ͷCI/CDͷύΠϓϥΠϯʹ૊ΈࠐΜͰσϓϩΠɾεέʔϧͰ͖ͨΓ جຊతʹ͸ʮ࢖ͬͨ෼͚ͩ՝ۚʯʹͳΔͷͰ͓ࡒ෍ʹ΋༏͍͠%
  23. Ϣʔεέʔεͱ࢖ͬͨαʔϏε

  24. None
  25. μογϡϘʔυΞϓϦ • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ

    • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ
  26. None
  27. σʔλऩू&BigQueryอଘ • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ

    • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
  28. None
  29. Firestore౤ೖʢDatabaseʹσʔλҠૹʣ • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

  30. PySpark + DataprocͰ࣮ݱ͢Δ αʔόϨεͳσʔλॲཧ ※͕͜͜͜ͷτʔΫͷຊ୊ͱͳΓ·͢.

  31. ͜ͷ࿩ͷείʔϓ

  32. None
  33. 33.4ඵͰΘ͔ͬͨʢؾʹͳΔʣ& SparkͱPySpark

  34. SparkͱPySpark • ʮେ͖͍σʔλΛ͍͍ײ͡ʹ෼ࢄͯ͠ॲཧ͢ΔʯͨΊͷFramework • Sparkຊମͷ࣮૷͸Java͕ͩ, PythonͷInterfaceͰ͋ΔʮPySparkʯΛ ࢖͏ࣄ͕ଟ͍ʢଞͷݴޠͩͱR΋࢖͑ͨΓ͢Δʣ. • σʔλॲཧόονͷϓϩάϥϜͱͯ͠ಈ͔͢ or

    Jupyter Lab, ZeppelinͰnotebook࣮ߦ. • Python࢖͍ʹ͸ඇৗʹೃછΈ͕ਂ͍, DataFrameͳInterface͕͋Δ. • SparkಠࣗͷDataframe. ͪͳΈʹPandas Dataframeʹม׵Մೳ • Pandas APIʢSpark 3.2Ҏ߱ͰPandasͷػೳΛར༻Մೳ, Ұ෦੍໿༗Γʣ
  35. SparkΛͲ͜Ͱߏஙɾӡ༻͢Δ͔' ؀ڥɾखஈ ߏஙͷखؒ ӡ༻͠΍͢͞ උߟ ΦϯϓϨϛεͰ શͯࣗલߏஙɾӡ༻ શͯࣗલͰઃఆ͢Δ ඞཁ͕͋Δ Կ͔ΒԿ·Ͱ

    ࣗ෼ͰݟΔඞཁ͕͋Δ Ұ൪େมͳύλʔϯ ຊ৬ͷΠϯϑϥΤϯδχΞ Ͱ΋͖͍ͭ࢓ࣄ Ϋϥ΢υ্ͷ7.,Tʹ ࣗલͰߏஙɾӡ༻ શͯࣗલͰઃఆ͢Δ ඞཁ͕͋Δ ͋Δఔ౓Ϋϥ΢υαʔϏε ͷԸܙʹत͔ΕΔ 4QBSL؀ڥͷࣗલߏங͸ ׂͱ೉қ౓͕ߴ͍ Ϋϥ΢υαʔϏεఏڙͷ ϚωʔδυαʔϏεΛ࢖͏ ˞࠷΋ਪ঑͢Δํࣜ (6*Ͱϙνϙν͢Δ ΋͘͠͸$-*"1*Ͱ ͍͍ײ͡ʹ࣮ߦ $16౳ͷϦιʔεΛ؂ࢹ ঢ়گʹԠͯ͡ϝϯςφϯε ࠷΋ָ͔ͭεϚʔτͳํ๏ "84 (PPHMF$MPVEଞ ֤ࣾαʔϏε༗
  36. Google Cloudʹ͓͚ΔSparkӡ༻ͷબ୒ࢶ ؀ڥɾखஈ ߏங ӡ༻ ࢖͑Δػೳ උߟ ($&PS(,&ʹ ؀ڥΛ࡞ͬͯӡ༻ ࣗલͰߏஙޙ

    4QBSLΛಋೖ શͯࣗલͰӡ༻ ໘౗ΛݟΔඞཁ༗ શͯͷػೳ ݁ہͷॴ%BUBQSPDͰ ग़དྷΔ͜ͱͳͷͰ ͓͢͢Ί͠ͳ͍ %BUBQSPD HDMPVEίϚϯυ  "1* ίϯιʔϧͷ ͲΕ͔Ͱߏங %BUBQSPD͕࡞ͬͨ (,&PS($&؀ڥ Λ؂ࢹɾӡ༻ શͯͷػೳ Ұ൪ඪ४తͳߏ੒ %BUBQSPD 4FSWFSMFTT HDMPVEίϚϯυ  "1* ίϯιʔϧ ্هͷͲΕ͔Ͱߏங ࣮ߦதͷ؂ࢹͷΈ ؀ڥ͸ॲཧޙʹ ࣗಈ࡟আ όονॲཧͷΈରԠ OPUFCPPL࢖͑ͳ͍ ఆظతͳόονॲཧ ͸͜Ε͕Ұ൪͍͍ ※Spark in BigQueryͱ͍͏, BigQueryͷετΞυͱͯ͠SparkΛ࣮ߦ͢Δػೳͷఏڙ༧ఆ༗Γʢby Google Cloud Next ‘22ʣ
  37. DataprocͱDataproc Serverless • Google Cloudʹ͸Dataprocͱ͍͏SparkʢHadoopʣͷϚωʔδυαʔϏε͕ଘࡏ͢Δ. • ࠓ·Ͱ͸GCE΍GKEʢK8sʣͰʮϗετ΋͘͠͸Cluster͕ଘࡏʯલఏͷ ӡ༻͔͠Ͱ͖ͳ͔͕ͬͨ, ͍ͭ࠷ۙServerlessͱ͍͏બ୒ࢶ͕ര஀ •

    ʮ1೔1ճʯʮ30෼͓͖ʯΈ͍ͨͳόονӡ༻Ͱ͋Ε͹Serverless͕࢖͑Δʂ ͳ͓, notebookͷ࣮ߦʢJupyterͳͲʣ͸ະରԠͳͷͰΞυϗοΫʹ͸࢖͑ͳ͍. • Serverless͸࢖ͬͨ෼͚ͩ՝ۚͳͷͰ͓ࡒ෍ʹ΋༏͍͠% • όʔδϣϯ͸Spark 3.2, PySpark͔ΒPandas API࢖͑·͢ʢ͕ࠓճ͸࢖ͬͯ·ͤΜʣ.
  38. PySparkΛ࢖ͬͯ΍ͬͨλεΫΛ঺հ • σʔλऩू&BigQuery΁ͷσʔλ౤ೖ • μογϡϘʔυΞϓϦ༻DBʢFirestoreʣ΁ͷσʔλ౤ೖ

  39. ʲ࠶ܝʳσʔλऩू&BigQueryอଘ • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ

    • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
  40. σʔλऩू ʢnot Sparkʣ • WebεΫϨΠϐϯά͸SparkͰ ΍Δ΂͖͜ͱͰ͸ͳ͍. • λεΫΛrequests-htmlͰ࣮૷, Cloud FunctionsͰӡ༻ͯ͠ରॲ.

    • Cloud SchedulerͷCronઃఆͰ ఆظ࣮ߦ, GCSʹอଘ
  41. CSVσʔλΛ BigQueryʹ౤ೖ • Dataproc্Ͱ΍ΔλεΫͱͯ͠ ద੾ͳൣғɾॲཧͷҰͭ • GCSͷύε͔ΒϑΝΠϧநग़ Spark SQLͰॲཧͯ͠BigQuery΁ •

    DataFrameͱSQL͕Θ͔Ε͹ ͍͍ײ͡ʹ࣮૷ɾӡ༻Մೳ
  42. DataprocΛ࢖͓͏ • Google CloudͷυΩϡϝϯτɾαϯϓϧΛࣸܦ͠ͳ͕Β΍Δͱྑ͖ • https://cloud.google.com/dataproc • https://cloud.google.com/dataproc-serverless/docs • https://github.com/GoogleCloudDataproc/cloud-dataproc

    • Serverlessͷ৔߹, ࣄલʹVPC subnetΛ࡞੒, ࣮ߦ࣌ʹࢦఆ͢Δඞཁ͋Γ. • ࣍ϖʔδ͔Β, PySparkΛ࢖ͬͯ΍Δ৔߹ͷαϯϓϧΛগ͠঺հ͠·͢. • Spark DataFrameΛݩʹ, ʮσʔλΛಡΜͰՃ޻ͯ͠ॻ͖ࠐΈʯతͳόονॲཧ. • ͲͷΫϥε͔Θ͔Γ΍͘͢͢ΔͨΊ, Type Hints෇͖Ͱ࣮૷͍ͯ͠·͢.
  43. ͻͱ·࣮ͣ૷ 1. SessionΛ࡞Δ • DB connectionతͳ΍ͭ • SparkSessionͷObjectΛ࡞Δ • Object࡞੒࣌ʹParameterࢦఆ

    • BigQueryΛ࢖͏࣌͸ JARͷࢦఆ͕ඞਢͳͷͰ஫ҙ
  44. ͻͱ·࣮ͣ૷ 2. SchemaΛ࡞Δ • CSVͷ৔߹SchemaΛ࡞Δ • ࡞੒͞ΕΔDataframeʹ ܕΛ͚ͭΔҝ, ઈରඞཁ •

    ࠓճ͸91߲໨෼ͷSchema ؤுͬͯॻ͖·ͨ͠ྦ
  45. ͻͱ·࣮ͣ૷ 3. CSVಡΈࠐΉ • sparkηογϣϯͷreadΛ ࢖͏, formatʹCSVΛࢦఆ • ϔομʔͱͯ͠ઌ΄Ͳͷ SchemaΛࢦఆ

    • GCSͷϑϧύεΛࢦఆ
  46. ͻͱ·࣮ͣ૷ 4. BigQueryอଘ • DataFrameͷwriteؔ਺ • bigqueryΛࢦఆ • ྫ͸طଘςʔϒϧ΁ͷ ௥هॻ͖ࠐΈ

  47. Dataproc ServerlessΛ࢖࣮ͬͯߦ

  48. BigQuery͔ΒGCSʹϑΝΠϧग़ྗ for Dataproc • BigQueryͷσʔλΛSpark DataFrameʹ • Spark DataFrameΛϑΝΠϧग़ྗ ͪͳΈʹ࣮ߦํ๏ʢgcloud

    CLIʣ͸มΘΒͳ͍ͷͰׂѪ͠·͢.
  49. ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

  50. ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

  51. ͻͱ·࣮ͣ૷ 5. BigQueryಡࠐ • อଘͱಉ͘͡BigQueryͷ JARΛࢦఆ • spark readͰBigQueryΛࢦఆ •

    BigQueryͷViewʹରͯ͠ ߦ͏৔߹, Φϓγϣϯ͕ඞཁ
  52. ͻͱ·࣮ͣ૷ 6. GCSอଘ • DataFrameͷwriteؔ਺ • jsonΛࢦఆ • ࠷ऴతͳύεΛࢦఆ

  53. PySparkͱDataproc Serverless • ʮ࢖͍͍ͨͱ͖͚ͩSparkΛ࢖͏ʯͱ͍͏ϢʔεέʔεΛ࣮ݱՄೳ. ͜Ε͕αʔόϨεαʔϏεΛ࢖͏΂͖࠷େͷཧ༝. • ࠓճͷΞϓϦέʔγϣϯͷσʔλαΠζʢ1೥Ͱ1GB͍͔ͳ͍ʣͩͱ Ըܙʹत͔Εͳ͍͕, ʮ਺GB/೔ఔ౓ͷσʔλΛαΫοͱόονॲཧʯ Έ͍ͨͳϢʔεέʔεʹͳΔͱ݁ߏศརͳؾ͕͠·͢ʢલॲཧɾΫϨϯδϯά͢Δͱ͔ʣ.

    • ʮॲཧ͢Δͱ͖͚ͩಈ͔͢ʯͱ͍͏ײ͡ͷ͍ܰίʔυͳͷͰPySparkͱ΋૬ੑόπάϯ. • ͳ͓, ॲཧͷࣗಈԽ͸ͪΐͬͱบ͕͋Γ·͢, Cloud ComposerʢAirflowʣ͕ඞཁ. ※ৄࡉ͸౰ࢿྉͷAppendixΛࢀর
  54. ٕज़ύʔτ͸͜͜Ͱऴྃ. ࠑॲ͔Βઌ͸…

  55. ΍͖͏ͷ͔͡Μͩ͋͋͋͋͋⽁

  56. 2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5 1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄ 2. FIGHTERS GIRL

    2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ 3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ 4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ 5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?
  57. 2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5 1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄ 2. FIGHTERS GIRL

    2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ 3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ 4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ 5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?
  58. Statcast ʢ&ࢲʣ͕ਪ͢ ʮΤά͍֎໺खͨͪʯ • ਎ମೳྗ͓Խ͚ͰΩϨοΩϨ • ύϫʔ, ڧݞ, ޷कͦͯ͠٭͕ചΓ •

    όοτΛৼΓճ͢໺ੑͬΆ͞ • ଧٿ଎౓Λݩʹਪ͠Λ3ਓ঺հ • ݱ໾࣌୅ͷ৽ঙ߶ࢤͬΆ͍ Ӊ஦ਓૉ੖Β͍͠֎໺खͰ͢(
  59. ຊ೔͝঺հ͢ΔΤά͍֎໺खͨͪ • Judge, AaronʢΞʔϩϯɾδϟοδʣ • Rodríguez, JulioʢϑϦΦɾϩυϦήεʣ • Buxton, ByronʢόΠϩϯɾόΫετϯʣ

    300ଧ੮Ҏ্ཱ͍ͬͯΔ֎໺ख͔ͭ, ଧٿ଎౓͕଎ͯ͘௕ଧ͕ग़·͘Δ, ݪଇηϯλʔΛक͍ͬͯΔબखΛ3ਓ঺հ͠·͢.
  60. Ξʔϩϯɾδϟοδ ʢ2022೥ຊྥଧԦʣ • ϠϯΩʔεͷڧଧऀͰ, ΦΦλχαϯͷϥΠόϧ • ݱ໾࠷ڧͷϗʔϜϥϯόολʔ • ͨͩύϫʔ͕͋Δ͚ͩͰͳ͘ 2mͷ਎௕Λੜ͔ͨ͠֎໺कඋ

    ηϯλʔकΕΔػಈྗ͕ചΓ
  61. ϑϦΦɾϩυϦήε ʢγΞτϧظ଴ͷ੕ʣ • ϚϦφʔζʹᰜ૘ͱݱΕͨظ଴ͷ੕ ͪͳΈʹࠓ೥ͷϧʔΩʔ • एख࣌୅ͷBIG BOSSΈ͍ͨͳ੒੷ ਎ମೳྗΛੜ͔ͨ͠ϓϨʔ͕ັྗ •

    ଧٿ֯౓্͕͕ͬͯόϨϧ૿͑ͨΒ Πνϩʔࢯʹগͣͭۙͮ͘͠ͷͰ͸? 10೥ܖ໿ʹԠ͑Δ׆༂Λظ଴ʂ
  62. όΠϩϯɾόΫετϯ ʢϛωιλͷສ೾ʣ • ϛωιλɾπΠϯζෆಈͷηϯλʔ • ໺ٿ͡Όͳ͍ڝٕ΋ߦ͚ͦ͏? ͱ͍͏Τήπͳ͍٭ྗͱݞͷ࣋ͪओ, ͦͷׂʹଧٿ֯౓͕ύϫʔώολʔ • ৭ʑࡶͬΆ͍ॴͱελΠϧͷྑ͞Ͱ

    ສ೾தਖ਼ʢϑΝΠλʔζʣʹࣅ͍ͯΔ. Ϛϯνϡ΢, ๺ͷόΫετϯʹͳͬͯ͘Εʂ
  63. ࠓ೥͸֎໺͸कͬͯ·ͤΜ͕. ͜ͷํ΋΍͸ΓΤά͍όολʔͰͨ͠

  64. ΦΦλχαϯʂʂΩϡϯͰ͢ὑ 300ଧ੮Ҏ্ͷ࠷ߴଧٿ଎౓ϥϯΩϯά, 2ҐͰͨ͠ʢࢲௐ΂ʣ

  65. ݁ͼ

  66. ʲ࠶ܝʳຊ೔ͷελʔςΟϯάϝϯόʔ • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏ • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫ • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ

    • Ϗοάσʔλ͕ਪ͢ʮΤά͍ΞεϦʔτܥ֎໺खʯ ָ͓͠Έ͍͚ͨͩ·͔ͨ͠?৘ใྔ͕ଟ͔ͬͨͷͰཧղ͢Δ·Ͱ೉͍͔͠΋׼ ࢿྉ͸ެ։͠·͢ͷͰ, ͥͻৼΓฦΓͱ͓ͯ͠ಡΈ͍ͩ͘͞)
  67. ࠓ೔ͷ࿩Λཁ໿͢Δͱ… • εϙʔπσʔλͷղੳɾ෼ੳͷ͓୊໨ͱͯ͠໺ٿ͸໘ന͍Αʂ Baseball Savantͱ͍͏τϥοΩϯάσʔλΛ࢖͏ͱྑ͖. • PythonͰ͍͍ײ͡ʹσʔλॲཧΛ͢ΔͷʹPySpark͸ྑ͍ͧ. • PySpark͸Ϋϥ΢υͰಈ͔ͤ·͢, ࠓ೔͸DataprocΛ঺հ͠·ͨ͠.

    • αʔόϨεʹΫϥ΢υΛ࢖͑ΔΑ͏ʹͳΔͱ,৭ʑͱָʢ੍ͨͩ͠ݶ΋͋Δʣ. • ϝδϟʔ͸Τά͍֎໺ख͕͍Δ͕, εϥΠμʔͱ2γʔϜ͓Խ͚ͷΦΦλχαϯڧ͍.
  68. ͓࢓ࣄʢۀ຿ʣͰࢀߟʹ͠Α͏ͱࢥͬͨํ΁ • ࠓճ঺հͨ͠΍Γํɾߏ੒͸ઈରతͳճ౴ɾϕεϓϥͰ͸ͳ͍Ͱ͢. ྫ͑͹αʔόϨεɾΞʔΩςΫνϟʹ͢΂͖/͢΂͖͡Όͳ͍ঢ়گ͸࣮֬ʹଘࡏ͠·͢. • ͜ͷ࿩͸ࢲʢshinyorkeʣ͕΍Γ͍ͨࣄ, ͍͍ͱࢥͬͯΔࣄʢ&৮Γ͍ͨϞϊʣΛ ٧ΊࠐΜͰ࡞ͬͨ, ࣗ෼͕΍Γ͍ͨࣄͷूେ੒Ͱ, ͋͘·Ͱ౴ͷग़͠ํͷҰͭͰ͢.

    • ΋ͬͱݴ͑͹, ʮαʔόϨεͱ͔PySparkͰͲ͜·Ͱ͍͍ײ͡ʹ࡞ΕΔ?ʯͱ͍͏ ϓϩτλΠϓͱͯ͠࡞Γ·ͨ͠&࣮͸ࠓޙSpark͸֎ͭ͢΋ΓͰ͢ʢৄ͘͠͸Appendixʹͯʣ. • ʢίϯςΩετͷཧղ͕த్൒୺ͳ··ʣͦͷ··ਅࣅ͢Δͱരࢮ͠·͢. ·ͣ͸खΛಈ͔͠, ֶशͨ͠Γಈ͔ͨ͠Γ͍͍ͯ͠΋ͷΛݟ͚ͭΔࢀߟʹͲ͏ͧʂ
  69. ʲଓ͖ʳAppendix - ΋͏ͪΐͬͱৄ͍͠࿩ • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022 • SparkΛ࢖Θͳ͍,

    େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ ؾʹͳΔํ͸ࢿྉͷଓ͖ΛಡΜͰ&ձ৔ͷํ͸࣭ٙԠ౴Ͱ࿩͠·͠ΐ͏ʂ
  70. ͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠⽁ Shinichi [email protected]

  71. Python࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯ ΦϚέฤʮຊฤͰ͸࿩͞ͳ͔ͬͨTips&ࢀߟࢿྉΛҰؾʹެ։͠·͢ʯ

  72. Appendix - ΋͏ͪΐͬͱৄ͍͠࿩ • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022 • SparkΛ࢖Θͳ͍,

    େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ • ࢀߟࢿྉ
  73. Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ

  74. Dataproc ServerlessͷࣗಈԽ • ૝ఆ͞ΕΔखஈ͸ҎԼͷ3ͭ. 1. APIΛ࢖͍͍ͬͯײ͡ʹ࣮ߦ͢ΔҝͷDocker imageΛ࡞੒ ͜ΕΛԿ͔͠Βͷํ๏ͰContainerͱͯ͠ಈ͔͢ʢK8sͳͲʣ 2. CLIʢgcloudίϚϯυʣͰ࣮ߦͰ͖ΔͷͰ,

    gcloudίϚϯυͷ Docker imageΛ࡞੒ʢҎԼ, 1.ͱಉจʣ 3. AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢ • 1.ͱ2.͸ۤߦ͔ͭ, αʔόϨεͷҙຯ͕ແ͘ͳΔՄೳੑ͕͋Δʢ͔ͭ, 1ͱ2͸΄΅ಉ͜͡ͱΛݴ͍ͬͯΔʣ. Cloud Run౳Ͱಈ͔ͤΕ͹Α͍͕, ߏஙɾӡ༻ͱ΋ʹϦεΫ͕͋Γͦ͏ͳ༧ײ. • ϕεϓϥͬΆ͍໛ൣղ౴͸ʮ3.AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢ʯͰ͢.
  75. ʲ໛ൣղ౴ʳAirflowͷOperatorΛܦ༝ͯ͠ಈ͔͢ Google CloudͷϚωʔδυɾαʔϏεʮCloud ComposerʯΛ࢖͏ͱྑͦ͞͏

  76. Dataproc ServerlessͷॲཧࣗಈԽ • Google CloudͰcronతͳॲཧΛࣗಈԽ͢ΔͷʹPub/Sub + Schedulerʢ΋͘͠͸Cloud TaskʣΛ ࢖͏ͱ͍͏ϕετɾϓϥΫςΟε͕ଘࡏ͢Δ. •

    ͨͩ, 2022೥10݄ݱࡏ, Dataproc Serverless͸Pub/SubΛInterfaceͱ࣮ͯ͠ߦ͢Δखஈ͕ແ͍ҝ, ࢒೦ͳ͕Β͜ͷํ๏͸࢖͑ͳ͍. • ͳͷͰ, ࠷΋εϚʔτͳํ๏͸AirflowͷDataprocܥOperatorΛ࢖࣮ͬͯߦ͢Δ͜ͱʹͳΔ. Cloud ComposerͰAirflowΫϥελΛ্ཱͪ͛ͯӡ༻͢Δ. • https://cloud.google.com/composer/docs/composer-2/run-dataproc-workloads • ͪͳΈʹCloud Composer͸αʔόϨεͰ͸ͳ͍Ͱ͢ʢϑϧϚωʔδυͰ͸͋Δ͕ʣ &K8sʢGKEʣΫϥελΛཱͯΔ͜ͱʹͳΔͷͰίετ໘΋஫ҙʢ࣮຿͸ͱ΋͔͘ݸਓͰ࢖͏ʹ͸ߴ͍ʣ
  77. SparkΛΫϥ΢υͰ࢖͏ Google CloudҎ֎ͷ৔߹

  78. Google CloudҎ֎ͷSparkαʔϏεબ୒ࢶ • AWS, Azureͦͯ͠ʢ͋Δҙຯ͝ຊՈͰ͋ΔʣDatabricks͕ީิ. • ύϒϦοΫΫϥ΢υΛΠϯϑϥͱͯ͠ѻ͏Ϣʔεέʔεͷ৔߹, Databricks͕࠷༗ྗީิʹʢϚϧνΫϥ΢υԽ͍ͨ͠౳ͷέʔεʣ. • ࣮͸͜ͷ෼໺,

    AWS͕ॆ࣮͍ͯͯ͠, EMRͱGlueͰϢʔεέʔεʹ ߹Θͤͯબ୒͢Δͱ͍͍Α͏ͳؾ͕͢Δ. • Azure͸৮ͬͨ͜ͱແ͍ͷͰΘ͔Βͳ͍…*
  79. Google CloudҎ֎ͷSparkαʔϏεબ୒ࢶ Ϋϥ΢υαʔϏε ˞શͯͰ͸ͳ͍Ͱ͢ 63- ֓ཁ %BUBCSJDLT IUUQTXXXEBUBCSJDLTDPNKQ ϚϧνΫϥ΢υ૝ఆͩͱબ୒ࢶʹ 4QBSLͷੜΈͷ਌͕։ൃɾఏڙ

    "84&.3 IUUQTBXTBNB[PODPNKQFNS "84ͷϚωʔδυ4QBSL)BEPPQ 4QBSLͱͯ͠࢖͏ͳΒͬͪ͜ "84(MVF IUUQTBXTBNB[PODPNKQHMVF 4QBSLΛ&5-ͱͯ͠࢖͏৔߹  &.3ΑΓ(MVFΛ࢖͏ͷ͕ϕετ "[VSF)%*OTJHIU IUUQTB[VSFNJDSPTPGUDPNKBKQ TFSWJDFTIEJOTJHIUPWFSWJFX "[VSFʹ͓͚Δબ୒ࢶ ʢࢲ͸৮ͬͨ͜ͱͳ͍Ͱ͕͢ʜ
  80. SparkʢDataprocʣΛ࢖Θͳ͍ ৔߹ͷ͍͍ײ͡ͳσʔλॲཧ for Google Cloud

  81. ͍͍ײ͡ͳσʔλॲཧ for Google Cloud • Dataflow • DataFusion • Dataprep

    • Cloud Run • Cloud Functions
  82. ༻్ʹ߹Θͤͯ࢖͍෼͚·͠ΐ͏ʂ (PPHMF$MPVE4FSWJDF 63- ֓ཁ %BUBqPX IUUQTDMPVEHPPHMFDPNEBUBqPX IMKB "QBDIF#FBN͕ϕʔε ετϦʔϛϯάॲཧͳΒ͜Ε %BUB'VTJPO

    IUUQTDMPVEHPPHMFDPNEBUB GVTJPOEPDT IMKB ΦϯϓϨΛؚΉɺطଘσʔλΛ औΓࠐΉ&5-తͳαʔϏε %BUBQSFQ IUUQTDMPVEHPPHMFDPNEBUBQSFQ IMKB σʔλલॲཧɾΫϨϯδϯάத৺ ͲͪΒ͔ͱ͍͑͹ϩʔίʔυ $MPVE3VO IUUQTDMPVEHPPHMFDPNSVO IMKB ޷͖ͳݴޠɾ'8Ͱ࡞ΔͳΒ͜Ε 1VC4VC౳ͰτϦΨʔͯ͠ಈ͔͢ $MPVE'VODUJPOT IUUQTDMPVEHPPHMFDPNGVODUJPOT IMKB $MPVE3VOΑΓ੍໿͋Δ͕  αΫοͱ࡞ͬͯಈ͔͢ͳΒ
  83. ݱ࣮తͳબ୒ࢶɾצॴ • ϦΞϧλΠϜܥͷॲཧ͸Dataflow͕࠷༗ྗͷબ୒ࢶ. • طଘͷσʔλͱ౷߹ͨ͠Γ·ͱΊͨΓ͸DataFusion. • ػցֶश౳ͷσʔλલॲཧ͸Dataprep. • PythonʹݶΒͣ, ࣗ෼Ͱ࡞ͬͯಈ͔͢ͳΒCloud

    Run. • ʮPandasͱBigQuery, GCS࢖͏ʯ͙Β͍ͳΒCloud FunctionsͰ αΫοͱ΍Ε·͢ʢ࣮͸͜ͷϢʔεέʔεଟ͍ͷͰ͸ʁʣ.
  84. Dash + Cloud RunͰӡ༻͢Δ σʔλՄࢹԽμογϡϘʔυ ※Spark͓ΑͼDataproc͸ొ৔͠·ͤΜ

  85. μογϡϘʔυΞϓϦʢຊฤͰׂѪͨ͠࿩ʣ • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ

    • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ
  86. μογϡϘʔυΞϓϦʢຊฤͰׂѪͨ͠࿩ʣ • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ

    • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ
  87. Dash + Cloud RunͰͷ ϗεςΟϯά • Dash͸Flask͕ݩʹͳͬͯΔͷͰ gunicornͰ͍͍ײ͡ʹಈ͔͢తͳ ํ๏ͰϗεςΟϯάՄೳ. •

    ͜Ε΋αʔόϨεͳͷͰ, ࢖ͬͨ࣌ؒɾϦιʔε ͚ͩಈ͘ײ͡ʹͳΔ, ࣗલͰՄࢹԽΞϓϦΛ ࡞Γ͍ͨํ͸΍ͬͯΈΔͱྑ͍͔΋? • ͪͳΈʹAWSͷ৔߹, App RunnerͰಉ͡ ํ๏͕औΕΔͱࢥ͍·͢ʢࢼͯ͠͸͍·ͤΜ͕ʣ.
  88. ͳ͓, CI/CDϫʔΫϑϩʔ͸͜Μͳײ͡. • GitHub Repositoryʹpushͨ͠ΒGitHub Actions͕ൃՐ, ςετ -> Docker Build

    -> Cloud RunσϓϩΠ • ςετ͸pytest, flake8, mypyΛGitHub Actions্Ͱ࣮ࢪʢunit, integration·Ͱ୲อ͢ΔΠϝʔδʣ • Docker build͸Cloud Runͷඪ४తͳ΍Γํʹै͏. • Cloud Build্ͰϏϧυ • Artifact Registryʹpush • Cloud Run΁ͷσϓϩΠ͸Github ActionsͷެࣜΛ࢖࣮ͬͯࢪ.
  89. ࢀߟࢿྉ

  90. Spark / PySparkؔ࿈ • PySpark Documents https://spark.apache.org/docs/latest/api/python/ • ೖ໳PySparkɹ˞ͪΐͬͱݹ͍ॻ੶Ͱ͢, ಺༰ͱ͔஫ҙ.

    https://www.oreilly.co.jp/books/9784873118185/ • PythonͰେྔσʔλॲཧʂ PySparkΛ༻͍ͨσʔλॲཧͱ෼ੳͷ͖΄Μ ʢPyCon JP 2017ʣ https://speakerdeck.com/chie8842/pythondeda-liang-detachu-li- pysparkwoyong-itadetachu-li-tofen-xi-falsekihon
  91. Google CloudʢDataprocʣ • ެࣜυΩϡϝϯτ https://cloud.google.com/dataproc/docs?hl=ja • PySparkͷެࣜαϯϓϧʢ͔͜͜Βࣸܦָ͕ʣ https://github.com/googleapis/python-dataproc • ެࣜαϯϓϧͦͷ2ʢΑΓ࣮ફతʣ

    https://github.com/GoogleCloudDataproc/cloud-dataproc
  92. Google Cloudʢॳ৺ऀɾ࢖͍͍ͨਓ޲͚ʣ • ެࣜυΩϡϝϯτ https://cloud.google.com/docs?hl=ja • ࢿ֨ https://cloud.google.com/certification?hl=ja • ΤϯλʔϓϥΠζͷͨΊͷGoogle

    Cloudʢਪ͠ͷॻ੶Ͱ͢ʣ https://www.shoeisha.co.jp/book/detail/9784798175256
  93. ࣗ෼ͷϒϩάهࣄʢPySpark/Dataؔ࿈ʣ • ໺ٿͷϏοάσʔλΛGCPͱPySparkͰ͍͍ײ͡ʹ࢖͍΍ͯ͘͢͠Έͨ https://shinyorke.hatenablog.com/entry/dataproc-baseball • SparkΛαʔόʔ؅ཧͤͣʹ࢖͏ํ๏ https://shinyorke.hatenablog.com/entry/dataproc-serverless • Google CloudͰSparkΛ࢖͏؀ڥΛαΫοͱखʹೖΕΔ

    https://shinyorke.hatenablog.com/entry/dataproc-terraform • WebΞϓϦͱσʔλج൫ΛαΫοͱ্ཱͪ͛ΔͨΊͷϓϥΫςΟε https://shinyorke.hatenablog.com/entry/cloud-arch-serverless
  94. ໺ٿؔ܎ͷࢀߟϒϩάɾίʔυ • ໺ٿ޷͖ͱσʔλ޷͖ͷͨΊͷStatcastσʔλೖ໳ https://shinyorke.hatenablog.com/entry/statcast-csv-docs-ja • StatcastσʔλͱPlotlyΛ࢖ͬͯʮଧٿͷ౸ୡҐஔʯΛՄࢹԽ͢Δ https://shinyorke.hatenablog.com/entry/statcast-visualization-for-batting • Baseball SavantͰΦΦλχαϯͷσʔλΛோΊΔαϯϓϧ

    https://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022 • RʹΑΔηΠόʔϝτϦΫεೖ໳ https://gihyo.jp/book/2020/978-4-297-11684-2
  95. Done. ࠷ޙ·Ͱ͝ഈಡ͋Γ͕ͱ͏͍͟͝·ͨ͠.