PyCon JP 2022 10/15 Talk Session Material
# Reference
https://shinyorke.hatenablog.com/entry/baseball-data-visualization-app
https://shinyorke.hatenablog.com/entry/ohtani-san-pitch-2022
No Baseball, No Engineering!High Performance Data PlatformKnowledge of PySpark, Cloud and ⚾Python͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯShinichi Nakagawa@shinyorke 2022/10/15 PyConJP 2022 Talk Session
View Slide
Onboardingʢ͜ͷηογϣϯͷ͝Ҋʣ• PythonͱSparkʢPySparkʣͱύϒϦοΫΫϥυʢGoogle CloudʣͰGBҎ্ͷσʔλΛ͍͍ײ͡ʹॲཧͯ͠ѻ͓͏ͥʂ, ͱ͍͏τʔΫͰ͢.• ༰తʹதڃऀʙ্ڃऀ͚Ͱ͢, ॳ৺ऀͷํͷࢦʹͳΔͱخ͍͠Ͱ͢.ʢ㲈Θ͔Βͳ͍ɾΒͳ͍͜ͱࣗ͝ͷʮ৳ͼ͠Ζʯͩͱࢥ͍ͬͯͩ͘͞ʣ• σʔλͷࡐʮϝδϟʔϦʔάʯͰ͢⽁, εϙʔπσʔλͷগ͠.• ٿʹڵຯͳ͍ʢor͖͡Όͳ͍ʣํͱҰॹʹָ͠ΊͨΒ͍Ͱ͢.ࠓͷτʔΫΛ͖͔͚ͬʹٿʹڵຯͯΔΑ͏ͳΛؤுͬͯΓ·͢"
օ༷ʹظ͢ΔલఏࣝͱϞνϕʔγϣϯ• ʲMustʳPandasSQLͰσʔλॲཧɾੳΛखΛಈ͔ͯͬͨ͜͠ͱ͕͋Δ.• ʲMustʳGoogle CloudʢGCPʣ, AWS, AzureͳͲͷPublic CloudͰPythonΛͬͨ͜ͱ͕͋Δ. ※αʔϏεΘͣʢEC2, App Engine, etc…ʣ• ϑϧϚωʔδυͷαʔόϨεڥͰͷ։ൃܦݧʢ৮ͬͨ͜ͱ͋ΕOKʣ.AWS Lambda, AWS App Runner, App Engine, Cloud RunͳͲ͕֘.• ʢ͖ݏ͍ؔͳ͘ʣٿͷϧʔϧͱΦΦλχαϯѲ͍ͯ͠Δ.
Who am ɹ?ʢ͓લ୭Α?ʣ• Shinichi Nakagawa@shinyorke• େख֎ࢿITίϯαϧاۀϚωʔδϟʔʢݩɾࣄۀձࣾͷϑϧαΠΫϧΤϯδχΞʣ• ΫϥυΠϯϑϥΛѻ͏νʔϜͷϚωʔδϟʔ• झຯͱ࣮ӹΛ݉Ͷͯݸਓ։ൃͯ͠·͢#ʢओʹٿͱϑΟδΧϧέΞతʣ• ٿͱҿΈͳ͕Βͷϓϩάϥϛϯάେ͖.• ਪ͠: ৽ঙ߶ࢤ, ສதਖ਼, ୩ݪ݈ଠʢͷڧݞʣ#Python #Serverless #GoogleCloud #Baseball#DataScience #SABRmetrics
ຊͷελʔςΟϯάϝϯόʔ• ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏• PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫• PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ• ٿϏοάσʔλ͕ਪ͢ʮΤά͍ʓʓͨͪʯ
ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏
ϝδϟʔϦʔάͷϏοάσʔλ• ϝδϟʔϦʔάʮStatcastʯͱ͍͏γεςϜͰ৭ΜͳσʔλΛه͍ͯ͠·͢.※ΧϝϥɾϨʔμʔͰه, Ұ෦౷ܭɾਓྗͰه• ྫ͑, ͜ͷลͷ࣮گͷݩωλͯ͢͜ͷʮStatcastʯͱ͍͏Ϗοάσʔλ͕ݩωλʹͳ͍ͬͯ·͢.• ΦΦλχαϯʂ˓߸ຊྥଧʂଧٿ180km/h, ඈڑ130m• ΦΦλχαϯʂ162km/hͷਅ͙ͬͰݟಀ͠ࡾৼʂʂʂ• ٿͷҰڍखҰ, ͯ͢ͷٿɾଧٿσʔλ͕ه͞ΕΔ.• ϨΪϡϥʔγʔζϯʢ30νʔϜɾ162ࢼ߹ʣͰ͓͓Αͦ70ʙ80ສٿલޙ. ϙετγʔζϯɾय़Ωϟϯϓσʔλ͋Δ.• σʔλ91ݸͷ߲ʢ!?ʣͰߏ͞ΕΔ, ϨΪϡϥʔγʔζϯͰ͓͓Αͦ400MBʙ600MB͙Β͍ͷσʔλ.• baseballsavant.mlb.com ͱ͍͏αΠτͰ୭ͰӾཡɾμϯϩʔυʢCSV ϑΥʔϚοτʣͰ͖·͢.
σʔλͷ༷ʢެࣜʣͪ͜Β.https://baseballsavant.mlb.com/csv-docsࢲͷղઆɾ༁൛ͬͪ͜.https://shinyorke.hatenablog.com/entry/statcast-csv-docs-ja֤σʔλ߲, νϥοͱ͓ݟͤ͠·͢.
???ʮਏ͍Ͱ͢…߲ͱҙຯ͕Θ͔Βͳ͍͔Β.ʯશ91߲, ୯Ґͱ͔ଌఆج४ॳݟࡴ͠Ͱ͢ʢ&৽Ҫ͞Μౡಜब͓Ίʣ
StatcastσʔλͰৼΓฦΔʮΦΦλχαϯͷ2022ʯͪ͜ΒΛྫʹStatcastσʔλΛݟ͍͖ͯ·͠ΐ͏.
https://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022্هͷStatcastΛͬͨαϯϓϧΛݩʹղઆ͠·͢&ίʔυެ։ͯ͠ΔͷͰͥͻ༡ΜͰ͍ͩ͘͞.※ΦϦδφϧσʔλmile/h & feetͰ͕͢, ࣄલʹkm/h & mʹมࡁΈʢΦϦδφϧσʔλʹແ͍ͷͰҙʣ.
2022ͷΦΦλχαϯ,εϥΠμʔͱ2γʔϜ,ΧοτϘʔϧܑ͞ΜʹͳΔ• ࠓͷΦΦλχαϯ, ΊͬͪΌεϥΠμʔ͍͛ͯΔ• ͓ؾ͖ͮͩΖ͏͔?ޙઓ2γʔϜʢσʔλ্Sinkerʣ͕૿͍͑ͯΔ͜ͱʹ!?• εϥΠμʔ, 2γʔϜ, ΧοτϘʔϧͰบ͕ڧ͍ۂ͕Γٿ͛ΔϚϯʹΩϟϥม
ͱ͋ΔΦΦλχαϯͷొ൘ʢ2022/9/29, 8ճ10ୣࡾৼແࣦʣۙ͘εϥΠμʔΛ͛ͯ2γʔϜͱΧοτͰԡ͍ͯ͘͠Πϝʔδ͛ͨॴʢัखઢʣ ϦϦʔεϙΠϯτʢัखઢʣٿछͷׂ߹
StatcastσʔλΛJupyter Lab + PlotlyͰோΊΔ• ৭Μͳσʔλ͕͋ΔͷͰ݁ߏͳ͜ͱ͕Θ͔Γͦ͏.• ࣌ܥྻσʔλͳͷͰ, ύϑΥʔϚϯεͷมԽ͔ͭΊΔ.• ٿ͕લͱҧ͏ͱ͔, ಥવ2γʔϜ૿͑ͨͳ?ͱ͔.• Ϙʔϧͷɾ࠲ඪܥσʔλ͕ἧ͍ͬͯΔ.• ؤுͬͯ࠲ඪܭࢉ͍͍ͯ͠ײ͡ʹͨ͠Β3Dඳըͱ͔͍͚Δ.ʢҙ༁ɾࠓճؤுΔ༨༟ͳ͔ͬͨͷͰͬͯ·ͤΜ$ʣ
ϫΠʮຖຖࢼ߹ݟΔΈཉ͍͠ʯhttps://baseballsavant.mlb.com/ ͕ඍົʹ͍ʹ͍͘ࣄ͋Γ…w͍͍͢σʔλج൫ʹͪ͠Ό͑ʂͱ͍͏ΞΠσΞ͕͋Δࢥ͍ͭ͘.
ͱ͍͏Θ͚Ͱ, ͪΐͬ͜ͱ࡞ͬͯΈ·ͨ͠.
PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫ʢٿฤʣ
ΞʔΩςΫνϟͷશମ૾
ΞʔΩςΫνϟղઆʢ㲈ͩ͜ΘΓϙΠϯτʣ• ຖσʔλ֬ೝɾຖσʔλߋ৽Λ͍͍ײ͡ʹ࣮ݱ͢ΔͨΊ,ʮϑϧϚωʔδυͳαʔόϨεܥΫϥυαʔϏεʯΛશ໘తʹ׆༻ͯ͠ߏஙɾӡ༻.• ʮϑϧϚωʔδυͳαʔόϨεܥΫϥυαʔϏεʯ #ͱ• CLIίϯιʔϧͰϙνϙν͢Δ͚ͩͰͻͱ·্ཱ͕ͣͪΔ• Πϯϑϥɾαʔόʔͷϝϯςφϯε͕ෆཁʢࣗ͡Όͳͯ͘, ΫϥυαʔϏεଆ͕Δʣ• ΑΓ۩ମతʹ, ࣗͰK8sΫϥελVMΛݐͯͳͯ͘ྑ͍ʢωοτϫʔΫͷઃఆൃੜʣ• GitHub ActionsͷCI/CDͷύΠϓϥΠϯʹΈࠐΜͰσϓϩΠɾεέʔϧͰ͖ͨΓجຊతʹʮ͚ͬͨͩ՝ۚʯʹͳΔͷͰ͓ࡒʹ༏͍͠%
ϢʔεέʔεͱͬͨαʔϏε
μογϡϘʔυΞϓϦ• ΞϓϦຊମCloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε• Firestore͕ϝΠϯͷDB, CacheͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ• ͜͜ͰSparkʢPySparkʣొ͠·ͤΜ
σʔλऩू&BigQueryอଘ• σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ• ࣮ߦ݁ՌGoogle Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ• GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
FirestoreೖʢDatabaseʹσʔλҠૹʣ• BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ• ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ• ͳ͓͍ͣΕखಈͰͷ࣮ߦʢཧ༝&ରԠࡦޙ΄Ͳʣ
PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ※͕͜͜͜ͷτʔΫͷຊͱͳΓ·͢.
͜ͷͷείʔϓ
33.4ඵͰΘ͔ͬͨʢؾʹͳΔʣ&SparkͱPySpark
SparkͱPySpark• ʮେ͖͍σʔλΛ͍͍ײ͡ʹࢄͯ͠ॲཧ͢ΔʯͨΊͷFramework• Sparkຊମͷ࣮Java͕ͩ, PythonͷInterfaceͰ͋ΔʮPySparkʯΛ͏ࣄ͕ଟ͍ʢଞͷݴޠͩͱR͑ͨΓ͢Δʣ.• σʔλॲཧόονͷϓϩάϥϜͱͯ͠ಈ͔͢ or Jupyter Lab, ZeppelinͰnotebook࣮ߦ.• Python͍ʹඇৗʹೃછΈ͕ਂ͍, DataFrameͳInterface͕͋Δ.• SparkಠࣗͷDataframe. ͪͳΈʹPandas DataframeʹมՄೳ• Pandas APIʢSpark 3.2Ҏ߱ͰPandasͷػೳΛར༻Մೳ, Ұ෦੍༗Γʣ
SparkΛͲ͜Ͱߏஙɾӡ༻͢Δ͔'ڥɾखஈ ߏஙͷखؒ ӡ༻͢͠͞ උߟΦϯϓϨϛεͰશͯࣗલߏஙɾӡ༻શͯࣗલͰઃఆ͢Δඞཁ͕͋ΔԿ͔ΒԿ·ͰࣗͰݟΔඞཁ͕͋ΔҰ൪େมͳύλʔϯຊ৬ͷΠϯϑϥΤϯδχΞͰ͖͍ͭࣄΫϥυ্ͷ7.,TʹࣗલͰߏஙɾӡ༻શͯࣗલͰઃఆ͢Δඞཁ͕͋Δ͋ΔఔΫϥυαʔϏεͷԸܙʹत͔ΕΔ4QBSLڥͷࣗલߏஙׂͱқ͕ߴ͍ΫϥυαʔϏεఏڙͷϚωʔδυαʔϏεΛ͏˞࠷ਪ͢Δํࣜ(6*Ͱϙνϙν͢Δ͘͠$-*"1*Ͱ͍͍ײ͡ʹ࣮ߦ$16ͷϦιʔεΛࢹঢ়گʹԠͯ͡ϝϯςφϯε࠷ָ͔ͭεϚʔτͳํ๏"84 (PPHMF$MPVEଞ֤ࣾαʔϏε༗
Google Cloudʹ͓͚ΔSparkӡ༻ͷબࢶڥɾखஈ ߏங ӡ༻ ͑Δػೳ උߟ($&PS(,&ʹڥΛ࡞ͬͯӡ༻ࣗલͰߏஙޙ4QBSLΛಋೖશͯࣗલͰӡ༻໘ΛݟΔඞཁ༗શͯͷػೳ݁ہͷॴ%BUBQSPDͰग़དྷΔ͜ͱͳͷͰ͓͢͢Ί͠ͳ͍%BUBQSPDHDMPVEίϚϯυ "1* ίϯιʔϧͷͲΕ͔Ͱߏங%BUBQSPD͕࡞ͬͨ(,&PS($&ڥΛࢹɾӡ༻શͯͷػೳ Ұ൪ඪ४తͳߏ%BUBQSPD4FSWFSMFTTHDMPVEίϚϯυ "1* ίϯιʔϧ্هͷͲΕ͔Ͱߏங࣮ߦதͷࢹͷΈڥॲཧޙʹࣗಈআόονॲཧͷΈରԠOPUFCPPL͑ͳ͍ఆظతͳόονॲཧ͜Ε͕Ұ൪͍͍※Spark in BigQueryͱ͍͏, BigQueryͷετΞυͱͯ͠SparkΛ࣮ߦ͢Δػೳͷఏڙ༧ఆ༗Γʢby Google Cloud Next ‘22ʣ
DataprocͱDataproc Serverless• Google CloudʹDataprocͱ͍͏SparkʢHadoopʣͷϚωʔδυαʔϏε͕ଘࡏ͢Δ.• ࠓ·ͰGCEGKEʢK8sʣͰʮϗετ͘͠Cluster͕ଘࡏʯલఏͷӡ༻͔͠Ͱ͖ͳ͔͕ͬͨ, ͍ͭ࠷ۙServerlessͱ͍͏બࢶ͕ര• ʮ11ճʯʮ30͓͖ʯΈ͍ͨͳόονӡ༻Ͱ͋ΕServerless͕͑Δʂͳ͓, notebookͷ࣮ߦʢJupyterͳͲʣະରԠͳͷͰΞυϗοΫʹ͑ͳ͍.• Serverless͚ͬͨͩ՝ۚͳͷͰ͓ࡒʹ༏͍͠%• όʔδϣϯSpark 3.2, PySpark͔ΒPandas API͑·͢ʢ͕ࠓճͬͯ·ͤΜʣ.
PySparkΛͬͯͬͨλεΫΛհ• σʔλऩू&BigQueryͷσʔλೖ• μογϡϘʔυΞϓϦ༻DBʢFirestoreʣͷσʔλೖ
ʲ࠶ܝʳσʔλऩू&BigQueryอଘ• σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ• ࣮ߦ݁ՌGoogle Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ• GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
σʔλऩूʢnot Sparkʣ• WebεΫϨΠϐϯάSparkͰΔ͖͜ͱͰͳ͍.• λεΫΛrequests-htmlͰ࣮,Cloud FunctionsͰӡ༻ͯ͠ରॲ.• Cloud SchedulerͷCronઃఆͰఆظ࣮ߦ, GCSʹอଘ
CSVσʔλΛBigQueryʹೖ• Dataproc্ͰΔλεΫͱͯ͠దͳൣғɾॲཧͷҰͭ• GCSͷύε͔ΒϑΝΠϧநग़Spark SQLͰॲཧͯ͠BigQuery• DataFrameͱSQL͕Θ͔Ε͍͍ײ͡ʹ࣮ɾӡ༻Մೳ
DataprocΛ͓͏• Google CloudͷυΩϡϝϯτɾαϯϓϧΛࣸܦ͠ͳ͕ΒΔͱྑ͖• https://cloud.google.com/dataproc• https://cloud.google.com/dataproc-serverless/docs• https://github.com/GoogleCloudDataproc/cloud-dataproc• Serverlessͷ߹, ࣄલʹVPC subnetΛ࡞, ࣮ߦ࣌ʹࢦఆ͢Δඞཁ͋Γ.• ࣍ϖʔδ͔Β, PySparkΛͬͯΔ߹ͷαϯϓϧΛগ͠հ͠·͢.• Spark DataFrameΛݩʹ, ʮσʔλΛಡΜͰՃͯ͠ॻ͖ࠐΈʯతͳόονॲཧ.• ͲͷΫϥε͔Θ͔Γ͘͢͢ΔͨΊ, Type Hints͖Ͱ࣮͍ͯ͠·͢.
ͻͱ·࣮ͣ1. SessionΛ࡞Δ• DB connectionతͳͭ• SparkSessionͷObjectΛ࡞Δ• Object࡞࣌ʹParameterࢦఆ• BigQueryΛ͏࣌JARͷࢦఆ͕ඞਢͳͷͰҙ
ͻͱ·࣮ͣ2. SchemaΛ࡞Δ• CSVͷ߹SchemaΛ࡞Δ• ࡞͞ΕΔDataframeʹܕΛ͚ͭΔҝ, ઈରඞཁ• ࠓճ91߲ͷSchemaؤுͬͯॻ͖·ͨ͠ྦ
ͻͱ·࣮ͣ3. CSVಡΈࠐΉ• sparkηογϣϯͷreadΛ͏, formatʹCSVΛࢦఆ• ϔομʔͱͯ͠ઌ΄ͲͷSchemaΛࢦఆ• GCSͷϑϧύεΛࢦఆ
ͻͱ·࣮ͣ4. BigQueryอଘ• DataFrameͷwriteؔ• bigqueryΛࢦఆ• ྫطଘςʔϒϧͷهॻ͖ࠐΈ
Dataproc ServerlessΛ࣮ͬͯߦ
BigQuery͔ΒGCSʹϑΝΠϧग़ྗ for Dataproc• BigQueryͷσʔλΛSpark DataFrameʹ• Spark DataFrameΛϑΝΠϧग़ྗͪͳΈʹ࣮ߦํ๏ʢgcloud CLIʣมΘΒͳ͍ͷͰׂѪ͠·͢.
ʲ࠶ܝʳFirestoreೖʢDatabaseʹσʔλҠૹʣ• BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ• ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ• ͳ͓͍ͣΕखಈͰͷ࣮ߦʢཧ༝&ରԠࡦޙ΄Ͳʣ
ͻͱ·࣮ͣ5. BigQueryಡࠐ• อଘͱಉ͘͡BigQueryͷJARΛࢦఆ• spark readͰBigQueryΛࢦఆ• BigQueryͷViewʹରͯ͠ߦ͏߹, Φϓγϣϯ͕ඞཁ
ͻͱ·࣮ͣ6. GCSอଘ• DataFrameͷwriteؔ• jsonΛࢦఆ• ࠷ऴతͳύεΛࢦఆ
PySparkͱDataproc Serverless• ʮ͍͍ͨͱ͖͚ͩSparkΛ͏ʯͱ͍͏ϢʔεέʔεΛ࣮ݱՄೳ.͜Ε͕αʔόϨεαʔϏεΛ͏͖࠷େͷཧ༝.• ࠓճͷΞϓϦέʔγϣϯͷσʔλαΠζʢ1Ͱ1GB͍͔ͳ͍ʣͩͱԸܙʹत͔Εͳ͍͕, ʮGB/ఔͷσʔλΛαΫοͱόονॲཧʯΈ͍ͨͳϢʔεέʔεʹͳΔͱ݁ߏศརͳؾ͕͠·͢ʢલॲཧɾΫϨϯδϯά͢Δͱ͔ʣ.• ʮॲཧ͢Δͱ͖͚ͩಈ͔͢ʯͱ͍͏ײ͡ͷ͍ܰίʔυͳͷͰPySparkͱ૬ੑόπάϯ.• ͳ͓, ॲཧͷࣗಈԽͪΐͬͱบ͕͋Γ·͢, Cloud ComposerʢAirflowʣ͕ඞཁ.※ৄࡉࢿྉͷAppendixΛࢀর
ٕज़ύʔτ͜͜Ͱऴྃ.ࠑॲ͔Βઌ…
͖͏ͷ͔͡Μͩ͋͋͋͋͋⽁
2022ϓϩٿ, άοͱ͖ͨग़དྷࣄBEST 51. ϑΝΠλʔζ, ສɾਗ਼ٶɾాٶΒ, ମೳྗ༏ΕΔएख͕಄2. FIGHTERS GIRL 2022, ΩϨοΩϨͷΩπωμϯε͕େώοτύɾϦʔάTVͷಈը࠶ੜͰଟͷ࢝ٿࣜಈըʹѹউ3. ύϫʔͱڧݞ, ढ़कͰҰ࣌Λங͍ͨࢳҪՅஉ, ཹհͷҾୀ4. ٿʹډ࠲Δੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎खʹෛ͚Δ5. ࠤʑ࿕رશࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه, Τά͘ͳ͍Ͱ͔͢?
Statcast ʢ&ࢲʣ͕ਪ͢ʮΤά͍֎खͨͪʯ• ମೳྗ͓Խ͚ͰΩϨοΩϨ• ύϫʔ, ڧݞ, कͦͯ͠٭͕ചΓ• όοτΛৼΓճ͢ੑͬΆ͞• ଧٿΛݩʹਪ͠Λ3ਓհ• ݱ࣌ͷ৽ঙ߶ࢤͬΆ͍ӉਓૉΒ͍͠֎खͰ͢(
ຊ͝հ͢ΔΤά͍֎खͨͪ• Judge, AaronʢΞʔϩϯɾδϟοδʣ• Rodríguez, JulioʢϑϦΦɾϩυϦήεʣ• Buxton, ByronʢόΠϩϯɾόΫετϯʣ300ଧ੮Ҏ্ཱ͍ͬͯΔ֎ख͔ͭ, ଧٿ͕ͯ͘ଧ͕ग़·͘Δ,ݪଇηϯλʔΛक͍ͬͯΔબखΛ3ਓհ͠·͢.
Ξʔϩϯɾδϟοδʢ2022ຊྥଧԦʣ• ϠϯΩʔεͷڧଧऀͰ,ΦΦλχαϯͷϥΠόϧ• ݱ࠷ڧͷϗʔϜϥϯόολʔ• ͨͩύϫʔ͕͋Δ͚ͩͰͳ͘2mͷΛੜ͔ͨ͠֎कඋηϯλʔकΕΔػಈྗ͕ചΓ
ϑϦΦɾϩυϦήεʢγΞτϧظͷʣ• ϚϦφʔζʹᰜͱݱΕͨظͷͪͳΈʹࠓͷϧʔΩʔ• एख࣌ͷBIG BOSSΈ͍ͨͳମೳྗΛੜ͔ͨ͠ϓϨʔ͕ັྗ• ଧٿ্͕͕֯ͬͯόϨϧ૿͑ͨΒΠνϩʔࢯʹগͣͭۙͮ͘͠ͷͰ?10ܖʹԠ͑Δ׆༂Λظʂ
όΠϩϯɾόΫετϯʢϛωιλͷສʣ• ϛωιλɾπΠϯζෆಈͷηϯλʔ• ٿ͡Όͳ͍ڝٕߦ͚ͦ͏?ͱ͍͏Τήπͳ͍٭ྗͱݞͷ࣋ͪओ,ͦͷׂʹଧٿ͕֯ύϫʔώολʔ• ৭ʑࡶͬΆ͍ॴͱελΠϧͷྑ͞Ͱສதਖ਼ʢϑΝΠλʔζʣʹࣅ͍ͯΔ.Ϛϯνϡ, ͷόΫετϯʹͳͬͯ͘Εʂ
ࠓ֎कͬͯ·ͤΜ͕.͜ͷํΓΤά͍όολʔͰͨ͠
ΦΦλχαϯʂʂΩϡϯͰ͢ὑ300ଧ੮Ҏ্ͷ࠷ߴଧٿϥϯΩϯά, 2ҐͰͨ͠ʢࢲௐʣ
݁ͼ
ʲ࠶ܝʳຊͷελʔςΟϯάϝϯόʔ• ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏• PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫• PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ• Ϗοάσʔλ͕ਪ͢ʮΤά͍ΞεϦʔτܥ֎खʯָ͓͠Έ͍͚ͨͩ·͔ͨ͠?ใྔ͕ଟ͔ͬͨͷͰཧղ͢Δ·Ͱ͍͔͠ࢿྉެ։͠·͢ͷͰ, ͥͻৼΓฦΓͱ͓ͯ͠ಡΈ͍ͩ͘͞)
ࠓͷΛཁ͢Δͱ…• εϙʔπσʔλͷղੳɾੳͷ͓ͱͯ͠ٿ໘ന͍ΑʂBaseball Savantͱ͍͏τϥοΩϯάσʔλΛ͏ͱྑ͖.• PythonͰ͍͍ײ͡ʹσʔλॲཧΛ͢ΔͷʹPySparkྑ͍ͧ.• PySparkΫϥυͰಈ͔ͤ·͢, ࠓDataprocΛհ͠·ͨ͠.• αʔόϨεʹΫϥυΛ͑ΔΑ͏ʹͳΔͱ,৭ʑͱָʢ੍ͨͩ͠ݶ͋Δʣ.• ϝδϟʔΤά͍֎ख͕͍Δ͕, εϥΠμʔͱ2γʔϜ͓Խ͚ͷΦΦλχαϯڧ͍.
͓ࣄʢۀʣͰࢀߟʹ͠Α͏ͱࢥͬͨํ• ࠓճհͨ͠ΓํɾߏઈରతͳճɾϕεϓϥͰͳ͍Ͱ͢.ྫ͑αʔόϨεɾΞʔΩςΫνϟʹ͖͢/͖͢͡Όͳ͍ঢ়گ࣮֬ʹଘࡏ͠·͢.• ͜ͷࢲʢshinyorkeʣ͕Γ͍ͨࣄ, ͍͍ͱࢥͬͯΔࣄʢ&৮Γ͍ͨϞϊʣΛ٧ΊࠐΜͰ࡞ͬͨ, ͕ࣗΓ͍ͨࣄͷूେͰ, ͋͘·Ͱͷग़͠ํͷҰͭͰ͢.• ͬͱݴ͑, ʮαʔόϨεͱ͔PySparkͰͲ͜·Ͱ͍͍ײ͡ʹ࡞ΕΔ?ʯͱ͍͏ϓϩτλΠϓͱͯ͠࡞Γ·ͨ͠&࣮ࠓޙSpark֎ͭ͢ΓͰ͢ʢৄ͘͠Appendixʹͯʣ.• ʢίϯςΩετͷཧղ͕த్ͳ··ʣͦͷ··ਅࣅ͢Δͱരࢮ͠·͢.·ͣखΛಈ͔͠, ֶशͨ͠Γಈ͔ͨ͠Γ͍͍ͯ͠ͷΛݟ͚ͭΔࢀߟʹͲ͏ͧʂ
ʲଓ͖ʳAppendix - ͏ͪΐͬͱৄ͍͠• Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ• AWSͳͲଞͷΫϥυͷSparkͳαʔϏεࣄ2022• SparkΛΘͳ͍, େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud• Dash + Cloud RunΛ͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞ΔؾʹͳΔํࢿྉͷଓ͖ΛಡΜͰ&ձͷํ࣭ٙԠͰ͠·͠ΐ͏ʂ
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠⽁Shinichi Nakagawa@shinyorke
Python͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯΦϚέฤʮຊฤͰ͞ͳ͔ͬͨTips&ࢀߟࢿྉΛҰؾʹެ։͠·͢ʯ
Appendix - ͏ͪΐͬͱৄ͍͠• Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ• AWSͳͲଞͷΫϥυͷSparkͳαʔϏεࣄ2022• SparkΛΘͳ͍, େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud• Dash + Cloud RunΛ͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ• ࢀߟࢿྉ
Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ
Dataproc ServerlessͷࣗಈԽ• ఆ͞ΕΔखஈҎԼͷ3ͭ.1. APIΛ͍͍ͬͯײ͡ʹ࣮ߦ͢ΔҝͷDocker imageΛ࡞͜ΕΛԿ͔͠Βͷํ๏ͰContainerͱͯ͠ಈ͔͢ʢK8sͳͲʣ2. CLIʢgcloudίϚϯυʣͰ࣮ߦͰ͖ΔͷͰ, gcloudίϚϯυͷDocker imageΛ࡞ʢҎԼ, 1.ͱಉจʣ3. AirflowͷOperatorΛͬͯDataproc ServerlessΛಈ͔͢• 1.ͱ2.ۤߦ͔ͭ, αʔόϨεͷҙຯ͕ແ͘ͳΔՄೳੑ͕͋Δʢ͔ͭ, 1ͱ2΄΅ಉ͜͡ͱΛݴ͍ͬͯΔʣ.Cloud RunͰಈ͔ͤΕΑ͍͕, ߏஙɾӡ༻ͱʹϦεΫ͕͋Γͦ͏ͳ༧ײ.• ϕεϓϥͬΆ͍ൣղʮ3.AirflowͷOperatorΛͬͯDataproc ServerlessΛಈ͔͢ʯͰ͢.
ʲൣղʳAirflowͷOperatorΛܦ༝ͯ͠ಈ͔͢Google CloudͷϚωʔδυɾαʔϏεʮCloud ComposerʯΛ͏ͱྑͦ͞͏
Dataproc ServerlessͷॲཧࣗಈԽ• Google CloudͰcronతͳॲཧΛࣗಈԽ͢ΔͷʹPub/Sub + Schedulerʢ͘͠Cloud TaskʣΛ͏ͱ͍͏ϕετɾϓϥΫςΟε͕ଘࡏ͢Δ.• ͨͩ, 202210݄ݱࡏ, Dataproc ServerlessPub/SubΛInterfaceͱ࣮ͯ͠ߦ͢Δखஈ͕ແ͍ҝ,೦ͳ͕Β͜ͷํ๏͑ͳ͍.• ͳͷͰ, ࠷εϚʔτͳํ๏AirflowͷDataprocܥOperatorΛ࣮ͬͯߦ͢Δ͜ͱʹͳΔ.Cloud ComposerͰAirflowΫϥελΛ্ཱͪ͛ͯӡ༻͢Δ.• https://cloud.google.com/composer/docs/composer-2/run-dataproc-workloads• ͪͳΈʹCloud ComposerαʔόϨεͰͳ͍Ͱ͢ʢϑϧϚωʔδυͰ͋Δ͕ʣ&K8sʢGKEʣΫϥελΛཱͯΔ͜ͱʹͳΔͷͰίετ໘ҙʢ࣮ͱ͔͘ݸਓͰ͏ʹߴ͍ʣ
SparkΛΫϥυͰ͏Google CloudҎ֎ͷ߹
Google CloudҎ֎ͷSparkαʔϏεબࢶ• AWS, Azureͦͯ͠ʢ͋Δҙຯ͝ຊՈͰ͋ΔʣDatabricks͕ީิ.• ύϒϦοΫΫϥυΛΠϯϑϥͱͯ͠ѻ͏Ϣʔεέʔεͷ߹,Databricks͕࠷༗ྗީิʹʢϚϧνΫϥυԽ͍ͨ͠ͷέʔεʣ.• ࣮͜ͷ, AWS͕ॆ࣮͍ͯͯ͠, EMRͱGlueͰϢʔεέʔεʹ߹Θͤͯબ͢Δͱ͍͍Α͏ͳؾ͕͢Δ.• Azure৮ͬͨ͜ͱແ͍ͷͰΘ͔Βͳ͍…*
Google CloudҎ֎ͷSparkαʔϏεબࢶΫϥυαʔϏε˞શͯͰͳ͍Ͱ͢63- ֓ཁ%BUBCSJDLT IUUQTXXXEBUBCSJDLTDPNKQϚϧνΫϥυఆͩͱબࢶʹ4QBSLͷੜΈͷ͕։ൃɾఏڙ"84&.3 IUUQTBXTBNB[PODPNKQFNS"84ͷϚωʔδυ4QBSL)BEPPQ4QBSLͱͯ͠͏ͳΒͬͪ͜"84(MVF IUUQTBXTBNB[PODPNKQHMVF4QBSLΛ&5-ͱͯ͠͏߹ &.3ΑΓ(MVFΛ͏ͷ͕ϕετ"[VSF)%*OTJHIUIUUQTB[VSFNJDSPTPGUDPNKBKQTFSWJDFTIEJOTJHIUPWFSWJFX"[VSFʹ͓͚Δબࢶʢࢲ৮ͬͨ͜ͱͳ͍Ͱ͕͢ʜ
SparkʢDataprocʣΛΘͳ͍߹ͷ͍͍ײ͡ͳσʔλॲཧfor Google Cloud
͍͍ײ͡ͳσʔλॲཧ for Google Cloud• Dataflow• DataFusion• Dataprep• Cloud Run• Cloud Functions
༻్ʹ߹Θ͍͚ͤͯ·͠ΐ͏ʂ(PPHMF$MPVE4FSWJDF 63- ֓ཁ%BUBqPXIUUQTDMPVEHPPHMFDPNEBUBqPXIMKB"QBDIF#FBN͕ϕʔεετϦʔϛϯάॲཧͳΒ͜Ε%BUB'VTJPOIUUQTDMPVEHPPHMFDPNEBUBGVTJPOEPDT IMKBΦϯϓϨΛؚΉɺطଘσʔλΛऔΓࠐΉ&5-తͳαʔϏε%BUBQSFQIUUQTDMPVEHPPHMFDPNEBUBQSFQIMKBσʔλલॲཧɾΫϨϯδϯάத৺ͲͪΒ͔ͱ͍͑ϩʔίʔυ$MPVE3VO IUUQTDMPVEHPPHMFDPNSVO IMKB͖ͳݴޠɾ'8Ͱ࡞ΔͳΒ͜Ε1VC4VCͰτϦΨʔͯ͠ಈ͔͢$MPVE'VODUJPOTIUUQTDMPVEHPPHMFDPNGVODUJPOTIMKB$MPVE3VOΑΓ੍͋Δ͕ αΫοͱ࡞ͬͯಈ͔͢ͳΒ
ݱ࣮తͳબࢶɾצॴ• ϦΞϧλΠϜܥͷॲཧDataflow͕࠷༗ྗͷબࢶ.• طଘͷσʔλͱ౷߹ͨ͠Γ·ͱΊͨΓDataFusion.• ػցֶशͷσʔλલॲཧDataprep.• PythonʹݶΒͣ, ࣗͰ࡞ͬͯಈ͔͢ͳΒCloud Run.• ʮPandasͱBigQuery, GCS͏ʯ͙Β͍ͳΒCloud FunctionsͰαΫοͱΕ·͢ʢ࣮͜ͷϢʔεέʔεଟ͍ͷͰʁʣ.
Dash + Cloud RunͰӡ༻͢ΔσʔλՄࢹԽμογϡϘʔυ※Spark͓ΑͼDataprocొ͠·ͤΜ
μογϡϘʔυΞϓϦʢຊฤͰׂѪͨ͠ʣ• ΞϓϦຊମCloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε• Firestore͕ϝΠϯͷDB, CacheͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ• ͜͜ͰSparkʢPySparkʣొ͠·ͤΜ
Dash + Cloud RunͰͷϗεςΟϯά• DashFlask͕ݩʹͳͬͯΔͷͰgunicornͰ͍͍ײ͡ʹಈ͔͢తͳํ๏ͰϗεςΟϯάՄೳ.• ͜ΕαʔόϨεͳͷͰ, ͬͨ࣌ؒɾϦιʔε͚ͩಈ͘ײ͡ʹͳΔ, ࣗલͰՄࢹԽΞϓϦΛ࡞Γ͍ͨํͬͯΈΔͱྑ͍͔?• ͪͳΈʹAWSͷ߹, App RunnerͰಉ͡ํ๏͕औΕΔͱࢥ͍·͢ʢࢼ͍ͯ͠·ͤΜ͕ʣ.
ͳ͓, CI/CDϫʔΫϑϩʔ͜Μͳײ͡.• GitHub Repositoryʹpushͨ͠ΒGitHub Actions͕ൃՐ, ςετ -> Docker Build -> Cloud RunσϓϩΠ• ςετpytest, flake8, mypyΛGitHub Actions্Ͱ࣮ࢪʢunit, integration·Ͱ୲อ͢ΔΠϝʔδʣ• Docker buildCloud Runͷඪ४తͳΓํʹै͏.• Cloud Build্ͰϏϧυ• Artifact Registryʹpush• Cloud RunͷσϓϩΠGithub ActionsͷެࣜΛ࣮ͬͯࢪ.
ࢀߟࢿྉ
Spark / PySparkؔ࿈• PySpark Documentshttps://spark.apache.org/docs/latest/api/python/• ೖPySparkɹ˞ͪΐͬͱݹ͍ॻ੶Ͱ͢, ༰ͱ͔ҙ.https://www.oreilly.co.jp/books/9784873118185/• PythonͰେྔσʔλॲཧʂ PySparkΛ༻͍ͨσʔλॲཧͱੳͷ͖΄ΜʢPyCon JP 2017ʣhttps://speakerdeck.com/chie8842/pythondeda-liang-detachu-li-pysparkwoyong-itadetachu-li-tofen-xi-falsekihon
Google CloudʢDataprocʣ• ެࣜυΩϡϝϯτhttps://cloud.google.com/dataproc/docs?hl=ja• PySparkͷެࣜαϯϓϧʢ͔͜͜Βࣸܦָ͕ʣhttps://github.com/googleapis/python-dataproc• ެࣜαϯϓϧͦͷ2ʢΑΓ࣮ફతʣhttps://github.com/GoogleCloudDataproc/cloud-dataproc
Google Cloudʢॳ৺ऀɾ͍͍ͨਓ͚ʣ• ެࣜυΩϡϝϯτhttps://cloud.google.com/docs?hl=ja• ࢿ֨https://cloud.google.com/certification?hl=ja• ΤϯλʔϓϥΠζͷͨΊͷGoogle Cloudʢਪ͠ͷॻ੶Ͱ͢ʣhttps://www.shoeisha.co.jp/book/detail/9784798175256
ࣗͷϒϩάهࣄʢPySpark/Dataؔ࿈ʣ• ٿͷϏοάσʔλΛGCPͱPySparkͰ͍͍ײ͡ʹ͍ͯ͘͢͠Έͨhttps://shinyorke.hatenablog.com/entry/dataproc-baseball• SparkΛαʔόʔཧͤͣʹ͏ํ๏https://shinyorke.hatenablog.com/entry/dataproc-serverless• Google CloudͰSparkΛ͏ڥΛαΫοͱखʹೖΕΔhttps://shinyorke.hatenablog.com/entry/dataproc-terraform• WebΞϓϦͱσʔλج൫ΛαΫοͱ্ཱͪ͛ΔͨΊͷϓϥΫςΟεhttps://shinyorke.hatenablog.com/entry/cloud-arch-serverless
ٿؔͷࢀߟϒϩάɾίʔυ• ٿ͖ͱσʔλ͖ͷͨΊͷStatcastσʔλೖhttps://shinyorke.hatenablog.com/entry/statcast-csv-docs-ja• StatcastσʔλͱPlotlyΛͬͯʮଧٿͷ౸ୡҐஔʯΛՄࢹԽ͢Δhttps://shinyorke.hatenablog.com/entry/statcast-visualization-for-batting• Baseball SavantͰΦΦλχαϯͷσʔλΛோΊΔαϯϓϧhttps://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022• RʹΑΔηΠόʔϝτϦΫεೖhttps://gihyo.jp/book/2020/978-4-297-11684-2
Done.࠷ޙ·Ͱ͝ഈಡ͋Γ͕ͱ͏͍͟͝·ͨ͠.