Slide 1

Slide 1 text

No Baseball, No Engineering! High Performance Data Platform Knowledge of PySpark, Cloud and ⚾ Python࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯ Shinichi Nakagawa@shinyorke 2022/10/15 PyConJP 2022 Talk Session

Slide 2

Slide 2 text

Onboardingʢ͜ͷηογϣϯͷ͝Ҋ಺ʣ • PythonͱSparkʢPySparkʣͱύϒϦοΫΫϥ΢υʢGoogle CloudʣͰ ਺GBҎ্ͷσʔλΛ͍͍ײ͡ʹॲཧͯ͠ѻ͓͏ͥʂ, ͱ͍͏τʔΫͰ͢. • ಺༰తʹ͸தڃऀʙ্ڃऀ޲͚Ͱ͢, ॳ৺ऀͷํͷࢦ਑ʹͳΔͱخ͍͠Ͱ͢. ʢ㲈Θ͔Βͳ͍ɾ஌Βͳ͍͜ͱ͸ࣗ͝෼ͷʮ৳ͼ͠Ζʯͩͱࢥ͍ͬͯͩ͘͞ʣ • σʔλͷ୊ࡐ͸ʮϝδϟʔϦʔάʯͰ͢⽁, εϙʔπσʔλͷ࿩΋গ͠. • ໺ٿʹڵຯͳ͍ʢor޷͖͡Όͳ͍ʣํͱ΋Ұॹʹָ͠ΊͨΒ޾͍Ͱ͢. ࠓ೔ͷτʔΫΛ͖͔͚ͬʹ໺ٿʹڵຯ΋ͯΔΑ͏ͳ࿩Λؤுͬͯ΍Γ·͢"

Slide 3

Slide 3 text

օ༷ʹظ଴͢Δલఏ஌ࣝͱϞνϕʔγϣϯ • ʲMustʳPandas΍SQLͰσʔλॲཧɾ෼ੳΛखΛಈ͔ͯ͠΍ͬͨ͜ͱ͕͋Δ. • ʲMustʳGoogle CloudʢGCPʣ, AWS, AzureͳͲͷPublic CloudͰ PythonΛ࢖ͬͨ͜ͱ͕͋Δ. ※αʔϏε͸໰ΘͣʢEC2, App Engine, etc…ʣ • ϑϧϚωʔδυͷαʔόϨε؀ڥͰͷ։ൃܦݧʢ৮ͬͨ͜ͱ͋Ε͹OKʣ. AWS Lambda, AWS App Runner, App Engine, Cloud RunͳͲ͕֘౰. • ʢ޷͖ݏ͍ؔ܎ͳ͘ʣ໺ٿͷϧʔϧͱΦΦλχαϯ͸೺Ѳ͍ͯ͠Δ.

Slide 4

Slide 4 text

Who am ɹ? ʢ͓લ୭Α?ʣ • Shinichi Nakagawa@shinyorke • େख֎ࢿITίϯαϧاۀϚωʔδϟʔ ʢݩɾࣄۀձࣾͷϑϧαΠΫϧΤϯδχΞʣ • Ϋϥ΢υΠϯϑϥΛѻ͏νʔϜͷϚωʔδϟʔ • झຯͱ࣮ӹΛ݉Ͷͯݸਓ։ൃͯ͠·͢# ʢओʹ໺ٿͱϑΟδΧϧέΞ໨తʣ • ໺ٿͱҿΈͳ͕Βͷϓϩάϥϛϯάେ޷͖. • ਪ͠: ৽ঙ߶ࢤ, ສ೾தਖ਼, ୩઒ݪ݈ଠʢͷڧݞʣ #Python #Serverless #GoogleCloud #Baseball #DataScience #SABRmetrics

Slide 5

Slide 5 text

ຊ೔ͷελʔςΟϯάϝϯόʔ • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏ • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫ • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ • ໺ٿϏοάσʔλ͕ਪ͢ʮΤά͍ʓʓͨͪʯ

Slide 6

Slide 6 text

ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏

Slide 7

Slide 7 text

ϝδϟʔϦʔάͷϏοάσʔλ • ϝδϟʔϦʔά͸ʮStatcastʯͱ͍͏γεςϜͰ৭ΜͳσʔλΛه࿥͍ͯ͠·͢. ※ΧϝϥɾϨʔμʔͰه࿥, Ұ෦౷ܭ஋ɾਓྗͰه࿥ • ྫ͑͹, ͜ͷลͷ࣮گͷݩωλ͸͢΂ͯ͜ͷʮStatcastʯͱ͍͏Ϗοάσʔλ͕ݩωλʹͳ͍ͬͯ·͢. • ΦΦλχαϯʂ˓߸ຊྥଧʂଧٿ଎౓180km/h, ඈڑ཭130m • ΦΦλχαϯʂ162km/hͷਅͬ௚͙Ͱݟಀ͠ࡾৼʂʂʂ • ໺ٿͷҰڍखҰ౤଍, ͢΂ͯͷ౤ٿɾଧٿσʔλ͕ه࿥͞ΕΔ. • ϨΪϡϥʔγʔζϯʢ30νʔϜɾ162ࢼ߹ʣͰ͓͓Αͦ70ʙ80ສٿલޙ. ϙετγʔζϯɾय़Ωϟϯϓσʔλ΋͋Δ. • σʔλ͸91ݸͷ߲໨ʢ!?ʣͰߏ੒͞ΕΔ, ϨΪϡϥʔγʔζϯ෼Ͱ͓͓Αͦ400MBʙ600MB͙Β͍ͷσʔλ. • baseballsavant.mlb.com ͱ͍͏αΠτͰ୭Ͱ΋Ӿཡɾμ΢ϯϩʔυʢCSV ϑΥʔϚοτʣͰ͖·͢.

Slide 8

Slide 8 text

σʔλͷ࢓༷ʢެࣜʣ͸ͪ͜Β. https://baseballsavant.mlb.com/csv-docs ࢲͷղઆɾ຋༁൛͸ͬͪ͜. https://shinyorke.hatenablog.com/ entry/statcast-csv-docs-ja ֤σʔλ߲໨, νϥοͱ͓ݟͤ͠·͢.

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

???ʮਏ͍Ͱ͢…߲໨ͱҙຯ͕Θ͔Βͳ͍͔Β.ʯ શ91߲໨, ୯Ґͱ͔ଌఆج४΋ॳݟࡴ͠Ͱ͢ʢ&৽Ҫ͞Μ޿ౡ؂ಜब೚͓Ίʣ

Slide 12

Slide 12 text

StatcastσʔλͰৼΓฦΔʮΦΦλχαϯͷ2022೥ʯ ͪ͜ΒΛྫʹStatcastσʔλΛݟ͍͖ͯ·͠ΐ͏.

Slide 13

Slide 13 text

https://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022 ্هͷStatcastΛ࢖ͬͨαϯϓϧΛݩʹղઆ͠·͢&ίʔυެ։ͯ͠ΔͷͰͥͻ༡ΜͰ͍ͩ͘͞. ※ΦϦδφϧσʔλ͸mile/h & feetͰ͕͢, ࣄલʹkm/h & mʹม׵ࡁΈʢΦϦδφϧσʔλʹ͸ແ͍ͷͰ஫ҙʣ.

Slide 14

Slide 14 text

2022೥ͷΦΦλχαϯ, εϥΠμʔͱ2γʔϜ, ΧοτϘʔϧܑ͞ΜʹͳΔ • ࠓ೥ͷΦΦλχαϯ, ΊͬͪΌ εϥΠμʔ౤͍͛ͯΔ • ͓ؾ͖ͮͩΖ͏͔?ޙ൒ઓ͸ 2γʔϜʢσʔλ্͸Sinkerʣ͕ ૿͍͑ͯΔ͜ͱʹ!? • εϥΠμʔ, 2γʔϜ, ΧοτϘʔϧͰ บ͕ڧ͍ۂ͕Γٿ౤͛ΔϚϯʹΩϟϥม

Slide 15

Slide 15 text

ͱ͋ΔΦΦλχαϯͷొ൘೔ʢ2022/9/29, 8ճ10ୣࡾৼແࣦ఺ʣ ൒෼ۙ͘εϥΠμʔΛ౤͛ͯ2γʔϜͱΧοτͰԡ͍ͯ͘͠Πϝʔδ ౤͛ͨ৔ॴʢัख໨ઢʣ ϦϦʔεϙΠϯτʢัख໨ઢʣ ٿछͷׂ߹

Slide 16

Slide 16 text

StatcastσʔλΛJupyter Lab + PlotlyͰோΊΔ • ৭Μͳσʔλ͕͋ΔͷͰ݁ߏͳ͜ͱ͕Θ͔Γͦ͏. • ࣌ܥྻσʔλͳͷͰ, ύϑΥʔϚϯεͷมԽ΋͔ͭΊΔ. • ٿ଎͕લͱҧ͏ͱ͔, ಥવ2γʔϜ૿͑ͨͳ?ͱ͔. • Ϙʔϧͷ଎౓ɾ࠲ඪܥσʔλ͕ἧ͍ͬͯΔ. • ؤுͬͯ࠲ඪܭࢉ͍͍ͯ͠ײ͡ʹͨ͠Β3Dඳըͱ͔͍͚Δ. ʢҙ༁ɾࠓճ͸ؤுΔ༨༟ͳ͔ͬͨͷͰ΍ͬͯ·ͤΜ$ʣ

Slide 17

Slide 17 text

ϫΠʮຖ೔ຖࢼ߹ݟΔ࢓૊Έཉ͍͠ʯ https://baseballsavant.mlb.com/ ͕ඍົʹ࢖͍ʹ͍͘ࣄ΋͋Γ…w ࢖͍΍͍͢σʔλج൫ʹͪ͠Ό͑ʂͱ͍͏ΞΠσΞ͕͋Δ೔ࢥ͍ͭ͘.

Slide 18

Slide 18 text

ͱ͍͏Θ͚Ͱ, ͪΐͬ͜ͱ࡞ͬͯΈ·ͨ͠.

Slide 19

Slide 19 text

PythonͱGoogle CloudͰ࡞Δ αʔόϨεͰ͍͍ײ͡ͳ σʔλج൫ʢ໺ٿฤʣ

Slide 20

Slide 20 text

ΞʔΩςΫνϟͷશମ૾

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

ΞʔΩςΫνϟղઆʢ㲈ͩ͜ΘΓϙΠϯτʣ • ຖ೔σʔλ֬ೝɾຖ೔σʔλߋ৽Λ͍͍ײ͡ʹ࣮ݱ͢ΔͨΊ, ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯΛશ໘తʹ׆༻ͯ͠ߏஙɾӡ༻. • ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯ #ͱ͸ • CLI΍ίϯιʔϧͰϙνϙν͢Δ͚ͩͰͻͱ·্ཱ͕ͣͪΔ • Πϯϑϥɾαʔόʔͷϝϯςφϯε͕ෆཁʢࣗ෼͡Όͳͯ͘, Ϋϥ΢υαʔϏεଆ͕΍Δʣ • ΑΓ۩ମతʹ͸, ࣗ෼ͰK8sΫϥελ΍VMΛݐͯͳͯ͘΋ྑ͍ʢωοτϫʔΫ౳ͷઃఆ͸ൃੜʣ • GitHub Actions౳ͷCI/CDͷύΠϓϥΠϯʹ૊ΈࠐΜͰσϓϩΠɾεέʔϧͰ͖ͨΓ جຊతʹ͸ʮ࢖ͬͨ෼͚ͩ՝ۚʯʹͳΔͷͰ͓ࡒ෍ʹ΋༏͍͠%

Slide 23

Slide 23 text

Ϣʔεέʔεͱ࢖ͬͨαʔϏε

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

μογϡϘʔυΞϓϦ • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

σʔλऩू&BigQueryอଘ • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Firestore౤ೖʢDatabaseʹσʔλҠૹʣ • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

Slide 30

Slide 30 text

PySpark + DataprocͰ࣮ݱ͢Δ αʔόϨεͳσʔλॲཧ ※͕͜͜͜ͷτʔΫͷຊ୊ͱͳΓ·͢.

Slide 31

Slide 31 text

͜ͷ࿩ͷείʔϓ

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

33.4ඵͰΘ͔ͬͨʢؾʹͳΔʣ& SparkͱPySpark

Slide 34

Slide 34 text

SparkͱPySpark • ʮେ͖͍σʔλΛ͍͍ײ͡ʹ෼ࢄͯ͠ॲཧ͢ΔʯͨΊͷFramework • Sparkຊମͷ࣮૷͸Java͕ͩ, PythonͷInterfaceͰ͋ΔʮPySparkʯΛ ࢖͏ࣄ͕ଟ͍ʢଞͷݴޠͩͱR΋࢖͑ͨΓ͢Δʣ. • σʔλॲཧόονͷϓϩάϥϜͱͯ͠ಈ͔͢ or Jupyter Lab, ZeppelinͰnotebook࣮ߦ. • Python࢖͍ʹ͸ඇৗʹೃછΈ͕ਂ͍, DataFrameͳInterface͕͋Δ. • SparkಠࣗͷDataframe. ͪͳΈʹPandas Dataframeʹม׵Մೳ • Pandas APIʢSpark 3.2Ҏ߱ͰPandasͷػೳΛར༻Մೳ, Ұ෦੍໿༗Γʣ

Slide 35

Slide 35 text

SparkΛͲ͜Ͱߏஙɾӡ༻͢Δ͔' ؀ڥɾखஈ ߏஙͷखؒ ӡ༻͠΍͢͞ උߟ ΦϯϓϨϛεͰ શͯࣗલߏஙɾӡ༻ શͯࣗલͰઃఆ͢Δ ඞཁ͕͋Δ Կ͔ΒԿ·Ͱ ࣗ෼ͰݟΔඞཁ͕͋Δ Ұ൪େมͳύλʔϯ ຊ৬ͷΠϯϑϥΤϯδχΞ Ͱ΋͖͍ͭ࢓ࣄ Ϋϥ΢υ্ͷ7.,Tʹ ࣗલͰߏஙɾӡ༻ શͯࣗલͰઃఆ͢Δ ඞཁ͕͋Δ ͋Δఔ౓Ϋϥ΢υαʔϏε ͷԸܙʹत͔ΕΔ 4QBSL؀ڥͷࣗલߏங͸ ׂͱ೉қ౓͕ߴ͍ Ϋϥ΢υαʔϏεఏڙͷ ϚωʔδυαʔϏεΛ࢖͏ ˞࠷΋ਪ঑͢Δํࣜ (6*Ͱϙνϙν͢Δ ΋͘͠͸$-*"1*Ͱ ͍͍ײ͡ʹ࣮ߦ $16౳ͷϦιʔεΛ؂ࢹ ঢ়گʹԠͯ͡ϝϯςφϯε ࠷΋ָ͔ͭεϚʔτͳํ๏ "84 (PPHMF$MPVEଞ ֤ࣾαʔϏε༗

Slide 36

Slide 36 text

Google Cloudʹ͓͚ΔSparkӡ༻ͷબ୒ࢶ ؀ڥɾखஈ ߏங ӡ༻ ࢖͑Δػೳ උߟ ($&PS(,&ʹ ؀ڥΛ࡞ͬͯӡ༻ ࣗલͰߏஙޙ 4QBSLΛಋೖ શͯࣗલͰӡ༻ ໘౗ΛݟΔඞཁ༗ શͯͷػೳ ݁ہͷॴ%BUBQSPDͰ ग़དྷΔ͜ͱͳͷͰ ͓͢͢Ί͠ͳ͍ %BUBQSPD HDMPVEίϚϯυ  "1* ίϯιʔϧͷ ͲΕ͔Ͱߏங %BUBQSPD͕࡞ͬͨ (,&PS($&؀ڥ Λ؂ࢹɾӡ༻ શͯͷػೳ Ұ൪ඪ४తͳߏ੒ %BUBQSPD 4FSWFSMFTT HDMPVEίϚϯυ  "1* ίϯιʔϧ ্هͷͲΕ͔Ͱߏங ࣮ߦதͷ؂ࢹͷΈ ؀ڥ͸ॲཧޙʹ ࣗಈ࡟আ όονॲཧͷΈରԠ OPUFCPPL࢖͑ͳ͍ ఆظతͳόονॲཧ ͸͜Ε͕Ұ൪͍͍ ※Spark in BigQueryͱ͍͏, BigQueryͷετΞυͱͯ͠SparkΛ࣮ߦ͢Δػೳͷఏڙ༧ఆ༗Γʢby Google Cloud Next ‘22ʣ

Slide 37

Slide 37 text

DataprocͱDataproc Serverless • Google Cloudʹ͸Dataprocͱ͍͏SparkʢHadoopʣͷϚωʔδυαʔϏε͕ଘࡏ͢Δ. • ࠓ·Ͱ͸GCE΍GKEʢK8sʣͰʮϗετ΋͘͠͸Cluster͕ଘࡏʯલఏͷ ӡ༻͔͠Ͱ͖ͳ͔͕ͬͨ, ͍ͭ࠷ۙServerlessͱ͍͏બ୒ࢶ͕ര஀ • ʮ1೔1ճʯʮ30෼͓͖ʯΈ͍ͨͳόονӡ༻Ͱ͋Ε͹Serverless͕࢖͑Δʂ ͳ͓, notebookͷ࣮ߦʢJupyterͳͲʣ͸ະରԠͳͷͰΞυϗοΫʹ͸࢖͑ͳ͍. • Serverless͸࢖ͬͨ෼͚ͩ՝ۚͳͷͰ͓ࡒ෍ʹ΋༏͍͠% • όʔδϣϯ͸Spark 3.2, PySpark͔ΒPandas API࢖͑·͢ʢ͕ࠓճ͸࢖ͬͯ·ͤΜʣ.

Slide 38

Slide 38 text

PySparkΛ࢖ͬͯ΍ͬͨλεΫΛ঺հ • σʔλऩू&BigQuery΁ͷσʔλ౤ೖ • μογϡϘʔυΞϓϦ༻DBʢFirestoreʣ΁ͷσʔλ౤ೖ

Slide 39

Slide 39 text

ʲ࠶ܝʳσʔλऩू&BigQueryอଘ • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ

Slide 40

Slide 40 text

σʔλऩू ʢnot Sparkʣ • WebεΫϨΠϐϯά͸SparkͰ ΍Δ΂͖͜ͱͰ͸ͳ͍. • λεΫΛrequests-htmlͰ࣮૷, Cloud FunctionsͰӡ༻ͯ͠ରॲ. • Cloud SchedulerͷCronઃఆͰ ఆظ࣮ߦ, GCSʹอଘ

Slide 41

Slide 41 text

CSVσʔλΛ BigQueryʹ౤ೖ • Dataproc্Ͱ΍ΔλεΫͱͯ͠ ద੾ͳൣғɾॲཧͷҰͭ • GCSͷύε͔ΒϑΝΠϧநग़ Spark SQLͰॲཧͯ͠BigQuery΁ • DataFrameͱSQL͕Θ͔Ε͹ ͍͍ײ͡ʹ࣮૷ɾӡ༻Մೳ

Slide 42

Slide 42 text

DataprocΛ࢖͓͏ • Google CloudͷυΩϡϝϯτɾαϯϓϧΛࣸܦ͠ͳ͕Β΍Δͱྑ͖ • https://cloud.google.com/dataproc • https://cloud.google.com/dataproc-serverless/docs • https://github.com/GoogleCloudDataproc/cloud-dataproc • Serverlessͷ৔߹, ࣄલʹVPC subnetΛ࡞੒, ࣮ߦ࣌ʹࢦఆ͢Δඞཁ͋Γ. • ࣍ϖʔδ͔Β, PySparkΛ࢖ͬͯ΍Δ৔߹ͷαϯϓϧΛগ͠঺հ͠·͢. • Spark DataFrameΛݩʹ, ʮσʔλΛಡΜͰՃ޻ͯ͠ॻ͖ࠐΈʯతͳόονॲཧ. • ͲͷΫϥε͔Θ͔Γ΍͘͢͢ΔͨΊ, Type Hints෇͖Ͱ࣮૷͍ͯ͠·͢.

Slide 43

Slide 43 text

ͻͱ·࣮ͣ૷ 1. SessionΛ࡞Δ • DB connectionతͳ΍ͭ • SparkSessionͷObjectΛ࡞Δ • Object࡞੒࣌ʹParameterࢦఆ • BigQueryΛ࢖͏࣌͸ JARͷࢦఆ͕ඞਢͳͷͰ஫ҙ

Slide 44

Slide 44 text

ͻͱ·࣮ͣ૷ 2. SchemaΛ࡞Δ • CSVͷ৔߹SchemaΛ࡞Δ • ࡞੒͞ΕΔDataframeʹ ܕΛ͚ͭΔҝ, ઈରඞཁ • ࠓճ͸91߲໨෼ͷSchema ؤுͬͯॻ͖·ͨ͠ྦ

Slide 45

Slide 45 text

ͻͱ·࣮ͣ૷ 3. CSVಡΈࠐΉ • sparkηογϣϯͷreadΛ ࢖͏, formatʹCSVΛࢦఆ • ϔομʔͱͯ͠ઌ΄Ͳͷ SchemaΛࢦఆ • GCSͷϑϧύεΛࢦఆ

Slide 46

Slide 46 text

ͻͱ·࣮ͣ૷ 4. BigQueryอଘ • DataFrameͷwriteؔ਺ • bigqueryΛࢦఆ • ྫ͸طଘςʔϒϧ΁ͷ ௥هॻ͖ࠐΈ

Slide 47

Slide 47 text

Dataproc ServerlessΛ࢖࣮ͬͯߦ

Slide 48

Slide 48 text

BigQuery͔ΒGCSʹϑΝΠϧग़ྗ for Dataproc • BigQueryͷσʔλΛSpark DataFrameʹ • Spark DataFrameΛϑΝΠϧग़ྗ ͪͳΈʹ࣮ߦํ๏ʢgcloud CLIʣ͸มΘΒͳ͍ͷͰׂѪ͠·͢.

Slide 49

Slide 49 text

ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

Slide 50

Slide 50 text

ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

Slide 51

Slide 51 text

ͻͱ·࣮ͣ૷ 5. BigQueryಡࠐ • อଘͱಉ͘͡BigQueryͷ JARΛࢦఆ • spark readͰBigQueryΛࢦఆ • BigQueryͷViewʹରͯ͠ ߦ͏৔߹, Φϓγϣϯ͕ඞཁ

Slide 52

Slide 52 text

ͻͱ·࣮ͣ૷ 6. GCSอଘ • DataFrameͷwriteؔ਺ • jsonΛࢦఆ • ࠷ऴతͳύεΛࢦఆ

Slide 53

Slide 53 text

PySparkͱDataproc Serverless • ʮ࢖͍͍ͨͱ͖͚ͩSparkΛ࢖͏ʯͱ͍͏ϢʔεέʔεΛ࣮ݱՄೳ. ͜Ε͕αʔόϨεαʔϏεΛ࢖͏΂͖࠷େͷཧ༝. • ࠓճͷΞϓϦέʔγϣϯͷσʔλαΠζʢ1೥Ͱ1GB͍͔ͳ͍ʣͩͱ Ըܙʹत͔Εͳ͍͕, ʮ਺GB/೔ఔ౓ͷσʔλΛαΫοͱόονॲཧʯ Έ͍ͨͳϢʔεέʔεʹͳΔͱ݁ߏศརͳؾ͕͠·͢ʢલॲཧɾΫϨϯδϯά͢Δͱ͔ʣ. • ʮॲཧ͢Δͱ͖͚ͩಈ͔͢ʯͱ͍͏ײ͡ͷ͍ܰίʔυͳͷͰPySparkͱ΋૬ੑόπάϯ. • ͳ͓, ॲཧͷࣗಈԽ͸ͪΐͬͱบ͕͋Γ·͢, Cloud ComposerʢAirflowʣ͕ඞཁ. ※ৄࡉ͸౰ࢿྉͷAppendixΛࢀর

Slide 54

Slide 54 text

ٕज़ύʔτ͸͜͜Ͱऴྃ. ࠑॲ͔Βઌ͸…

Slide 55

Slide 55 text

΍͖͏ͷ͔͡Μͩ͋͋͋͋͋⽁

Slide 56

Slide 56 text

2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5 1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄ 2. FIGHTERS GIRL 2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ 3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ 4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ 5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?

Slide 57

Slide 57 text

2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5 1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄ 2. FIGHTERS GIRL 2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ 3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ 4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ 5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?

Slide 58

Slide 58 text

Statcast ʢ&ࢲʣ͕ਪ͢ ʮΤά͍֎໺खͨͪʯ • ਎ମೳྗ͓Խ͚ͰΩϨοΩϨ • ύϫʔ, ڧݞ, ޷कͦͯ͠٭͕ചΓ • όοτΛৼΓճ͢໺ੑͬΆ͞ • ଧٿ଎౓Λݩʹਪ͠Λ3ਓ঺հ • ݱ໾࣌୅ͷ৽ঙ߶ࢤͬΆ͍ Ӊ஦ਓૉ੖Β͍͠֎໺खͰ͢(

Slide 59

Slide 59 text

ຊ೔͝঺հ͢ΔΤά͍֎໺खͨͪ • Judge, AaronʢΞʔϩϯɾδϟοδʣ • Rodríguez, JulioʢϑϦΦɾϩυϦήεʣ • Buxton, ByronʢόΠϩϯɾόΫετϯʣ 300ଧ੮Ҏ্ཱ͍ͬͯΔ֎໺ख͔ͭ, ଧٿ଎౓͕଎ͯ͘௕ଧ͕ग़·͘Δ, ݪଇηϯλʔΛक͍ͬͯΔબखΛ3ਓ঺հ͠·͢.

Slide 60

Slide 60 text

Ξʔϩϯɾδϟοδ ʢ2022೥ຊྥଧԦʣ • ϠϯΩʔεͷڧଧऀͰ, ΦΦλχαϯͷϥΠόϧ • ݱ໾࠷ڧͷϗʔϜϥϯόολʔ • ͨͩύϫʔ͕͋Δ͚ͩͰͳ͘ 2mͷ਎௕Λੜ͔ͨ͠֎໺कඋ ηϯλʔकΕΔػಈྗ͕ചΓ

Slide 61

Slide 61 text

ϑϦΦɾϩυϦήε ʢγΞτϧظ଴ͷ੕ʣ • ϚϦφʔζʹᰜ૘ͱݱΕͨظ଴ͷ੕ ͪͳΈʹࠓ೥ͷϧʔΩʔ • एख࣌୅ͷBIG BOSSΈ͍ͨͳ੒੷ ਎ମೳྗΛੜ͔ͨ͠ϓϨʔ͕ັྗ • ଧٿ֯౓্͕͕ͬͯόϨϧ૿͑ͨΒ Πνϩʔࢯʹগͣͭۙͮ͘͠ͷͰ͸? 10೥ܖ໿ʹԠ͑Δ׆༂Λظ଴ʂ

Slide 62

Slide 62 text

όΠϩϯɾόΫετϯ ʢϛωιλͷສ೾ʣ • ϛωιλɾπΠϯζෆಈͷηϯλʔ • ໺ٿ͡Όͳ͍ڝٕ΋ߦ͚ͦ͏? ͱ͍͏Τήπͳ͍٭ྗͱݞͷ࣋ͪओ, ͦͷׂʹଧٿ֯౓͕ύϫʔώολʔ • ৭ʑࡶͬΆ͍ॴͱελΠϧͷྑ͞Ͱ ສ೾தਖ਼ʢϑΝΠλʔζʣʹࣅ͍ͯΔ. Ϛϯνϡ΢, ๺ͷόΫετϯʹͳͬͯ͘Εʂ

Slide 63

Slide 63 text

ࠓ೥͸֎໺͸कͬͯ·ͤΜ͕. ͜ͷํ΋΍͸ΓΤά͍όολʔͰͨ͠

Slide 64

Slide 64 text

ΦΦλχαϯʂʂΩϡϯͰ͢ὑ 300ଧ੮Ҏ্ͷ࠷ߴଧٿ଎౓ϥϯΩϯά, 2ҐͰͨ͠ʢࢲௐ΂ʣ

Slide 65

Slide 65 text

݁ͼ

Slide 66

Slide 66 text

ʲ࠶ܝʳຊ೔ͷελʔςΟϯάϝϯόʔ • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏ • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫ • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ • Ϗοάσʔλ͕ਪ͢ʮΤά͍ΞεϦʔτܥ֎໺खʯ ָ͓͠Έ͍͚ͨͩ·͔ͨ͠?৘ใྔ͕ଟ͔ͬͨͷͰཧղ͢Δ·Ͱ೉͍͔͠΋׼ ࢿྉ͸ެ։͠·͢ͷͰ, ͥͻৼΓฦΓͱ͓ͯ͠ಡΈ͍ͩ͘͞)

Slide 67

Slide 67 text

ࠓ೔ͷ࿩Λཁ໿͢Δͱ… • εϙʔπσʔλͷղੳɾ෼ੳͷ͓୊໨ͱͯ͠໺ٿ͸໘ന͍Αʂ Baseball Savantͱ͍͏τϥοΩϯάσʔλΛ࢖͏ͱྑ͖. • PythonͰ͍͍ײ͡ʹσʔλॲཧΛ͢ΔͷʹPySpark͸ྑ͍ͧ. • PySpark͸Ϋϥ΢υͰಈ͔ͤ·͢, ࠓ೔͸DataprocΛ঺հ͠·ͨ͠. • αʔόϨεʹΫϥ΢υΛ࢖͑ΔΑ͏ʹͳΔͱ,৭ʑͱָʢ੍ͨͩ͠ݶ΋͋Δʣ. • ϝδϟʔ͸Τά͍֎໺ख͕͍Δ͕, εϥΠμʔͱ2γʔϜ͓Խ͚ͷΦΦλχαϯڧ͍.

Slide 68

Slide 68 text

͓࢓ࣄʢۀ຿ʣͰࢀߟʹ͠Α͏ͱࢥͬͨํ΁ • ࠓճ঺հͨ͠΍Γํɾߏ੒͸ઈରతͳճ౴ɾϕεϓϥͰ͸ͳ͍Ͱ͢. ྫ͑͹αʔόϨεɾΞʔΩςΫνϟʹ͢΂͖/͢΂͖͡Όͳ͍ঢ়گ͸࣮֬ʹଘࡏ͠·͢. • ͜ͷ࿩͸ࢲʢshinyorkeʣ͕΍Γ͍ͨࣄ, ͍͍ͱࢥͬͯΔࣄʢ&৮Γ͍ͨϞϊʣΛ ٧ΊࠐΜͰ࡞ͬͨ, ࣗ෼͕΍Γ͍ͨࣄͷूେ੒Ͱ, ͋͘·Ͱ౴ͷग़͠ํͷҰͭͰ͢. • ΋ͬͱݴ͑͹, ʮαʔόϨεͱ͔PySparkͰͲ͜·Ͱ͍͍ײ͡ʹ࡞ΕΔ?ʯͱ͍͏ ϓϩτλΠϓͱͯ͠࡞Γ·ͨ͠&࣮͸ࠓޙSpark͸֎ͭ͢΋ΓͰ͢ʢৄ͘͠͸Appendixʹͯʣ. • ʢίϯςΩετͷཧղ͕த్൒୺ͳ··ʣͦͷ··ਅࣅ͢Δͱരࢮ͠·͢. ·ͣ͸खΛಈ͔͠, ֶशͨ͠Γಈ͔ͨ͠Γ͍͍ͯ͠΋ͷΛݟ͚ͭΔࢀߟʹͲ͏ͧʂ

Slide 69

Slide 69 text

ʲଓ͖ʳAppendix - ΋͏ͪΐͬͱৄ͍͠࿩ • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022 • SparkΛ࢖Θͳ͍, େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ ؾʹͳΔํ͸ࢿྉͷଓ͖ΛಡΜͰ&ձ৔ͷํ͸࣭ٙԠ౴Ͱ࿩͠·͠ΐ͏ʂ

Slide 70

Slide 70 text

͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠⽁ Shinichi Nakagawa@shinyorke

Slide 71

Slide 71 text

Python࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯ ΦϚέฤʮຊฤͰ͸࿩͞ͳ͔ͬͨTips&ࢀߟࢿྉΛҰؾʹެ։͠·͢ʯ

Slide 72

Slide 72 text

Appendix - ΋͏ͪΐͬͱৄ͍͠࿩ • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022 • SparkΛ࢖Θͳ͍, େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ • ࢀߟࢿྉ

Slide 73

Slide 73 text

Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ

Slide 74

Slide 74 text

Dataproc ServerlessͷࣗಈԽ • ૝ఆ͞ΕΔखஈ͸ҎԼͷ3ͭ. 1. APIΛ࢖͍͍ͬͯײ͡ʹ࣮ߦ͢ΔҝͷDocker imageΛ࡞੒ ͜ΕΛԿ͔͠Βͷํ๏ͰContainerͱͯ͠ಈ͔͢ʢK8sͳͲʣ 2. CLIʢgcloudίϚϯυʣͰ࣮ߦͰ͖ΔͷͰ, gcloudίϚϯυͷ Docker imageΛ࡞੒ʢҎԼ, 1.ͱಉจʣ 3. AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢ • 1.ͱ2.͸ۤߦ͔ͭ, αʔόϨεͷҙຯ͕ແ͘ͳΔՄೳੑ͕͋Δʢ͔ͭ, 1ͱ2͸΄΅ಉ͜͡ͱΛݴ͍ͬͯΔʣ. Cloud Run౳Ͱಈ͔ͤΕ͹Α͍͕, ߏஙɾӡ༻ͱ΋ʹϦεΫ͕͋Γͦ͏ͳ༧ײ. • ϕεϓϥͬΆ͍໛ൣղ౴͸ʮ3.AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢ʯͰ͢.

Slide 75

Slide 75 text

ʲ໛ൣղ౴ʳAirflowͷOperatorΛܦ༝ͯ͠ಈ͔͢ Google CloudͷϚωʔδυɾαʔϏεʮCloud ComposerʯΛ࢖͏ͱྑͦ͞͏

Slide 76

Slide 76 text

Dataproc ServerlessͷॲཧࣗಈԽ • Google CloudͰcronతͳॲཧΛࣗಈԽ͢ΔͷʹPub/Sub + Schedulerʢ΋͘͠͸Cloud TaskʣΛ ࢖͏ͱ͍͏ϕετɾϓϥΫςΟε͕ଘࡏ͢Δ. • ͨͩ, 2022೥10݄ݱࡏ, Dataproc Serverless͸Pub/SubΛInterfaceͱ࣮ͯ͠ߦ͢Δखஈ͕ແ͍ҝ, ࢒೦ͳ͕Β͜ͷํ๏͸࢖͑ͳ͍. • ͳͷͰ, ࠷΋εϚʔτͳํ๏͸AirflowͷDataprocܥOperatorΛ࢖࣮ͬͯߦ͢Δ͜ͱʹͳΔ. Cloud ComposerͰAirflowΫϥελΛ্ཱͪ͛ͯӡ༻͢Δ. • https://cloud.google.com/composer/docs/composer-2/run-dataproc-workloads • ͪͳΈʹCloud Composer͸αʔόϨεͰ͸ͳ͍Ͱ͢ʢϑϧϚωʔδυͰ͸͋Δ͕ʣ &K8sʢGKEʣΫϥελΛཱͯΔ͜ͱʹͳΔͷͰίετ໘΋஫ҙʢ࣮຿͸ͱ΋͔͘ݸਓͰ࢖͏ʹ͸ߴ͍ʣ

Slide 77

Slide 77 text

SparkΛΫϥ΢υͰ࢖͏ Google CloudҎ֎ͷ৔߹

Slide 78

Slide 78 text

Google CloudҎ֎ͷSparkαʔϏεબ୒ࢶ • AWS, Azureͦͯ͠ʢ͋Δҙຯ͝ຊՈͰ͋ΔʣDatabricks͕ީิ. • ύϒϦοΫΫϥ΢υΛΠϯϑϥͱͯ͠ѻ͏Ϣʔεέʔεͷ৔߹, Databricks͕࠷༗ྗީิʹʢϚϧνΫϥ΢υԽ͍ͨ͠౳ͷέʔεʣ. • ࣮͸͜ͷ෼໺, AWS͕ॆ࣮͍ͯͯ͠, EMRͱGlueͰϢʔεέʔεʹ ߹Θͤͯબ୒͢Δͱ͍͍Α͏ͳؾ͕͢Δ. • Azure͸৮ͬͨ͜ͱແ͍ͷͰΘ͔Βͳ͍…*

Slide 79

Slide 79 text

Google CloudҎ֎ͷSparkαʔϏεબ୒ࢶ Ϋϥ΢υαʔϏε ˞શͯͰ͸ͳ͍Ͱ͢ 63- ֓ཁ %BUBCSJDLT IUUQTXXXEBUBCSJDLTDPNKQ ϚϧνΫϥ΢υ૝ఆͩͱબ୒ࢶʹ 4QBSLͷੜΈͷ਌͕։ൃɾఏڙ "84&.3 IUUQTBXTBNB[PODPNKQFNS "84ͷϚωʔδυ4QBSL)BEPPQ 4QBSLͱͯ͠࢖͏ͳΒͬͪ͜ "84(MVF IUUQTBXTBNB[PODPNKQHMVF 4QBSLΛ&5-ͱͯ͠࢖͏৔߹  &.3ΑΓ(MVFΛ࢖͏ͷ͕ϕετ "[VSF)%*OTJHIU IUUQTB[VSFNJDSPTPGUDPNKBKQ TFSWJDFTIEJOTJHIUPWFSWJFX "[VSFʹ͓͚Δબ୒ࢶ ʢࢲ͸৮ͬͨ͜ͱͳ͍Ͱ͕͢ʜ

Slide 80

Slide 80 text

SparkʢDataprocʣΛ࢖Θͳ͍ ৔߹ͷ͍͍ײ͡ͳσʔλॲཧ for Google Cloud

Slide 81

Slide 81 text

͍͍ײ͡ͳσʔλॲཧ for Google Cloud • Dataflow • DataFusion • Dataprep • Cloud Run • Cloud Functions

Slide 82

Slide 82 text

༻్ʹ߹Θͤͯ࢖͍෼͚·͠ΐ͏ʂ (PPHMF$MPVE4FSWJDF 63- ֓ཁ %BUBqPX IUUQTDMPVEHPPHMFDPNEBUBqPX IMKB "QBDIF#FBN͕ϕʔε ετϦʔϛϯάॲཧͳΒ͜Ε %BUB'VTJPO IUUQTDMPVEHPPHMFDPNEBUB GVTJPOEPDT IMKB ΦϯϓϨΛؚΉɺطଘσʔλΛ औΓࠐΉ&5-తͳαʔϏε %BUBQSFQ IUUQTDMPVEHPPHMFDPNEBUBQSFQ IMKB σʔλલॲཧɾΫϨϯδϯάத৺ ͲͪΒ͔ͱ͍͑͹ϩʔίʔυ $MPVE3VO IUUQTDMPVEHPPHMFDPNSVO IMKB ޷͖ͳݴޠɾ'8Ͱ࡞ΔͳΒ͜Ε 1VC4VC౳ͰτϦΨʔͯ͠ಈ͔͢ $MPVE'VODUJPOT IUUQTDMPVEHPPHMFDPNGVODUJPOT IMKB $MPVE3VOΑΓ੍໿͋Δ͕  αΫοͱ࡞ͬͯಈ͔͢ͳΒ

Slide 83

Slide 83 text

ݱ࣮తͳબ୒ࢶɾצॴ • ϦΞϧλΠϜܥͷॲཧ͸Dataflow͕࠷༗ྗͷબ୒ࢶ. • طଘͷσʔλͱ౷߹ͨ͠Γ·ͱΊͨΓ͸DataFusion. • ػցֶश౳ͷσʔλલॲཧ͸Dataprep. • PythonʹݶΒͣ, ࣗ෼Ͱ࡞ͬͯಈ͔͢ͳΒCloud Run. • ʮPandasͱBigQuery, GCS࢖͏ʯ͙Β͍ͳΒCloud FunctionsͰ αΫοͱ΍Ε·͢ʢ࣮͸͜ͷϢʔεέʔεଟ͍ͷͰ͸ʁʣ.

Slide 84

Slide 84 text

Dash + Cloud RunͰӡ༻͢Δ σʔλՄࢹԽμογϡϘʔυ ※Spark͓ΑͼDataproc͸ొ৔͠·ͤΜ

Slide 85

Slide 85 text

μογϡϘʔυΞϓϦʢຊฤͰׂѪͨ͠࿩ʣ • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

Slide 86

Slide 86 text

μογϡϘʔυΞϓϦʢຊฤͰׂѪͨ͠࿩ʣ • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

Slide 87

Slide 87 text

Dash + Cloud RunͰͷ ϗεςΟϯά • Dash͸Flask͕ݩʹͳͬͯΔͷͰ gunicornͰ͍͍ײ͡ʹಈ͔͢తͳ ํ๏ͰϗεςΟϯάՄೳ. • ͜Ε΋αʔόϨεͳͷͰ, ࢖ͬͨ࣌ؒɾϦιʔε ͚ͩಈ͘ײ͡ʹͳΔ, ࣗલͰՄࢹԽΞϓϦΛ ࡞Γ͍ͨํ͸΍ͬͯΈΔͱྑ͍͔΋? • ͪͳΈʹAWSͷ৔߹, App RunnerͰಉ͡ ํ๏͕औΕΔͱࢥ͍·͢ʢࢼͯ͠͸͍·ͤΜ͕ʣ.

Slide 88

Slide 88 text

ͳ͓, CI/CDϫʔΫϑϩʔ͸͜Μͳײ͡. • GitHub Repositoryʹpushͨ͠ΒGitHub Actions͕ൃՐ, ςετ -> Docker Build -> Cloud RunσϓϩΠ • ςετ͸pytest, flake8, mypyΛGitHub Actions্Ͱ࣮ࢪʢunit, integration·Ͱ୲อ͢ΔΠϝʔδʣ • Docker build͸Cloud Runͷඪ४తͳ΍Γํʹै͏. • Cloud Build্ͰϏϧυ • Artifact Registryʹpush • Cloud Run΁ͷσϓϩΠ͸Github ActionsͷެࣜΛ࢖࣮ͬͯࢪ.

Slide 89

Slide 89 text

ࢀߟࢿྉ

Slide 90

Slide 90 text

Spark / PySparkؔ࿈ • PySpark Documents https://spark.apache.org/docs/latest/api/python/ • ೖ໳PySparkɹ˞ͪΐͬͱݹ͍ॻ੶Ͱ͢, ಺༰ͱ͔஫ҙ. https://www.oreilly.co.jp/books/9784873118185/ • PythonͰେྔσʔλॲཧʂ PySparkΛ༻͍ͨσʔλॲཧͱ෼ੳͷ͖΄Μ ʢPyCon JP 2017ʣ https://speakerdeck.com/chie8842/pythondeda-liang-detachu-li- pysparkwoyong-itadetachu-li-tofen-xi-falsekihon

Slide 91

Slide 91 text

Google CloudʢDataprocʣ • ެࣜυΩϡϝϯτ https://cloud.google.com/dataproc/docs?hl=ja • PySparkͷެࣜαϯϓϧʢ͔͜͜Βࣸܦָ͕ʣ https://github.com/googleapis/python-dataproc • ެࣜαϯϓϧͦͷ2ʢΑΓ࣮ફతʣ https://github.com/GoogleCloudDataproc/cloud-dataproc

Slide 92

Slide 92 text

Google Cloudʢॳ৺ऀɾ࢖͍͍ͨਓ޲͚ʣ • ެࣜυΩϡϝϯτ https://cloud.google.com/docs?hl=ja • ࢿ֨ https://cloud.google.com/certification?hl=ja • ΤϯλʔϓϥΠζͷͨΊͷGoogle Cloudʢਪ͠ͷॻ੶Ͱ͢ʣ https://www.shoeisha.co.jp/book/detail/9784798175256

Slide 93

Slide 93 text

ࣗ෼ͷϒϩάهࣄʢPySpark/Dataؔ࿈ʣ • ໺ٿͷϏοάσʔλΛGCPͱPySparkͰ͍͍ײ͡ʹ࢖͍΍ͯ͘͢͠Έͨ https://shinyorke.hatenablog.com/entry/dataproc-baseball • SparkΛαʔόʔ؅ཧͤͣʹ࢖͏ํ๏ https://shinyorke.hatenablog.com/entry/dataproc-serverless • Google CloudͰSparkΛ࢖͏؀ڥΛαΫοͱखʹೖΕΔ https://shinyorke.hatenablog.com/entry/dataproc-terraform • WebΞϓϦͱσʔλج൫ΛαΫοͱ্ཱͪ͛ΔͨΊͷϓϥΫςΟε https://shinyorke.hatenablog.com/entry/cloud-arch-serverless

Slide 94

Slide 94 text

໺ٿؔ܎ͷࢀߟϒϩάɾίʔυ • ໺ٿ޷͖ͱσʔλ޷͖ͷͨΊͷStatcastσʔλೖ໳ https://shinyorke.hatenablog.com/entry/statcast-csv-docs-ja • StatcastσʔλͱPlotlyΛ࢖ͬͯʮଧٿͷ౸ୡҐஔʯΛՄࢹԽ͢Δ https://shinyorke.hatenablog.com/entry/statcast-visualization-for-batting • Baseball SavantͰΦΦλχαϯͷσʔλΛோΊΔαϯϓϧ https://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022 • RʹΑΔηΠόʔϝτϦΫεೖ໳ https://gihyo.jp/book/2020/978-4-297-11684-2

Slide 95

Slide 95 text

Done. ࠷ޙ·Ͱ͝ഈಡ͋Γ͕ͱ͏͍͟͝·ͨ͠.