Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python使いのためのスポーツデータ解析のきほん - PySparkとメジャーリーグデータを...

Python使いのためのスポーツデータ解析のきほん - PySparkとメジャーリーグデータを添えて #PyConJP 2022

Shinichi Nakagawa

October 15, 2022
Tweet

More Decks by Shinichi Nakagawa

Other Decks in Technology

Transcript

  1. No Baseball, No Engineering! High Performance Data Platform Knowledge of

    PySpark, Cloud and ⚾ Python࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯ Shinichi Nakagawa@shinyorke 2022/10/15 PyConJP 2022 Talk Session
  2. օ༷ʹظ଴͢Δલఏ஌ࣝͱϞνϕʔγϣϯ • ʲMustʳPandas΍SQLͰσʔλॲཧɾ෼ੳΛखΛಈ͔ͯ͠΍ͬͨ͜ͱ͕͋Δ. • ʲMustʳGoogle CloudʢGCPʣ, AWS, AzureͳͲͷPublic CloudͰ PythonΛ࢖ͬͨ͜ͱ͕͋Δ.

    ※αʔϏε͸໰ΘͣʢEC2, App Engine, etc…ʣ • ϑϧϚωʔδυͷαʔόϨε؀ڥͰͷ։ൃܦݧʢ৮ͬͨ͜ͱ͋Ε͹OKʣ. AWS Lambda, AWS App Runner, App Engine, Cloud RunͳͲ͕֘౰. • ʢ޷͖ݏ͍ؔ܎ͳ͘ʣ໺ٿͷϧʔϧͱΦΦλχαϯ͸೺Ѳ͍ͯ͠Δ.
  3. Who am ɹ? ʢ͓લ୭Α?ʣ • Shinichi Nakagawa@shinyorke • େख֎ࢿITίϯαϧاۀϚωʔδϟʔ ʢݩɾࣄۀձࣾͷϑϧαΠΫϧΤϯδχΞʣ

    • Ϋϥ΢υΠϯϑϥΛѻ͏νʔϜͷϚωʔδϟʔ • झຯͱ࣮ӹΛ݉Ͷͯݸਓ։ൃͯ͠·͢# ʢओʹ໺ٿͱϑΟδΧϧέΞ໨తʣ • ໺ٿͱҿΈͳ͕Βͷϓϩάϥϛϯάେ޷͖. • ਪ͠: ৽ঙ߶ࢤ, ສ೾தਖ਼, ୩઒ݪ݈ଠʢͷڧݞʣ #Python #Serverless #GoogleCloud #Baseball #DataScience #SABRmetrics
  4. ϝδϟʔϦʔάͷϏοάσʔλ • ϝδϟʔϦʔά͸ʮStatcastʯͱ͍͏γεςϜͰ৭ΜͳσʔλΛه࿥͍ͯ͠·͢. ※ΧϝϥɾϨʔμʔͰه࿥, Ұ෦౷ܭ஋ɾਓྗͰه࿥ • ྫ͑͹, ͜ͷลͷ࣮گͷݩωλ͸͢΂ͯ͜ͷʮStatcastʯͱ͍͏Ϗοάσʔλ͕ݩωλʹͳ͍ͬͯ·͢. • ΦΦλχαϯʂ˓߸ຊྥଧʂଧٿ଎౓180km/h,

    ඈڑ཭130m • ΦΦλχαϯʂ162km/hͷਅͬ௚͙Ͱݟಀ͠ࡾৼʂʂʂ • ໺ٿͷҰڍखҰ౤଍, ͢΂ͯͷ౤ٿɾଧٿσʔλ͕ه࿥͞ΕΔ. • ϨΪϡϥʔγʔζϯʢ30νʔϜɾ162ࢼ߹ʣͰ͓͓Αͦ70ʙ80ສٿલޙ. ϙετγʔζϯɾय़Ωϟϯϓσʔλ΋͋Δ. • σʔλ͸91ݸͷ߲໨ʢ!?ʣͰߏ੒͞ΕΔ, ϨΪϡϥʔγʔζϯ෼Ͱ͓͓Αͦ400MBʙ600MB͙Β͍ͷσʔλ. • baseballsavant.mlb.com ͱ͍͏αΠτͰ୭Ͱ΋Ӿཡɾμ΢ϯϩʔυʢCSV ϑΥʔϚοτʣͰ͖·͢.
  5. StatcastσʔλΛJupyter Lab + PlotlyͰோΊΔ • ৭Μͳσʔλ͕͋ΔͷͰ݁ߏͳ͜ͱ͕Θ͔Γͦ͏. • ࣌ܥྻσʔλͳͷͰ, ύϑΥʔϚϯεͷมԽ΋͔ͭΊΔ. •

    ٿ଎͕લͱҧ͏ͱ͔, ಥવ2γʔϜ૿͑ͨͳ?ͱ͔. • Ϙʔϧͷ଎౓ɾ࠲ඪܥσʔλ͕ἧ͍ͬͯΔ. • ؤுͬͯ࠲ඪܭࢉ͍͍ͯ͠ײ͡ʹͨ͠Β3Dඳըͱ͔͍͚Δ. ʢҙ༁ɾࠓճ͸ؤுΔ༨༟ͳ͔ͬͨͷͰ΍ͬͯ·ͤΜ$ʣ
  6. ΞʔΩςΫνϟղઆʢ㲈ͩ͜ΘΓϙΠϯτʣ • ຖ೔σʔλ֬ೝɾຖ೔σʔλߋ৽Λ͍͍ײ͡ʹ࣮ݱ͢ΔͨΊ, ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯΛશ໘తʹ׆༻ͯ͠ߏஙɾӡ༻. • ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯ #ͱ͸ • CLI΍ίϯιʔϧͰϙνϙν͢Δ͚ͩͰͻͱ·্ཱ͕ͣͪΔ •

    Πϯϑϥɾαʔόʔͷϝϯςφϯε͕ෆཁʢࣗ෼͡Όͳͯ͘, Ϋϥ΢υαʔϏεଆ͕΍Δʣ • ΑΓ۩ମతʹ͸, ࣗ෼ͰK8sΫϥελ΍VMΛݐͯͳͯ͘΋ྑ͍ʢωοτϫʔΫ౳ͷઃఆ͸ൃੜʣ • GitHub Actions౳ͷCI/CDͷύΠϓϥΠϯʹ૊ΈࠐΜͰσϓϩΠɾεέʔϧͰ͖ͨΓ جຊతʹ͸ʮ࢖ͬͨ෼͚ͩ՝ۚʯʹͳΔͷͰ͓ࡒ෍ʹ΋༏͍͠%
  7. SparkͱPySpark • ʮେ͖͍σʔλΛ͍͍ײ͡ʹ෼ࢄͯ͠ॲཧ͢ΔʯͨΊͷFramework • Sparkຊମͷ࣮૷͸Java͕ͩ, PythonͷInterfaceͰ͋ΔʮPySparkʯΛ ࢖͏ࣄ͕ଟ͍ʢଞͷݴޠͩͱR΋࢖͑ͨΓ͢Δʣ. • σʔλॲཧόονͷϓϩάϥϜͱͯ͠ಈ͔͢ or

    Jupyter Lab, ZeppelinͰnotebook࣮ߦ. • Python࢖͍ʹ͸ඇৗʹೃછΈ͕ਂ͍, DataFrameͳInterface͕͋Δ. • SparkಠࣗͷDataframe. ͪͳΈʹPandas Dataframeʹม׵Մೳ • Pandas APIʢSpark 3.2Ҏ߱ͰPandasͷػೳΛར༻Մೳ, Ұ෦੍໿༗Γʣ
  8. SparkΛͲ͜Ͱߏஙɾӡ༻͢Δ͔' ؀ڥɾखஈ ߏஙͷखؒ ӡ༻͠΍͢͞ උߟ ΦϯϓϨϛεͰ શͯࣗલߏஙɾӡ༻ શͯࣗલͰઃఆ͢Δ ඞཁ͕͋Δ Կ͔ΒԿ·Ͱ

    ࣗ෼ͰݟΔඞཁ͕͋Δ Ұ൪େมͳύλʔϯ ຊ৬ͷΠϯϑϥΤϯδχΞ Ͱ΋͖͍ͭ࢓ࣄ Ϋϥ΢υ্ͷ7.,Tʹ ࣗલͰߏஙɾӡ༻ શͯࣗલͰઃఆ͢Δ ඞཁ͕͋Δ ͋Δఔ౓Ϋϥ΢υαʔϏε ͷԸܙʹत͔ΕΔ 4QBSL؀ڥͷࣗલߏங͸ ׂͱ೉қ౓͕ߴ͍ Ϋϥ΢υαʔϏεఏڙͷ ϚωʔδυαʔϏεΛ࢖͏ ˞࠷΋ਪ঑͢Δํࣜ (6*Ͱϙνϙν͢Δ ΋͘͠͸$-*"1*Ͱ ͍͍ײ͡ʹ࣮ߦ $16౳ͷϦιʔεΛ؂ࢹ ঢ়گʹԠͯ͡ϝϯςφϯε ࠷΋ָ͔ͭεϚʔτͳํ๏ "84 (PPHMF$MPVEଞ ֤ࣾαʔϏε༗
  9. Google Cloudʹ͓͚ΔSparkӡ༻ͷબ୒ࢶ ؀ڥɾखஈ ߏங ӡ༻ ࢖͑Δػೳ උߟ ($&PS(,&ʹ ؀ڥΛ࡞ͬͯӡ༻ ࣗલͰߏஙޙ

    4QBSLΛಋೖ શͯࣗલͰӡ༻ ໘౗ΛݟΔඞཁ༗ શͯͷػೳ ݁ہͷॴ%BUBQSPDͰ ग़དྷΔ͜ͱͳͷͰ ͓͢͢Ί͠ͳ͍ %BUBQSPD HDMPVEίϚϯυ  "1* ίϯιʔϧͷ ͲΕ͔Ͱߏங %BUBQSPD͕࡞ͬͨ (,&PS($&؀ڥ Λ؂ࢹɾӡ༻ શͯͷػೳ Ұ൪ඪ४తͳߏ੒ %BUBQSPD 4FSWFSMFTT HDMPVEίϚϯυ  "1* ίϯιʔϧ ্هͷͲΕ͔Ͱߏங ࣮ߦதͷ؂ࢹͷΈ ؀ڥ͸ॲཧޙʹ ࣗಈ࡟আ όονॲཧͷΈରԠ OPUFCPPL࢖͑ͳ͍ ఆظతͳόονॲཧ ͸͜Ε͕Ұ൪͍͍ ※Spark in BigQueryͱ͍͏, BigQueryͷετΞυͱͯ͠SparkΛ࣮ߦ͢Δػೳͷఏڙ༧ఆ༗Γʢby Google Cloud Next ‘22ʣ
  10. DataprocͱDataproc Serverless • Google Cloudʹ͸Dataprocͱ͍͏SparkʢHadoopʣͷϚωʔδυαʔϏε͕ଘࡏ͢Δ. • ࠓ·Ͱ͸GCE΍GKEʢK8sʣͰʮϗετ΋͘͠͸Cluster͕ଘࡏʯલఏͷ ӡ༻͔͠Ͱ͖ͳ͔͕ͬͨ, ͍ͭ࠷ۙServerlessͱ͍͏બ୒ࢶ͕ര஀ •

    ʮ1೔1ճʯʮ30෼͓͖ʯΈ͍ͨͳόονӡ༻Ͱ͋Ε͹Serverless͕࢖͑Δʂ ͳ͓, notebookͷ࣮ߦʢJupyterͳͲʣ͸ະରԠͳͷͰΞυϗοΫʹ͸࢖͑ͳ͍. • Serverless͸࢖ͬͨ෼͚ͩ՝ۚͳͷͰ͓ࡒ෍ʹ΋༏͍͠% • όʔδϣϯ͸Spark 3.2, PySpark͔ΒPandas API࢖͑·͢ʢ͕ࠓճ͸࢖ͬͯ·ͤΜʣ.
  11. DataprocΛ࢖͓͏ • Google CloudͷυΩϡϝϯτɾαϯϓϧΛࣸܦ͠ͳ͕Β΍Δͱྑ͖ • https://cloud.google.com/dataproc • https://cloud.google.com/dataproc-serverless/docs • https://github.com/GoogleCloudDataproc/cloud-dataproc

    • Serverlessͷ৔߹, ࣄલʹVPC subnetΛ࡞੒, ࣮ߦ࣌ʹࢦఆ͢Δඞཁ͋Γ. • ࣍ϖʔδ͔Β, PySparkΛ࢖ͬͯ΍Δ৔߹ͷαϯϓϧΛগ͠঺հ͠·͢. • Spark DataFrameΛݩʹ, ʮσʔλΛಡΜͰՃ޻ͯ͠ॻ͖ࠐΈʯతͳόονॲཧ. • ͲͷΫϥε͔Θ͔Γ΍͘͢͢ΔͨΊ, Type Hints෇͖Ͱ࣮૷͍ͯ͠·͢.
  12. 2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5 1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄ 2. FIGHTERS GIRL

    2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ 3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ 4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ 5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?
  13. 2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5 1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄ 2. FIGHTERS GIRL

    2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ 3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ 4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ 5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?
  14. Statcast ʢ&ࢲʣ͕ਪ͢ ʮΤά͍֎໺खͨͪʯ • ਎ମೳྗ͓Խ͚ͰΩϨοΩϨ • ύϫʔ, ڧݞ, ޷कͦͯ͠٭͕ചΓ •

    όοτΛৼΓճ͢໺ੑͬΆ͞ • ଧٿ଎౓Λݩʹਪ͠Λ3ਓ঺հ • ݱ໾࣌୅ͷ৽ঙ߶ࢤͬΆ͍ Ӊ஦ਓૉ੖Β͍͠֎໺खͰ͢(
  15. ຊ೔͝঺հ͢ΔΤά͍֎໺खͨͪ • Judge, AaronʢΞʔϩϯɾδϟοδʣ • Rodríguez, JulioʢϑϦΦɾϩυϦήεʣ • Buxton, ByronʢόΠϩϯɾόΫετϯʣ

    300ଧ੮Ҏ্ཱ͍ͬͯΔ֎໺ख͔ͭ, ଧٿ଎౓͕଎ͯ͘௕ଧ͕ग़·͘Δ, ݪଇηϯλʔΛक͍ͬͯΔબखΛ3ਓ঺հ͠·͢.
  16. ʲ࠶ܝʳຊ೔ͷελʔςΟϯάϝϯόʔ • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏ • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫ • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ

    • Ϗοάσʔλ͕ਪ͢ʮΤά͍ΞεϦʔτܥ֎໺खʯ ָ͓͠Έ͍͚ͨͩ·͔ͨ͠?৘ใྔ͕ଟ͔ͬͨͷͰཧղ͢Δ·Ͱ೉͍͔͠΋׼ ࢿྉ͸ެ։͠·͢ͷͰ, ͥͻৼΓฦΓͱ͓ͯ͠ಡΈ͍ͩ͘͞)
  17. ͓࢓ࣄʢۀ຿ʣͰࢀߟʹ͠Α͏ͱࢥͬͨํ΁ • ࠓճ঺հͨ͠΍Γํɾߏ੒͸ઈରతͳճ౴ɾϕεϓϥͰ͸ͳ͍Ͱ͢. ྫ͑͹αʔόϨεɾΞʔΩςΫνϟʹ͢΂͖/͢΂͖͡Όͳ͍ঢ়گ͸࣮֬ʹଘࡏ͠·͢. • ͜ͷ࿩͸ࢲʢshinyorkeʣ͕΍Γ͍ͨࣄ, ͍͍ͱࢥͬͯΔࣄʢ&৮Γ͍ͨϞϊʣΛ ٧ΊࠐΜͰ࡞ͬͨ, ࣗ෼͕΍Γ͍ͨࣄͷूେ੒Ͱ, ͋͘·Ͱ౴ͷग़͠ํͷҰͭͰ͢.

    • ΋ͬͱݴ͑͹, ʮαʔόϨεͱ͔PySparkͰͲ͜·Ͱ͍͍ײ͡ʹ࡞ΕΔ?ʯͱ͍͏ ϓϩτλΠϓͱͯ͠࡞Γ·ͨ͠&࣮͸ࠓޙSpark͸֎ͭ͢΋ΓͰ͢ʢৄ͘͠͸Appendixʹͯʣ. • ʢίϯςΩετͷཧղ͕த్൒୺ͳ··ʣͦͷ··ਅࣅ͢Δͱരࢮ͠·͢. ·ͣ͸खΛಈ͔͠, ֶशͨ͠Γಈ͔ͨ͠Γ͍͍ͯ͠΋ͷΛݟ͚ͭΔࢀߟʹͲ͏ͧʂ
  18. ʲଓ͖ʳAppendix - ΋͏ͪΐͬͱৄ͍͠࿩ • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022 • SparkΛ࢖Θͳ͍,

    େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ ؾʹͳΔํ͸ࢿྉͷଓ͖ΛಡΜͰ&ձ৔ͷํ͸࣭ٙԠ౴Ͱ࿩͠·͠ΐ͏ʂ
  19. Appendix - ΋͏ͪΐͬͱৄ͍͠࿩ • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022 • SparkΛ࢖Θͳ͍,

    େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ • ࢀߟࢿྉ
  20. Dataproc ServerlessͷࣗಈԽ • ૝ఆ͞ΕΔखஈ͸ҎԼͷ3ͭ. 1. APIΛ࢖͍͍ͬͯײ͡ʹ࣮ߦ͢ΔҝͷDocker imageΛ࡞੒ ͜ΕΛԿ͔͠Βͷํ๏ͰContainerͱͯ͠ಈ͔͢ʢK8sͳͲʣ 2. CLIʢgcloudίϚϯυʣͰ࣮ߦͰ͖ΔͷͰ,

    gcloudίϚϯυͷ Docker imageΛ࡞੒ʢҎԼ, 1.ͱಉจʣ 3. AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢ • 1.ͱ2.͸ۤߦ͔ͭ, αʔόϨεͷҙຯ͕ແ͘ͳΔՄೳੑ͕͋Δʢ͔ͭ, 1ͱ2͸΄΅ಉ͜͡ͱΛݴ͍ͬͯΔʣ. Cloud Run౳Ͱಈ͔ͤΕ͹Α͍͕, ߏஙɾӡ༻ͱ΋ʹϦεΫ͕͋Γͦ͏ͳ༧ײ. • ϕεϓϥͬΆ͍໛ൣղ౴͸ʮ3.AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢ʯͰ͢.
  21. Dataproc ServerlessͷॲཧࣗಈԽ • Google CloudͰcronతͳॲཧΛࣗಈԽ͢ΔͷʹPub/Sub + Schedulerʢ΋͘͠͸Cloud TaskʣΛ ࢖͏ͱ͍͏ϕετɾϓϥΫςΟε͕ଘࡏ͢Δ. •

    ͨͩ, 2022೥10݄ݱࡏ, Dataproc Serverless͸Pub/SubΛInterfaceͱ࣮ͯ͠ߦ͢Δखஈ͕ແ͍ҝ, ࢒೦ͳ͕Β͜ͷํ๏͸࢖͑ͳ͍. • ͳͷͰ, ࠷΋εϚʔτͳํ๏͸AirflowͷDataprocܥOperatorΛ࢖࣮ͬͯߦ͢Δ͜ͱʹͳΔ. Cloud ComposerͰAirflowΫϥελΛ্ཱͪ͛ͯӡ༻͢Δ. • https://cloud.google.com/composer/docs/composer-2/run-dataproc-workloads • ͪͳΈʹCloud Composer͸αʔόϨεͰ͸ͳ͍Ͱ͢ʢϑϧϚωʔδυͰ͸͋Δ͕ʣ &K8sʢGKEʣΫϥελΛཱͯΔ͜ͱʹͳΔͷͰίετ໘΋஫ҙʢ࣮຿͸ͱ΋͔͘ݸਓͰ࢖͏ʹ͸ߴ͍ʣ
  22. Google CloudҎ֎ͷSparkαʔϏεબ୒ࢶ Ϋϥ΢υαʔϏε ˞શͯͰ͸ͳ͍Ͱ͢ 63- ֓ཁ %BUBCSJDLT IUUQTXXXEBUBCSJDLTDPNKQ ϚϧνΫϥ΢υ૝ఆͩͱબ୒ࢶʹ 4QBSLͷੜΈͷ਌͕։ൃɾఏڙ

    "84&.3 IUUQTBXTBNB[PODPNKQFNS "84ͷϚωʔδυ4QBSL)BEPPQ 4QBSLͱͯ͠࢖͏ͳΒͬͪ͜ "84(MVF IUUQTBXTBNB[PODPNKQHMVF 4QBSLΛ&5-ͱͯ͠࢖͏৔߹  &.3ΑΓ(MVFΛ࢖͏ͷ͕ϕετ "[VSF)%*OTJHIU IUUQTB[VSFNJDSPTPGUDPNKBKQ TFSWJDFTIEJOTJHIUPWFSWJFX "[VSFʹ͓͚Δબ୒ࢶ ʢࢲ͸৮ͬͨ͜ͱͳ͍Ͱ͕͢ʜ
  23. ༻్ʹ߹Θͤͯ࢖͍෼͚·͠ΐ͏ʂ (PPHMF$MPVE4FSWJDF 63- ֓ཁ %BUBqPX IUUQTDMPVEHPPHMFDPNEBUBqPX IMKB "QBDIF#FBN͕ϕʔε ετϦʔϛϯάॲཧͳΒ͜Ε %BUB'VTJPO

    IUUQTDMPVEHPPHMFDPNEBUB GVTJPOEPDT IMKB ΦϯϓϨΛؚΉɺطଘσʔλΛ औΓࠐΉ&5-తͳαʔϏε %BUBQSFQ IUUQTDMPVEHPPHMFDPNEBUBQSFQ IMKB σʔλલॲཧɾΫϨϯδϯάத৺ ͲͪΒ͔ͱ͍͑͹ϩʔίʔυ $MPVE3VO IUUQTDMPVEHPPHMFDPNSVO IMKB ޷͖ͳݴޠɾ'8Ͱ࡞ΔͳΒ͜Ε 1VC4VC౳ͰτϦΨʔͯ͠ಈ͔͢ $MPVE'VODUJPOT IUUQTDMPVEHPPHMFDPNGVODUJPOT IMKB $MPVE3VOΑΓ੍໿͋Δ͕  αΫοͱ࡞ͬͯಈ͔͢ͳΒ
  24. Dash + Cloud RunͰͷ ϗεςΟϯά • Dash͸Flask͕ݩʹͳͬͯΔͷͰ gunicornͰ͍͍ײ͡ʹಈ͔͢తͳ ํ๏ͰϗεςΟϯάՄೳ. •

    ͜Ε΋αʔόϨεͳͷͰ, ࢖ͬͨ࣌ؒɾϦιʔε ͚ͩಈ͘ײ͡ʹͳΔ, ࣗલͰՄࢹԽΞϓϦΛ ࡞Γ͍ͨํ͸΍ͬͯΈΔͱྑ͍͔΋? • ͪͳΈʹAWSͷ৔߹, App RunnerͰಉ͡ ํ๏͕औΕΔͱࢥ͍·͢ʢࢼͯ͠͸͍·ͤΜ͕ʣ.
  25. ͳ͓, CI/CDϫʔΫϑϩʔ͸͜Μͳײ͡. • GitHub Repositoryʹpushͨ͠ΒGitHub Actions͕ൃՐ, ςετ -> Docker Build

    -> Cloud RunσϓϩΠ • ςετ͸pytest, flake8, mypyΛGitHub Actions্Ͱ࣮ࢪʢunit, integration·Ͱ୲อ͢ΔΠϝʔδʣ • Docker build͸Cloud Runͷඪ४తͳ΍Γํʹै͏. • Cloud Build্ͰϏϧυ • Artifact Registryʹpush • Cloud Run΁ͷσϓϩΠ͸Github ActionsͷެࣜΛ࢖࣮ͬͯࢪ.
  26. Spark / PySparkؔ࿈ • PySpark Documents https://spark.apache.org/docs/latest/api/python/ • ೖ໳PySparkɹ˞ͪΐͬͱݹ͍ॻ੶Ͱ͢, ಺༰ͱ͔஫ҙ.

    https://www.oreilly.co.jp/books/9784873118185/ • PythonͰେྔσʔλॲཧʂ PySparkΛ༻͍ͨσʔλॲཧͱ෼ੳͷ͖΄Μ ʢPyCon JP 2017ʣ https://speakerdeck.com/chie8842/pythondeda-liang-detachu-li- pysparkwoyong-itadetachu-li-tofen-xi-falsekihon