Slide 1

Slide 1 text

Python࢖͍ͷͨΊͷ εϙʔπσʔλղੳͷ͖΄Μ β൛ʢ΋͘͠͸༧ࠂฤʣ PyCon JP 2022ͷຊ൪લʹͪΐͬͱ͓࿩ͯ͠Έ·ͨ͠. Shinichi Nakagawa@shinyorke 2022/09/28 kawasaki.rb #112

Slide 2

Slide 2 text

Onboardingʢ͜ͷࢿྉɾൃදͷҐஔ͚ͮʣ • PythonͱSparkͱύϒϦοΫΫϥ΢υʢGoogle CloudʣͰ 
 ਺ඦGBҎ্ͷσʔλΛ͍͍ײ͡ʹॲཧͯ͠ѻ͓͏ͥʂ, ͱ͍͏࿩. • PyCon JP 2022ʮPython࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μʯ 
 Ͱ΍Δલʹͪΐͬͱ͚͓ͩ൸࿐໨͢΂͘࡞੒ͨ͠΋ͷͰ͢. • ݱ࣌఺Ͱͷ׬੒൛Ͱ, ຊ൪Ͱ͸಺༰͕มΘΔՄೳੑ͕͋Γ·͢.

Slide 3

Slide 3 text

Who am ɹ? ʢ͓લ୭Α?ʣ • Shinichi Nakagawa@shinyorke • ֎ࢿITίϯαϧاۀϚωʔδϟʔ • ࢓ࣄ͸Ϋϥ΢υαʔϏεΛѻ͏ΤϯδχΞ. 
 ʢͷνʔϜΛవΊΔϚωʔδϟʔʣ • ࠷ۙ͸ެࢲͱ΋ʹGoogle CloudΛ࢖͏ਓ. • झຯͱ࣮ӹΛ݉Ͷͯݸਓ։ൃͯ͠·͢🍺 #Python #GoogleCloud #Baseball #DataScience #SABRmetrics

Slide 4

Slide 4 text

ຊ೔ͷελʔςΟϯάϝϯόʔ • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏ ※PyCon౰೔ʹ࿩͢ • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫ • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ • ࠓ೥ͷϝδϟʔϦʔάΛσʔλͰৼΓฦΔ ※PyCon౰೔ʹ࿩͢

Slide 5

Slide 5 text

ࠓճѻ͏σʔλʮMLBͷϏοάσʔλʯ • ϝδϟʔϦʔά͸ʮStatcastʯͱ͍͏γεςϜͰ৭ΜͳσʔλΛه࿥͍ͯ͠·͢. 
 ※ΧϝϥɾϨʔμʔͰه࿥, Ұ෦౷ܭ஋ɾਓྗͰه࿥ • ྫ͑͹, ͜ͷลͷ࣮گͷݩωλ͸͢΂ͯ͜ͷʮStatcastʯͱ͍͏Ϗοάσʔλ͕ݩωλʹͳ͍ͬͯ·͢. • ΦΦλχαϯʂ˓߸ຊྥଧʂଧٿ଎౓180km/h, ඈڑ཭130m • ΦΦλχαϯʂ162km/hͷਅͬ௚͙Ͱݟಀ͠ࡾৼʂʂʂ • ໺ٿͷҰڍखҰ౤଍, ͢΂ͯͷ౤ٿɾଧٿσʔλ͕ه࿥͞ΕΔ. • ϨΪϡϥʔγʔζϯʢ30νʔϜɾ162ࢼ߹ʣͰ͓͓Αͦ70ʙ80ສٿલޙ. ϙετγʔζϯɾय़Ωϟϯϓσʔλ΋͋Δ. • σʔλ͸91ݸͷ߲໨ʢ!?ʣͰߏ੒͞ΕΔ, ϨΪϡϥʔγʔζϯ෼Ͱ͓͓Αͦ400MBʙ600MB͙Β͍ͷσʔλ. • baseballsavant.mlb.com ͱ͍͏αΠτͰ୭Ͱ΋Ӿཡɾμ΢ϯϩʔυʢCSV ϑΥʔϚοτʣͰ͖·͢.

Slide 6

Slide 6 text

ϫΠʮຖ೔ຖࢼ߹ݟΔ࢓૊Έཉ͍͠ʯ ຖ೔σʔλऩू -> BigQueryʹอଘ -> ՄࢹԽ, తͳج൫࡞ͬͪΌ͑ʂ

Slide 7

Slide 7 text

ͱ͍͏Θ͚Ͱ, ͪΐͬ͜ͱ࡞ͬͯΈ·ͨ͠.

Slide 8

Slide 8 text

PythonͱGoogle CloudͰ࡞Δ αʔόϨεͰ͍͍ײ͡ͳ σʔλج൫ʢ໺ٿฤʣ

Slide 9

Slide 9 text

ΞʔΩςΫνϟͷશମ૾

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

ΞʔΩςΫνϟղઆʢ㲈ͩ͜ΘΓϙΠϯτʣ • ຖ೔σʔλ֬ೝɾຖ೔σʔλߋ৽Λ͍͍ײ͡ʹ࣮ݱ͢ΔͨΊ, 
 ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯΛશ໘తʹ׆༻ͯ͠ߏஙɾӡ༻. • ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯ #ͱ͸ • CLI΍ίϯιʔϧͰϙνϙν͢Δ͚ͩͰͻͱ·্ཱ͕ͣͪΔ • Πϯϑϥɾαʔόʔͷϝϯςφϯε͕ෆཁʢࣗ෼͡Όͳͯ͘, Ϋϥ΢υαʔϏεଆ͕΍Δʣ • ΑΓ۩ମతʹ͸, ࣗ෼ͰK8sΫϥελ΍VMΛݐͯͳͯ͘΋ྑ͍ʢωοτϫʔΫ౳ͷઃఆ͸ൃੜʣ • GitHub Actions౳ͷCI/CDͷύΠϓϥΠϯʹ૊ΈࠐΜͰσϓϩΠɾεέʔϧͰ͖ͨΓ 
 جຊతʹ͸ʮ࢖ͬͨ෼͚ͩ՝ۚʯʹͳΔͷͰ͓ࡒ෍ʹ΋༏͍͠👛

Slide 12

Slide 12 text

Ϣʔεέʔεͱ࢖ͬͨαʔϏε

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

μογϡϘʔυΞϓϦ • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

σʔλऩू&BigQueryอଘ • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Firestore౤ೖʢDatabaseʹσʔλҠૹʣ • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

Slide 19

Slide 19 text

PySpark + DataprocͰ࣮ݱ͢Δ αʔόϨεͳσʔλॲཧ ※͕͜͜͜ͷτʔΫͷຊ୊ͱͳΓ·͢.

Slide 20

Slide 20 text

͜ͷ࿩ͷείʔϓ

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

33.4ඵͰΘ͔ͬͨʢؾʹͳΔʣ🐯 SparkͱPySpark

Slide 23

Slide 23 text

SparkͱPySpark • ʮେ͖͍σʔλΛ͍͍ײ͡ʹ෼ࢄͯ͠ॲཧ͢ΔʯͨΊͷFramework • ྺ࢙తͳ࿩Ͱ͍͑͹HadoopͷޙܧͰ, ར༻ࣄྫ͸݁ߏ͋Δ. • ϥΠϒϥϦɾAPIͷ࣮૷͸Java͕ͩ, ʮPySparkʯͱ͍͏ 
 PythonͷInterface͕͋Δ. • DataFrame APIͱ͍͏, PandasͰ͓ೃછΈͷDataFrameͳInterface 
 ͕͋Γ, ͜ΕΛϝΠϯʹ࢖͏ͱ͍͍ײ͡ʹ࢖͑Δ.

Slide 24

Slide 24 text

SparkʢͱPySparkʣͷ࢖͍ํ • σʔληοτ • େݩͷσʔλ͸RDDʢResilient Distributed DatasetʣͰߏ੒ɾ؅ཧ. • RDDΛ͍͍ײ͡ʹѻ͏Interfaceͱͯ͠DataFrame API͕ଘࡏ͢Δ. • DataFrame͸PandasͬΆ͍࢖͍ํ͕ग़དྷΔ. • ࢖͍ํ • εΫϦϓτΛॻ͍ͯSpark؀ڥ্Ͱ࣮ߦ. • Jupyter Lab, Zeppelin notebookͰΞυϗοΫʹ࢖͏.

Slide 25

Slide 25 text

؀ڥΛͲ͜Ͱߏஙɾӡ༻͢Δ͔🤔 ؀ڥɾखஈ ߏஙͷखؒ ӡ༻͠΍͢͞ උߟ ΦϯϓϨϛεͰ શͯࣗલߏஙɾӡ༻ શͯࣗલͰઃఆ͢Δ ඞཁ͕͋Δ Կ͔ΒԿ·Ͱ ࣗ෼ͰݟΔඞཁ͕͋Δ Ұ൪େมͳύλʔϯ ຊ৬ͷΠϯϑϥΤϯδχΞ Ͱ΋͖͍ͭ࢓ࣄ Ϋϥ΢υ্ͷ7.,Tʹ ࣗલͰߏஙɾӡ༻ શͯࣗલͰઃఆ͢Δ ඞཁ͕͋Δ ͋Δఔ౓Ϋϥ΢υαʔϏε ͷԸܙʹत͔ΕΔ 4QBSL؀ڥͷࣗલߏங͸ ׂͱ೉қ౓͕ߴ͍ Ϋϥ΢υαʔϏεఏڙͷ ϚωʔδυαʔϏεΛ࢖͏ ˞࠷΋ਪ঑͢Δํࣜ (6*Ͱϙνϙν͢Δ ΋͘͠͸$-*"1*Ͱ ͍͍ײ͡ʹ࣮ߦ $16౳ͷϦιʔεΛ؂ࢹ ঢ়گʹԠͯ͡ϝϯςφϯε ࠷΋ָ͔ͭεϚʔτͳํ๏ "84 (PPHMF$MPVEଞ ֤ࣾαʔϏε༗

Slide 26

Slide 26 text

Google Cloudʹ͓͚ΔSparkӡ༻ͷબ୒ࢶ ؀ڥɾखஈ ߏங ӡ༻ ࢖͑Δػೳ උߟ ($&PS(,&ʹ ؀ڥΛ࡞ͬͯӡ༻ ࣗલͰߏஙޙ 4QBSLΛಋೖ શͯࣗલͰӡ༻ ໘౗ΛݟΔඞཁ༗ શͯͷػೳ ݁ہͷॴ%BUBQSPDͰ ग़དྷΔ͜ͱͳͷͰ ͓͢͢Ί͠ͳ͍ %BUBQSPD HDMPVEίϚϯυ "1* ίϯιʔϧͷ ͲΕ͔Ͱߏங %BUBQSPD͕࡞ͬͨ (,&PS($&؀ڥ Λ؂ࢹɾӡ༻ શͯͷػೳ Ұ൪ඪ४తͳߏ੒ %BUBQSPD 4FSWFSMFTT HDMPVEίϚϯυ "1* ίϯιʔϧ ্هͷͲΕ͔Ͱߏங ࣮ߦதͷ؂ࢹͷΈ ؀ڥ͸ॲཧޙʹ ࣗಈ࡟আ όονॲཧͷΈରԠ OPUFCPPL࢖͑ͳ͍ ఆظతͳόονॲཧ ͸͜Ε͕Ұ൪͍͍

Slide 27

Slide 27 text

DataprocͱDataproc Serverless • Google Cloudʹ͸Dataprocͱ͍͏SparkʢHadoopʣͷ 
 ϚωʔδυαʔϏε͕ଘࡏ͢Δ. • ࠓ·Ͱ͸GCE΍GKEʢK8sʣͰʮϗετɾCluster͕ଘࡏʯલఏͷ 
 ӡ༻͔͠Ͱ͖ͳ͔͕ͬͨ, ͍ͭ࠷ۙServerlessͱ͍͏બ୒ࢶ͕ര஀ • ʮ1೔1ճʯʮ30෼͓͖ʯΈ͍ͨͳόονӡ༻Ͱ͋Ε͹ 
 Serverless͕࢖͑Δʂͨͩ͠notebookܥͷ࣮ߦʢJupyterͳͲʣ͸ະରԠ • Serverless͸࢖ͬͨ෼͚ͩ՝ۚͳͷͰ͓ࡒ෍ʹ΋༏͍͠👛

Slide 28

Slide 28 text

PySparkΛ࢖ͬͯ΍ͬͨλεΫΛ঺հ • σʔλऩू&BigQuery΁ͷσʔλ౤ೖ • μογϡϘʔυΞϓϦ༻DBʢFirestoreʣ΁ͷσʔλ౤ೖ

Slide 29

Slide 29 text

ʲ࠶ܝʳσʔλऩू&BigQueryอଘ • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ

Slide 30

Slide 30 text

σʔλऩू ʢnot Sparkʣ • WebεΫϨΠϐϯά͸SparkͰ 
 ΍Δ΂͖͜ͱͰ͸ͳ͍. • λεΫΛrequests-htmlͰ࣮૷, 
 Cloud FunctionsͰӡ༻ͯ͠ରॲ. • Cloud SchedulerͷCronઃఆͰ 
 ఆظ࣮ߦ, GCSʹอଘ

Slide 31

Slide 31 text

CSVσʔλΛ BigQueryʹ౤ೖ • Dataproc্Ͱ΍ΔλεΫͱͯ͠ 
 ద੾ͳൣғɾॲཧͷҰͭ • GCSͷύε͔ΒϑΝΠϧநग़ 
 Spark SQLͰॲཧͯ͠BigQuery΁ • DataFrameͱSQL͕Θ͔Ε͹ 
 ͍͍ײ͡ʹ࣮૷ɾӡ༻Մೳ

Slide 32

Slide 32 text

DataprocΛ࢖͓͏ • Google CloudͷυΩϡϝϯτɾαϯϓϧΛࣸܦ͠ͳ͕Β΍Δͱྑ͖ • https://cloud.google.com/dataproc • https://cloud.google.com/dataproc-serverless/docs • https://github.com/GoogleCloudDataproc/cloud-dataproc • Serverlessͷ৔߹, ࣄલʹVPC subnetΛ࡞Δඞཁ͋Γ • ࣍ϖʔδ͔Β, PySparkΛ࢖ͬͯ΍Δ৔߹ͷαϯϓϧΛগ͠঺հ • αϯϓϧίʔυ͸Type Hints࢖࣮ͬͯ૷ͯ͠·͕͢ผʹແͯ͘΋͍͍Ͱ͢. 
 ʢͲ͜ͷInterfaceͳͷ͔Θ͔Γ΍͘͢͢ΔͨΊ, ׶͑ͯType HintsΛ࢖ͬͯॻ͍͍ͯ·͢ʣ.

Slide 33

Slide 33 text

ͻͱ·࣮ͣ૷ 1. SessionΛ࡞Δ from pyspark.sql import SparkSession # BigQueryΛ࢖͏&StringͷൣғΛ޿͛Δ spark: SparkSession = SparkSession \ .builder \ .appName(‘app_yakiu')\ .config('spark.jars', 'gs://spark-lib/bigquery/spark- bigquery-with-dependencies_2.12-0.25.2.jar') \ .config('spark.sql.debug.maxToStringFields', 2000) \ .getOrCreate() # ͜Ε͸Dataprocಛ༗ͷఆٛ, ࣮ߦ࣌ͷtemporaryྖҬ spark.conf.set('temporaryGcsBucket', 'GCSͷόέοτ໊') • ద౰ͳ.pyϑΝΠϧΛ࡞Δ • SparkSessionΛੜ੒͢Δ • ඞཁͳJAR΍ఆٛΛઃఆ 
 BigQueryΛ࢖͏࣌͸ 
 JARͷࢦఆ͕ඞਢ

Slide 34

Slide 34 text

ͻͱ·࣮ͣ૷ 2. SchemaΛ࡞Δ from pyspark.sql.types import StructType, StructField, DoubleType, DateType, StringType, LongType # schemaઃఆʢͪΐͬͱ௕͍ʣ STATCAST_SCHEMA: StructType = StructType( [ StructField("pitch_type", StringType(), False), StructField("game_date", DateType(), False), StructField("release_speed", DoubleType(), False), StructField("release_pos_x", DoubleType(), False), StructField("release_pos_z", DoubleType(), False), # ௕͗͢ΔͷͰলུʢ91߲໨෼͋Δʣ StructField("spin_axis", DoubleType(), False), StructField("delta_home_win_exp", DoubleType(), False), StructField("delta_run_exp", DoubleType(), False) ] ) • CSVͷ৔߹SchemaΛ࡞Δ • ͜Ε͕ແ͍ͱҙਤ௨Γʹ 
 ಈ͔ͳ͍ • 91߲໨෼ͷSchema 
 ؤுͬͯॻ͖·ͨ͠ྦ

Slide 35

Slide 35 text

ͻͱ·࣮ͣ૷ 3. CSVಡΈࠐΉ from pyspark.sql import DataFrame as SparkDataFrame from schema import STATCAST_SCHEMA # ઌ΄ͲͷεΩʔϚఆٛ # ࠷ॳʹੜ੒ͨ͠sparkηογϣϯ͔Βreadؔ਺ΛCSV FormatࢦఆͰಈ͔͢ def read_csv(date: str, filename: str, schema: StructType = None) -> SparkDataFrame: try: return spark.read \ .format('csv') \ .options(header="true", inferSchema="true") \ .load(f'gs://your-bucket-name/path/{date}/ {filename}', schema=schema) except AnalysisException: return None # type: ignore sdf: SparkDataFrame = read_csv('2022-10-15', 'batter.csv', schema=STATCAST_SCHEMA) • sparkηογϣϯͷreadΛ 
 ࢖͏, formatʹCSVΛࢦఆ • ϔομʔͱͯ͠ઌ΄Ͳͷ 
 SchemaΛࢦఆ • GCSͷϑϧύεΛࢦఆ

Slide 36

Slide 36 text

ͻͱ·࣮ͣ૷ 4. BigQueryอଘ from pyspark.sql import DataFrame as SparkDataFrame # BigQueryʹอଘ͢Δؔ਺ def save_bigquery(sdf: SparkDataFrame, table_name: str) -> None: sdf.write\ .mode('append') \ .format('bigquery') \ .option('project', 'your project name') \ .option('table', f'dataset.{table_name}') \ .option('temporaryGcsBucket', 'GCSͷόέοτ໊') \ .option('createDisposition', 'CREATE_NEVER') \ .save() # Spark DataFrameͱBigQueryͷςʔϒϧ໊Λࢦఆ save_bigquery(sdf, 'batting_data') • DataFrameͷwriteؔ਺ • bigqueryΛࢦఆ • ྫ͸طଘςʔϒϧ΁ͷ 
 ௥هॻ͖ࠐΈ

Slide 37

Slide 37 text

Dataproc ServerlessΛ࢖࣮ͬͯߦ # ࣮ߦϑΝΠϧΛGCSʹΞοϓϩʔυ gsutil -m cp -r app gs://your-gcs-bucket/script # bata optionΛ͚ͭΔͱαʔόϨε࣮ߦʹͳΓ·͢ gcloud beta dataproc batches submit \ --project your-project \ --region asia-northeast1 pyspark \ --batch hello-world gs://your-gcs-bucket/script/app/csv2bq_daily.py \ --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar \ --subnet your-subnet \ --py-files gs://your-gcs-bucket/script/app/environment.py,gs://your-gcs-bucket/script/app/query.py,gs://your-gcs- bucket/script/app/schema.py

Slide 38

Slide 38 text

΍ͬͯΈͨײ૝ • ͻͱ·ͣ࡞Γ͍ͨ΋ͷ͕εϚʔτʹ࡞ΕͨͷͰྑ͔ͬͨ. • PySparkͷ࣮૷ͦͷ΋ͷ͸Dataprocͷํݴ͕গͳͯͦ͘͜΋ྑ͖. • ࣮ߦݖݶʢService Account & IAMʣ͸σϑΥϧτͰ͍͍ײ͡ʹߦ͚ͨ. • BigQuery΋GCS΋࠷ॳ͔ΒϑϧͰ৮ΕΔ. • ͨͩ, ݖݶڧ͗͢Δઆ͋ΔͷͰ࣮຿ͷ࣌͸νϡʔχϯά͕ඞཁ͔΋. • ͋Δఔ౓SparkͱGoogle Cloudʹ׳ΕͯΔਓ͡Όͳ͍ͱΩπΠઆ? 
 SparkΛॳΊͯ৮Γ·͢ʂͱ͍͏ํ͸ͪΐͬͱ֮ޛ͕ඞཁ͔΋͠Εͳ͍.

Slide 39

Slide 39 text

ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

Slide 40

Slide 40 text

ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

Slide 41

Slide 41 text

BigQuery͔ΒGCSʹϑΝΠϧग़ྗ for Dataproc • BigQueryͷσʔλΛSpark DataFrameʹ • Spark DataFrameΛϑΝΠϧग़ྗ ͪͳΈʹ࣮ߦํ๏ʢgcloud CLIʣ͸มΘΒͳ͍ͷͰׂѪ͠·͢.

Slide 42

Slide 42 text

ͻͱ·࣮ͣ૷ 5. BigQueryಡࠐ from pyspark.sql import SparkSession from pyspark.sql import DataFrame as SparkDataFrame from pyspark.sql.utils import AnalysisException # ͜͜͸ಉ͡ spark: SparkSession = SparkSession \ .builder \ .appName('your app')\ .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with- dependencies_2.12-0.25.2.jar') \ .config('spark.sql.debug.maxToStringFields', 2000) \ .getOrCreate() spark.conf.set('temporaryGcsBucket', GCS_BUCKET) # ͜͜ͰViewಡΈऔΓΛ༗ޮԽ͠ͳ͍ͱΤϥʔʹͳΔͷͰ஫ҙ spark.conf.set("viewsEnabled", "true") def read_bq() -> SparkDataFrame: """ Read Dashboard data BigQuery View to SparkDataFrame """ try: df: SparkDataFrame = spark.read.format('bigquery') \ .option('project', 'your project') \ .option('table', f'your_project.view_baseball') \ .load() return df except AnalysisException: return None # type: ignore sdf: SparkDataFrame = read_bq() • อଘͱಉ͘͡BigQueryͷ 
 JARΛࢦఆ • spark readͰBigQueryΛࢦఆ • BigQueryͷViewʹରͯ͠ 
 ߦ͏৔߹, Φϓγϣϯ͕ඞཁ

Slide 43

Slide 43 text

ͻͱ·࣮ͣ૷ 6. GCSอଘ from pyspark.sql import DataFrame as SparkDataFrame def save_json(sdf: SparkDataFrame) -> None: """ Save as JSON dataset SparkDataFrame to GCS Bucket """ sdf.write \ .format("json") \ .mode("overwrite") \ .option("path", "gs://your-gcs-bucket/filepath/hoge") \ .save() save_json(sdf) • DataFrameͷwriteؔ਺ • jsonΛࢦఆ • ࠷ऴతͳύεΛࢦఆ

Slide 44

Slide 44 text

PySparkͱDataproc Serverless • ʮ࢖͍͍ͨͱ͖͚ͩSparkΛ࢖͏ʯͱ͍͏ϢʔεέʔεΛ࣮ݱՄೳ. 
 ͜Ε͕αʔόϨεαʔϏεΛ࢖͏΂͖࠷େͷཧ༝. • ࠓճͷΞϓϦέʔγϣϯͷσʔλαΠζʢ1೥Ͱ1GB͍͔ͳ͍ʣͩͱ 
 Ըܙʹत͔Εͳ͍͕, ʮ਺GB/೔ఔ౓ͷσʔλΛαΫοͱόονॲཧʯ 
 Έ͍ͨͳϢʔεέʔεʹͳΔͱ݁ߏศརͳؾ͕͠·͢ʢલॲཧɾΫϨϯδϯά͢Δͱ͔ʣ. • ʮॲཧ͢Δͱ͖͚ͩಈ͔͢ʯͱ͍͏ײ͡ͷ͍ܰίʔυͳͷͰPySparkͱ΋૬ੑόπάϯ. • ͳ͓, ॲཧͷࣗಈԽ͸ͪΐͬͱบ͕͋Γ·͢, खஈ੍͕ݶ͞ΕΔʢͷͰखಈʹͨ͠ʣ. 
 ࣗಈԽ͸Cloud ComposerʢAir fl owʣͰ΍Δͷ͕ࠓͷॴͷϕεϓϥͬΆ͍.

Slide 45

Slide 45 text

ࠓ೔ͷ࿩ͷ·ͱΊ • PythonͰ͍͍ײ͡ʹσʔλॲཧΛ͢ΔͷʹPySpark͸ྑ͍ͧ. • PySpark͸Ϋϥ΢υͰಈ͔ͤ·͢, ࠓ೔͸DataprocΛ঺հ͠·ͨ͠. • αʔόϨεʹΫϥ΢υΛ࢖͑ΔΑ͏ʹͳΔͱ,৭ʑͱָ 
 ͨͩ͠, ࢓্༷ͷ੍ݶ΋͋ΔͷͰ༰ྔ༻๏Λਖ਼͘͠ཧղͯ͠࢖͓͏.

Slide 46

Slide 46 text

Done. ͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠⽁