Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Introduction of GCP Dataflow
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
soymsk
April 12, 2018
Technology
250
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Introduction of GCP Dataflow
(In Japanese)
soymsk
April 12, 2018
More Decks by soymsk
See All by soymsk
[SUSTEN 勉強会]マイナンバーカードの仕組み
soymsk
0
250
Google_Cloud_Next_19_AI_ML_Summary_public.pdf
soymsk
7
2k
DeNAにおけるデータ活用事例 〜移動体データ活用によるサービス創出とその基盤 / Data Driven Service in Taxi hiring app MOV
soymsk
0
410
wavenet
soymsk
0
89
Other Decks in Technology
See All in Technology
Building applications in the Gemini API family.
line_developers_tw
PRO
0
2.7k
価格.comをAI駆動で全面刷新する ー 30年分の技術的負債を返し、次の30年の土台をつくる ー / AI Engineering Summit Tokyo 2026
tkyowa
53
59k
就職⽀援サービスにおけるキャリアアドバイザーのシフトスケジューリング
recruitengineers
PRO
1
120
失敗を経て、Harness Engineering で 大切にしたいことを考える / Learning from Failure: What Matters in Harness Engineering
bitkey
PRO
0
230
チームで進めるAI駆動アジャイル×ウォーターフォール
kumaiu
0
140
運用を見据えたAIエージェント設計実践
amacbee
1
3.5k
データ基盤をDataformで整えた話 〜 開発環境を添えて 〜
takapy
0
140
製造業のクラウド活用最適解〜AI,DXを加速するデータ基盤の作り方〜
hamadakoji
0
430
AI Engineering Summit Tokyo 2026 AIの前に、やることがある 〜医療データ企業の4フェーズ〜
dtaniwaki
0
2.4k
RSA暗号を手計算したくなること、ありますよね?? (20260615_orestudy6_rsa)
thousanda
0
110
Amazon Bedrock AgentCore ワークショップ JAWS UG TOHOKU / amazon-bedrock-agentcore-workshop-jawsug-tohoku-2026
gawa
9
570
地球に⽣きるAI —GeoAIと「中間領域」— / AI Living on Earth — GeoAI and the “Intermediate Layer” —
ykiyota
0
100
Featured
See All Featured
Discover your Explorer Soul
emna__ayadi
2
1.1k
Everyday Curiosity
cassininazir
0
230
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
1
280
The Cult of Friendly URLs
andyhume
79
6.9k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
23k
The Cost Of JavaScript in 2023
addyosmani
55
10k
SERP Conf. Vienna - Web Accessibility: Optimizing for Inclusivity and SEO
sarafernandez
2
1.5k
Crafting Experiences
bethany
1
170
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
Leveraging Curiosity to Care for An Aging Population
cassininazir
1
270
Reality Check: Gamification 10 Years Later
codingconduct
0
2.2k
Darren the Foodie - Storyboard
khoart
PRO
3
3.4k
Transcript
Introduction of GCP Dataflow 2018-04 @soymsk
Dataflowͬͯͳʹʁ
None
Կ͕Ͱ͖Δͷʁ
ࢄฒྻॲཧ͕Ͱ͖·͢
ੈͷதʹࢄ͍ͨ͠ͷͨ͘͞Μ͋Δ • σʔλूܭ: Sum, Average, etc.. • σʔλՃ: JSONύʔεɺMapMatch •
I/O: ෳϦιʔεͷೖग़ྗ • ߦྻԋࢉ • ػցֶश:ࢄֶश ( CV, RL)
͜Μͳ͓Έ͕͋Γ·ͤΜ͔ʁ • ࢄॲཧͷ࣮͕͍͠ɻɻ • ࠓόονॲཧ͍͚ͨ͠ͲɺকདྷετϦʔϛϯάॲ ཧ͍ͨ͠ʂ • ࢄॲཧج൫(Hadoop, Yarn, Spark)ͷӡ༻͕ͭΒ
͍ɻɻ
͜Μͳ͓Έ͕͋Γ·ͤΜ͔ʁ • ࢄॲཧͷ࣮͕͍͠ɻɻ • →PythonͰαΫοͱ(※ʣ͔͚·͢ • ࠓόονॲཧ͍͚ͨ͠ͲɺকདྷϦΞϧλΠϜʹॲཧ ͍ͨ͠ʂ • →ಉҰίʔυ(※)
Ͱ όον/ετϦʔϛϯά྆ํಈ͔ͤ ·͢ɻ • ࢄॲཧج൫(Hadoop, Yarn, Spark)ͷӡ༻͕ͭΒ͍ɻɻ • → Fully managed ͳͷͰӡ༻ෆཁ(※) ※: ΄΅
ࢄ vs ฒྻ • ※ఆٛ໌֬ʹܾ·͍ͬͯ·ͤΜ • ࢄॲཧΛࣗલͰ࣮͢Δ͜ͱෆՄೳʹ͍ۙ • ނোɺ௨৴ःஅ͕͋Γ͏ΔΫϥελ্ͰॲཧΛҡ࣋͢Δʹɾɾɾʁ •
࠶࣮ߦʁ • ҟͳΔεϖοΫͷෳϚγϯ্ͰɺదʹෛՙΛࢄͤ͞Δʹɾɾɾʁ • దͳεέδϡʔϦϯά • ୭্͕هΛΔͷ͔ɾɾɾʁ • Ϛελʔ ←ɹϚελʔ͕͓ͪͨΒʁ • ϚϧνϚελʔʁ • Ϛελʔબग़ΞϧΰϦζϜɺίϯηϯαεΞϧΰϦζϜ(Raft, Paxos) • http://block-chain.jp/blockchain/distributed-consensus-algorithm-protocol/ ฒྻ 1BSBMMFM ࢄʢ%JTUSJCVUFE ఆٛ˞ ୯ҰϚγϯ্Ͱฒྻॲཧ ෳϚγϯ্Ͱฒྻॲཧ ࣮ݱํ๏ ɾϚϧνϓϩηεԽ ɾϚϧνεϨουԽ ɾ$16໋ྩ ɾࣗલͰ࣮ ɾʮࢄॲཧج൫ʯ্Ͱ࣮
ࢄॲཧج൫ ετϦʔϛϯά όον
ࢄॲཧج൫ • ಘҙͳ͜ͱ͕ͦΕͧΕҟͳΔ • όον͔ετϦʔϛϯά͔ • ࣮ݴޠ(Java, Python) • ͲͷϨΠϠʔΛ୲͢Δ͔
• Ϋϥελཧ·ͰҰ؏ͯ͠Δ • Ϋϥελཧผͷίϯϙʔωϯτʹҕৡ͢Δ( on YARN) • ϑϨʔϜϫʔΫ͝ͱʹ࣮ํ๏͕ҟͳΔͷͰɺֶशίετ͕͔͔Γɺ ίʔυͷՄൖੑͳ͍ɻ • ӡ༻͕େม • ίʔυͷσόοά͢Βେมͳͷʹɺج൫ͷϝϯςඞཁ
ͦ͜Ͱ
Dataflow(Apache Beam)ͱ • GCP্ͷϑϧϚωʔδυࢄॲཧαʔϏε • PythonɺJavaɺ(Scala)ɺ(Go)Ͱ࣮Մೳ • Implement once ,
Run on any runtimes. • ϑϨʔϜϫʔΫࣗମApache Beamͱ͍͏໊લͰ࣮͕ ެ։͞Ε͍ͯΔɻ • Apache BeamͰ࣮͢Εɺ͍ΖΜͳϥϯλΠϜ্Ͱ࣮ ߦՄೳ • Dataflowɺ࣮ʮApache Beam͕ಈ࡞͢ΔϥϯλΠ Ϝʯͷ̍ͭɺͱ͍͏Ґஔ͚ͮ
Apache Beam • Apache BeamͰɺॲཧ༰Λఆٛ͢Δ͚ͩ • ॲཧͷ࣮ߦͷํϥϯλΠϜͷ • Apache BeamϥϯλΠϜೖྗσʔλΛநԽͯ͘͠
ΕΔ
Apache Beam Concept • ॲཧΛؔͱͯ͠ఆٛ͠ɺचͭͳ͗ʹ࣮ • PCollection: ύΠϓϥΠϯؒσʔλΛϥοϓͨ͠Φϒ δΣΫτ •
Transform: σʔλॲཧ(Pythonؔʣɻࢄͷ࠷খ୯Ґ
όονॲཧΛྫʹղઆ͠·͢
Hands-on
I/O Language File-based Messaging Database Java Beam Java supports Apache
HDFS, Amazon S3, Google Cloud Storage, and local filesystems. FileIO (general-purpose reading, writing, and matching of files) AvroIO TextIO TFRecordIO XmlIO TikaIO Amazon Kinesis AMQP Apache Kafka Google Cloud PubSub JMS MQTT Apache Cassandra Apache Hadoop InputFormat Apache HBase Apache Hive (HCatalog) Apache Solr Elasticsearch (v2.x and v5.x) Google BigQuery Google Cloud Bigtable Google Cloud Datastore Google Cloud Spanner JDBC MongoDB Redis Python Beam Python supports Google Cloud Storage and local filesystems. avroio textio tfrecordio vcfio Google BigQuery Google Cloud Datastore
Transforms %BUB ParDo(Map) %BUB GroupBy %BUB %BUB %BUB Side Input
%BUB %BUB 1 1 N M Multiple Output %BUB %BUB %BUB Combine %BUB %BUB N 1
Streaming? • ೖྗΛMessagingαʔϏε(PubSub, Kafka)ʹม͑Δͩ ͚ͰɺࣗಈతʹετϦʔϛϯάॲཧͱͯ͠ಈ࡞͢Δɻ • Transformόονॲཧͱશ͘ಉ͡ͰOK • ೖग़ྗ͕1:1ͷॲཧ͜Ε͚ͩ •
→ Aggregation(GroupBy)͢Δʹɻɻʁ
Windowing • StreamingͱBatchͷॲཧΛநԽ͢ΔΈͱͯ͠ ʮBounded dataʯʮUnbounded dataʯ ͱ͍͏֓೦͕͋Δ • Bounded …
σʔλͷ࠷ॳͱ࠷ޙ͕ఆٛ͞Ε͍ͯΔ(όο νʣ • Unbounded …࠷ޙ͕ະఆʢετϦʔϛϯά) • Boundedͳσʔλ GroupByͰ͖Δɻ
Windowing • Unbounded dataBounded dataʹม͢ΕΑ͍ɻ • σʔλΛ͚۠ͯ͠Bounded ʹ͢ΔΈ͕ ʮWindowingʯ •
࣮ɺόονॲཧɺσʔλΛͯ͢ ʮGlobalWindowʯͱ͍͏̍ͭͷWindowͷσʔλ ʹର͢ΔετϦʔϛϯάͱಉٛͱߟ͑Δ͜ͱ͕Ͱ͖Δɻ
Windowing • Sliding time windowͰ͍͍ͩͨΓ͍ͨ͜ͱͰ͖Δ • 5ຖʹɺ30ؒͷσʔλͷฏۉΛग़͢ɺͳͲɻ window (෯:60s, Φϑηοτ:
30s)
DataflowͷϋϚΓͲ͜Ζ
ᶃϩʔΧϧͰಈ͍ͨͷʹɺ Dataflow্Ͱಈ͔ͳ͍
ߟ͑ΒΕΔͱ͜Ζ • ࢄڥͰɺϝϞϦۭؒҰڞ༗͞Εͳ͍ • εςʔτϑϧͳϩδοΫ͕ඞཁ • ϫʔΧʔؒͷঢ়ଶཧ֎෦ίϯϙʔωϯτ (memcached, GCS ͳͲ)Λܦ༝͢Δ
• ϫʔΧϦιʔεෆ • Transform1Ϩίʔυ࠷1ճ࣮ߦɻ • ϫʔΧʔ͕མͪͨͱ͖ͳͲɺ࠶࣮ߦ͞ΕΔՄೳੑɻ • ֎෦DBʹϨίʔυॲཧճΛΠϯΫϦϝϯτͯͨ͠Γ ͢ΔͱɺೖྗϨίʔυͱҰக͠ͳ͍Մೳੑ͕͋Δɻ
ߟ͑ΒΕΔͱ͜Ζ2 • Ϟδϡʔϧґଘؔ • σϑΥϧτͰϫʔΧʔʹجຊతͳύοέʔδ͔͠ೖͬͯͳ͍ • pipͰೖΔͷrequirements.txtॻ͍ͯΦϓγϣϯʹࢦఆͯ͠σ ϓϩΠ͢Δ͚ͩɻ • ࣗͰ࡞ͬͨϞδϡʔϧsetup.pyΛ༻ҙ͠ɺύοέʔδԽ͕ඞ
ཁɻ • όΠφϦ(lib)ͷՃἚͷಓ • DataflowͷϫʔΧʔChromeOSΆ͍(Ubuntuͷѥछ) • apt͔ͭ͑Δ͕ɺखݩͷlibಈ͔ͳ͍ • buildίʔυΛsetupʹࠐΉ
ᶄͳ͔͍ͥ
Dataflowνϡʔχϯά • νϡʔχϯά߲ଟ͘ͳ͍ • Ϧʔδϣϯ(σʔλ, ΠϯελϯεͳΔۙ͘͘) • ΠϯελϯελΠϓ x ʮ࠷େϫʔΧʔʯ
• جຊɺUIΛݟͳ͕Β٧·͍ͬͯΔͱ͜ΖͷίʔυΛվળɻ • ೖྗ/ग़ྗ͕ϘτϧωοΫʹͳΔ͔ʁ • PubSubͷQuotaΛ͋͛Δ • GCSͷϑΝΠϧΛେྔʹಡΈࠐΉͱ͔ͬͨɻ • Combine͔ͯ͠ΒಡΈࠐΉ • ࠷ۙվળ͞Εͨʁ
νϡʔχϯά - Fusion - • ༥߹(Fusion) • DataflowɺύΠϓϥΠϯͷߏΛղੳͯ͠ɺෳͷ TransformΛ༥߹ͯ̍ͭ͠ͷTransformͱ࣮ͯ͠ߦ͢ΔΈ ͕ೖ͍ͬͯΔɻ
• Fusion͞ΕͨTransformɺϥούʔͷPCollection͕࣮ମԽ ͞ΕͣɺೖྗσʔλΛͦͷ··࣍ͷTransformʹྲྀ͢ͷͰɺύ ϑΥʔϚϯε্͕͕Δɻ • Fusion͞ΕͨҰ࿈ͷTransformɺ̍ͭͷTransformͱ࣮ͯ͠ ߦ͞ΕΔɻ
νϡʔχϯά - Fusion - Word Countͷྫ ֤εςοϓ͝ͱʹதؒσʔλʢ1$PMMFDUJPO ͕ ࣮ମԽ͞ΕΔͷͰߴίετ εςοϓΛ·ͱΊͯɺதؒσʔλͷ࣮ମԽΛল͍ͯߴԽ
νϡʔχϯά - Fusion - • جຊతʹFusionѱ͞Λ͠ͳ͍͕ɺ ʮೖྗϨίʔυʯ<<< ʮग़ྗϨίʔυʯͳParDo ("high fan-
out” ParDo) ͕ೖΔͱɺޙଓͷTransformͰॲཧ͕ͭ·Δɺͱ͍͏ݱ͕͋Δɻ • Fusion͞Εͨ݁ՌɺதؒͷTransform͕ࢄ͞Εͳ͍ɻ ※ೖྗϨίʔυʹΑͬͯ୲worker͕அ͞ΕΔ 5SBOTGPSN 5SBOTGPSN 5SBOTGPSN 5SBOTGPSN 'VTJPOFE5SBOTGPSN ຊདྷҙਤͨ͠ࢄ …. record: 1 record: 10000 record: 1
νϡʔχϯά - Fusion - • ྫ͑ɺϑΝΠϧ໊ͷྻΛೖྗͱ͠ɺϑΝΠϧͷத ͷϨίʔυΛग़ྗ͢ΔΑ͏ͳParDo ةݥ(සग़)ɹ • ճආࡦ
• GroupBy -> UnGroup ͱ͍͏ॲཧΛhigh fan-outͷ ParDoͷޙʹࠐΉ • εςοϓͰग़ྗ͞ΕͨPCollectionΛ࣍ͷεςοϓͷʮSide InputʯʹೖΕ ΔʢSide Inputඞ࣮ͣମԽ͞ΕΔʣ
Prevent Fusion • ΈࠐΈͷϑΝΠϧಡΈࠐΈॲཧʹ͢ͰʹରࡦࡁΈ {
ᶅσόοάͭΒ͍
ࢄฒྻॲཧͷ໋॓Ͱ͢
Dataflow σόοά • ྫ֎࣌ͷελοΫτϨʔεग़ͳ͍ɻ • ֤εςοϓͷதͰCatchͯࣗ͠ྗͰग़͢ • ࣦഊ࣌ͷೖྗσʔλGCSͳͲୀආͤ͞Δ • ҆қʹlogging͢Δͱlog͕େྔʹ(GB୯Ґ)Ͱग़ͯ
དྷΔ •
·ͱΊ • Dataflow(Apache Beam)Google͕ɺཚཱ͢Δࢄॲཧج൫Λ౷ Ұ͢ΔͨΊʹऔΓΜͰ͍ΔΒ͍͠ɻ • ۤ࿑͋Δ͕ɺΦʔτεέʔϧͳͲϚωʔδυࢄॲཧج൫ͱͯ͠ ΄΅།Ұͷબࢶʁ • ϦΞϧλΠϜϩάऩूͱͯ͠PubSub->(DF)->BQ͕σϑΝΫτʹ
ͳΓͭͭ͋Γͦ͏ɻ • Python, Go, JavaͰ࣮Մೳ • ͨͩ͠ɺJava͕Ұ൪҆ఆ͍ͯ͠Δɻ • ಛʹετϦʔϛϯάϞʔυͷ໘ͰJavaʹ͓͍ͯͨ͠΄ ͏͕Α͍ɻ