Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Introduction of GCP Dataflow
Search
soymsk
April 12, 2018
Technology
250
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Introduction of GCP Dataflow
(In Japanese)
soymsk
April 12, 2018
More Decks by soymsk
See All by soymsk
[SUSTEN 勉強会]マイナンバーカードの仕組み
soymsk
0
250
Google_Cloud_Next_19_AI_ML_Summary_public.pdf
soymsk
7
2k
DeNAにおけるデータ活用事例 〜移動体データ活用によるサービス創出とその基盤 / Data Driven Service in Taxi hiring app MOV
soymsk
0
410
wavenet
soymsk
0
89
Other Decks in Technology
See All in Technology
「速く作る」から「正しく作る」へ ─ 生成AI時代の開発フロー改革の ロードマップと実行 ─
starfish719
0
9.5k
非エンジニアがClaudeと挑んだ「1ヶ月間プロダクト30本ノック」
askokc
0
190
LLMと共に進化するプロセスを目指して
ymatsuwitter
12
3.8k
Amazon Bedrock AgentCore ワークショップ JAWS UG TOHOKU / amazon-bedrock-agentcore-workshop-jawsug-tohoku-2026
gawa
9
570
タクシーアプリ『GO』の実践的データ活用
mot_techtalk
3
190
Oracle AI Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
6
1.5k
製造業のクラウド活用最適解〜AI,DXを加速するデータ基盤の作り方〜
hamadakoji
0
430
Oracle AI Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
4
2.9k
ChatworkとBPaaS 異なる特性で学んだAI機能開発の ベストプラクティス
kubell_hr
2
3.4k
失敗を資産に変えるClaude Code
shinyasaita
0
170
Snowflakeと仲良くなる第一歩
coco_se
4
370
機械学習を「社会実装」するということ 2026年夏版 / Social Implementation of Machine Learning June 2026 Version
moepy_stats
4
980
Featured
See All Featured
brightonSEO & MeasureFest 2025 - Christian Goodrich - Winning strategies for Black Friday CRO & PPC
cargoodrich
3
720
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
1
540
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.2k
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
2
210
Public Speaking Without Barfing On Your Shoes - THAT 2023
reverentgeek
1
420
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Information Architects: The Missing Link in Design Systems
soysaucechin
0
960
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.2k
Making Projects Easy
brettharned
120
6.7k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
4.1k
Navigating the Design Leadership Dip - Product Design Week Design Leaders+ Conference 2024
apolaine
1
340
Stop Working from a Prison Cell
hatefulcrawdad
274
21k
Transcript
Introduction of GCP Dataflow 2018-04 @soymsk
Dataflowͬͯͳʹʁ
None
Կ͕Ͱ͖Δͷʁ
ࢄฒྻॲཧ͕Ͱ͖·͢
ੈͷதʹࢄ͍ͨ͠ͷͨ͘͞Μ͋Δ • σʔλूܭ: Sum, Average, etc.. • σʔλՃ: JSONύʔεɺMapMatch •
I/O: ෳϦιʔεͷೖग़ྗ • ߦྻԋࢉ • ػցֶश:ࢄֶश ( CV, RL)
͜Μͳ͓Έ͕͋Γ·ͤΜ͔ʁ • ࢄॲཧͷ࣮͕͍͠ɻɻ • ࠓόονॲཧ͍͚ͨ͠ͲɺকདྷετϦʔϛϯάॲ ཧ͍ͨ͠ʂ • ࢄॲཧج൫(Hadoop, Yarn, Spark)ͷӡ༻͕ͭΒ
͍ɻɻ
͜Μͳ͓Έ͕͋Γ·ͤΜ͔ʁ • ࢄॲཧͷ࣮͕͍͠ɻɻ • →PythonͰαΫοͱ(※ʣ͔͚·͢ • ࠓόονॲཧ͍͚ͨ͠ͲɺকདྷϦΞϧλΠϜʹॲཧ ͍ͨ͠ʂ • →ಉҰίʔυ(※)
Ͱ όον/ετϦʔϛϯά྆ํಈ͔ͤ ·͢ɻ • ࢄॲཧج൫(Hadoop, Yarn, Spark)ͷӡ༻͕ͭΒ͍ɻɻ • → Fully managed ͳͷͰӡ༻ෆཁ(※) ※: ΄΅
ࢄ vs ฒྻ • ※ఆٛ໌֬ʹܾ·͍ͬͯ·ͤΜ • ࢄॲཧΛࣗલͰ࣮͢Δ͜ͱෆՄೳʹ͍ۙ • ނোɺ௨৴ःஅ͕͋Γ͏ΔΫϥελ্ͰॲཧΛҡ࣋͢Δʹɾɾɾʁ •
࠶࣮ߦʁ • ҟͳΔεϖοΫͷෳϚγϯ্ͰɺదʹෛՙΛࢄͤ͞Δʹɾɾɾʁ • దͳεέδϡʔϦϯά • ୭্͕هΛΔͷ͔ɾɾɾʁ • Ϛελʔ ←ɹϚελʔ͕͓ͪͨΒʁ • ϚϧνϚελʔʁ • Ϛελʔબग़ΞϧΰϦζϜɺίϯηϯαεΞϧΰϦζϜ(Raft, Paxos) • http://block-chain.jp/blockchain/distributed-consensus-algorithm-protocol/ ฒྻ 1BSBMMFM ࢄʢ%JTUSJCVUFE ఆٛ˞ ୯ҰϚγϯ্Ͱฒྻॲཧ ෳϚγϯ্Ͱฒྻॲཧ ࣮ݱํ๏ ɾϚϧνϓϩηεԽ ɾϚϧνεϨουԽ ɾ$16໋ྩ ɾࣗલͰ࣮ ɾʮࢄॲཧج൫ʯ্Ͱ࣮
ࢄॲཧج൫ ετϦʔϛϯά όον
ࢄॲཧج൫ • ಘҙͳ͜ͱ͕ͦΕͧΕҟͳΔ • όον͔ετϦʔϛϯά͔ • ࣮ݴޠ(Java, Python) • ͲͷϨΠϠʔΛ୲͢Δ͔
• Ϋϥελཧ·ͰҰ؏ͯ͠Δ • Ϋϥελཧผͷίϯϙʔωϯτʹҕৡ͢Δ( on YARN) • ϑϨʔϜϫʔΫ͝ͱʹ࣮ํ๏͕ҟͳΔͷͰɺֶशίετ͕͔͔Γɺ ίʔυͷՄൖੑͳ͍ɻ • ӡ༻͕େม • ίʔυͷσόοά͢Βେมͳͷʹɺج൫ͷϝϯςඞཁ
ͦ͜Ͱ
Dataflow(Apache Beam)ͱ • GCP্ͷϑϧϚωʔδυࢄॲཧαʔϏε • PythonɺJavaɺ(Scala)ɺ(Go)Ͱ࣮Մೳ • Implement once ,
Run on any runtimes. • ϑϨʔϜϫʔΫࣗମApache Beamͱ͍͏໊લͰ࣮͕ ެ։͞Ε͍ͯΔɻ • Apache BeamͰ࣮͢Εɺ͍ΖΜͳϥϯλΠϜ্Ͱ࣮ ߦՄೳ • Dataflowɺ࣮ʮApache Beam͕ಈ࡞͢ΔϥϯλΠ Ϝʯͷ̍ͭɺͱ͍͏Ґஔ͚ͮ
Apache Beam • Apache BeamͰɺॲཧ༰Λఆٛ͢Δ͚ͩ • ॲཧͷ࣮ߦͷํϥϯλΠϜͷ • Apache BeamϥϯλΠϜೖྗσʔλΛநԽͯ͘͠
ΕΔ
Apache Beam Concept • ॲཧΛؔͱͯ͠ఆٛ͠ɺचͭͳ͗ʹ࣮ • PCollection: ύΠϓϥΠϯؒσʔλΛϥοϓͨ͠Φϒ δΣΫτ •
Transform: σʔλॲཧ(Pythonؔʣɻࢄͷ࠷খ୯Ґ
όονॲཧΛྫʹղઆ͠·͢
Hands-on
I/O Language File-based Messaging Database Java Beam Java supports Apache
HDFS, Amazon S3, Google Cloud Storage, and local filesystems. FileIO (general-purpose reading, writing, and matching of files) AvroIO TextIO TFRecordIO XmlIO TikaIO Amazon Kinesis AMQP Apache Kafka Google Cloud PubSub JMS MQTT Apache Cassandra Apache Hadoop InputFormat Apache HBase Apache Hive (HCatalog) Apache Solr Elasticsearch (v2.x and v5.x) Google BigQuery Google Cloud Bigtable Google Cloud Datastore Google Cloud Spanner JDBC MongoDB Redis Python Beam Python supports Google Cloud Storage and local filesystems. avroio textio tfrecordio vcfio Google BigQuery Google Cloud Datastore
Transforms %BUB ParDo(Map) %BUB GroupBy %BUB %BUB %BUB Side Input
%BUB %BUB 1 1 N M Multiple Output %BUB %BUB %BUB Combine %BUB %BUB N 1
Streaming? • ೖྗΛMessagingαʔϏε(PubSub, Kafka)ʹม͑Δͩ ͚ͰɺࣗಈతʹετϦʔϛϯάॲཧͱͯ͠ಈ࡞͢Δɻ • Transformόονॲཧͱશ͘ಉ͡ͰOK • ೖग़ྗ͕1:1ͷॲཧ͜Ε͚ͩ •
→ Aggregation(GroupBy)͢Δʹɻɻʁ
Windowing • StreamingͱBatchͷॲཧΛநԽ͢ΔΈͱͯ͠ ʮBounded dataʯʮUnbounded dataʯ ͱ͍͏֓೦͕͋Δ • Bounded …
σʔλͷ࠷ॳͱ࠷ޙ͕ఆٛ͞Ε͍ͯΔ(όο νʣ • Unbounded …࠷ޙ͕ະఆʢετϦʔϛϯά) • Boundedͳσʔλ GroupByͰ͖Δɻ
Windowing • Unbounded dataBounded dataʹม͢ΕΑ͍ɻ • σʔλΛ͚۠ͯ͠Bounded ʹ͢ΔΈ͕ ʮWindowingʯ •
࣮ɺόονॲཧɺσʔλΛͯ͢ ʮGlobalWindowʯͱ͍͏̍ͭͷWindowͷσʔλ ʹର͢ΔετϦʔϛϯάͱಉٛͱߟ͑Δ͜ͱ͕Ͱ͖Δɻ
Windowing • Sliding time windowͰ͍͍ͩͨΓ͍ͨ͜ͱͰ͖Δ • 5ຖʹɺ30ؒͷσʔλͷฏۉΛग़͢ɺͳͲɻ window (෯:60s, Φϑηοτ:
30s)
DataflowͷϋϚΓͲ͜Ζ
ᶃϩʔΧϧͰಈ͍ͨͷʹɺ Dataflow্Ͱಈ͔ͳ͍
ߟ͑ΒΕΔͱ͜Ζ • ࢄڥͰɺϝϞϦۭؒҰڞ༗͞Εͳ͍ • εςʔτϑϧͳϩδοΫ͕ඞཁ • ϫʔΧʔؒͷঢ়ଶཧ֎෦ίϯϙʔωϯτ (memcached, GCS ͳͲ)Λܦ༝͢Δ
• ϫʔΧϦιʔεෆ • Transform1Ϩίʔυ࠷1ճ࣮ߦɻ • ϫʔΧʔ͕མͪͨͱ͖ͳͲɺ࠶࣮ߦ͞ΕΔՄೳੑɻ • ֎෦DBʹϨίʔυॲཧճΛΠϯΫϦϝϯτͯͨ͠Γ ͢ΔͱɺೖྗϨίʔυͱҰக͠ͳ͍Մೳੑ͕͋Δɻ
ߟ͑ΒΕΔͱ͜Ζ2 • Ϟδϡʔϧґଘؔ • σϑΥϧτͰϫʔΧʔʹجຊతͳύοέʔδ͔͠ೖͬͯͳ͍ • pipͰೖΔͷrequirements.txtॻ͍ͯΦϓγϣϯʹࢦఆͯ͠σ ϓϩΠ͢Δ͚ͩɻ • ࣗͰ࡞ͬͨϞδϡʔϧsetup.pyΛ༻ҙ͠ɺύοέʔδԽ͕ඞ
ཁɻ • όΠφϦ(lib)ͷՃἚͷಓ • DataflowͷϫʔΧʔChromeOSΆ͍(Ubuntuͷѥछ) • apt͔ͭ͑Δ͕ɺखݩͷlibಈ͔ͳ͍ • buildίʔυΛsetupʹࠐΉ
ᶄͳ͔͍ͥ
Dataflowνϡʔχϯά • νϡʔχϯά߲ଟ͘ͳ͍ • Ϧʔδϣϯ(σʔλ, ΠϯελϯεͳΔۙ͘͘) • ΠϯελϯελΠϓ x ʮ࠷େϫʔΧʔʯ
• جຊɺUIΛݟͳ͕Β٧·͍ͬͯΔͱ͜ΖͷίʔυΛվળɻ • ೖྗ/ग़ྗ͕ϘτϧωοΫʹͳΔ͔ʁ • PubSubͷQuotaΛ͋͛Δ • GCSͷϑΝΠϧΛେྔʹಡΈࠐΉͱ͔ͬͨɻ • Combine͔ͯ͠ΒಡΈࠐΉ • ࠷ۙվળ͞Εͨʁ
νϡʔχϯά - Fusion - • ༥߹(Fusion) • DataflowɺύΠϓϥΠϯͷߏΛղੳͯ͠ɺෳͷ TransformΛ༥߹ͯ̍ͭ͠ͷTransformͱ࣮ͯ͠ߦ͢ΔΈ ͕ೖ͍ͬͯΔɻ
• Fusion͞ΕͨTransformɺϥούʔͷPCollection͕࣮ମԽ ͞ΕͣɺೖྗσʔλΛͦͷ··࣍ͷTransformʹྲྀ͢ͷͰɺύ ϑΥʔϚϯε্͕͕Δɻ • Fusion͞ΕͨҰ࿈ͷTransformɺ̍ͭͷTransformͱ࣮ͯ͠ ߦ͞ΕΔɻ
νϡʔχϯά - Fusion - Word Countͷྫ ֤εςοϓ͝ͱʹதؒσʔλʢ1$PMMFDUJPO ͕ ࣮ମԽ͞ΕΔͷͰߴίετ εςοϓΛ·ͱΊͯɺதؒσʔλͷ࣮ମԽΛল͍ͯߴԽ
νϡʔχϯά - Fusion - • جຊతʹFusionѱ͞Λ͠ͳ͍͕ɺ ʮೖྗϨίʔυʯ<<< ʮग़ྗϨίʔυʯͳParDo ("high fan-
out” ParDo) ͕ೖΔͱɺޙଓͷTransformͰॲཧ͕ͭ·Δɺͱ͍͏ݱ͕͋Δɻ • Fusion͞Εͨ݁ՌɺதؒͷTransform͕ࢄ͞Εͳ͍ɻ ※ೖྗϨίʔυʹΑͬͯ୲worker͕அ͞ΕΔ 5SBOTGPSN 5SBOTGPSN 5SBOTGPSN 5SBOTGPSN 'VTJPOFE5SBOTGPSN ຊདྷҙਤͨ͠ࢄ …. record: 1 record: 10000 record: 1
νϡʔχϯά - Fusion - • ྫ͑ɺϑΝΠϧ໊ͷྻΛೖྗͱ͠ɺϑΝΠϧͷத ͷϨίʔυΛग़ྗ͢ΔΑ͏ͳParDo ةݥ(සग़)ɹ • ճආࡦ
• GroupBy -> UnGroup ͱ͍͏ॲཧΛhigh fan-outͷ ParDoͷޙʹࠐΉ • εςοϓͰग़ྗ͞ΕͨPCollectionΛ࣍ͷεςοϓͷʮSide InputʯʹೖΕ ΔʢSide Inputඞ࣮ͣମԽ͞ΕΔʣ
Prevent Fusion • ΈࠐΈͷϑΝΠϧಡΈࠐΈॲཧʹ͢ͰʹରࡦࡁΈ {
ᶅσόοάͭΒ͍
ࢄฒྻॲཧͷ໋॓Ͱ͢
Dataflow σόοά • ྫ֎࣌ͷελοΫτϨʔεग़ͳ͍ɻ • ֤εςοϓͷதͰCatchͯࣗ͠ྗͰग़͢ • ࣦഊ࣌ͷೖྗσʔλGCSͳͲୀආͤ͞Δ • ҆қʹlogging͢Δͱlog͕େྔʹ(GB୯Ґ)Ͱग़ͯ
དྷΔ •
·ͱΊ • Dataflow(Apache Beam)Google͕ɺཚཱ͢Δࢄॲཧج൫Λ౷ Ұ͢ΔͨΊʹऔΓΜͰ͍ΔΒ͍͠ɻ • ۤ࿑͋Δ͕ɺΦʔτεέʔϧͳͲϚωʔδυࢄॲཧج൫ͱͯ͠ ΄΅།Ұͷબࢶʁ • ϦΞϧλΠϜϩάऩूͱͯ͠PubSub->(DF)->BQ͕σϑΝΫτʹ
ͳΓͭͭ͋Γͦ͏ɻ • Python, Go, JavaͰ࣮Մೳ • ͨͩ͠ɺJava͕Ұ൪҆ఆ͍ͯ͠Δɻ • ಛʹετϦʔϛϯάϞʔυͷ໘ͰJavaʹ͓͍ͯͨ͠΄ ͏͕Α͍ɻ