Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介
Search
Naoki Matsuda
September 17, 2019
Technology
6.9k
2
Share
[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介
PyCon JP 2019の発表資料です。
Naoki Matsuda
September 17, 2019
More Decks by Naoki Matsuda
See All by Naoki Matsuda
Tech x Marketing #4 Airflowでもサブワークフロー単位で分割開発したい!
matsudan
0
220
Other Decks in Technology
See All in Technology
論文紹介:Pixal3D (SIGGRAPH 2026)
tenten0727
0
730
AI時代に改めて考える、ドメイン駆動設計 - モデリングが「AIへの共通言語」になる
littlehands
7
2.4k
「使われるデータ基盤」を目指してデータアナリストとワークショップをやった話
jackojacko_
2
860
[みん強]AIの価値を最大化するデータ基盤戦略:Self-Service型Data Meshへの転換とAgentic AI Meshに向けた取り組み with Snowflake他
y_matsubara
1
180
A Harness for Behaviour: how to get AI to generate code that does what we intend, or "TDD in the age of AI"
xpmatteo
0
420
The Making of AI Chips
pfn
PRO
0
770
イベントストーミングとKiroの仕様駆動開発で実現する要件の認識合わせプロセス
syobochim
5
420
実践 TanStack Start ― 新規プロダクトを開発して確立した、サーバーとクライアント境界の設計パターン / Practical TanStack Start Server-Client Boundary Patterns
kaminashi
2
320
Claude Code x Accounting
kawaguti
PRO
1
320
はじめてのAI-DLC
yoshidashingo
2
530
Amazon Bedrock 経由の Claude Cowork を試してみよう・MCP にも繋いでみよう
sugimomoto
0
180
ルール・ロール・ツールを創る / Creating Rules, Roles and Tools
ks91
PRO
0
160
Featured
See All Featured
Being A Developer After 40
akosma
91
590k
Designing for Timeless Needs
cassininazir
1
230
Imperfection Machines: The Place of Print at Facebook
scottboms
270
14k
The SEO identity crisis: Don't let AI make you average
varn
0
470
Visualization
eitanlees
151
17k
Un-Boring Meetings
codingconduct
0
300
SEO for Brand Visibility & Recognition
aleyda
0
4.6k
Test your architecture with Archunit
thirion
1
2.2k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
65
55k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
10k
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
1
270
Statistics for Hackers
jakevdp
799
230k
Transcript
৽ถPythonista͕ଃΔAirflowೖ & ׆༻ࣄྫհ PyCon JP 2019 2019.9.17 Naoki Matsuda
Agenda 0. ࣗݾհ 1. Airflowͷ֓ཁ 2. Airflowͷࣾࣄྫհ - ։ൃϓϩμΫτ֓ཁͱ՝, Airflowͷڥ
3. AirflowͰͭ·͍ͮͨ - λεΫؒͷσʔλͷΓͱΓ - DAGಈ࡞֬ೝ ~ dockerͰϩʔΧϧ։ൃڥߏங
ࣗݾհ দా थ (·ͭͩ ͳ͓͖) - ॴଐɿגࣜձࣾ ి௨σδλϧ - ۀɿόοΫΤϯυαʔϏεɺETLपΓͷ։ൃ
- 2018ೖࣾ
1. Airflowͷ֓ཁ
Apache Airflow֓ཁ όονॲཧ͔ΒͳΔϫʔΫϑϩʔͷεέδϡʔϦϯάˍϞχλ Ϧϯά͕ՄೳͳϓϥοτϑΥʔϜ - Airbnbࣾ - Φʔϓϯιʔε (Apache software
foundationͷincubation project)1,2 - PythonͰ࣮͞Ε͍ͯΔ3 - ։ൃίϛϡχςΟ͕׆ൃ3 IUUQTJODVCBUPSBQBDIFPSHQSPKFDUTBJSGMPXIUNM IUUQTBJSGMPXBQBDIFPSHMJDFOTFIUNM IUUQTHJUIVCDPNBQBDIFBJSGMPX
։ൃίϛϡχςΟͷ׆ൃ͞(2019.9.9࣌)
Apache AirflowͰͰ͖Δ͜ͱ - PythonίʔυͰϫʔΫϑϩʔ(DAG)Λఆٛ - ґଘؔʹج͍ͮͨλεΫͷ࣮ߦ - ϫʔΫϑϩʔͷεέδϡʔϦϯά - ϦονͳWeb
UI - DAG࣮ߦεςʔλεͷϞχλϦϯά - λεΫͷϩά֬ೝ - ґଘؔͷՄࢹԽ ͳͲ
PythonίʔυͰϫʔΫϑϩʔఆٛ(DAGͷ࡞) λεΫؒґଘؔͷఆٛ λεΫ1 λεΫ2 DAGͷڞ௨ઃఆ ࣮ߦස, ࣮ߦظؒ, λΠϜΞτ࣌ؒͳͲ
ϫʔΫϑϩʔΛߏ͢ΔλεΫͷ࡞ ≈ - ϫʔΫϑϩʔOperatorͱݺΕΔλεΫʹΑΓߏ͞ΕΔ1 - 1ͭͷOperatorͰ1ͭͷλεΫΛهड़ OperatorͷҾɻ ֤Operator͕ԿͷҾΛ ͱΔ͔υΩϡϝϯτࢀর2 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST
IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
ϫʔΫϑϩʔΛߏ͢ΔλεΫͷ࡞ - BashίϚϯυ࣮ߦ: BashOperator - Python࣮ؔߦ: PythonOperator - SQL࣮ߦ: MySqlOperator,
PostgresOperator, … - HTTPϦΫΤετૹ৴: SimpleHttpOperator - ͦͷଞΫϥυܥͳͲ: BigQueryOperator, AWSAthenaOperator, … - ಛఆ݅Ληϯγϯά: Sensor IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
2. ࣾࣄྫհ - ։ൃϓϩμΫτ֓ཁͱ՝, Airflowͷߏ
։ൃ৫ͱϓϩμΫτʹ͍ͭͯ ɾɾɾ ࠂ৴ σʔλ ϓϥϯφʔ Ӧۀ - ڈ7~9݄, - GoogleDispla
y - ҿྉۀք imp 100000 clicks 5000 cv 700 cost ɾɾɾ ϝσΟΞ ։ൃ৫ νʔϜنɿσʔλΤϯδχΞ ໊ σʔλαΠΤϯςΟετͳͲ໊ d ϓϩμΫτ ͚ࣾσδλϧࠂϓϥϯχϯάπʔϧ ࣾͰѻ͏ϝσΟΞɾΫϥΠΞϯτͷࠂ৴࣮σʔλΛՄࢹԽˍ༧ଌ ϓϩμΫτ (PPHMF :BIPP 5XJUUFS 'BDFCPPL -*/&
։ൃϓϩμΫτʹ͓͚Δ՝ - σʔλ͕RDBʹೖͬͯͳ͍ ৴Ϩϙʔτσʔλ͕ੳ༻ͷྻࢤσʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳใΛՃ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ
։ൃϓϩμΫτʹ͓͚Δ՝ - σʔλ͕RDBʹೖͬͯͳ͍ ৴Ϩϙʔτσʔλ͕ੳ༻ͷྻࢤσʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳใΛՃ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ → ։ൃϓϩμΫτ༻ʹσʔλϚʔτ࡞ RDBʹϑΝΫτ,
σΟϝϯγϣϯςʔϒϧΛETLͰ࡞
Airflowߏ apache-airflow 1.10.2 web worker scheduler Amazon S3 Amazon RDS
Airflow AWS Fargate Amazon ElastiCache Redis Elastic Load Balancing flower DAGs - AWS FargateʹAirflowΛσϓϩΠ ≈ ≈
docker-airflow https://github.com/puckel/docker-airflow
ߏஙͨ͠σʔλϑϩʔ ֤ϝσΟΞࠂ৴σʔλ ϦϨʔγϣϯςʔϒϧܥ JOIN ΫϥΠΞϯτใܥ ʜ ΧϥϜ໊ دͤͳͲ ≈ Amazon
Athena Backend service INSERT "JSGMPX͕࣮ߦ͢ΔλεΫ INSERT
3. AirflowͰͭ·͍ͮͨ - λεΫؒͷσʔλͷΓͱΓ - DAGಈ࡞֬ೝ ~dockerͰϩʔΧϧ։ൃڥߏங
λεΫؒͷσʔλͷΓͱΓ λεΫؒͷσʔλΓͱΓXComΛ͏ - XComͷ͍ํ - XComσʔλΛpush - ؔͰreturn - ؔͰkwargs['task_instance’].
xcom_push(value=hoge, key=‘huga’) - Λฦ͢Operator ྫ: BigqueryGetDataOperator - XCom͔ΒσʔλΛpull - kwargs['task_instance’].xcom_pull() metadata database
λεΫؒͷσʔλͷΓͱΓ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔΛ࣮ߦ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ 1. task1Ͱpush͞ΕͨXcomͷ σʔλΛpullͯ͠ 2. ςʔϒϧͷσʔλΛసஔ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔΛ࣮ߦ
BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ provide_contextΛTrueʹ͠ͳ͍ͱkwargs[‘task_intance’]ͰKeyError - provide_context=False (default) kwargs : {} - provide_context=True
kwargs: { 'dag': <DAG: sample>, 'ds': '2019-09-10’, 'next_ds': '2019-09-10’, … 'task_instance’: <TaskInstance: sample.task1_2 …> … }
λεΫؒͷσʔλͷΓͱΓ - PythonOperatorͷҾͰTrue OR - default_argsͰઃఆ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங - ࡞ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏Δʁ എܠɿ - Ϋϥυ্devڥͰͷDAGಈ࡞֬ೝͰS3upload͢Δखؒ - ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠ΔͱΓͮΒ͍… →
ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங → dockerͰAirflowΛϩʔΧϧʹ্ཱͪ͛Δ - ࡞ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏Δʁ എܠɿ - Ϋϥυ্devڥͰͷDAGಈ࡞֬ೝͰS3upload͢Δखؒ -
ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠ΔͱΓͮΒ͍… → ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங LocalExecutorΛ༻
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங LocalExecutorΛ༻ dagsσΟϨΫτϦΛvolumeͱ͠ ͯϚϯτ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங - dockerͷvolumeͱͯ͠dagsσΟϨΫτϦΛϚϯ τ͍ͯ͠ΔͷͰॻ͖͑ͨΒ͙͢ʹө - Web UIʹ͕ࣗ࡞ͨ͠DAGͷΈ͕දࣔ͞ΕΔ - ECR͔ΒimageΛऔͬͯ͘ΔΑ͏ʹͯ͠ຊ൪ͱಉ͡
ڥͰಈ࡞֬ೝͰ͖Δ
·ͱΊ - ETL͕ඞཁͳࣾ։ൃϓϩμΫτʹ͓͍ͯAirflowΛ ͍·ͨ͠ɻ - ຊ൪ڥͷAirflowECS FargateʹσϓϩΠ͠·ͨ͠ɻ - ϩʔΧϧ։ൃڥʹdockerΛ༻ͯ͠։ൃָ͕ʹͳΓ ·ͨ͠ɻ
We are hiring ! https://bit.ly/2UqWPGO
supplementary information
λεΫؒґଘؔͷఆٛ - Ϗοτγϑτԋࢉࢠ(>>, <<)Λ͍λεΫͷґଘؔΛද͢ - ޙଓλεΫͷ࣮ߦ݅શͯͷઌߦλεΫޭ͕σϑΥϧτઃఆ1 - શOperator͕࣋ͭtrigger_ruleҾͰ࣮ߦ݅ΛมߋՄೳ1 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMUSJHHFSSVMFT
- γϯϓϧͳґଘؔ task1 >> task2 - λεΫάϧʔϓ͕͋Δґଘؔ task1 >> [task2-1,task2-2] >> task3
Web UI: ϞχλϦϯά - Tree View - Gantt Chart
Web UI: Variable
ฒྻઃఆ Configuring parallelism in airflow.cfg - parallelism : ࢄॲཧΫϥελશମͰ࣮ߦՄೳͳϓϩηε -
dag_concurrency : ҰͭͷϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େϓϩηε - max_active_runs_per_dag : DAG෦Ͱಉ࣮࣌ߦՄೳͳ࠷େλε Ϋ - worker_concurrency : ҰͭͷCeleryϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େ ϓϩηε IUUQTBOBMZUJDTMJWFTFOTFDPKQFOUSZ