Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Naoki Matsuda
September 17, 2019
Technology
7k
2
Share
[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介
PyCon JP 2019の発表資料です。
Naoki Matsuda
September 17, 2019
More Decks by Naoki Matsuda
See All by Naoki Matsuda
Tech x Marketing #4 Airflowでもサブワークフロー単位で分割開発したい!
matsudan
0
220
Other Decks in Technology
See All in Technology
AIガバナンス実践 - 生成AIコネクタのデータ漏洩リスクと実務対策
knishioka
0
180
Spring Boot における AOT Cache 活用テクニックと 起動時間改善事例
ntt_dsol_java
0
210
先取りMaven4 ~16年ぶりのメジャーアップデート、その進化とは?~
ogiwarat
0
140
TypeScript Compiler APIとPHP-Parserを活用し、TypeScriptとPHPで型を共有する
shuta13
0
350
ポスター発表&デモと総括 / Poster Presentations & Demonstrations and Summary
ks91
PRO
0
190
Mastering Ruby Box
tagomoris
3
140
地元にいないローカルオーガナイザーの立ち回り
uvb_76
1
460
ポケモンの型をTypeScriptの型システムで表現してみた
subroh0508
0
270
AI Testing Talks: Challenges of Applying AI in Software Testing: From Hype to Practical Use
exactpro
PRO
1
110
AIを「創る」と「使う」の循環 — HRテックが実践するリアルなAI組織実装
taketo957
0
1.4k
サプライチェーンセキュリティの空白地帯 - 信頼できる”依存性”の未来を考える
rung
PRO
2
670
AI フレンドリーなエラー監視を TypeScript で実現する
shinyaigeek
2
250
Featured
See All Featured
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
320
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.2k
Reality Check: Gamification 10 Years Later
codingconduct
0
2.2k
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
130
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.9k
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
380
Leo the Paperboy
mayatellez
7
1.8k
The Spectacular Lies of Maps
axbom
PRO
1
790
Everyday Curiosity
cassininazir
0
220
Crafting Experiences
bethany
1
160
The Curse of the Amulet
leimatthew05
1
13k
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
130
Transcript
৽ถPythonista͕ଃΔAirflowೖ & ׆༻ࣄྫհ PyCon JP 2019 2019.9.17 Naoki Matsuda
Agenda 0. ࣗݾհ 1. Airflowͷ֓ཁ 2. Airflowͷࣾࣄྫհ - ։ൃϓϩμΫτ֓ཁͱ՝, Airflowͷڥ
3. AirflowͰͭ·͍ͮͨ - λεΫؒͷσʔλͷΓͱΓ - DAGಈ࡞֬ೝ ~ dockerͰϩʔΧϧ։ൃڥߏங
ࣗݾհ দా थ (·ͭͩ ͳ͓͖) - ॴଐɿגࣜձࣾ ి௨σδλϧ - ۀɿόοΫΤϯυαʔϏεɺETLपΓͷ։ൃ
- 2018ೖࣾ
1. Airflowͷ֓ཁ
Apache Airflow֓ཁ όονॲཧ͔ΒͳΔϫʔΫϑϩʔͷεέδϡʔϦϯάˍϞχλ Ϧϯά͕ՄೳͳϓϥοτϑΥʔϜ - Airbnbࣾ - Φʔϓϯιʔε (Apache software
foundationͷincubation project)1,2 - PythonͰ࣮͞Ε͍ͯΔ3 - ։ൃίϛϡχςΟ͕׆ൃ3 IUUQTJODVCBUPSBQBDIFPSHQSPKFDUTBJSGMPXIUNM IUUQTBJSGMPXBQBDIFPSHMJDFOTFIUNM IUUQTHJUIVCDPNBQBDIFBJSGMPX
։ൃίϛϡχςΟͷ׆ൃ͞(2019.9.9࣌)
Apache AirflowͰͰ͖Δ͜ͱ - PythonίʔυͰϫʔΫϑϩʔ(DAG)Λఆٛ - ґଘؔʹج͍ͮͨλεΫͷ࣮ߦ - ϫʔΫϑϩʔͷεέδϡʔϦϯά - ϦονͳWeb
UI - DAG࣮ߦεςʔλεͷϞχλϦϯά - λεΫͷϩά֬ೝ - ґଘؔͷՄࢹԽ ͳͲ
PythonίʔυͰϫʔΫϑϩʔఆٛ(DAGͷ࡞) λεΫؒґଘؔͷఆٛ λεΫ1 λεΫ2 DAGͷڞ௨ઃఆ ࣮ߦස, ࣮ߦظؒ, λΠϜΞτ࣌ؒͳͲ
ϫʔΫϑϩʔΛߏ͢ΔλεΫͷ࡞ ≈ - ϫʔΫϑϩʔOperatorͱݺΕΔλεΫʹΑΓߏ͞ΕΔ1 - 1ͭͷOperatorͰ1ͭͷλεΫΛهड़ OperatorͷҾɻ ֤Operator͕ԿͷҾΛ ͱΔ͔υΩϡϝϯτࢀর2 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST
IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
ϫʔΫϑϩʔΛߏ͢ΔλεΫͷ࡞ - BashίϚϯυ࣮ߦ: BashOperator - Python࣮ؔߦ: PythonOperator - SQL࣮ߦ: MySqlOperator,
PostgresOperator, … - HTTPϦΫΤετૹ৴: SimpleHttpOperator - ͦͷଞΫϥυܥͳͲ: BigQueryOperator, AWSAthenaOperator, … - ಛఆ݅Ληϯγϯά: Sensor IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
2. ࣾࣄྫհ - ։ൃϓϩμΫτ֓ཁͱ՝, Airflowͷߏ
։ൃ৫ͱϓϩμΫτʹ͍ͭͯ ɾɾɾ ࠂ৴ σʔλ ϓϥϯφʔ Ӧۀ - ڈ7~9݄, - GoogleDispla
y - ҿྉۀք imp 100000 clicks 5000 cv 700 cost ɾɾɾ ϝσΟΞ ։ൃ৫ νʔϜنɿσʔλΤϯδχΞ ໊ σʔλαΠΤϯςΟετͳͲ໊ d ϓϩμΫτ ͚ࣾσδλϧࠂϓϥϯχϯάπʔϧ ࣾͰѻ͏ϝσΟΞɾΫϥΠΞϯτͷࠂ৴࣮σʔλΛՄࢹԽˍ༧ଌ ϓϩμΫτ (PPHMF :BIPP 5XJUUFS 'BDFCPPL -*/&
։ൃϓϩμΫτʹ͓͚Δ՝ - σʔλ͕RDBʹೖͬͯͳ͍ ৴Ϩϙʔτσʔλ͕ੳ༻ͷྻࢤσʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳใΛՃ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ
։ൃϓϩμΫτʹ͓͚Δ՝ - σʔλ͕RDBʹೖͬͯͳ͍ ৴Ϩϙʔτσʔλ͕ੳ༻ͷྻࢤσʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳใΛՃ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ → ։ൃϓϩμΫτ༻ʹσʔλϚʔτ࡞ RDBʹϑΝΫτ,
σΟϝϯγϣϯςʔϒϧΛETLͰ࡞
Airflowߏ apache-airflow 1.10.2 web worker scheduler Amazon S3 Amazon RDS
Airflow AWS Fargate Amazon ElastiCache Redis Elastic Load Balancing flower DAGs - AWS FargateʹAirflowΛσϓϩΠ ≈ ≈
docker-airflow https://github.com/puckel/docker-airflow
ߏஙͨ͠σʔλϑϩʔ ֤ϝσΟΞࠂ৴σʔλ ϦϨʔγϣϯςʔϒϧܥ JOIN ΫϥΠΞϯτใܥ ʜ ΧϥϜ໊ دͤͳͲ ≈ Amazon
Athena Backend service INSERT "JSGMPX͕࣮ߦ͢ΔλεΫ INSERT
3. AirflowͰͭ·͍ͮͨ - λεΫؒͷσʔλͷΓͱΓ - DAGಈ࡞֬ೝ ~dockerͰϩʔΧϧ։ൃڥߏங
λεΫؒͷσʔλͷΓͱΓ λεΫؒͷσʔλΓͱΓXComΛ͏ - XComͷ͍ํ - XComσʔλΛpush - ؔͰreturn - ؔͰkwargs['task_instance’].
xcom_push(value=hoge, key=‘huga’) - Λฦ͢Operator ྫ: BigqueryGetDataOperator - XCom͔ΒσʔλΛpull - kwargs['task_instance’].xcom_pull() metadata database
λεΫؒͷσʔλͷΓͱΓ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔΛ࣮ߦ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ 1. task1Ͱpush͞ΕͨXcomͷ σʔλΛpullͯ͠ 2. ςʔϒϧͷσʔλΛసஔ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔΛ࣮ߦ
BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ provide_contextΛTrueʹ͠ͳ͍ͱkwargs[‘task_intance’]ͰKeyError - provide_context=False (default) kwargs : {} - provide_context=True
kwargs: { 'dag': <DAG: sample>, 'ds': '2019-09-10’, 'next_ds': '2019-09-10’, … 'task_instance’: <TaskInstance: sample.task1_2 …> … }
λεΫؒͷσʔλͷΓͱΓ - PythonOperatorͷҾͰTrue OR - default_argsͰઃఆ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங - ࡞ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏Δʁ എܠɿ - Ϋϥυ্devڥͰͷDAGಈ࡞֬ೝͰS3upload͢Δखؒ - ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠ΔͱΓͮΒ͍… →
ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங → dockerͰAirflowΛϩʔΧϧʹ্ཱͪ͛Δ - ࡞ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏Δʁ എܠɿ - Ϋϥυ্devڥͰͷDAGಈ࡞֬ೝͰS3upload͢Δखؒ -
ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠ΔͱΓͮΒ͍… → ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங LocalExecutorΛ༻
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங LocalExecutorΛ༻ dagsσΟϨΫτϦΛvolumeͱ͠ ͯϚϯτ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங - dockerͷvolumeͱͯ͠dagsσΟϨΫτϦΛϚϯ τ͍ͯ͠ΔͷͰॻ͖͑ͨΒ͙͢ʹө - Web UIʹ͕ࣗ࡞ͨ͠DAGͷΈ͕දࣔ͞ΕΔ - ECR͔ΒimageΛऔͬͯ͘ΔΑ͏ʹͯ͠ຊ൪ͱಉ͡
ڥͰಈ࡞֬ೝͰ͖Δ
·ͱΊ - ETL͕ඞཁͳࣾ։ൃϓϩμΫτʹ͓͍ͯAirflowΛ ͍·ͨ͠ɻ - ຊ൪ڥͷAirflowECS FargateʹσϓϩΠ͠·ͨ͠ɻ - ϩʔΧϧ։ൃڥʹdockerΛ༻ͯ͠։ൃָ͕ʹͳΓ ·ͨ͠ɻ
We are hiring ! https://bit.ly/2UqWPGO
supplementary information
λεΫؒґଘؔͷఆٛ - Ϗοτγϑτԋࢉࢠ(>>, <<)Λ͍λεΫͷґଘؔΛද͢ - ޙଓλεΫͷ࣮ߦ݅શͯͷઌߦλεΫޭ͕σϑΥϧτઃఆ1 - શOperator͕࣋ͭtrigger_ruleҾͰ࣮ߦ݅ΛมߋՄೳ1 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMUSJHHFSSVMFT
- γϯϓϧͳґଘؔ task1 >> task2 - λεΫάϧʔϓ͕͋Δґଘؔ task1 >> [task2-1,task2-2] >> task3
Web UI: ϞχλϦϯά - Tree View - Gantt Chart
Web UI: Variable
ฒྻઃఆ Configuring parallelism in airflow.cfg - parallelism : ࢄॲཧΫϥελશମͰ࣮ߦՄೳͳϓϩηε -
dag_concurrency : ҰͭͷϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େϓϩηε - max_active_runs_per_dag : DAG෦Ͱಉ࣮࣌ߦՄೳͳ࠷େλε Ϋ - worker_concurrency : ҰͭͷCeleryϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େ ϓϩηε IUUQTBOBMZUJDTMJWFTFOTFDPKQFOUSZ