Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Naoki Matsuda
September 17, 2019
Technology
6.9k
2
Share
[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介
PyCon JP 2019の発表資料です。
Naoki Matsuda
September 17, 2019
More Decks by Naoki Matsuda
See All by Naoki Matsuda
Tech x Marketing #4 Airflowでもサブワークフロー単位で分割開発したい!
matsudan
0
220
Other Decks in Technology
See All in Technology
LLM時代のリファクタリング戦略_AIエージェントによる段階的・安全なTS移行方法
play_inc
0
180
【禁断】Obsidianの第二の脳に「知の巨人」と呼ばれた師匠の脳をロードしてみた
nagatsu
0
6.5k
GitHub Copilot CLI の Rubber Duck 機能を使ってコーディングの品質をあげよう #techbaton_findy
stefafafan
2
1.1k
AI とサービス・デザイン / AI and Service Design
ks91
PRO
0
170
データ基盤構築・運用の現場から 〜 Snowflake Intelligence 導入で変わった、データ活用の未来 〜
wonohe
0
180
「使われるデータ基盤」を目指してデータアナリストとワークショップをやった話
jackojacko_
2
860
Gradle×GitHub_ActionsでCI時間を約50%短縮 ジョブ分割の設計と落とし穴 / Cutting CI Time by ~50% with Gradle and GitHub Actions: Job-Splitting Design and Pitfalls
takatty
0
120
コーポレートサイトのアクセシビリティ改善とJIS準拠への実践
lycorptech_jp
PRO
2
140
イベントで大活躍する電子ペーパー名札 〜その3〜 / ビジュアルプログラミングIoTLT vol.23
you
PRO
0
140
【ハノーバーメッセ振り返りイベントat名古屋】データは集約からAI起点の収集に ~組織内・組織間でのデータ連携~
tanakaseiya
0
110
DI コンテナ自動生成ツールを実装してみた / intro-autodi
uhzz
0
870
生成AIに振り回されない 〜確率論と決定論の使い分け〜
shukob
0
110
Featured
See All Featured
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
360
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
280
How to Ace a Technical Interview
jacobian
281
24k
Darren the Foodie - Storyboard
khoart
PRO
3
3.3k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
540
VelocityConf: Rendering Performance Case Studies
addyosmani
333
25k
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
2
200
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
560
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.5k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
62k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Transcript
৽ถPythonista͕ଃΔAirflowೖ & ׆༻ࣄྫհ PyCon JP 2019 2019.9.17 Naoki Matsuda
Agenda 0. ࣗݾհ 1. Airflowͷ֓ཁ 2. Airflowͷࣾࣄྫհ - ։ൃϓϩμΫτ֓ཁͱ՝, Airflowͷڥ
3. AirflowͰͭ·͍ͮͨ - λεΫؒͷσʔλͷΓͱΓ - DAGಈ࡞֬ೝ ~ dockerͰϩʔΧϧ։ൃڥߏங
ࣗݾհ দా थ (·ͭͩ ͳ͓͖) - ॴଐɿגࣜձࣾ ి௨σδλϧ - ۀɿόοΫΤϯυαʔϏεɺETLपΓͷ։ൃ
- 2018ೖࣾ
1. Airflowͷ֓ཁ
Apache Airflow֓ཁ όονॲཧ͔ΒͳΔϫʔΫϑϩʔͷεέδϡʔϦϯάˍϞχλ Ϧϯά͕ՄೳͳϓϥοτϑΥʔϜ - Airbnbࣾ - Φʔϓϯιʔε (Apache software
foundationͷincubation project)1,2 - PythonͰ࣮͞Ε͍ͯΔ3 - ։ൃίϛϡχςΟ͕׆ൃ3 IUUQTJODVCBUPSBQBDIFPSHQSPKFDUTBJSGMPXIUNM IUUQTBJSGMPXBQBDIFPSHMJDFOTFIUNM IUUQTHJUIVCDPNBQBDIFBJSGMPX
։ൃίϛϡχςΟͷ׆ൃ͞(2019.9.9࣌)
Apache AirflowͰͰ͖Δ͜ͱ - PythonίʔυͰϫʔΫϑϩʔ(DAG)Λఆٛ - ґଘؔʹج͍ͮͨλεΫͷ࣮ߦ - ϫʔΫϑϩʔͷεέδϡʔϦϯά - ϦονͳWeb
UI - DAG࣮ߦεςʔλεͷϞχλϦϯά - λεΫͷϩά֬ೝ - ґଘؔͷՄࢹԽ ͳͲ
PythonίʔυͰϫʔΫϑϩʔఆٛ(DAGͷ࡞) λεΫؒґଘؔͷఆٛ λεΫ1 λεΫ2 DAGͷڞ௨ઃఆ ࣮ߦස, ࣮ߦظؒ, λΠϜΞτ࣌ؒͳͲ
ϫʔΫϑϩʔΛߏ͢ΔλεΫͷ࡞ ≈ - ϫʔΫϑϩʔOperatorͱݺΕΔλεΫʹΑΓߏ͞ΕΔ1 - 1ͭͷOperatorͰ1ͭͷλεΫΛهड़ OperatorͷҾɻ ֤Operator͕ԿͷҾΛ ͱΔ͔υΩϡϝϯτࢀর2 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST
IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
ϫʔΫϑϩʔΛߏ͢ΔλεΫͷ࡞ - BashίϚϯυ࣮ߦ: BashOperator - Python࣮ؔߦ: PythonOperator - SQL࣮ߦ: MySqlOperator,
PostgresOperator, … - HTTPϦΫΤετૹ৴: SimpleHttpOperator - ͦͷଞΫϥυܥͳͲ: BigQueryOperator, AWSAthenaOperator, … - ಛఆ݅Ληϯγϯά: Sensor IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
2. ࣾࣄྫհ - ։ൃϓϩμΫτ֓ཁͱ՝, Airflowͷߏ
։ൃ৫ͱϓϩμΫτʹ͍ͭͯ ɾɾɾ ࠂ৴ σʔλ ϓϥϯφʔ Ӧۀ - ڈ7~9݄, - GoogleDispla
y - ҿྉۀք imp 100000 clicks 5000 cv 700 cost ɾɾɾ ϝσΟΞ ։ൃ৫ νʔϜنɿσʔλΤϯδχΞ ໊ σʔλαΠΤϯςΟετͳͲ໊ d ϓϩμΫτ ͚ࣾσδλϧࠂϓϥϯχϯάπʔϧ ࣾͰѻ͏ϝσΟΞɾΫϥΠΞϯτͷࠂ৴࣮σʔλΛՄࢹԽˍ༧ଌ ϓϩμΫτ (PPHMF :BIPP 5XJUUFS 'BDFCPPL -*/&
։ൃϓϩμΫτʹ͓͚Δ՝ - σʔλ͕RDBʹೖͬͯͳ͍ ৴Ϩϙʔτσʔλ͕ੳ༻ͷྻࢤσʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳใΛՃ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ
։ൃϓϩμΫτʹ͓͚Δ՝ - σʔλ͕RDBʹೖͬͯͳ͍ ৴Ϩϙʔτσʔλ͕ੳ༻ͷྻࢤσʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳใΛՃ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ → ։ൃϓϩμΫτ༻ʹσʔλϚʔτ࡞ RDBʹϑΝΫτ,
σΟϝϯγϣϯςʔϒϧΛETLͰ࡞
Airflowߏ apache-airflow 1.10.2 web worker scheduler Amazon S3 Amazon RDS
Airflow AWS Fargate Amazon ElastiCache Redis Elastic Load Balancing flower DAGs - AWS FargateʹAirflowΛσϓϩΠ ≈ ≈
docker-airflow https://github.com/puckel/docker-airflow
ߏஙͨ͠σʔλϑϩʔ ֤ϝσΟΞࠂ৴σʔλ ϦϨʔγϣϯςʔϒϧܥ JOIN ΫϥΠΞϯτใܥ ʜ ΧϥϜ໊ دͤͳͲ ≈ Amazon
Athena Backend service INSERT "JSGMPX͕࣮ߦ͢ΔλεΫ INSERT
3. AirflowͰͭ·͍ͮͨ - λεΫؒͷσʔλͷΓͱΓ - DAGಈ࡞֬ೝ ~dockerͰϩʔΧϧ։ൃڥߏங
λεΫؒͷσʔλͷΓͱΓ λεΫؒͷσʔλΓͱΓXComΛ͏ - XComͷ͍ํ - XComσʔλΛpush - ؔͰreturn - ؔͰkwargs['task_instance’].
xcom_push(value=hoge, key=‘huga’) - Λฦ͢Operator ྫ: BigqueryGetDataOperator - XCom͔ΒσʔλΛpull - kwargs['task_instance’].xcom_pull() metadata database
λεΫؒͷσʔλͷΓͱΓ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔΛ࣮ߦ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ 1. task1Ͱpush͞ΕͨXcomͷ σʔλΛpullͯ͠ 2. ςʔϒϧͷσʔλΛసஔ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔΛ࣮ߦ
BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ provide_contextΛTrueʹ͠ͳ͍ͱkwargs[‘task_intance’]ͰKeyError - provide_context=False (default) kwargs : {} - provide_context=True
kwargs: { 'dag': <DAG: sample>, 'ds': '2019-09-10’, 'next_ds': '2019-09-10’, … 'task_instance’: <TaskInstance: sample.task1_2 …> … }
λεΫؒͷσʔλͷΓͱΓ - PythonOperatorͷҾͰTrue OR - default_argsͰઃఆ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங - ࡞ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏Δʁ എܠɿ - Ϋϥυ্devڥͰͷDAGಈ࡞֬ೝͰS3upload͢Δखؒ - ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠ΔͱΓͮΒ͍… →
ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங → dockerͰAirflowΛϩʔΧϧʹ্ཱͪ͛Δ - ࡞ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏Δʁ എܠɿ - Ϋϥυ্devڥͰͷDAGಈ࡞֬ೝͰS3upload͢Δखؒ -
ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠ΔͱΓͮΒ͍… → ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங LocalExecutorΛ༻
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங LocalExecutorΛ༻ dagsσΟϨΫτϦΛvolumeͱ͠ ͯϚϯτ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங - dockerͷvolumeͱͯ͠dagsσΟϨΫτϦΛϚϯ τ͍ͯ͠ΔͷͰॻ͖͑ͨΒ͙͢ʹө - Web UIʹ͕ࣗ࡞ͨ͠DAGͷΈ͕දࣔ͞ΕΔ - ECR͔ΒimageΛऔͬͯ͘ΔΑ͏ʹͯ͠ຊ൪ͱಉ͡
ڥͰಈ࡞֬ೝͰ͖Δ
·ͱΊ - ETL͕ඞཁͳࣾ։ൃϓϩμΫτʹ͓͍ͯAirflowΛ ͍·ͨ͠ɻ - ຊ൪ڥͷAirflowECS FargateʹσϓϩΠ͠·ͨ͠ɻ - ϩʔΧϧ։ൃڥʹdockerΛ༻ͯ͠։ൃָ͕ʹͳΓ ·ͨ͠ɻ
We are hiring ! https://bit.ly/2UqWPGO
supplementary information
λεΫؒґଘؔͷఆٛ - Ϗοτγϑτԋࢉࢠ(>>, <<)Λ͍λεΫͷґଘؔΛද͢ - ޙଓλεΫͷ࣮ߦ݅શͯͷઌߦλεΫޭ͕σϑΥϧτઃఆ1 - શOperator͕࣋ͭtrigger_ruleҾͰ࣮ߦ݅ΛมߋՄೳ1 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMUSJHHFSSVMFT
- γϯϓϧͳґଘؔ task1 >> task2 - λεΫάϧʔϓ͕͋Δґଘؔ task1 >> [task2-1,task2-2] >> task3
Web UI: ϞχλϦϯά - Tree View - Gantt Chart
Web UI: Variable
ฒྻઃఆ Configuring parallelism in airflow.cfg - parallelism : ࢄॲཧΫϥελશମͰ࣮ߦՄೳͳϓϩηε -
dag_concurrency : ҰͭͷϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େϓϩηε - max_active_runs_per_dag : DAG෦Ͱಉ࣮࣌ߦՄೳͳ࠷େλε Ϋ - worker_concurrency : ҰͭͷCeleryϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େ ϓϩηε IUUQTBOBMZUJDTMJWFTFOTFDPKQFOUSZ