Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介
Search
Naoki Matsuda
September 17, 2019
Technology
7k
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介
PyCon JP 2019の発表資料です。
Naoki Matsuda
September 17, 2019
More Decks by Naoki Matsuda
See All by Naoki Matsuda
Tech x Marketing #4 Airflowでもサブワークフロー単位で分割開発したい!
matsudan
0
220
Other Decks in Technology
See All in Technology
Chainlitで作るお手軽チャットUI
ynt0485
0
230
SONiCの統計情報を取得したい
sonic
0
120
AIの性能が向上しても未解決な組織の重大問題は何か?/An Unsolved Organizational Problem in the Age of AI
moriyuya
4
660
Bedrock AgentCore RuntimeでAuth0 Changelog調査AIをアップグレードした話
t5u8a5a
1
120
RAG を使わないという選択肢
tatsutaka
1
220
MUSUBI 田中裕一『AIと共に行う「しごとのリデザイン」- スモールバックオフィス編』AI Ops Lab #4
musubi
0
110
2026 TECHFRESH 畢業分享會 - AI-Native 重塑軟體工程與虛擬講師
line_developers_tw
PRO
0
960
LayerXにおけるセキュリティ管理の現在地と次の一手
tosho
0
130
10倍の生産性を実現するAI駆動並列エージェントのすべて
kumaiu
5
1.4k
Android の公式 Skill / Android skills
yanzm
0
140
AIのReact習熟度を測る
uhyo
2
400
チームで進めるAI駆動アジャイル×ウォーターフォール
kumaiu
0
160
Featured
See All Featured
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.5k
How To Speak Unicorn (iThemes Webinar)
marktimemedia
1
480
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
300
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
170
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
65
56k
The Cult of Friendly URLs
andyhume
79
6.9k
16th Malabo Montpellier Forum Presentation
akademiya2063
PRO
0
140
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.9k
A Modern Web Designer's Workflow
chriscoyier
698
190k
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
390
Transcript
৽ถPythonista͕ଃΔAirflowೖ & ׆༻ࣄྫհ PyCon JP 2019 2019.9.17 Naoki Matsuda
Agenda 0. ࣗݾհ 1. Airflowͷ֓ཁ 2. Airflowͷࣾࣄྫհ - ։ൃϓϩμΫτ֓ཁͱ՝, Airflowͷڥ
3. AirflowͰͭ·͍ͮͨ - λεΫؒͷσʔλͷΓͱΓ - DAGಈ࡞֬ೝ ~ dockerͰϩʔΧϧ։ൃڥߏங
ࣗݾհ দా थ (·ͭͩ ͳ͓͖) - ॴଐɿגࣜձࣾ ి௨σδλϧ - ۀɿόοΫΤϯυαʔϏεɺETLपΓͷ։ൃ
- 2018ೖࣾ
1. Airflowͷ֓ཁ
Apache Airflow֓ཁ όονॲཧ͔ΒͳΔϫʔΫϑϩʔͷεέδϡʔϦϯάˍϞχλ Ϧϯά͕ՄೳͳϓϥοτϑΥʔϜ - Airbnbࣾ - Φʔϓϯιʔε (Apache software
foundationͷincubation project)1,2 - PythonͰ࣮͞Ε͍ͯΔ3 - ։ൃίϛϡχςΟ͕׆ൃ3 IUUQTJODVCBUPSBQBDIFPSHQSPKFDUTBJSGMPXIUNM IUUQTBJSGMPXBQBDIFPSHMJDFOTFIUNM IUUQTHJUIVCDPNBQBDIFBJSGMPX
։ൃίϛϡχςΟͷ׆ൃ͞(2019.9.9࣌)
Apache AirflowͰͰ͖Δ͜ͱ - PythonίʔυͰϫʔΫϑϩʔ(DAG)Λఆٛ - ґଘؔʹج͍ͮͨλεΫͷ࣮ߦ - ϫʔΫϑϩʔͷεέδϡʔϦϯά - ϦονͳWeb
UI - DAG࣮ߦεςʔλεͷϞχλϦϯά - λεΫͷϩά֬ೝ - ґଘؔͷՄࢹԽ ͳͲ
PythonίʔυͰϫʔΫϑϩʔఆٛ(DAGͷ࡞) λεΫؒґଘؔͷఆٛ λεΫ1 λεΫ2 DAGͷڞ௨ઃఆ ࣮ߦස, ࣮ߦظؒ, λΠϜΞτ࣌ؒͳͲ
ϫʔΫϑϩʔΛߏ͢ΔλεΫͷ࡞ ≈ - ϫʔΫϑϩʔOperatorͱݺΕΔλεΫʹΑΓߏ͞ΕΔ1 - 1ͭͷOperatorͰ1ͭͷλεΫΛهड़ OperatorͷҾɻ ֤Operator͕ԿͷҾΛ ͱΔ͔υΩϡϝϯτࢀর2 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST
IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
ϫʔΫϑϩʔΛߏ͢ΔλεΫͷ࡞ - BashίϚϯυ࣮ߦ: BashOperator - Python࣮ؔߦ: PythonOperator - SQL࣮ߦ: MySqlOperator,
PostgresOperator, … - HTTPϦΫΤετૹ৴: SimpleHttpOperator - ͦͷଞΫϥυܥͳͲ: BigQueryOperator, AWSAthenaOperator, … - ಛఆ݅Ληϯγϯά: Sensor IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
2. ࣾࣄྫհ - ։ൃϓϩμΫτ֓ཁͱ՝, Airflowͷߏ
։ൃ৫ͱϓϩμΫτʹ͍ͭͯ ɾɾɾ ࠂ৴ σʔλ ϓϥϯφʔ Ӧۀ - ڈ7~9݄, - GoogleDispla
y - ҿྉۀք imp 100000 clicks 5000 cv 700 cost ɾɾɾ ϝσΟΞ ։ൃ৫ νʔϜنɿσʔλΤϯδχΞ ໊ σʔλαΠΤϯςΟετͳͲ໊ d ϓϩμΫτ ͚ࣾσδλϧࠂϓϥϯχϯάπʔϧ ࣾͰѻ͏ϝσΟΞɾΫϥΠΞϯτͷࠂ৴࣮σʔλΛՄࢹԽˍ༧ଌ ϓϩμΫτ (PPHMF :BIPP 5XJUUFS 'BDFCPPL -*/&
։ൃϓϩμΫτʹ͓͚Δ՝ - σʔλ͕RDBʹೖͬͯͳ͍ ৴Ϩϙʔτσʔλ͕ੳ༻ͷྻࢤσʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳใΛՃ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ
։ൃϓϩμΫτʹ͓͚Δ՝ - σʔλ͕RDBʹೖͬͯͳ͍ ৴Ϩϙʔτσʔλ͕ੳ༻ͷྻࢤσʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳใΛՃ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ → ։ൃϓϩμΫτ༻ʹσʔλϚʔτ࡞ RDBʹϑΝΫτ,
σΟϝϯγϣϯςʔϒϧΛETLͰ࡞
Airflowߏ apache-airflow 1.10.2 web worker scheduler Amazon S3 Amazon RDS
Airflow AWS Fargate Amazon ElastiCache Redis Elastic Load Balancing flower DAGs - AWS FargateʹAirflowΛσϓϩΠ ≈ ≈
docker-airflow https://github.com/puckel/docker-airflow
ߏஙͨ͠σʔλϑϩʔ ֤ϝσΟΞࠂ৴σʔλ ϦϨʔγϣϯςʔϒϧܥ JOIN ΫϥΠΞϯτใܥ ʜ ΧϥϜ໊ دͤͳͲ ≈ Amazon
Athena Backend service INSERT "JSGMPX͕࣮ߦ͢ΔλεΫ INSERT
3. AirflowͰͭ·͍ͮͨ - λεΫؒͷσʔλͷΓͱΓ - DAGಈ࡞֬ೝ ~dockerͰϩʔΧϧ։ൃڥߏங
λεΫؒͷσʔλͷΓͱΓ λεΫؒͷσʔλΓͱΓXComΛ͏ - XComͷ͍ํ - XComσʔλΛpush - ؔͰreturn - ؔͰkwargs['task_instance’].
xcom_push(value=hoge, key=‘huga’) - Λฦ͢Operator ྫ: BigqueryGetDataOperator - XCom͔ΒσʔλΛpull - kwargs['task_instance’].xcom_pull() metadata database
λεΫؒͷσʔλͷΓͱΓ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔΛ࣮ߦ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ 1. task1Ͱpush͞ΕͨXcomͷ σʔλΛpullͯ͠ 2. ςʔϒϧͷσʔλΛసஔ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔΛ࣮ߦ
BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ͢Δྫ
λεΫؒͷσʔλͷΓͱΓ provide_contextΛTrueʹ͠ͳ͍ͱkwargs[‘task_intance’]ͰKeyError - provide_context=False (default) kwargs : {} - provide_context=True
kwargs: { 'dag': <DAG: sample>, 'ds': '2019-09-10’, 'next_ds': '2019-09-10’, … 'task_instance’: <TaskInstance: sample.task1_2 …> … }
λεΫؒͷσʔλͷΓͱΓ - PythonOperatorͷҾͰTrue OR - default_argsͰઃఆ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங - ࡞ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏Δʁ എܠɿ - Ϋϥυ্devڥͰͷDAGಈ࡞֬ೝͰS3upload͢Δखؒ - ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠ΔͱΓͮΒ͍… →
ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங → dockerͰAirflowΛϩʔΧϧʹ্ཱͪ͛Δ - ࡞ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏Δʁ എܠɿ - Ϋϥυ্devڥͰͷDAGಈ࡞֬ೝͰS3upload͢Δखؒ -
ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠ΔͱΓͮΒ͍… → ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங LocalExecutorΛ༻
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங LocalExecutorΛ༻ dagsσΟϨΫτϦΛvolumeͱ͠ ͯϚϯτ
DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃڥߏங - dockerͷvolumeͱͯ͠dagsσΟϨΫτϦΛϚϯ τ͍ͯ͠ΔͷͰॻ͖͑ͨΒ͙͢ʹө - Web UIʹ͕ࣗ࡞ͨ͠DAGͷΈ͕දࣔ͞ΕΔ - ECR͔ΒimageΛऔͬͯ͘ΔΑ͏ʹͯ͠ຊ൪ͱಉ͡
ڥͰಈ࡞֬ೝͰ͖Δ
·ͱΊ - ETL͕ඞཁͳࣾ։ൃϓϩμΫτʹ͓͍ͯAirflowΛ ͍·ͨ͠ɻ - ຊ൪ڥͷAirflowECS FargateʹσϓϩΠ͠·ͨ͠ɻ - ϩʔΧϧ։ൃڥʹdockerΛ༻ͯ͠։ൃָ͕ʹͳΓ ·ͨ͠ɻ
We are hiring ! https://bit.ly/2UqWPGO
supplementary information
λεΫؒґଘؔͷఆٛ - Ϗοτγϑτԋࢉࢠ(>>, <<)Λ͍λεΫͷґଘؔΛද͢ - ޙଓλεΫͷ࣮ߦ݅શͯͷઌߦλεΫޭ͕σϑΥϧτઃఆ1 - શOperator͕࣋ͭtrigger_ruleҾͰ࣮ߦ݅ΛมߋՄೳ1 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMUSJHHFSSVMFT
- γϯϓϧͳґଘؔ task1 >> task2 - λεΫάϧʔϓ͕͋Δґଘؔ task1 >> [task2-1,task2-2] >> task3
Web UI: ϞχλϦϯά - Tree View - Gantt Chart
Web UI: Variable
ฒྻઃఆ Configuring parallelism in airflow.cfg - parallelism : ࢄॲཧΫϥελશମͰ࣮ߦՄೳͳϓϩηε -
dag_concurrency : ҰͭͷϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େϓϩηε - max_active_runs_per_dag : DAG෦Ͱಉ࣮࣌ߦՄೳͳ࠷େλε Ϋ - worker_concurrency : ҰͭͷCeleryϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େ ϓϩηε IUUQTBOBMZUJDTMJWFTFOTFDPKQFOUSZ