Slide 1

Slide 1 text

৽ถPythonista͕ଃΔAirflowೖ໳ & ׆༻ࣄྫ঺հ PyCon JP 2019 2019.9.17 Naoki Matsuda

Slide 2

Slide 2 text

Agenda 0. ࣗݾ঺հ 1. Airflowͷ֓ཁ 2. Airflowͷࣾ಺ࣄྫ঺հ - ։ൃϓϩμΫτ֓ཁͱ՝୊, Airflowͷ؀ڥ 3. AirflowͰͭ·͍ͮͨ఺ - λεΫؒͷσʔλͷ΍ΓͱΓ - DAGಈ࡞֬ೝ ~ dockerͰϩʔΧϧ։ൃ؀ڥߏங

Slide 3

Slide 3 text

ࣗݾ঺հ দా ௚थ (·ͭͩ ͳ͓͖) - ॴଐɿגࣜձࣾ ి௨σδλϧ - ۀ຿ɿόοΫΤϯυαʔϏεɺETLपΓͷ։ൃ - 2018೥ೖࣾ

Slide 4

Slide 4 text

1. Airflowͷ֓ཁ

Slide 5

Slide 5 text

Apache Airflow֓ཁ όονॲཧ͔ΒͳΔϫʔΫϑϩʔͷεέδϡʔϦϯάˍϞχλ Ϧϯά͕ՄೳͳϓϥοτϑΥʔϜ - Airbnbࣾ੡ - Φʔϓϯιʔε (Apache software foundationͷincubation project)1,2 - PythonͰ࣮૷͞Ε͍ͯΔ3 - ։ൃίϛϡχςΟ͕׆ൃ3 IUUQTJODVCBUPSBQBDIFPSHQSPKFDUTBJSGMPXIUNM IUUQTBJSGMPXBQBDIFPSHMJDFOTFIUNM IUUQTHJUIVCDPNBQBDIFBJSGMPX

Slide 6

Slide 6 text

։ൃίϛϡχςΟͷ׆ൃ͞(2019.9.9࣌఺)

Slide 7

Slide 7 text

Apache AirflowͰͰ͖Δ͜ͱ - PythonίʔυͰϫʔΫϑϩʔ(DAG)Λఆٛ - ґଘؔ܎ʹج͍ͮͨλεΫͷ࣮ߦ - ϫʔΫϑϩʔͷεέδϡʔϦϯά - ϦονͳWeb UI - DAG࣮ߦεςʔλεͷϞχλϦϯά - λεΫͷϩά֬ೝ - ґଘؔ܎ͷՄࢹԽ ͳͲ

Slide 8

Slide 8 text

PythonίʔυͰϫʔΫϑϩʔఆٛ(DAGͷ࡞੒) λεΫؒґଘؔ܎ͷఆٛ λεΫ1 λεΫ2 DAGͷڞ௨ઃఆ ࣮ߦස౓, ࣮ߦظؒ, λΠϜΞ΢τ࣌ؒͳͲ

Slide 9

Slide 9 text

ϫʔΫϑϩʔΛߏ੒͢ΔλεΫͷ࡞੒ ≈ - ϫʔΫϑϩʔ͸Operatorͱݺ͹ΕΔλεΫʹΑΓߏ੒͞ΕΔ1 - 1ͭͷOperatorͰ1ͭͷλεΫΛهड़ OperatorͷҾ਺ɻ ֤Operator͕ԿͷҾ਺Λ ͱΔ͔͸υΩϡϝϯτࢀর2 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM

Slide 10

Slide 10 text

ϫʔΫϑϩʔΛߏ੒͢ΔλεΫͷ࡞੒ - BashίϚϯυ࣮ߦ: BashOperator - Pythonؔ਺࣮ߦ: PythonOperator - SQL࣮ߦ: MySqlOperator, PostgresOperator, … - HTTPϦΫΤετૹ৴: SimpleHttpOperator - ͦͷଞΫϥ΢υܥͳͲ: BigQueryOperator, AWSAthenaOperator, … - ಛఆ৚݅Ληϯγϯά: Sensor IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM

Slide 11

Slide 11 text

2. ࣾ಺ࣄྫ঺հ - ։ൃϓϩμΫτ֓ཁͱ՝୊, Airflowͷߏ੒

Slide 12

Slide 12 text

։ൃ૊৫ͱϓϩμΫτʹ͍ͭͯ ɾɾɾ ޿ࠂ഑৴ σʔλ ϓϥϯφʔ Ӧۀ - ڈ೥7~9݄, - GoogleDispla y - ҿྉۀք imp 100000 clicks 5000 cv 700 cost ɾɾɾ ϝσΟΞ ։ൃ૊৫ νʔϜن໛ɿσʔλΤϯδχΞ ໊ σʔλαΠΤϯςΟετͳͲ໊ d ϓϩμΫτ ࣾ಺޲͚σδλϧ޿ࠂϓϥϯχϯάπʔϧ ࣾ಺Ͱѻ͏ϝσΟΞɾΫϥΠΞϯτͷ޿ࠂ഑৴࣮੷σʔλΛՄࢹԽˍ༧ଌ ϓϩμΫτ (PPHMF :BIPP 5XJUUFS 'BDFCPPL -*/&

Slide 13

Slide 13 text

։ൃϓϩμΫτʹ͓͚Δ՝୊ - σʔλ͕RDBʹೖͬͯͳ͍ ഑৴Ϩϙʔτσʔλ͕෼ੳ༻ͷྻࢤ޲σʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳ৘ใΛ෇Ճ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ

Slide 14

Slide 14 text

։ൃϓϩμΫτʹ͓͚Δ՝୊ - σʔλ͕RDBʹೖͬͯͳ͍ ഑৴Ϩϙʔτσʔλ͕෼ੳ༻ͷྻࢤ޲σʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳ৘ใΛ෇Ճ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ → ։ൃϓϩμΫτ༻ʹσʔλϚʔτ࡞੒ RDBʹϑΝΫτ, σΟϝϯγϣϯςʔϒϧΛETLͰ࡞੒

Slide 15

Slide 15 text

Airflowߏ੒ apache-airflow 1.10.2 web worker scheduler Amazon S3 Amazon RDS Airflow AWS Fargate Amazon ElastiCache Redis Elastic Load Balancing flower DAGs - AWS FargateʹAirflowΛσϓϩΠ ≈ ≈

Slide 16

Slide 16 text

docker-airflow https://github.com/puckel/docker-airflow

Slide 17

Slide 17 text

ߏஙͨ͠σʔλϑϩʔ ֤ϝσΟΞ޿ࠂ഑৴σʔλ ϦϨʔγϣϯςʔϒϧܥ JOIN ΫϥΠΞϯτ৘ใܥ ʜ ΧϥϜ໊ دͤͳͲ ≈ Amazon Athena Backend service INSERT "JSGMPX͕࣮ߦ͢ΔλεΫ INSERT

Slide 18

Slide 18 text

3. AirflowͰͭ·͍ͮͨ఺ - λεΫؒͷσʔλͷ΍ΓͱΓ - DAGಈ࡞֬ೝ ~dockerͰϩʔΧϧ։ൃ؀ڥߏங

Slide 19

Slide 19 text

λεΫؒͷσʔλͷ΍ΓͱΓ λεΫؒͷσʔλ΍ΓͱΓ͸XComΛ࢖͏ - XComͷ࢖͍ํ - XCom΁σʔλΛpush - ؔ਺಺Ͱreturn - ؔ਺಺Ͱkwargs['task_instance’]. xcom_push(value=hoge, key=‘huga’) - ஋Λฦ͢Operator ྫ: BigqueryGetDataOperator - XCom͔ΒσʔλΛpull - kwargs['task_instance’].xcom_pull() metadata database

Slide 20

Slide 20 text

λεΫؒͷσʔλͷ΍ΓͱΓ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ޻͢Δྫ

Slide 21

Slide 21 text

λεΫؒͷσʔλͷ΍ΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ޻͢Δྫ

Slide 22

Slide 22 text

λεΫؒͷσʔλͷ΍ΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔ਺Λ࣮ߦ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ޻͢Δྫ

Slide 23

Slide 23 text

λεΫؒͷσʔλͷ΍ΓͱΓ 1. task1Ͱpush͞ΕͨXcomͷ σʔλΛpullͯ͠ 2. ςʔϒϧͷσʔλΛసஔ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔ਺Λ࣮ߦ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ޻͢Δྫ

Slide 24

Slide 24 text

λεΫؒͷσʔλͷ΍ΓͱΓ provide_contextΛTrueʹ͠ͳ͍ͱkwargs[‘task_intance’]ͰKeyError - provide_context=False (default) kwargs : {} - provide_context=True kwargs: { 'dag': , 'ds': '2019-09-10’, 'next_ds': '2019-09-10’, … 'task_instance’: … }

Slide 25

Slide 25 text

λεΫؒͷσʔλͷ΍ΓͱΓ - PythonOperatorͷҾ਺ͰTrue OR - default_argsͰઃఆ

Slide 26

Slide 26 text

DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங - ࡞੒ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏΍Δʁ എܠɿ - Ϋϥ΢υ্dev؀ڥͰͷDAGಈ࡞֬ೝͰ͸S3΁upload͢Δखؒ - ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠Δͱ΍ΓͮΒ͍… → ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ

Slide 27

Slide 27 text

DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங → dockerͰAirflowΛϩʔΧϧʹ্ཱͪ͛Δ - ࡞੒ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏΍Δʁ എܠɿ - Ϋϥ΢υ্dev؀ڥͰͷDAGಈ࡞֬ೝͰ͸S3΁upload͢Δखؒ - ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠Δͱ΍ΓͮΒ͍… → ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ

Slide 28

Slide 28 text

DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங

Slide 29

Slide 29 text

DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங LocalExecutorΛ࢖༻

Slide 30

Slide 30 text

DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங LocalExecutorΛ࢖༻ dagsσΟϨΫτϦΛvolumeͱ͠ ͯϚ΢ϯτ

Slide 31

Slide 31 text

DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங - dockerͷvolumeͱͯ͠dagsσΟϨΫτϦΛϚ΢ϯ τ͍ͯ͠ΔͷͰॻ͖׵͑ͨΒ͙͢ʹ൓ө - Web UIʹࣗ෼͕࡞੒ͨ͠DAGͷΈ͕දࣔ͞ΕΔ - ECR͔ΒimageΛऔͬͯ͘ΔΑ͏ʹͯ͠ຊ൪ͱಉ͡ ؀ڥͰಈ࡞֬ೝͰ͖Δ

Slide 32

Slide 32 text

·ͱΊ - ETL͕ඞཁͳࣾ಺։ൃϓϩμΫτʹ͓͍ͯAirflowΛ࢖ ͍·ͨ͠ɻ - ຊ൪؀ڥͷAirflow͸ECS FargateʹσϓϩΠ͠·ͨ͠ɻ - ϩʔΧϧ։ൃ؀ڥʹdockerΛ࢖༻ͯ͠։ൃָ͕ʹͳΓ ·ͨ͠ɻ

Slide 33

Slide 33 text

We are hiring ! https://bit.ly/2UqWPGO

Slide 34

Slide 34 text

supplementary information

Slide 35

Slide 35 text

λεΫؒґଘؔ܎ͷఆٛ - Ϗοτγϑτԋࢉࢠ(>>, <<)Λ࢖͍λεΫͷґଘؔ܎Λද͢ - ޙଓλεΫͷ࣮ߦ৚݅͸શͯͷઌߦλεΫ੒ޭ͕σϑΥϧτઃఆ1 - શOperator͕࣋ͭtrigger_ruleҾ਺Ͱ࣮ߦ৚݅ΛมߋՄೳ1 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMUSJHHFSSVMFT - γϯϓϧͳґଘؔ܎ task1 >> task2 - λεΫάϧʔϓ͕͋Δґଘؔ܎ task1 >> [task2-1,task2-2] >> task3

Slide 36

Slide 36 text

Web UI: ϞχλϦϯά - Tree View - Gantt Chart

Slide 37

Slide 37 text

Web UI: Variable

Slide 38

Slide 38 text

ฒྻઃఆ Configuring parallelism in airflow.cfg - parallelism : ෼ࢄॲཧΫϥελશମͰ࣮ߦՄೳͳϓϩηε਺ - dag_concurrency : ҰͭͷϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େϓϩηε਺ - max_active_runs_per_dag : DAG಺෦Ͱಉ࣮࣌ߦՄೳͳ࠷େλε Ϋ਺ - worker_concurrency : ҰͭͷCeleryϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େ ϓϩηε਺ IUUQTBOBMZUJDTMJWFTFOTFDPKQFOUSZ