Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介

A23744d79540a5c45c240d3f2ccec914?s=47 Naoki Matsuda
September 17, 2019

[PyCon JP 2019] 新米Pythonistaが贈るAirflow入門&活用事例紹介

PyCon JP 2019の発表資料です。

A23744d79540a5c45c240d3f2ccec914?s=128

Naoki Matsuda

September 17, 2019
Tweet

More Decks by Naoki Matsuda

Other Decks in Technology

Transcript

  1. ৽ถPythonista͕ଃΔAirflowೖ໳ & ׆༻ࣄྫ঺հ PyCon JP 2019 2019.9.17 Naoki Matsuda

  2. Agenda 0. ࣗݾ঺հ 1. Airflowͷ֓ཁ 2. Airflowͷࣾ಺ࣄྫ঺հ - ։ൃϓϩμΫτ֓ཁͱ՝୊, Airflowͷ؀ڥ

    3. AirflowͰͭ·͍ͮͨ఺ - λεΫؒͷσʔλͷ΍ΓͱΓ - DAGಈ࡞֬ೝ ~ dockerͰϩʔΧϧ։ൃ؀ڥߏங
  3. ࣗݾ঺հ দా ௚थ (·ͭͩ ͳ͓͖) - ॴଐɿגࣜձࣾ ి௨σδλϧ - ۀ຿ɿόοΫΤϯυαʔϏεɺETLपΓͷ։ൃ

    - 2018೥ೖࣾ
  4. 1. Airflowͷ֓ཁ

  5. Apache Airflow֓ཁ όονॲཧ͔ΒͳΔϫʔΫϑϩʔͷεέδϡʔϦϯάˍϞχλ Ϧϯά͕ՄೳͳϓϥοτϑΥʔϜ - Airbnbࣾ੡ - Φʔϓϯιʔε (Apache software

    foundationͷincubation project)1,2 - PythonͰ࣮૷͞Ε͍ͯΔ3 - ։ൃίϛϡχςΟ͕׆ൃ3  IUUQTJODVCBUPSBQBDIFPSHQSPKFDUTBJSGMPXIUNM  IUUQTBJSGMPXBQBDIFPSHMJDFOTFIUNM IUUQTHJUIVCDPNBQBDIFBJSGMPX
  6. ։ൃίϛϡχςΟͷ׆ൃ͞(2019.9.9࣌఺)

  7. Apache AirflowͰͰ͖Δ͜ͱ - PythonίʔυͰϫʔΫϑϩʔ(DAG)Λఆٛ - ґଘؔ܎ʹج͍ͮͨλεΫͷ࣮ߦ - ϫʔΫϑϩʔͷεέδϡʔϦϯά - ϦονͳWeb

    UI - DAG࣮ߦεςʔλεͷϞχλϦϯά - λεΫͷϩά֬ೝ - ґଘؔ܎ͷՄࢹԽ ͳͲ
  8. PythonίʔυͰϫʔΫϑϩʔఆٛ(DAGͷ࡞੒) λεΫؒґଘؔ܎ͷఆٛ λεΫ1 λεΫ2 DAGͷڞ௨ઃఆ ࣮ߦස౓, ࣮ߦظؒ, λΠϜΞ΢τ࣌ؒͳͲ

  9. ϫʔΫϑϩʔΛߏ੒͢ΔλεΫͷ࡞੒ ≈ - ϫʔΫϑϩʔ͸Operatorͱݺ͹ΕΔλεΫʹΑΓߏ੒͞ΕΔ1 - 1ͭͷOperatorͰ1ͭͷλεΫΛهड़ OperatorͷҾ਺ɻ ֤Operator͕ԿͷҾ਺Λ ͱΔ͔͸υΩϡϝϯτࢀর2 IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST

    IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
  10. ϫʔΫϑϩʔΛߏ੒͢ΔλεΫͷ࡞੒ - BashίϚϯυ࣮ߦ: BashOperator - Pythonؔ਺࣮ߦ: PythonOperator - SQL࣮ߦ: MySqlOperator,

    PostgresOperator, … - HTTPϦΫΤετૹ৴: SimpleHttpOperator - ͦͷଞΫϥ΢υܥͳͲ: BigQueryOperator, AWSAthenaOperator, … - ಛఆ৚݅Ληϯγϯά: Sensor IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMPQFSBUPST IUUQTBJSGMPXBQBDIFPSH@BQJBJSGMPXPQFSBUPSTJOEFYIUNM
  11. 2. ࣾ಺ࣄྫ঺հ - ։ൃϓϩμΫτ֓ཁͱ՝୊, Airflowͷߏ੒

  12. ։ൃ૊৫ͱϓϩμΫτʹ͍ͭͯ ɾɾɾ ޿ࠂ഑৴ σʔλ ϓϥϯφʔ Ӧۀ - ڈ೥7~9݄, - GoogleDispla

    y - ҿྉۀք imp 100000 clicks 5000 cv 700 cost ɾɾɾ ϝσΟΞ  ։ൃ૊৫  νʔϜن໛ɿσʔλΤϯδχΞ ໊ σʔλαΠΤϯςΟετͳͲ໊ d  ϓϩμΫτ  ࣾ಺޲͚σδλϧ޿ࠂϓϥϯχϯάπʔϧ  ࣾ಺Ͱѻ͏ϝσΟΞɾΫϥΠΞϯτͷ޿ࠂ഑৴࣮੷σʔλΛՄࢹԽˍ༧ଌ ϓϩμΫτ (PPHMF :BIPP 5XJUUFS 'BDFCPPL -*/&
  13. ։ൃϓϩμΫτʹ͓͚Δ՝୊ - σʔλ͕RDBʹೖͬͯͳ͍ ഑৴Ϩϙʔτσʔλ͕෼ੳ༻ͷྻࢤ޲σʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳ৘ใΛ෇Ճ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ

  14. ։ൃϓϩμΫτʹ͓͚Δ՝୊ - σʔλ͕RDBʹೖͬͯͳ͍ ഑৴Ϩϙʔτσʔλ͕෼ੳ༻ͷྻࢤ޲σʔλϕʔεʹ ͋ͬͨΓɺϚελσʔλ͕εϓϨουγʔτʹ͋ͬͨΓ… - ඞཁͳ৘ใΛ෇Ճ͢ΔͨΊʹଟ͘ͷϦϨʔγϣϯΛͨͲΔ → ։ൃϓϩμΫτ༻ʹσʔλϚʔτ࡞੒ RDBʹϑΝΫτ,

    σΟϝϯγϣϯςʔϒϧΛETLͰ࡞੒
  15. Airflowߏ੒ apache-airflow 1.10.2 web worker scheduler Amazon S3 Amazon RDS

    Airflow AWS Fargate Amazon ElastiCache Redis Elastic Load Balancing flower DAGs - AWS FargateʹAirflowΛσϓϩΠ ≈ ≈
  16. docker-airflow https://github.com/puckel/docker-airflow

  17. ߏஙͨ͠σʔλϑϩʔ ֤ϝσΟΞ޿ࠂ഑৴σʔλ ϦϨʔγϣϯςʔϒϧܥ JOIN ΫϥΠΞϯτ৘ใܥ ʜ ΧϥϜ໊ دͤͳͲ ≈ Amazon

    Athena Backend service INSERT "JSGMPX͕࣮ߦ͢ΔλεΫ INSERT
  18. 3. AirflowͰͭ·͍ͮͨ఺ - λεΫؒͷσʔλͷ΍ΓͱΓ - DAGಈ࡞֬ೝ ~dockerͰϩʔΧϧ։ൃ؀ڥߏங

  19. λεΫؒͷσʔλͷ΍ΓͱΓ λεΫؒͷσʔλ΍ΓͱΓ͸XComΛ࢖͏ - XComͷ࢖͍ํ - XCom΁σʔλΛpush - ؔ਺಺Ͱreturn - ؔ਺಺Ͱkwargs['task_instance’].

    xcom_push(value=hoge, key=‘huga’) - ஋Λฦ͢Operator ྫ: BigqueryGetDataOperator - XCom͔ΒσʔλΛpull - kwargs['task_instance’].xcom_pull() metadata database
  20. λεΫؒͷσʔλͷ΍ΓͱΓ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ޻͢Δྫ

  21. λεΫؒͷσʔλͷ΍ΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ޻͢Δྫ

  22. λεΫؒͷσʔλͷ΍ΓͱΓ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔ਺Λ࣮ߦ BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ޻͢Δྫ

  23. λεΫؒͷσʔλͷ΍ΓͱΓ 1. task1Ͱpush͞ΕͨXcomͷ σʔλΛpullͯ͠ 2. ςʔϒϧͷσʔλΛసஔ BQςʔϒϧ͔Βσʔλऔಘ # XComʹpush͞ΕΔ transpose_dataؔ਺Λ࣮ߦ

    BigQuery͔ΒςʔϒϧσʔλΛऔಘͯͦ͠ͷσʔλΛՃ޻͢Δྫ
  24. λεΫؒͷσʔλͷ΍ΓͱΓ provide_contextΛTrueʹ͠ͳ͍ͱkwargs[‘task_intance’]ͰKeyError - provide_context=False (default) kwargs : {} - provide_context=True

    kwargs: { 'dag': <DAG: sample>, 'ds': '2019-09-10’, 'next_ds': '2019-09-10’, … 'task_instance’: <TaskInstance: sample.task1_2 …> … }
  25. λεΫؒͷσʔλͷ΍ΓͱΓ - PythonOperatorͷҾ਺ͰTrue OR - default_argsͰઃఆ

  26. DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங - ࡞੒ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏΍Δʁ എܠɿ - Ϋϥ΢υ্dev؀ڥͰͷDAGಈ࡞֬ೝͰ͸S3΁upload͢Δखؒ - ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠Δͱ΍ΓͮΒ͍… →

    ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
  27. DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங → dockerͰAirflowΛϩʔΧϧʹ্ཱͪ͛Δ - ࡞੒ͨ͠ϫʔΫϑϩʔ(DAG)ͷςετͲ͏΍Δʁ എܠɿ - Ϋϥ΢υ্dev؀ڥͰͷDAGಈ࡞֬ೝͰ͸S3΁upload͢Δखؒ -

    ଞͷਓ͕ಉ͡λΠϛϯάͰ։ൃ͍ͯ͠Δͱ΍ΓͮΒ͍… → ϩʔΧϧͰDAGͷಈ࡞֬ೝ͍ͨ͠ʂ
  28. DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங

  29. DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங LocalExecutorΛ࢖༻

  30. DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங LocalExecutorΛ࢖༻ dagsσΟϨΫτϦΛvolumeͱ͠ ͯϚ΢ϯτ

  31. DAGಈ࡞֬ೝ ~ϩʔΧϧ։ൃ؀ڥߏங - dockerͷvolumeͱͯ͠dagsσΟϨΫτϦΛϚ΢ϯ τ͍ͯ͠ΔͷͰॻ͖׵͑ͨΒ͙͢ʹ൓ө - Web UIʹࣗ෼͕࡞੒ͨ͠DAGͷΈ͕දࣔ͞ΕΔ - ECR͔ΒimageΛऔͬͯ͘ΔΑ͏ʹͯ͠ຊ൪ͱಉ͡

    ؀ڥͰಈ࡞֬ೝͰ͖Δ
  32. ·ͱΊ - ETL͕ඞཁͳࣾ಺։ൃϓϩμΫτʹ͓͍ͯAirflowΛ࢖ ͍·ͨ͠ɻ - ຊ൪؀ڥͷAirflow͸ECS FargateʹσϓϩΠ͠·ͨ͠ɻ - ϩʔΧϧ։ൃ؀ڥʹdockerΛ࢖༻ͯ͠։ൃָ͕ʹͳΓ ·ͨ͠ɻ

  33. We are hiring ! https://bit.ly/2UqWPGO

  34. supplementary information

  35. λεΫؒґଘؔ܎ͷఆٛ - Ϗοτγϑτԋࢉࢠ(>>, <<)Λ࢖͍λεΫͷґଘؔ܎Λද͢ - ޙଓλεΫͷ࣮ߦ৚݅͸શͯͷઌߦλεΫ੒ޭ͕σϑΥϧτઃఆ1 - શOperator͕࣋ͭtrigger_ruleҾ਺Ͱ࣮ߦ৚݅ΛมߋՄೳ1  IUUQTBJSGMPXBQBDIFPSHDPODFQUTIUNMUSJHHFSSVMFT

    - γϯϓϧͳґଘؔ܎ task1 >> task2 - λεΫάϧʔϓ͕͋Δґଘؔ܎ task1 >> [task2-1,task2-2] >> task3
  36. Web UI: ϞχλϦϯά - Tree View - Gantt Chart

  37. Web UI: Variable

  38. ฒྻઃఆ Configuring parallelism in airflow.cfg - parallelism : ෼ࢄॲཧΫϥελશମͰ࣮ߦՄೳͳϓϩηε਺ -

    dag_concurrency : ҰͭͷϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େϓϩηε਺ - max_active_runs_per_dag : DAG಺෦Ͱಉ࣮࣌ߦՄೳͳ࠷େλε Ϋ਺ - worker_concurrency : ҰͭͷCeleryϫʔΧͰಉ࣮࣌ߦՄೳͳ࠷େ ϓϩηε਺  IUUQTBOBMZUJDTMJWFTFOTFDPKQFOUSZ