$30 off During Our Annual Pro Sale. View Details »

Building Data Pipelines with Apache Airflow

Building Data Pipelines with Apache Airflow

Takumi Sakamoto

May 11, 2017
Tweet

More Decks by Takumi Sakamoto

Other Decks in Programming

Transcript

  1. Building Data Pipelines
    with Apache Airflow
    Takumi Sakamoto
    Tokyo Airflow Meetup #1
    2017.05.11

    View Slide

  2. About me
    Takumi Sakamoto
    @takus
    Product Manager at KAIZEN Platform
    I’m here
    http://www.mindtheproduct.com/2011/10/what-exactly-is-a-product-manager/

    View Slide

  3. Why I’m here
    I used Airflow as a data engineer before the recent job change
    https://goo.gl/wn6PCO

    View Slide

  4. Joined Kaizen Platform in this month
    What does “Kaizen” mean?

    View Slide

  5. Kaizen Platform
    UX optimization platform for website
    https://kaizenplatform.com/en/

    View Slide

  6. Continuous Improvement Flow
    Data processing is important for identifying UX issues & reporting after tests
    All clients have to do is to embed tags. Collect
    data and identify issues automatically.
    Over 4,600 optimizers on Kaizen platform come
    up with optimized design variations.
    Clients select multiple variations out of all the
    submissions by optimizers. Conduct A/B test
    and keep replacing design with better
    performing variation
    Execute tests
    Collect design variations
    Collect data/Identify issues
    100
    111
    110
    100
    114
    95
    Use JS and
    display multiple
    variations
    List of variations
    Use JS and
    display multiple
    variations
    Design
    variation
    Embed 

    JS tag
    Customer Success
    Decide
    what to
    optimize
    Request
    design
    variations
    Identify issues
    from data
    collected

    View Slide

  7. Data & its key numbers
    User activity logs (PV, click …) on our customer websites
    100M+ records per day from 230+ enterprises in various types of industries
    Finance Media
    EC
    Travel
    Education
    Infrastructure
    Job Hunt Real Estate
    Used Car
    Match making/Wedding
    Finance Lead Generation EC/Media
    (چ ΨϦόʔΠϯλʔφγϣφϧ)

    View Slide

  8. Data pipelines
    Transform data from one representation to another through a series of steps
    https://databricks.com/blog/2014/11/14/application-spotlight-bedrock.html

    View Slide

  9. Why data pipelines matter?
    • Analytics & batch processing are mission critical
    • serve decision makers
    • power machine learning models that can feed into production
    • Data pipelines become more complex everyday
    • add new data transfers for new business logic
    • support new data sources

    View Slide

  10. Example of data pipelines
    https://medium.com/@dustinstansbury/beyond-cron-an-introduction-to-workflow-management-systems-19987afcdb5e
    Figure 1.1: An Example Workflow: Reporting and Predicting Online Gaming Revenue

    View Slide

  11. Using cron to manage data pipelines
    5 0 * * * app extract_ad.sh
    5 0 * * * app extract_appstore.sh
    30 0 * * * app extract_cv_rate.sh
    30 0 * * * app transform.sh
    0 2 * * * app combine.sh
    0 3 * * * app import_into_db.sh

    View Slide

  12. Using cron becomes a headache
    • It can not handle dependencies between tasks, so many times it forces to
    set up fixed execution times with ad-hoc guard times.
    • It’s very difficult to add new jobs in complex crons. When to schedule a
    new heavy task? Some independent tasks share a common resource (i.e. a
    database) so it’s best to do not overlap them.
    • Hard to debug and maintain. The crontab is just a text file.
    • Rich logging have to be handled externally.
    • Lack of stats
    https://danidelvalle.me/2016/09/12/im-sorry-cron-ive-met-airbnbs-airflow/

    View Slide

  13. Workflow management system (WMS)
    • Manage scheduling and running tasks in data pipelines
    • Ensures jobs are ordered correctly based on dependencies
    • Manage allocation of scarce resources
    • Provides mechanism for tracking the state of tasks and recovering from failure

    View Slide

  14. Apache Airflow
    • A workflow management system
    • define workflow as code
    • a lot of useful features
    • built in shiny Web UI & rich CLI

    View Slide

  15. Workflow as code
    More maintainable, versionable, testable, and collaborative than configuration
    dag = DAG('tutorial', default_args=default_args)
    t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag)
    t2 = BashOperator(
    task_id='sleep',
    bash_command='sleep 5',
    retries=3,
    dag=dag)
    t3 = BashOperator(
    task_id='templated',
    bash_command="""
    {% for i in range(5) %}
    echo "{{ ds }}"
    echo "{{ macros.ds_add(ds, 7)}}"
    echo "{{ params.my_param }}"
    {% endfor %}
    """,
    params={'my_param': 'Parameter I passed in'},
    dag=dag)
    t2.set_upstream(t1)
    t3.set_upstream(t1)
    Task
    Dependencies
    Python code
    DAG
    (Workflow)

    View Slide

  16. Workflow as code
    Dynamic workflow for dynamic infrastructure
    # Create ETL tasks for ELB access logs
    # Aggregate PVs after converting JSON to Parquet format
    for elb in c.describe_load_balancers():
    task = HiveOperator(
    task_id=‘to_parquet_{}’.format(elb.LoadBalancerName),
    hql=etl_query,
    op_kwargs={
    'name': elb.LoadBalancerName,
    ‘s3_bucket’: elb.AccessLog.S3BucketName,
    ‘s3_path’: elb.AccessLog. S3BucketPrefix,
    },
    dag=DAG)
    task.set_upstream(aggregation_task)
    NOTICE: This examples doesn’t consider deleted ELBs

    View Slide

  17. Useful feature
    Resource management by “pool” for avoiding too much load
    MySQL
    Operator
    MySQL
    HiveOperator
    HiveOperator
    HiveOperator
    HiveOperator
    HiveOperator
    YARN : resource manager
    Dynamically Scaling
    Airflow pool
    Limit task concurrency
    task
    API

    View Slide

  18. Useful feature
    Task callbacks for success / failure / SLA miss
    https://www.slideshare.net/r39132/airflow-agari-63072756

    View Slide

  19. WebUI: Workflow status
    Which workflows (DAG) or tasks failed?
    https://speakerdeck.com/artwr/apache-airflow-at-airbnb-introduction-and-lessons-learned

    View Slide

  20. WebUI: Graph view
    Visualize task dependencies

    View Slide

  21. WebUI: Gantt
    Which task is blocker?
    https://speakerdeck.com/artwr/apache-airflow-at-airbnb-introduction-and-lessons-learned

    View Slide

  22. WebUI: Task detail
    See task metadata, rendered template, execution logs etc… for debugging

    View Slide

  23. Rich CLI
    Useful for re-run some tasks after fixing bugs on ETL process
    // Clear task execution histories from 2017-05-01
    airflow clear etl \
    --task_regex insight_ \
    --downstream \
    --start_date 2017-05-01
    // Backfill cleared tasks
    airflow backfill etl \
    --start_date 2017-05-01

    View Slide

  24. Tips: Jupiter Notebook
    Useful for developing workflow interactively
    Use BigQueryHook on Airflow

    View Slide

  25. Tips: Data quality check
    Anomaly detection with Apache Airflow and Datadog
    BigQuery Operator
    on_success
    callback
    https://www.datadoghq.com/blog/introducing-anomaly-detection-datadog/
    on_failure
    callback

    View Slide

  26. Tips: Data quality check
    Anomaly detection with Apache Airflow and Datadog
    from airflow.contrib.hooks.datadog_hook import DatadogHook
    bq = BigQueryHook()
    dd = DatadogHook()
    def dd_callback(context):
    # return dataframe of (metric_name, timestamp, numeric_value)
    df = bq.get_pandas_df(bql=get_validate_bql(context[‘dag’], context[‘task’]))
    for i, c in df.iterrows():
    dd.send_metric(
    c[0],
    datapoint=(c[1], c[2]),
    tags=[
    “dag:{}”.format(context[‘dag’]),
    ”task:{}”.format(context[‘task’])
    ]
    )

    View Slide

  27. What didn’t work well?
    • Not mature enough in v1.6.1 and hit some bugs
    • may be fixed in v1.8.1?
    • History tables become huge by high-frequency DAG
    • https://github.com/teamclairvoyant/airflow-maintenance-dags
    • Sensor tasks fills up all available slots
    • need to limit concurrency by pool or priority
    • Timezone
    • define macro to convert UTC to JST

    View Slide

  28. Questions?

    View Slide