Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tips and Tricks on Building Serverless Data Lake Pipeline

Aditya Satrya
December 16, 2020

Tips and Tricks on Building Serverless Data Lake Pipeline

My presentation at the sharing session on AWS DevAx Connect. We shared some lessons learned on building a Data Lake Pipeline on top of AWS services, and how it helps the Data team at Mekari delivers business value faster.

Recording video: https://www.twitch.tv/videos/873990234

Aditya Satrya

December 16, 2020
Tweet

More Decks by Aditya Satrya

Other Decks in Technology

Transcript

  1. Tips and Tricks on Building Serverless Data Lake Pipeline Aditya

    Satrya Data Engineering Tech Lead at Mekari linkedin.com/in/asatrya 16 December 2020
  2. Introduction 2 • Aditya Satrya • Data Engineering Tech Lead

    at Mekari • Lives in Bandung • linkedin.com/in/asatrya/
  3. 3 Mekari is Indonesia's #1 Software-as-a-Service (SaaS) company. Our mission

    is to empower the progress of businesses and professionals. https://www.mekari.com
  4. Agenda 4 • Basic Concepts • Mekari Case Study •

    Lessons Learned • Common Pitfalls
  5. What is data pipeline? 6 Series of steps or actions

    to move and combine data from various sources for analysis or visualization Image source: https://martinfowler.com/articles/data-mesh-principles.html OLAP: • Analytical purposes • Deal with many records at a time, but few columns OLTP: • Operational purposes • Deal with one record at a time, but all columns
  6. • Centralized repository • Raw format • Schemaless • Any

    scale Data Lake 9 • Centralized repository • Transformed • Single consistent schema • Minimum data growth Data Warehouse
  7. 1. Data is scattered in many places (DBs, spreadsheets, 3rd

    party applications) 2. Different departments have different values on the same metric (need many reconciliation meetings) 3. Little knowledge about the data at the beginning 4. High (in terms of quality and speed) demand of data/analysis from business users for decision making 5. Need to build foundation for data products (giving value to customers using analytics/AI) 6. Existing solution (querying to replica database) is not adequate for heavy analysis Challenges 12 Single source of truth Fast data movement, Speedy insights, Democratizing data Support new types of analytics (ML) & new sources/format Performance
  8. Criterias of Solution 13 Criterias of Solution Data Lake Data

    Warehouse Single source of truth V V Fast data movement, Speedy insights, Democratizing data V X Support new types of analytics (ML) & new sources/format V X Performance V V (with high cost)
  9. • Separating into zones makes a clear logical separation, ensure

    quality, and access control ◦ Prevent data lake become a data swamp ◦ Ensure the quality of data received by downstream consumers while still retaining the ability to quickly and easily ingest new sources of data • Ingest and store data in as-raw-as-possible format ◦ Prevent losing data because of process failure ◦ Enable to reprocess data when there’s changes in business rules 1) Tips on Data Lake Organization 17
  10. • Address conventions as early as possible ◦ Zone names

    ◦ Storage path (good references: this and this) ◦ Data types allowed ◦ Database and table names • Make underlying storage path transparent from developers ◦ Minimize human error ◦ Create a library/helper to specify any location in data lake 1) Tips on Data Lake Organization 18
  11. • Use Write-Validate-Publish pattern ◦ Write data (to storage) between

    tasks ◦ Validate data after each step ◦ Only publish data when it passes validation ◦ Prevent publishing wrong data ◦ Define data test in the code ◦ Consider alternative: greatexpectations.io 1) Tips on Data Lake Organization 19
  12. 20 Ops: Bitbucket Pipelines, AWS CloudFormation, AWS CloudWatch 2) Tips

    on Data Processing Engine (Glue, EMR, Athena)
  13. Glue is a managed Spark ETL service and includes the

    following high level main features: • Data catalogue: a place to logically organise your data into Databases and Tables • Crawlers: to populate the catalogue • Job: Ability to author ETL jobs in Python or Scala and execute them on a managed cluster • Workflows: to orchestrate a series of Jobs with triggers AWS Glue Overview 21
  14. • Use Glue 2.0 ◦ Faster provisioning time ◦ Smaller

    billing time unit • Write code with portability in mind ◦ Minimize using GlueContext and DynamicFrame ◦ Glue is great to start with, but as you grow you will need to move • Beware of waiting time inside the job run (under-utilized cluster) ◦ Slow query caused by not index-optimized tables/query ◦ S3 operations (write, read, copy) 2a) Tips on Using Glue 22
  15. • Optimize worker type and number of DPU in per

    job basis ◦ Enable Job Metrics ◦ Understand the graphs, allocate just enough capacity for each job ◦ Good refs: Monitoring for DPU Capacity Planning • Glue Workflows is adequate only if you have small jobs ◦ IMO, jump directly to Airflow • Use CD pipeline and automatic provisioning tools (ex: Cloudformation) to create multiple environments (production, staging, development) → more on this later 2a) Tips on Using Glue 23
  16. Amazon EMR is a managed cluster platform that simplifies running

    big data frameworks, such as Apache Hadoop and Apache Spark. • Storage: ◦ S3 ◦ HDFS ◦ Local Disk • Cluster Resource Management: ◦ YARN • Data Processing Framework: ◦ MapReduce ◦ Spark AWS EMR Overview 24
  17. 2b) Tips on Using Spark EMR 25 • We move

    from Glue to EMR as we manage more jobs and need better cost-efficiency • Use Spot Instances for cost-efficiency • Create a custom Airflow operator to: ◦ Provision cluster ◦ Submit jobs ◦ Terminate cluster
  18. • Understand basic Spark performance tuning ◦ Understand the metrics

    in Spark UI (more here) ◦ Understand Spark Defaults set by EMR (more here) ◦ Rule of thumb is to set spark.executors.cores = 5 ◦ Set spark.dynamicAllocation.enabled to true only if the numbers are properly determined for spark.dynamicAllocation.initialExecutors/min Executors/maxExecutors parameters ◦ Set the right configuration (especially num_executors) for each job 2b) Tips on Using Spark EMR 26
  19. • Amazon Athena is an interactive query service that makes

    it easy to analyze data in Amazon S3 using standard SQL. • Serverless ◦ No infrastructure to manage ◦ Pay only for the queries that you run AWS Athena Overview 27
  20. 2c) Tips on Using Athena 28 • In our case:

    ◦ Most transformations can be done using SQL ◦ Athena is actually for interactive query, but in our case it’s still enough to perform analytics transformation job ◦ Our Data Analysts are really fluent in SQL • Create a custom Airflow operator for Data Analysts so that they can create an transformation task by specifying the SQL script • Table dependencies is defined as Airflow DAG dependencies • Consider alternative: getdbt.com
  21. • We use Glue Data Catalogue and Glue Crawler •

    Run the crawler everytime a group of jobs finished ◦ Anticipating schema changes and new partitions • Create a custom Airflow operator to run Glue Crawler • We insert last_refreshed_time information as column description to build confidence about the freshness of the data from downstream users 3) Tips on Metadata Store 30
  22. Airflow Overview 32 Airflow is a platform to programmatically author,

    schedule and monitor workflows (https://airflow.apache.org/)
  23. • Airflow is very flexible, so we need to address

    coding standards and naming conventions as early as possible ◦ DAG name ◦ Task name ◦ Variable name ◦ etc • Define logic in operators, not DAGs • Create custom operator for any purpose needed ◦ Data Engineers create custom operator ◦ Data Analyst and Scientist create task using the operator • Keep the workload of Airflow node minimum, only for orchestrating ◦ The actual work is done outside Airflow (Glue, EMR) • Utilize Sensors and Markers to define inter-DAG dependencies 4) Tips on Using Airflow 33
  24. • Implement CI/CD as early as possible • Implement Infrastructure

    as Code (i.e: using Cloudformation) to provision resources ◦ EC2 instances ◦ EMR cluster ◦ Glue Jobs, Crawler • Deploy for multi-environment (prod, stag, dev) • Ensure all aspect in the pipeline (DAG, tables, resources) is defined as code and versioned in Git • Practice code review for every changes 5) Tips on Ops 35
  25. • Create email alert for job failure • Use logging

    to monitor SLA ◦ Use Airflow’s on_success_callback and on_failure_callback to send log to CloudWatch every time a task is finished ◦ Regularly export the logs to S3 ◦ Analyze the logs to calculate SLA achievement • Consider alternative: ◦ Send to PagerDuty for job failure ◦ Send events to Prometheus for real time monitoring 5) Tips on Ops 36
  26. • Enable remote logging in Airflow to export logs to

    S3 • Troubleshooting can be done by looking at the logs in Airflow UI • Make your pipeline breaks when errors happen ◦ Prevent silent error 5) Tips on Ops 38
  27. • The process and data is not reproducible ◦ Define

    anything as code ◦ Use version control system ◦ Create idempotent and deterministic processes • Data errors are not self-exposed ◦ Create data test/validation ◦ Break the pipeline run when error happens • Not aware or ignoring technical debt ◦ Like financial debt, having a technical debt is good ◦ Make sure it is intentional and paid in time Common Pitfalls 39
  28. Common Pitfalls 40 • Not leveraging automation ◦ Too much

    manual process ◦ When you do manual tasks for the second time, make sure you automate for the third one • Fire-fighting never ends ◦ Do proper root-cause analysis ◦ Plan for preventive actions • Knowledge isolated in 1-2 person in the team ◦ Practice code review ◦ Write clean code