Tips and Tricks on Building Serverless Data Lake Pipeline

Tips and Tricks on Building Serverless Data Lake Pipeline Aditya
Satrya Data Engineering Tech Lead at Mekari linkedin.com/in/asatrya 16 December 2020

Introduction 2 • Aditya Satrya • Data Engineering Tech Lead
at Mekari • Lives in Bandung • linkedin.com/in/asatrya/

3 Mekari is Indonesia's #1 Software-as-a-Service (SaaS) company. Our mission
is to empower the progress of businesses and professionals. https://www.mekari.com

Agenda 4 • Basic Concepts • Mekari Case Study •
Lessons Learned • Common Pitfalls

Basic Concepts 5

What is data pipeline? 6 Series of steps or actions
to move and combine data from various sources for analysis or visualization Image source: https://martinfowler.com/articles/data-mesh-principles.html OLAP: • Analytical purposes • Deal with many records at a time, but few columns OLTP: • Operational purposes • Deal with one record at a time, but all columns

Traditional Data Architecture 7 Image source: https://martinfowler.com/articles/data-mesh-principles.html Characteristics: • Schema-on-write
• ETL • High-cost storage

Modern Data Architecture 8 Image source: https://martinfowler.com/articles/data-mesh-principles.html Characteristics: • Schema-on-read
• ELT • Lower-cost storage • Distributed computing

• Centralized repository • Raw format • Schemaless • Any
scale Data Lake 9 • Centralized repository • Transformed • Single consistent schema • Minimum data growth Data Warehouse

What is Serverless? 10 No infra management Fully managed security
Pay-per-usage

Mekari Case Study 11

1. Data is scattered in many places (DBs, spreadsheets, 3rd
party applications) 2. Different departments have different values on the same metric (need many reconciliation meetings) 3. Little knowledge about the data at the beginning 4. High (in terms of quality and speed) demand of data/analysis from business users for decision making 5. Need to build foundation for data products (giving value to customers using analytics/AI) 6. Existing solution (querying to replica database) is not adequate for heavy analysis Challenges 12 Single source of truth Fast data movement, Speedy insights, Democratizing data Support new types of analytics (ML) & new sources/format Performance

Criterias of Solution 13 Criterias of Solution Data Lake Data
Warehouse Single source of truth V V Fast data movement, Speedy insights, Democratizing data V X Support new types of analytics (ML) & new sources/format V X Performance V V (with high cost)

Data Platform Architecture 14 Ops: Bitbucket Pipelines, AWS CloudFormation, AWS
CloudWatch

Lessons Learned 15

16 Ops: Bitbucket Pipelines, AWS CloudFormation, AWS CloudWatch 1) Tips
on Data Lake Organization

• Separating into zones makes a clear logical separation, ensure
quality, and access control ◦ Prevent data lake become a data swamp ◦ Ensure the quality of data received by downstream consumers while still retaining the ability to quickly and easily ingest new sources of data • Ingest and store data in as-raw-as-possible format ◦ Prevent losing data because of process failure ◦ Enable to reprocess data when there’s changes in business rules 1) Tips on Data Lake Organization 17

• Address conventions as early as possible ◦ Zone names
◦ Storage path (good references: this and this) ◦ Data types allowed ◦ Database and table names • Make underlying storage path transparent from developers ◦ Minimize human error ◦ Create a library/helper to specify any location in data lake 1) Tips on Data Lake Organization 18

• Use Write-Validate-Publish pattern ◦ Write data (to storage) between
tasks ◦ Validate data after each step ◦ Only publish data when it passes validation ◦ Prevent publishing wrong data ◦ Deﬁne data test in the code ◦ Consider alternative: greatexpectations.io 1) Tips on Data Lake Organization 19

on Data Processing Engine (Glue, EMR, Athena)

Glue is a managed Spark ETL service and includes the
following high level main features: • Data catalogue: a place to logically organise your data into Databases and Tables • Crawlers: to populate the catalogue • Job: Ability to author ETL jobs in Python or Scala and execute them on a managed cluster • Workﬂows: to orchestrate a series of Jobs with triggers AWS Glue Overview 21

• Use Glue 2.0 ◦ Faster provisioning time ◦ Smaller
billing time unit • Write code with portability in mind ◦ Minimize using GlueContext and DynamicFrame ◦ Glue is great to start with, but as you grow you will need to move • Beware of waiting time inside the job run (under-utilized cluster) ◦ Slow query caused by not index-optimized tables/query ◦ S3 operations (write, read, copy) 2a) Tips on Using Glue 22

• Optimize worker type and number of DPU in per
job basis ◦ Enable Job Metrics ◦ Understand the graphs, allocate just enough capacity for each job ◦ Good refs: Monitoring for DPU Capacity Planning • Glue Workﬂows is adequate only if you have small jobs ◦ IMO, jump directly to Airﬂow • Use CD pipeline and automatic provisioning tools (ex: Cloudformation) to create multiple environments (production, staging, development) → more on this later 2a) Tips on Using Glue 23

Amazon EMR is a managed cluster platform that simpliﬁes running
big data frameworks, such as Apache Hadoop and Apache Spark. • Storage: ◦ S3 ◦ HDFS ◦ Local Disk • Cluster Resource Management: ◦ YARN • Data Processing Framework: ◦ MapReduce ◦ Spark AWS EMR Overview 24

2b) Tips on Using Spark EMR 25 • We move
from Glue to EMR as we manage more jobs and need better cost-efficiency • Use Spot Instances for cost-efficiency • Create a custom Airflow operator to: ◦ Provision cluster ◦ Submit jobs ◦ Terminate cluster

• Understand basic Spark performance tuning ◦ Understand the metrics
in Spark UI (more here) ◦ Understand Spark Defaults set by EMR (more here) ◦ Rule of thumb is to set spark.executors.cores = 5 ◦ Set spark.dynamicAllocation.enabled to true only if the numbers are properly determined for spark.dynamicAllocation.initialExecutors/min Executors/maxExecutors parameters ◦ Set the right conﬁguration (especially num_executors) for each job 2b) Tips on Using Spark EMR 26

• Amazon Athena is an interactive query service that makes
it easy to analyze data in Amazon S3 using standard SQL. • Serverless ◦ No infrastructure to manage ◦ Pay only for the queries that you run AWS Athena Overview 27

2c) Tips on Using Athena 28 • In our case:
◦ Most transformations can be done using SQL ◦ Athena is actually for interactive query, but in our case it’s still enough to perform analytics transformation job ◦ Our Data Analysts are really fluent in SQL • Create a custom Airflow operator for Data Analysts so that they can create an transformation task by specifying the SQL script • Table dependencies is defined as Airflow DAG dependencies • Consider alternative: getdbt.com

on Metadata Store

• We use Glue Data Catalogue and Glue Crawler •
Run the crawler everytime a group of jobs finished ◦ Anticipating schema changes and new partitions • Create a custom Airflow operator to run Glue Crawler • We insert last_refreshed_time information as column description to build confidence about the freshness of the data from downstream users 3) Tips on Metadata Store 30

on Using Airﬂow

Airflow Overview 32 Airflow is a platform to programmatically author,
schedule and monitor workflows (https://airflow.apache.org/)

• Airflow is very flexible, so we need to address
coding standards and naming conventions as early as possible ◦ DAG name ◦ Task name ◦ Variable name ◦ etc • Define logic in operators, not DAGs • Create custom operator for any purpose needed ◦ Data Engineers create custom operator ◦ Data Analyst and Scientist create task using the operator • Keep the workload of Airflow node minimum, only for orchestrating ◦ The actual work is done outside Airflow (Glue, EMR) • Utilize Sensors and Markers to define inter-DAG dependencies 4) Tips on Using Airflow 33

on Ops

• Implement CI/CD as early as possible • Implement Infrastructure
as Code (i.e: using Cloudformation) to provision resources ◦ EC2 instances ◦ EMR cluster ◦ Glue Jobs, Crawler • Deploy for multi-environment (prod, stag, dev) • Ensure all aspect in the pipeline (DAG, tables, resources) is deﬁned as code and versioned in Git • Practice code review for every changes 5) Tips on Ops 35

• Create email alert for job failure • Use logging
to monitor SLA ◦ Use Airﬂow’s on_success_callback and on_failure_callback to send log to CloudWatch every time a task is ﬁnished ◦ Regularly export the logs to S3 ◦ Analyze the logs to calculate SLA achievement • Consider alternative: ◦ Send to PagerDuty for job failure ◦ Send events to Prometheus for real time monitoring 5) Tips on Ops 36

Common Pitfalls 37

• Enable remote logging in Airﬂow to export logs to
S3 • Troubleshooting can be done by looking at the logs in Airﬂow UI • Make your pipeline breaks when errors happen ◦ Prevent silent error 5) Tips on Ops 38

• The process and data is not reproducible ◦ Deﬁne
anything as code ◦ Use version control system ◦ Create idempotent and deterministic processes • Data errors are not self-exposed ◦ Create data test/validation ◦ Break the pipeline run when error happens • Not aware or ignoring technical debt ◦ Like ﬁnancial debt, having a technical debt is good ◦ Make sure it is intentional and paid in time Common Pitfalls 39

Common Pitfalls 40 • Not leveraging automation ◦ Too much
manual process ◦ When you do manual tasks for the second time, make sure you automate for the third one • Fire-ﬁghting never ends ◦ Do proper root-cause analysis ◦ Plan for preventive actions • Knowledge isolated in 1-2 person in the team ◦ Practice code review ◦ Write clean code

Thank you! 41

42 mekari.com/careers

Tips and Tricks on Building Serverless Data Lak...

Tips and Tricks on Building Serverless Data Lake Pipeline

More Decks by Aditya Satrya

Other Decks in Technology

Featured

Transcript