Tips and Tricks on Building Serverless Data Lake Pipeline

Slide 1

Slide 1 text

Tips and Tricks on Building Serverless Data Lake Pipeline Aditya Satrya Data Engineering Tech Lead at Mekari linkedin.com/in/asatrya 16 December 2020

Slide 2

Slide 2 text

Introduction 2 ● Aditya Satrya ● Data Engineering Tech Lead at Mekari ● Lives in Bandung ● linkedin.com/in/asatrya/

Slide 3

Slide 3 text

3 Mekari is Indonesia's #1 Software-as-a-Service (SaaS) company. Our mission is to empower the progress of businesses and professionals. https://www.mekari.com

Slide 4

Slide 4 text

Agenda 4 ● Basic Concepts ● Mekari Case Study ● Lessons Learned ● Common Pitfalls

Slide 5

Slide 5 text

Basic Concepts 5

Slide 6

Slide 6 text

What is data pipeline? 6 Series of steps or actions to move and combine data from various sources for analysis or visualization Image source: https://martinfowler.com/articles/data-mesh-principles.html OLAP: ● Analytical purposes ● Deal with many records at a time, but few columns OLTP: ● Operational purposes ● Deal with one record at a time, but all columns

Slide 7

Slide 7 text

Traditional Data Architecture 7 Image source: https://martinfowler.com/articles/data-mesh-principles.html Characteristics: ● Schema-on-write ● ETL ● High-cost storage

Slide 8

Slide 8 text

Modern Data Architecture 8 Image source: https://martinfowler.com/articles/data-mesh-principles.html Characteristics: ● Schema-on-read ● ELT ● Lower-cost storage ● Distributed computing

Slide 9

Slide 9 text

● Centralized repository ● Raw format ● Schemaless ● Any scale Data Lake 9 ● Centralized repository ● Transformed ● Single consistent schema ● Minimum data growth Data Warehouse

Slide 10

Slide 10 text

What is Serverless? 10 No infra management Fully managed security Pay-per-usage

Slide 11

Slide 11 text

Mekari Case Study 11

Slide 12

Slide 12 text

1. Data is scattered in many places (DBs, spreadsheets, 3rd party applications) 2. Different departments have different values on the same metric (need many reconciliation meetings) 3. Little knowledge about the data at the beginning 4. High (in terms of quality and speed) demand of data/analysis from business users for decision making 5. Need to build foundation for data products (giving value to customers using analytics/AI) 6. Existing solution (querying to replica database) is not adequate for heavy analysis Challenges 12 Single source of truth Fast data movement, Speedy insights, Democratizing data Support new types of analytics (ML) & new sources/format Performance

Slide 13

Slide 13 text

Criterias of Solution 13 Criterias of Solution Data Lake Data Warehouse Single source of truth V V Fast data movement, Speedy insights, Democratizing data V X Support new types of analytics (ML) & new sources/format V X Performance V V (with high cost)

Slide 14

Slide 14 text

Data Platform Architecture 14 Ops: Bitbucket Pipelines, AWS CloudFormation, AWS CloudWatch

Slide 15

Slide 15 text

Lessons Learned 15

Slide 16

Slide 16 text

16 Ops: Bitbucket Pipelines, AWS CloudFormation, AWS CloudWatch 1) Tips on Data Lake Organization

Slide 17

Slide 17 text

● Separating into zones makes a clear logical separation, ensure quality, and access control ○ Prevent data lake become a data swamp ○ Ensure the quality of data received by downstream consumers while still retaining the ability to quickly and easily ingest new sources of data ● Ingest and store data in as-raw-as-possible format ○ Prevent losing data because of process failure ○ Enable to reprocess data when there’s changes in business rules 1) Tips on Data Lake Organization 17

Slide 18

Slide 18 text

● Address conventions as early as possible ○ Zone names ○ Storage path (good references: this and this) ○ Data types allowed ○ Database and table names ● Make underlying storage path transparent from developers ○ Minimize human error ○ Create a library/helper to specify any location in data lake 1) Tips on Data Lake Organization 18

Slide 19

Slide 19 text

● Use Write-Validate-Publish pattern ○ Write data (to storage) between tasks ○ Validate data after each step ○ Only publish data when it passes validation ○ Prevent publishing wrong data ○ Deﬁne data test in the code ○ Consider alternative: greatexpectations.io 1) Tips on Data Lake Organization 19

Slide 20

Slide 20 text

20 Ops: Bitbucket Pipelines, AWS CloudFormation, AWS CloudWatch 2) Tips on Data Processing Engine (Glue, EMR, Athena)

Slide 21

Slide 21 text

Glue is a managed Spark ETL service and includes the following high level main features: ● Data catalogue: a place to logically organise your data into Databases and Tables ● Crawlers: to populate the catalogue ● Job: Ability to author ETL jobs in Python or Scala and execute them on a managed cluster ● Workﬂows: to orchestrate a series of Jobs with triggers AWS Glue Overview 21

Slide 22

Slide 22 text

● Use Glue 2.0 ○ Faster provisioning time ○ Smaller billing time unit ● Write code with portability in mind ○ Minimize using GlueContext and DynamicFrame ○ Glue is great to start with, but as you grow you will need to move ● Beware of waiting time inside the job run (under-utilized cluster) ○ Slow query caused by not index-optimized tables/query ○ S3 operations (write, read, copy) 2a) Tips on Using Glue 22

Slide 23

Slide 23 text

● Optimize worker type and number of DPU in per job basis ○ Enable Job Metrics ○ Understand the graphs, allocate just enough capacity for each job ○ Good refs: Monitoring for DPU Capacity Planning ● Glue Workﬂows is adequate only if you have small jobs ○ IMO, jump directly to Airﬂow ● Use CD pipeline and automatic provisioning tools (ex: Cloudformation) to create multiple environments (production, staging, development) → more on this later 2a) Tips on Using Glue 23

Slide 24

Slide 24 text

Amazon EMR is a managed cluster platform that simpliﬁes running big data frameworks, such as Apache Hadoop and Apache Spark. ● Storage: ○ S3 ○ HDFS ○ Local Disk ● Cluster Resource Management: ○ YARN ● Data Processing Framework: ○ MapReduce ○ Spark AWS EMR Overview 24

Slide 25

Slide 25 text

2b) Tips on Using Spark EMR 25 ● We move from Glue to EMR as we manage more jobs and need better cost-efficiency ● Use Spot Instances for cost-efficiency ● Create a custom Airflow operator to: ○ Provision cluster ○ Submit jobs ○ Terminate cluster

Slide 26

Slide 26 text

● Understand basic Spark performance tuning ○ Understand the metrics in Spark UI (more here) ○ Understand Spark Defaults set by EMR (more here) ○ Rule of thumb is to set spark.executors.cores = 5 ○ Set spark.dynamicAllocation.enabled to true only if the numbers are properly determined for spark.dynamicAllocation.initialExecutors/min Executors/maxExecutors parameters ○ Set the right conﬁguration (especially num_executors) for each job 2b) Tips on Using Spark EMR 26

Slide 27

Slide 27 text

● Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. ● Serverless ○ No infrastructure to manage ○ Pay only for the queries that you run AWS Athena Overview 27

Slide 28

Slide 28 text

2c) Tips on Using Athena 28 ● In our case: ○ Most transformations can be done using SQL ○ Athena is actually for interactive query, but in our case it’s still enough to perform analytics transformation job ○ Our Data Analysts are really fluent in SQL ● Create a custom Airflow operator for Data Analysts so that they can create an transformation task by specifying the SQL script ● Table dependencies is defined as Airflow DAG dependencies ● Consider alternative: getdbt.com

Slide 29

Slide 29 text

29 Ops: Bitbucket Pipelines, AWS CloudFormation, AWS CloudWatch 3) Tips on Metadata Store

Slide 30

Slide 30 text

● We use Glue Data Catalogue and Glue Crawler ● Run the crawler everytime a group of jobs finished ○ Anticipating schema changes and new partitions ● Create a custom Airflow operator to run Glue Crawler ● We insert last_refreshed_time information as column description to build confidence about the freshness of the data from downstream users 3) Tips on Metadata Store 30

Slide 31

Slide 31 text

31 Ops: Bitbucket Pipelines, AWS CloudFormation, AWS CloudWatch 4) Tips on Using Airﬂow

Slide 32

Slide 32 text

Airflow Overview 32 Airflow is a platform to programmatically author, schedule and monitor workflows (https://airflow.apache.org/)

Slide 33

Slide 33 text

● Airflow is very flexible, so we need to address coding standards and naming conventions as early as possible ○ DAG name ○ Task name ○ Variable name ○ etc ● Define logic in operators, not DAGs ● Create custom operator for any purpose needed ○ Data Engineers create custom operator ○ Data Analyst and Scientist create task using the operator ● Keep the workload of Airflow node minimum, only for orchestrating ○ The actual work is done outside Airflow (Glue, EMR) ● Utilize Sensors and Markers to define inter-DAG dependencies 4) Tips on Using Airflow 33

Slide 34

Slide 34 text

34 Ops: Bitbucket Pipelines, AWS CloudFormation, AWS CloudWatch 5) Tips on Ops

Slide 35

Slide 35 text

● Implement CI/CD as early as possible ● Implement Infrastructure as Code (i.e: using Cloudformation) to provision resources ○ EC2 instances ○ EMR cluster ○ Glue Jobs, Crawler ● Deploy for multi-environment (prod, stag, dev) ● Ensure all aspect in the pipeline (DAG, tables, resources) is deﬁned as code and versioned in Git ● Practice code review for every changes 5) Tips on Ops 35

Slide 36

Slide 36 text

● Create email alert for job failure ● Use logging to monitor SLA ○ Use Airﬂow’s on_success_callback and on_failure_callback to send log to CloudWatch every time a task is ﬁnished ○ Regularly export the logs to S3 ○ Analyze the logs to calculate SLA achievement ● Consider alternative: ○ Send to PagerDuty for job failure ○ Send events to Prometheus for real time monitoring 5) Tips on Ops 36

Slide 37

Slide 37 text

Common Pitfalls 37

Slide 38

Slide 38 text

● Enable remote logging in Airﬂow to export logs to S3 ● Troubleshooting can be done by looking at the logs in Airﬂow UI ● Make your pipeline breaks when errors happen ○ Prevent silent error 5) Tips on Ops 38

Slide 39

Slide 39 text

● The process and data is not reproducible ○ Deﬁne anything as code ○ Use version control system ○ Create idempotent and deterministic processes ● Data errors are not self-exposed ○ Create data test/validation ○ Break the pipeline run when error happens ● Not aware or ignoring technical debt ○ Like ﬁnancial debt, having a technical debt is good ○ Make sure it is intentional and paid in time Common Pitfalls 39

Slide 40

Slide 40 text

Common Pitfalls 40 ● Not leveraging automation ○ Too much manual process ○ When you do manual tasks for the second time, make sure you automate for the third one ● Fire-ﬁghting never ends ○ Do proper root-cause analysis ○ Plan for preventive actions ● Knowledge isolated in 1-2 person in the team ○ Practice code review ○ Write clean code