Data Lake Implementation in Traveloka

Slide 1

Slide 1 text

Data Lake Implementation on Traveloka Andi N. Dirgantara Lead Data Engineer

Slide 2

Slide 2 text

2 Speaker Profile ● I’m Andi Nugroho Dirgantara ● 5+ years as a software engineer ● 3+ years as a data engineer (big data) ● Lead Data Engineer, Traveloka ● Lead, FB DevC Malang ● Big Data and JavaScript lover ● Father of 3+ years old son ● Gamer ○ Steam Account: hellowin_cavemen ○ Battle Tag: Hellowin#11826

Slide 3

Slide 3 text

3 How we use our data ● Business Intelligence ● Analytics ● Personalization ● Fraud Detection ● Ads optimization ● Cross selling ● AB Test ● etc.

Slide 4

Slide 4 text

4 Problems Client ● Web ● Android ● etc. Backend Database Big Data Platform ? Data Processing ● Analytics ● Machine Learning ● etc. Overly simplified data architecture on Traveloka Product Side Data Side How to accommodate: ● Data Scientists ● Data Analysts ● Business Intelligence Tools Without disrupting production side? It should be: ● Scalable ● Query-able ● Fault tolerant (reliable)

Slide 5

Slide 5 text

5 There are solutions exists, but ... source: mattturck.com/bigdata2017

Slide 6

Slide 6 text

But what it is? We need Data Lake

Slide 7

Slide 7 text

7 ● A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. - http://searchaws.techtarget.com ● A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. - Tamara Dull, (SAS), https://www.kdnuggets.com ● It store the data in its native/ raw format ● The schema applied when on query time ● Sometimes it’s also just a “marketing label” to simplified people saying the technology which complied with Hadoop, just like “big data” terms for distributed storing and query engine Data Lake by Definitions

Slide 8

Slide 8 text

8 Data Lake implementation on Data Team Side Big Data Platform ? Data Processing ● Analytics ● Machine Learning ● etc. Backend Data Source ● Stream Processing (Kafka, PubSub, etc.) ● DBs ● Data Warehouse ● etc. Hive (S3) Presto BigQuery input output Hive + Presto ● Deployed on Amazon Web Service (AWS) ● Self hosted and self managed ● Hadoop family Big Query ● Deployed on Google Cloud Platform (GCP) ● Managed service ● GCP family

Slide 9

Slide 9 text

9 Pros ● More flexible in the context of managing (self managed) ○ Able to define nodes, replication factor, cluster, etc. ○ Able to specify node specs. ● Good integration with other Hadoop ecosystem ○ Spark ○ Kafka ○ Impala ● More mature ● Open sourced Hive + Presto Pros and Cons Cons ● Harder to maintain (also because of self managed)

Slide 10

Slide 10 text

10 Pros ● Easier to maintain (managed by GCP) ● Good integration with other GCP managed tools ○ Dataflow ○ PubSub ○ Cloud Storage ● Enterprise ready, support is 24/7 Big Query Cons ● Less mature compared to Hadoop ecosystem ● Limited API yet (not supported Scala API) ● Unable to store data on S3, need to be on Cloud Storage ● Close sourced

Slide 11

Slide 11 text

Conclusions

Slide 12

Slide 12 text

12 ● We use still use AWS and GCP side by side ● Maintainability is one thing, but in industry its value is everything ● Big Data stack is moving so fast ● It’s Data Engineer’s responsibility to make the migration agile ● There’s no “one thing fits all” solution Conclusions

Slide 13

Slide 13 text

13 ● How Big Data Platform Handle big Things (https://speakerdeck.com/hellowin/how-big-data-platform-handle-big-things) ● How to Improve Data Warehouse Efficiency using S3 over HDFS on Hive (https://blog.andi.dirgantara.co/how-to-improve-data-warehouse-efficiency-using-s3-over-hdfs-on-hive-e9da90ea378c) References and Other Presentations