Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Lake Implementation in Traveloka

Data Lake Implementation in Traveloka

Andi N. Dirgantara

January 23, 2018
Tweet

More Decks by Andi N. Dirgantara

Other Decks in Programming

Transcript

  1. 2 Speaker Profile • I’m Andi Nugroho Dirgantara • 5+

    years as a software engineer • 3+ years as a data engineer (big data) • Lead Data Engineer, Traveloka • Lead, FB DevC Malang • Big Data and JavaScript lover • Father of 3+ years old son • Gamer ◦ Steam Account: hellowin_cavemen ◦ Battle Tag: Hellowin#11826
  2. 3 How we use our data • Business Intelligence •

    Analytics • Personalization • Fraud Detection • Ads optimization • Cross selling • AB Test • etc.
  3. 4 Problems Client • Web • Android • etc. Backend

    Database Big Data Platform ? Data Processing • Analytics • Machine Learning • etc. Overly simplified data architecture on Traveloka Product Side Data Side How to accommodate: • Data Scientists • Data Analysts • Business Intelligence Tools Without disrupting production side? It should be: • Scalable • Query-able • Fault tolerant (reliable)
  4. 7 • A data lake is a storage repository that

    holds a vast amount of raw data in its native format until it is needed. - http://searchaws.techtarget.com • A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. - Tamara Dull, (SAS), https://www.kdnuggets.com • It store the data in its native/ raw format • The schema applied when on query time • Sometimes it’s also just a “marketing label” to simplified people saying the technology which complied with Hadoop, just like “big data” terms for distributed storing and query engine Data Lake by Definitions
  5. 8 Data Lake implementation on Data Team Side Big Data

    Platform ? Data Processing • Analytics • Machine Learning • etc. Backend Data Source • Stream Processing (Kafka, PubSub, etc.) • DBs • Data Warehouse • etc. Hive (S3) Presto BigQuery input output Hive + Presto • Deployed on Amazon Web Service (AWS) • Self hosted and self managed • Hadoop family Big Query • Deployed on Google Cloud Platform (GCP) • Managed service • GCP family
  6. 9 Pros • More flexible in the context of managing

    (self managed) ◦ Able to define nodes, replication factor, cluster, etc. ◦ Able to specify node specs. • Good integration with other Hadoop ecosystem ◦ Spark ◦ Kafka ◦ Impala • More mature • Open sourced Hive + Presto Pros and Cons Cons • Harder to maintain (also because of self managed)
  7. 10 Pros • Easier to maintain (managed by GCP) •

    Good integration with other GCP managed tools ◦ Dataflow ◦ PubSub ◦ Cloud Storage • Enterprise ready, support is 24/7 Big Query Cons • Less mature compared to Hadoop ecosystem • Limited API yet (not supported Scala API) • Unable to store data on S3, need to be on Cloud Storage • Close sourced
  8. 12 • We use still use AWS and GCP side

    by side • Maintainability is one thing, but in industry its value is everything • Big Data stack is moving so fast • It’s Data Engineer’s responsibility to make the migration agile • There’s no “one thing fits all” solution Conclusions
  9. 13 • How Big Data Platform Handle big Things (https://speakerdeck.com/hellowin/how-big-data-platform-handle-big-things)

    • How to Improve Data Warehouse Efficiency using S3 over HDFS on Hive (https://blog.andi.dirgantara.co/how-to-improve-data-warehouse-efficiency-using-s3-over-hdfs-on-hive-e9da90ea378c) References and Other Presentations