Slide 1

Slide 1 text

How Big Data Works in Sale Stock Andi N. Dirgantara Data Engineer @ Sale Stock

Slide 2

Slide 2 text

Why?

Slide 3

Slide 3 text

Problems ▸ 1 table/day = 40+ millions row ▸ 1 table/day = 4+ GB (Parquet + GZiped)* ▸ Some query takes 3+ hours** * 10+ GB raw csv ** without partition, before implementing well structured big data framework

Slide 4

Slide 4 text

Let’s do simple math... ▸ 1 month = 1,200,000,000 rows ▸ 1 month = 120++ GB On single table. We have hundreds of tables.

Slide 5

Slide 5 text

Example rows count on single table

Slide 6

Slide 6 text

Expectation ▸ Query-able 3+ hours on single query doesn’t make any sense ▸ Integrated can be joined with other tables ▸ Scalable can perform well regardless how much the data We need FAST and SCALABLE data pool

Slide 7

Slide 7 text

What?

Slide 8

Slide 8 text

Big Data Tools and Platform Choices

Slide 9

Slide 9 text

What we use?

Slide 10

Slide 10 text

Data Sources Pipeline Kubernetes Cluster Kafka Microservices RDBMS (SQL) Sqoop Data Pool Spark Example of microservices: ● Order ● Cart ● Chat ● Product ● etc. ● Kafka Provide message queue and event driven data pipeline. ● Spark Used if some data needs to be calculated before passed to next pipeline. Either batch or stream. ● Sqoop Provide ETL from SQL to Hadoop ecosystem.

Slide 11

Slide 11 text

Data Pool Usage Data Pool Artificial Intelligence Raw Query Business Analyst Batch Processing etc. Kubernetes Cluster Microservices RDBMS (SQL) ● Any data processor can consume data from pool to provide enhanced data. ● Some of it gives back to pool, the others provide data back to microservice cluster. Example usage: ● Data visualization ● Recommendation engine ● Fraud detection system ● etc.

Slide 12

Slide 12 text

Data Pool (Big Data Cluster) HDFS for file system YARN for resource manager Cloud persistent storage Without replication* GUI for Hadoop Ecosystem Data warehouse Distributed data processor Columnar storage format *https://blog.andi.dirgantara.co/how-to-improve-data-warehouse-efficiency-using-s3-over-hdfs-on-hive-e9da90ea378c

Slide 13

Slide 13 text

Profit! Exactly same data source, same query, same partition Hive took 26.60 seconds, Impala took 0.23 seconds.

Slide 14

Slide 14 text

How?

Slide 15

Slide 15 text

Basic Approach Distributed over Centralized Scalable approach should be done by distributed system. Technically “big data” refers to distributed platform. ● Partition Provide scalable indexing within distributed system. https://blog.andi.dirgantara.co/konsep-partisi-fitur-dasar-big-data-5b531a777fdd ● Replication Factor (Availability) Provide fault tolerant/ high availability on distributed system. ● Coordination (Consistency) Provide consistency either write or read on distributed system.

Slide 16

Slide 16 text

Watch the demonstration

Slide 17

Slide 17 text

Trade Off ▸ Less Flexible The only indexing available is through partition. https://blog.andi.dirgantara.co/keterbatasan-teknologi-big-data-terkait-partisi-5613d0795956 ▸ High Cost Mostly default replication factor is 3, which means it’s 3 times more expensive. There are a lot more other costs like monitoring system, additional tooling, etc.

Slide 18

Slide 18 text

When?

Slide 19

Slide 19 text

When should we use big data? When you need scalable data infrastructure* *There are no formal guideline nor de facto approach to help answer that question.

Slide 20

Slide 20 text

Example 1 We want to build students big database for all senior high school students in Indonesia. Seems “big” enough, isn’t it? Total of senior high schools in Indonesia are < 13,000*, let’s rounded it up. Assume on single school consists of 1000 pupil so there are total 13,000,000 pupil. For the next year there are additional 300 pupil so there are 3,900,000 pupil added each year. From that given data, RDBMS like MySQL still capable to store that amount of data. *https://www.bps.go.id/linkTabelStatis/view/id/1837

Slide 21

Slide 21 text

Example 2 From example 1, we change the approach, instead of just recording number and profile of students, we want to record all students activity like paying monthly tuition fees, presence, etc. Assume we already have 15,000,000 pupil recorded. Single pupil have average 50 activities per month, so there are 750.000.000 activities per month. From that given data, RDBMS like MySQL already incapable, so we can try to use big data approach.

Slide 22

Slide 22 text

Questions?