How Big Data Works in Sale Stock

How Big Data Works in Sale Stock Andi N. Dirgantara
Data Engineer @ Sale Stock

Problems ▸ 1 table/day = 40+ millions row ▸ 1
table/day = 4+ GB (Parquet + GZiped)* ▸ Some query takes 3+ hours** * 10+ GB raw csv ** without partition, before implementing well structured big data framework

Let’s do simple math... ▸ 1 month = 1,200,000,000 rows
▸ 1 month = 120++ GB On single table. We have hundreds of tables.

Example rows count on single table

Expectation ▸ Query-able 3+ hours on single query doesn’t make
any sense ▸ Integrated can be joined with other tables ▸ Scalable can perform well regardless how much the data We need FAST and SCALABLE data pool

Big Data Tools and Platform Choices

What we use?

Data Sources Pipeline Kubernetes Cluster Kafka Microservices RDBMS (SQL) Sqoop
Data Pool Spark Example of microservices: • Order • Cart • Chat • Product • etc. • Kafka Provide message queue and event driven data pipeline. • Spark Used if some data needs to be calculated before passed to next pipeline. Either batch or stream. • Sqoop Provide ETL from SQL to Hadoop ecosystem.

Data Pool Usage Data Pool Artificial Intelligence Raw Query Business
Analyst Batch Processing etc. Kubernetes Cluster Microservices RDBMS (SQL) • Any data processor can consume data from pool to provide enhanced data. • Some of it gives back to pool, the others provide data back to microservice cluster. Example usage: • Data visualization • Recommendation engine • Fraud detection system • etc.

Data Pool (Big Data Cluster) HDFS for file system YARN
for resource manager Cloud persistent storage Without replication* GUI for Hadoop Ecosystem Data warehouse Distributed data processor Columnar storage format *https://blog.andi.dirgantara.co/how-to-improve-data-warehouse-efficiency-using-s3-over-hdfs-on-hive-e9da90ea378c

Profit! Exactly same data source, same query, same partition Hive
took 26.60 seconds, Impala took 0.23 seconds.

Basic Approach Distributed over Centralized Scalable approach should be done
by distributed system. Technically “big data” refers to distributed platform. • Partition Provide scalable indexing within distributed system. https://blog.andi.dirgantara.co/konsep-partisi-fitur-dasar-big-data-5b531a777fdd • Replication Factor (Availability) Provide fault tolerant/ high availability on distributed system. • Coordination (Consistency) Provide consistency either write or read on distributed system.

Watch the demonstration

Trade Off ▸ Less Flexible The only indexing available is
through partition. https://blog.andi.dirgantara.co/keterbatasan-teknologi-big-data-terkait-partisi-5613d0795956 ▸ High Cost Mostly default replication factor is 3, which means it’s 3 times more expensive. There are a lot more other costs like monitoring system, additional tooling, etc.

When should we use big data? When you need scalable
data infrastructure* *There are no formal guideline nor de facto approach to help answer that question.

Example 1 We want to build students big database for
all senior high school students in Indonesia. Seems “big” enough, isn’t it? Total of senior high schools in Indonesia are < 13,000*, let’s rounded it up. Assume on single school consists of 1000 pupil so there are total 13,000,000 pupil. For the next year there are additional 300 pupil so there are 3,900,000 pupil added each year. From that given data, RDBMS like MySQL still capable to store that amount of data. *https://www.bps.go.id/linkTabelStatis/view/id/1837

Example 2 From example 1, we change the approach, instead
of just recording number and profile of students, we want to record all students activity like paying monthly tuition fees, presence, etc. Assume we already have 15,000,000 pupil recorded. Single pupil have average 50 activities per month, so there are 750.000.000 activities per month. From that given data, RDBMS like MySQL already incapable, so we can try to use big data approach.

Questions?

How Big Data Works in Sale Stock

How Big Data Works in Sale Stock

Andi N. Dirgantara

More Decks by Andi N. Dirgantara

Other Decks in Technology

Featured

Transcript