Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Big Data Works in Sale Stock

How Big Data Works in Sale Stock

This presentation presented on Facebook Developer Circles Malang Meetup June 2017.

Andi N. Dirgantara

June 17, 2017
Tweet

More Decks by Andi N. Dirgantara

Other Decks in Technology

Transcript

  1. Problems ▸ 1 table/day = 40+ millions row ▸ 1

    table/day = 4+ GB (Parquet + GZiped)* ▸ Some query takes 3+ hours** * 10+ GB raw csv ** without partition, before implementing well structured big data framework
  2. Let’s do simple math... ▸ 1 month = 1,200,000,000 rows

    ▸ 1 month = 120++ GB On single table. We have hundreds of tables.
  3. Expectation ▸ Query-able 3+ hours on single query doesn’t make

    any sense ▸ Integrated can be joined with other tables ▸ Scalable can perform well regardless how much the data We need FAST and SCALABLE data pool
  4. Data Sources Pipeline Kubernetes Cluster Kafka Microservices RDBMS (SQL) Sqoop

    Data Pool Spark Example of microservices: • Order • Cart • Chat • Product • etc. • Kafka Provide message queue and event driven data pipeline. • Spark Used if some data needs to be calculated before passed to next pipeline. Either batch or stream. • Sqoop Provide ETL from SQL to Hadoop ecosystem.
  5. Data Pool Usage Data Pool Artificial Intelligence Raw Query Business

    Analyst Batch Processing etc. Kubernetes Cluster Microservices RDBMS (SQL) • Any data processor can consume data from pool to provide enhanced data. • Some of it gives back to pool, the others provide data back to microservice cluster. Example usage: • Data visualization • Recommendation engine • Fraud detection system • etc.
  6. Data Pool (Big Data Cluster) HDFS for file system YARN

    for resource manager Cloud persistent storage Without replication* GUI for Hadoop Ecosystem Data warehouse Distributed data processor Columnar storage format *https://blog.andi.dirgantara.co/how-to-improve-data-warehouse-efficiency-using-s3-over-hdfs-on-hive-e9da90ea378c
  7. Profit! Exactly same data source, same query, same partition Hive

    took 26.60 seconds, Impala took 0.23 seconds.
  8. Basic Approach Distributed over Centralized Scalable approach should be done

    by distributed system. Technically “big data” refers to distributed platform. • Partition Provide scalable indexing within distributed system. https://blog.andi.dirgantara.co/konsep-partisi-fitur-dasar-big-data-5b531a777fdd • Replication Factor (Availability) Provide fault tolerant/ high availability on distributed system. • Coordination (Consistency) Provide consistency either write or read on distributed system.
  9. Trade Off ▸ Less Flexible The only indexing available is

    through partition. https://blog.andi.dirgantara.co/keterbatasan-teknologi-big-data-terkait-partisi-5613d0795956 ▸ High Cost Mostly default replication factor is 3, which means it’s 3 times more expensive. There are a lot more other costs like monitoring system, additional tooling, etc.
  10. When should we use big data? When you need scalable

    data infrastructure* *There are no formal guideline nor de facto approach to help answer that question.
  11. Example 1 We want to build students big database for

    all senior high school students in Indonesia. Seems “big” enough, isn’t it? Total of senior high schools in Indonesia are < 13,000*, let’s rounded it up. Assume on single school consists of 1000 pupil so there are total 13,000,000 pupil. For the next year there are additional 300 pupil so there are 3,900,000 pupil added each year. From that given data, RDBMS like MySQL still capable to store that amount of data. *https://www.bps.go.id/linkTabelStatis/view/id/1837
  12. Example 2 From example 1, we change the approach, instead

    of just recording number and profile of students, we want to record all students activity like paying monthly tuition fees, presence, etc. Assume we already have 15,000,000 pupil recorded. Single pupil have average 50 activities per month, so there are 750.000.000 activities per month. From that given data, RDBMS like MySQL already incapable, so we can try to use big data approach.