Slide 1

Slide 1 text

How Big Data Platform Handle Big Things Andi N. Dirgantara - Data Engineer, Sale Stock

Slide 2

Slide 2 text

Introduction I’m Andi N. Dirgantara 5+ years as software engineer 3+ years as data engineer (big data) Currently Data Engineer in Sale Stock Lead, FB DevC Malang Big Data and JavaScript lover Father or 3 years old son My Steam account hellowin_cavemen

Slide 3

Slide 3 text

Outline Problems before big data How people solves those problems Technology behind big data platform Example implementation Conclusion

Slide 4

Slide 4 text

Imagine this problem... We have MySQL installed on some cloud instance, used by our application/ service Everything went well until… We faced 50,000 rows per seconds (18 millions rows per hour) Storage consume more than 100GB each days Single query can takes more than 5 hours

Slide 5

Slide 5 text

Then we searching for solution...

Slide 6

Slide 6 text

source: mattturck.com/bigdata2017

Slide 7

Slide 7 text

Let’s dive deeper into the technology behind it...

Slide 8

Slide 8 text

Distributed system is the way Partition is the key Throughput Storage Query problems, is because single machine do it all together } So let’s break it down to multiple machine instead. Machine 1 Data 1 Write Read Machine 2 Data 2 Machine n Data n

Slide 9

Slide 9 text

How? Storing in distributed system Example storing user profile Machine 2 Data D-F Machine 1 Data A-C Machine n Data nx-ny Read Christine profile Write Dony profile Machine 2 Data Date 6-10 Machine 1 Data Date 1-5 Machine n Data Date nx-ny Scan data date 2-9 Scan data date 6-7

Slide 10

Slide 10 text

How Processing in distributed system Example doing word count Bunch of text data Processing file 1 Processing file 2 Processing file n Processing word 1 Processing word 2 Processing word n Result

Slide 11

Slide 11 text

Our problems are solved! Throughput Storage Query All of those problems were solved } but another problem then raised...

Slide 12

Slide 12 text

What if? Machine 1 Machine 2 Machine 3 One machine was down

Slide 13

Slide 13 text

We needs high availability Replication will solve it Example replication factor 3 Input data 1 Input data 2 Machine 1 Machine 2 Machine 3 Input data n Machine n Machine n+1 Machine n+2

Slide 14

Slide 14 text

High availability solves hardware failure Machine 1 Data x, y Machine 2 Data x, z Machine 3 Data y, z Query data z Query data x another problem raised again...

Slide 15

Slide 15 text

Replication factor slowing down I/O process Machine 1 Data x, y Machine 2 Data x, z Machine 3 Data y, z Query data z Query data x Write data z Write data x Write data y 2x write process

Slide 16

Slide 16 text

We needs consistency control We can choose success sign either all, quorum, or only one replication machines said succeed Input data 1 Input data 2 Machine 1 Machine 2 Machine 3 Input data 1 Input data 2 Machine 1 Machine 2 Machine 3

Slide 17

Slide 17 text

Now we have ● Partition system ● Replication factor ● Consistency control How to use it?

Slide 18

Slide 18 text

Cassandra CREATE TABLE groups ( groupname text, username text, email text, age int, PRIMARY KEY (groupname, username) ) Partition Key Clustering Key CONSISTENCY ALL; SELECT * FROM demo_table WHERE id = 0; Change consistency on query CREATE KEYSPACE Excelsior WITH REPLICATION = { 'replication_factor' : 3 }; Set replication factor

Slide 19

Slide 19 text

Hive using HDFS Set replication factor on HDFS hdfs-site.xml dfs.replication 3 Set consistency control on HDFS hdfs-site.xml dfs.namenode.replication.min 1 Set partition when create table CREATE TABLE user_profile( userid BIGINT, usergroup STRING, name STRING, address STRING) PARTITIONED BY(usergroup STRING, userid BIGINT) STORED AS SEQUENCEFILE;

Slide 20

Slide 20 text

HBase Set replication factor on HDFS hdfs-site.xml dfs.replication 3 HBase is always set to strong consistency HBase partition is based on each row’s “key”

Slide 21

Slide 21 text

Spark Partition on Spark val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) Spark doesn’t need consistency and replication. Partition by word

Slide 22

Slide 22 text

and many more...

Slide 23

Slide 23 text

Conclusion Technically, big data terms always related to distributed system. Big data platform have partition, replication, and consistency control which make it capable to handle large amount of data. There’s no silver bullet, choose wisely which technology will solve your problems.

Slide 24

Slide 24 text

Thank you