How Big Data Platform Handle Big Things

How Big Data Platform Handle Big Things Andi N. Dirgantara
- Data Engineer, Sale Stock

Introduction I’m Andi N. Dirgantara 5+ years as software engineer
3+ years as data engineer (big data) Currently Data Engineer in Sale Stock Lead, FB DevC Malang Big Data and JavaScript lover Father or 3 years old son My Steam account hellowin_cavemen

Outline Problems before big data How people solves those problems
Technology behind big data platform Example implementation Conclusion

Imagine this problem... We have MySQL installed on some cloud
instance, used by our application/ service Everything went well until… We faced 50,000 rows per seconds (18 millions rows per hour) Storage consume more than 100GB each days Single query can takes more than 5 hours

Then we searching for solution...

source: mattturck.com/bigdata2017

Let’s dive deeper into the technology behind it...

Distributed system is the way Partition is the key Throughput
Storage Query problems, is because single machine do it all together } So let’s break it down to multiple machine instead. Machine 1 Data 1 Write Read Machine 2 Data 2 Machine n Data n

How? Storing in distributed system Example storing user profile Machine
2 Data D-F Machine 1 Data A-C Machine n Data nx-ny Read Christine profile Write Dony profile Machine 2 Data Date 6-10 Machine 1 Data Date 1-5 Machine n Data Date nx-ny Scan data date 2-9 Scan data date 6-7

How Processing in distributed system Example doing word count Bunch
of text data Processing file 1 Processing file 2 Processing file n Processing word 1 Processing word 2 Processing word n Result

Our problems are solved! Throughput Storage Query All of those
problems were solved } but another problem then raised...

What if? Machine 1 Machine 2 Machine 3 One machine
was down

We needs high availability Replication will solve it Example replication
factor 3 Input data 1 Input data 2 Machine 1 Machine 2 Machine 3 Input data n Machine n Machine n+1 Machine n+2

High availability solves hardware failure Machine 1 Data x, y
Machine 2 Data x, z Machine 3 Data y, z Query data z Query data x another problem raised again...

Replication factor slowing down I/O process Machine 1 Data x,
y Machine 2 Data x, z Machine 3 Data y, z Query data z Query data x Write data z Write data x Write data y 2x write process

We needs consistency control We can choose success sign either
all, quorum, or only one replication machines said succeed Input data 1 Input data 2 Machine 1 Machine 2 Machine 3 Input data 1 Input data 2 Machine 1 Machine 2 Machine 3

Now we have • Partition system • Replication factor •
Consistency control How to use it?

Cassandra CREATE TABLE groups ( groupname text, username text, email
text, age int, PRIMARY KEY (groupname, username) ) Partition Key Clustering Key CONSISTENCY ALL; SELECT * FROM demo_table WHERE id = 0; Change consistency on query CREATE KEYSPACE Excelsior WITH REPLICATION = { 'replication_factor' : 3 }; Set replication factor

Hive using HDFS Set replication factor on HDFS hdfs-site.xml <property>
<name>dfs.replication</name> <value>3</value> </property> Set consistency control on HDFS hdfs-site.xml <property> <name>dfs.namenode.replication.min</name> <value>1</value> </property> Set partition when create table CREATE TABLE user_profile( userid BIGINT, usergroup STRING, name STRING, address STRING) PARTITIONED BY(usergroup STRING, userid BIGINT) STORED AS SEQUENCEFILE;

HBase Set replication factor on HDFS hdfs-site.xml <property> <name>dfs.replication</name> <value>3</value>
</property> HBase is always set to strong consistency HBase partition is based on each row’s “key”

Spark Partition on Spark val counts = textFile.flatMap(line => line.split("
")) .map(word => (word, 1)) .reduceByKey(_ + _) Spark doesn’t need consistency and replication. Partition by word

and many more...

Conclusion Technically, big data terms always related to distributed system.
Big data platform have partition, replication, and consistency control which make it capable to handle large amount of data. There’s no silver bullet, choose wisely which technology will solve your problems.

Thank you

How Big Data Platform Handle Big Things

How Big Data Platform Handle Big Things

Andi N. Dirgantara

More Decks by Andi N. Dirgantara

Other Decks in Programming

Featured

Transcript