Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Big Data Platform Handle Big Things

How Big Data Platform Handle Big Things

Andi N. Dirgantara

November 28, 2017
Tweet

More Decks by Andi N. Dirgantara

Other Decks in Programming

Transcript

  1. Introduction I’m Andi N. Dirgantara 5+ years as software engineer

    3+ years as data engineer (big data) Currently Data Engineer in Sale Stock Lead, FB DevC Malang Big Data and JavaScript lover Father or 3 years old son My Steam account hellowin_cavemen
  2. Outline Problems before big data How people solves those problems

    Technology behind big data platform Example implementation Conclusion
  3. Imagine this problem... We have MySQL installed on some cloud

    instance, used by our application/ service Everything went well until… We faced 50,000 rows per seconds (18 millions rows per hour) Storage consume more than 100GB each days Single query can takes more than 5 hours
  4. Distributed system is the way Partition is the key Throughput

    Storage Query problems, is because single machine do it all together } So let’s break it down to multiple machine instead. Machine 1 Data 1 Write Read Machine 2 Data 2 Machine n Data n
  5. How? Storing in distributed system Example storing user profile Machine

    2 Data D-F Machine 1 Data A-C Machine n Data nx-ny Read Christine profile Write Dony profile Machine 2 Data Date 6-10 Machine 1 Data Date 1-5 Machine n Data Date nx-ny Scan data date 2-9 Scan data date 6-7
  6. How Processing in distributed system Example doing word count Bunch

    of text data Processing file 1 Processing file 2 Processing file n Processing word 1 Processing word 2 Processing word n Result
  7. Our problems are solved! Throughput Storage Query All of those

    problems were solved } but another problem then raised...
  8. We needs high availability Replication will solve it Example replication

    factor 3 Input data 1 Input data 2 Machine 1 Machine 2 Machine 3 Input data n Machine n Machine n+1 Machine n+2
  9. High availability solves hardware failure Machine 1 Data x, y

    Machine 2 Data x, z Machine 3 Data y, z Query data z Query data x another problem raised again...
  10. Replication factor slowing down I/O process Machine 1 Data x,

    y Machine 2 Data x, z Machine 3 Data y, z Query data z Query data x Write data z Write data x Write data y 2x write process
  11. We needs consistency control We can choose success sign either

    all, quorum, or only one replication machines said succeed Input data 1 Input data 2 Machine 1 Machine 2 Machine 3 Input data 1 Input data 2 Machine 1 Machine 2 Machine 3
  12. Cassandra CREATE TABLE groups ( groupname text, username text, email

    text, age int, PRIMARY KEY (groupname, username) ) Partition Key Clustering Key CONSISTENCY ALL; SELECT * FROM demo_table WHERE id = 0; Change consistency on query CREATE KEYSPACE Excelsior WITH REPLICATION = { 'replication_factor' : 3 }; Set replication factor
  13. Hive using HDFS Set replication factor on HDFS hdfs-site.xml <property>

    <name>dfs.replication</name> <value>3</value> </property> Set consistency control on HDFS hdfs-site.xml <property> <name>dfs.namenode.replication.min</name> <value>1</value> </property> Set partition when create table CREATE TABLE user_profile( userid BIGINT, usergroup STRING, name STRING, address STRING) PARTITIONED BY(usergroup STRING, userid BIGINT) STORED AS SEQUENCEFILE;
  14. HBase Set replication factor on HDFS hdfs-site.xml <property> <name>dfs.replication</name> <value>3</value>

    </property> HBase is always set to strong consistency HBase partition is based on each row’s “key”
  15. Spark Partition on Spark val counts = textFile.flatMap(line => line.split("

    ")) .map(word => (word, 1)) .reduceByKey(_ + _) Spark doesn’t need consistency and replication. Partition by word
  16. Conclusion Technically, big data terms always related to distributed system.

    Big data platform have partition, replication, and consistency control which make it capable to handle large amount of data. There’s no silver bullet, choose wisely which technology will solve your problems.