Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MySQL & Hadoop

Avatar for Sagar Jauhari Sagar Jauhari
November 11, 2012

MySQL & Hadoop

Avatar for Sagar Jauhari

Sagar Jauhari

November 11, 2012
Tweet

Other Decks in Technology

Transcript

  1. 2 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Improving MySQL Performance with Hadoop Sagar Jauhari, Manish Kumar
  2. 3 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. India May 03 – May 04, 2012 San Francisco September 30 – October 4, 2012
  3. 4 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Program Agenda • Introduction • Inside Hadoop! • Integration with MySQL • Facebook's usage of MySQL & Hadoop • Twitter's usage of MySQL &Hadoop
  4. 5 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Introduction • 12 million product installations • 65,000 downloads each day • Part of the rapidly growing open source LAMP stack • MySQL Commercial Editions Available MySQL
  5. 6 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Introduction • Highly scalable Distributed Framework o Yahoo! has a 4000 node cluster! • Extremely powerful in terms of computation o Sorts a TB of random integers in 62 seconds! Hadoop
  6. 7 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Introduction • A scalable system for data storage and processing. • Fault tolerant • Parallelizes data processing across many nodes • Leverages its distributed file system (HDFS)* to cheaply and reliably replicate chunks of data. Hadoop is ..
  7. 8 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Introduction • Yahoo:  Ad Systems and Web Search. • Facebook:  Reporting/analytics and machine learning. • Twitter:  Data warehousing, data analysis. • Netflix:  Movie recommendation algorithm uses Hive ( which uses Hadoop, HDFS & MapReduce underneath) Who uses Hadoop?
  8. 9 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Introduction MySQL Hadoop Data Capacity TB+ (may require sharding) PB+ Data per query GB? PB+ Read/Write Random read/write Sequential scans, Append - only Query Language SQL Java MapReduce, scripting languages, Hive QL Transaction Yes No Indexes Yes No Latence Sub-second (hopefully) Minutes to hours Data structure Structured Structured or unstructured MySQL Vs Hadoop Courtesy: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010
  9. 10 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. A shallow Deep Dive Inside Hadoop
  10. 11 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Inside Hadoop • A distributed, scalable, and portable file system written in Java • Each node in a Hadoop instance typically has a single name-node; a cluster of data-nodes form the HDFS cluster. HDFS Name Node HDFS Map / Reduce Workers
  11. 12 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Inside Hadoop • Uses the TCP/IP layer for communication • Stores large files across multiple machines • Single name node stores metadata in-memory. HDFS Name Node HDFS Map / Reduce Workers
  12. 14 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Inside Hadoop • Design Goals o Scalability o Cost Efficiency • Implementation o User Jobs are executed as 'map' and 'reduce' functions o Work distribution and fault tolerance are managed Map Reduce Input Map Shuffle and sort Reduce Output
  13. 15 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Inside Hadoop • Map o Map Reduce job splits input data into independent chunks o Each chunk is processed by the map task in a parallel manner o Generic key-value computation Map Reduce Input Map Shuffle and sort Reduce Output
  14. 16 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Inside Hadoop • Reduce o Data from data nodes is merge sorted so that the key-value pairs for a given key are contiguous o The merged data is read sequentially and the values are passed to the reduce method with an iterator reading the input file until the next key value is encountered Map Reduce Input Map Shuffle and sort Reduce Output
  15. 17 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Inside Hadoop Map Reduce Word Hadoop MySQL Hive Sqoop Pig Hadoop Reduce Map Map Map Reduce Word Count Hadoop 2 MySQL 1 Hive 1 Sqoop 1 Pig 1 Input Map Shuffle and sort Reduce Output
  16. 18 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Inside Hadoop • Framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. • Master o Schedules the jobs' component tasks on the slaves o Monitors the jobs o Re-executes the failed tasks • Slave o Executes the tasks as directed by the master. How does hadoop use Map-Reduce
  17. 19 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Inside Hadoop • Language support o Java, PHP, Hive, Pig, Python, Wukong (Ruby), Rhipe (R) . • Scales Horizontally • Programmer is isolated from individual failed tasks • Tasks are restarted on another node Why Map Reduce ?
  18. 20 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Inside Hadoop • Not a good fit for problems that exhibit task-driven parallelism. • Requires a particular form of input - a set of (key, pair) pairs. • A lot of MapReduce applications end up sharing data one way or another. Map Reduce Limitations
  19. 21 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Integration with MySQL Leveraging Hadoop to Improve MySQL performance
  20. 22 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Integration with MySQL • The benefits of MySQL to developers is the speed, reliability, data integrity and scalability it provides. • It can successfully process large amounts of data (in petabytes). • But for applications that require a massive parallel processing we may need the benefits of a parallel processing system, such as hadoop.
  21. 23 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Integration with MySQL Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010
  22. 24 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Integration with MySQL Problem Statement Word Count Problem • In a large set of documents, find the number of occurrences of each word.
  23. 25 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Integration with MySQL Word count problem Word Hadoop MySQL Hive Sqoop Pig Hadoop Reduce Map Map Map Reduce Word Count Hadoop 2 MySQL 1 Hive 1 Sqoop 1 Pig 1 Input Map Shuffle and sort Reduce Output
  24. 26 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Integration with MySQL Mapping Map(key, value) foreach(word in the value) output(word,1) Key and Value represent a row of data: key is the byte office, value in a line. Intermediate Output <word1>, 1 <word2>, 1 <word3>, 1
  25. 27 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Integration with MySQL Reducing Reduce(key, list) sum the list Output(key, sum) Hadoop aggregates the keys and calls reduce for each unique key: <word1>, (1,1,1,1,1,1…1) <word2>, (1,1,1) <word3>, (1,1,1,1,1,1) . Final result: <word1>, 45823 <word2>, 1204 <word3>, 2693
  26. 28 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Integration with MySQL Writing Map Reduce jobs in Java
  27. 29 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Integration with MySQL Writing Map Reduce jobs in Java
  28. 32 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Facebook's usage of MySQL & Hadoop • Facebook collects TB of data everyday from around 800 million users. • MySQL handles pretty much every user interaction: likes, shares, status updates, alerts, requests, etc. • Hadoop/Hive Warehouse – 4800 cores, 2 PetaBytes (July 2009) – 4800 cores, 12 PetaBytes (Sept 2009) • Hadoop Archival Store – 200 TB
  29. 33 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Facebook's usage of MySQL & Hadoop • Data warehouse system for Hadoop. • Facilitates easy data summarization. • Hive translates HiveQL to MapReduce code. • Querying o Provides a mechanism to project structure onto this data o Allows querying the data using a SQL-like language called HiveQL Hive
  30. 34 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Facebook's usage of MySQL & Hadoop
  31. 35 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Hive Vs SQL RDBMS HIVE Language SQL-92 standard (maybe) Subset of SQL-92 plus Hive- specific extension Update Capabilities INSERT, UPDATE and DELETE INSERT but not UPDATE or DELETE Transactions Yes No Latency Sub-Second Minutes or more Indexes Any number of indexes, very important for performance No indexes, data is always scanned (in parallel) Data size TBs PBs Data per query GBs PBs Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010
  32. 36 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Hadoop Implementation • > 12 terabytes of new data per day! • Most stored data is LZ0 compressed • Uses Scribe to write logs to Hadoop o Scribe: a log collection framework created and open- sourced by Facebook. • Hadoop used for data warehousing, data analysis. At Twitter
  33. 37 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. References • Leveraging Hadoop to Augment MySQL Deployments - Sarah Sproehnle, Cloudera • http://engineering.twitter.com/2010/04/hadoop-at-twitter.html • http://semanticvoid.com • http://michael-noll.com • http://hadoop.apache.org/
  34. 38 Copyright © 2012, Oracle and/or its affiliates. All rights

    reserved. Legal Disclaimer • All other products, company names, brand names, trademarks and logos are the property of their respective owners.