Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JavaOne 2013 Presentation: Next Generation Hado...

Alex Holmes
September 25, 2013

JavaOne 2013 Presentation: Next Generation Hadoop: It's Not Just Batch!

Hadoop is rapidly becoming the kernel for distributed computing. Hadoop is known as the de facto tool for anything related to batch processing, but the little-known secret is that in recent years, it has become much more than batch. The introduction of tools and technologies such as HBase, Impala, and the next-generation MapReduce architecture have brought real-time capabilities to Hadoop, and it now offers a complete ecosystem that can be used to address any challenge. This presentation explores several real-world big data problems and identifies key parts of Hadoop that can be used to solve these challenges. It also examines how real-time and batch processing can be fused to play to their strengths.

Alex Holmes

September 25, 2013
Tweet

Other Decks in Technology

Transcript

  1. Who am I? • Alex Holmes • Software engineer •

    Working with Hadoop since 2008 • @grep_alex / grepalex.com
  2. never throw out your data tweets likes upload video photographs

    web pages deposits/ withdrawals reviews clickstream - How are users responding to your search results? web server logs - Who's hacking into your site? - What are legitimate users doing? application logs - What's the state of your system? - Any errors being generated right now? system logs - CPU/disk/network utilization
  3. Hadoop GFS (distributed storage) MapReduce (batch distributed compute) GFS HDFS

    (distributed storage) Map Map Map Reduce Reduce Reduce Outputs Shuffle Inputs Inputs Inputs Outputs Outputs Page Rank Trending Searches Indexing Analytics Pig / Hive (MapReduce DSL's)
  4. MapReduce HDFS Pig Hive Cascading Crunch Sqoop RHIPE Flume HBase

    Mahout Solr Cascalog Hue Ambari WebHDFS Oozie Azkaban Impala Splunk Blur
  5. Use Cases • ETL • Complex - large data volumes,

    data adapters for disparate data, N-way joins, coordination • Hadoop provides a scalable architecture and rich ecosystem • Data Warehousing • System of record • HDFS is a scalable, fault-tolerant, tried and tested storage solution • Batch ingress/egress • Research/BI
  6. YARN (resource management) HDFS (storage) Hadoop 1 Hadoop 2 Big

    Data Kernel MapReduce Tez Giraph Storm ... MapReduce (resource management and computational processing) HDFS (storage)
  7. Resource Manager (resource arbitrater) Client (submits work) submit application 1

    ask for containers: - priority - hostname - resources - number of containers 3 Node Manager (container management) Containers create containers 4 5 application-specific communication Application Master (framework-specific resource negotiator) a new application master is created for each application 2
  8. Blueprints • NoSQL with HBase • Stream processing with Storm

    • Graph processing with Giraph • SQL-on-Hadoop with Impala • Columnar Data Formats
  9. HBase • Based on Google BigTable • Low-latency, persistent store

    • Distributed, sorted, multi-dimensional map • Massively parallel reads and writes
  10. SLAVE NODE HRegionServer Region Region Region HFiles HFiles HFiles HBase

    architecture MapReduce Mapper Mapper Mapper MapReduce Reducer Reducer HFiles HFiles HDFS Client Container YARN Node Manager Hoya Application Master YARN Resource Manager
  11. column family: raw 1 <!doctype ... com.twitter:/posts/bgates 1 <html><bod... <!DOCTYPE

    ... com.example.www:/index.html com.example.blog:/a/b/c html dirty column family: meta application/ xhtml+xml text/html application/ octet-stream 200 301 202 status code rowkey content type Rows are lexicographically sorted text/xml application/ xhtml+xml Cells can have multiple "Versions" Column families are stored in different files Data model
  12. Indexing in MapReduce MapReduce (batch distributed compute) Internet Crawler (download

    web content) Indexing Inverted index HDFS (distributed storage) Data is usually buffered and written by an intermediary Latency involved in running MapReduce jobs. Crawl data and metadata only available in HDFS
  13. HBase (real-time column store) Internet Crawler (download web content) Inverted

    index MapReduce (batch distributed compute) coprocessors Analytics Dashboard Joins MapReduce Indexing using HBase
  14. see also ... • Accumulo • NSA-developed secure BigTable derivative

    • Eventually open-sourced as an Apache project • Offers HDFS storage, cell-level security • ElephantDB • Developed by Nathan Marz to support his Lambda Architecture • A simple read-only KV store that sources data from HDFS • Only 3K lines of code
  15. Use cases • Capturing system metrics • User-interaction data (messages,

    impressions, clickstream, ...) • Facebook selected HBase for messaging and user analytics over Cassandra • Content serving • Google uses BigTable for GMail, analytics, personalized search
  16. Stream processing • Data enters our systems in real-time, and

    we want to make real-time decisions (e.g. data aggregations, anomaly detection) • Approach used to be custom MQ's and workers, but complex and hard to maintain • Stream processing offers simple programming interfaces, with the framework taking care of fault tolerance and reliability
  17. Storm • Most mature of the systems available • Created

    by Twitter, maintained by Nathan Marz • Used at Twitter, Yahoo!, slew of others • https://github.com/nathanmarz/storm/wiki/Powered-By
  18. HBase Search App I saw a pussy cat I did,

    I did! Kafka search search Storm search Tokenize the search strings and emit a stream of words. <spout> search Maintain a sliding window of words and their occurrences. word word <bolt> 1 sliding window 43 0 51 3 63 cat 72 18 52 821 2 91 dog <bolt> dog top N 43 cat 18 ... 43 Keep a top N list of word frequencies. bolt bolt Trending Search
  19. Storm-on-YARN • Yahoo uses Storm for ad targeting, fraud detection,

    trending topics • Wanted to integrate it into existing YARN infrastructure, and use HDFS as a data source and sink • Added 2 features missing in Storm • Auto-scaling for load balancing • Security to allow Storm to access secure HDFS data
  20. see also ... • Samza • Developed by LinkedIn •

    Leverages Kafka for reliable messaging • Morphlines • A library for streaming ETL • Flume, HBase and MapReduce implementations • Can use SolrCloud as a data sink
  21. Use cases • Trending topics (e.g. Google Zeitgeist, Twitter trending

    hashtags) • Analytics aggregations (e.g. Ad providers) • Image processing (e.g. panorama image generation in Google Street View)
  22. PageRank www www www www www www Map Map Reduce

    Reduce Reduce write barrier Map Map Reduce Reduce Reduce write barrier write barrier write barrier
  23. Giraph • Inspired by Google's Pregel, a graph processing architecture

    • Based on the Bulk Synchronous Parallel model of distributed computing • Runs on MapReduce v1, and YARN
  24. MapReduce Giraph Map Map Reduce Reduce Reduce write barrier Map

    Map Reduce Reduce Reduce write barrier write barrier write barrier Map Map sync barrier
  25. PageRank algorithm public void compute(Iterable<DoubleWritable> messages) { double pageRank; if

    (getSuperstep() == 0) { pageRank = 1.0 / getTotalNumVertices(); } else { double rankFromNeighbors = MathUtils.sum(messages); double dampingFactor = ((1.0 - DAMPING_FACTOR) / (double) getTotalNumVertices()); pageRank = dampingFactor + (DAMPING_FACTOR * rankFromNeighbors); } setValue(pageRank); for (Edge<LongWritable, FloatWritable> edge : getEdges()) { sendMessage(edge.getTargetVertexId(), new DoubleWritable(pageRank / getNumEdges())); } } www www www www www www
  26. Use Cases • Analyze user social graphs (popularity, personalized rankings,

    shared connections, shortest path) • Web graphs - PageRank and variants • Networking/transportation (shortest path)
  27. Impala • Cloudera’s implementation of Google Dremel • Interactive SQL

    on HDFS and HBase • Written in C++; up to 100x faster than Hive
  28. Impala impalad HDFS / HBase Query Executor Query Coordinator Query

    Planner impalad HDFS / HBase Query Executor Query Coordinator Query Planner impalad HDFS / HBase Query Executor Query Coordinator Query Planner Client 1 2 3 3 4 4 Submit query 1 Push plan fragments 2 MPP distributed 3 Stream intermediary results 4 5 Stream results 5
  29. Use Cases • Data science research • Exposing data to

    organization • Find the next killer feature in your data!
  30. Parquet and ORC File • Both support repeating nested data

    • And bit packing • And run length encoding • And dictionary encoding • And compression • And are extensible aaaaaabbbccccc 6a3b5c 00000000 00000000 00000000 00000001 00000000 00000000 00000000 00000010 00000000 00000000 00000000 00000011 00000000 00000000 00000000 00000100 00000000 00000000 00000000 00000101 00000000 00000000 00000000 00000110 00000000 00000000 00000000 00000111 00000101 00111001 01110111 00000000 00000000 00000000 00000000
  31. Credit: Martin SoulStealer http://en.wikipedia.org/wiki/File:Mexican_Standoff.jpg Mexican Standoff between Parquet and ORC

    File • Compatible Systems - ORC supports Hive, Parquet MR, Impala, Hive, Pig - winner Parquet • Query optimization - ORC has stripe metadata for min/max/sum/count for each column - winner ORC • Parquet is on version 1.0, already deployed in production @ Twitter with 30% space savings - winner Parquet
  32. Conclusion • MapReduce still relevant for ETL, data warehousing, bulk

    ingest/egress • With Hadoop 2 we have a Big Data Kernel, where multiple systems can use HDFS as common storage, and YARN for resource management • Fusion of batch + real-time workloads in the same cluster
  33. What else? • Hadoop 2.2 (GA) expected in a few

    weeks • More apps coming soon on YARN (e.g. Tez) • Mesos • Lambda Architecture + Summingbird • Spark, Spark Streaming, Shark, GraphX, ...