Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Search in Hadoop

Search in Hadoop

Doug Cutting @Cloudera, talk at Data Science London, June 12 2013

Data Science London

July 13, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Why add Search? Easy • An interactive experience without technical

    knowledge • Single data set for multiple computing frameworks Fast • Exploratory analysis, esp. unstructured data • Broad range of indexing options to accommodate needs Efficient • Single scalable platform; no incremental investment • No need for separate systems, storage Experts know MapReduce. Savvy people know SQL. Everyone knows Search.
  2. How to add Search? an Integrated Part of the Hadoop

    System One pool of data One security framework One set of system resources One management interface
  3. Indexing • Apache Tika for format conversion • Additional file

    formats ◦ Sequence File ◦ Avro • Morphlines ◦ configuration-based transformation pipelines • Flume • MapReduce • HBase
  4. Flume: Near Real Time Indexing at Ingest Log File Events

    piped through a Flume hierarchy HDFS Flume Agent Indexer Other Log File Flume Agent Indexer
  5. MapReduce: Scalable Batch Indexing End-Reducer (shard 1): Index document End-Reducer

    (shard 2): Index document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging
  6. HDFS MapReduce: Scalable Batch Indexing Files Index shard Files Index

    shard Indexer Files Files Solr server Indexer Solr server GOLIVE
  7. 1. HBase Side Effect Proc. (SEP) • Indexing trigger mechanism

    • light-weight process • zero impact on write path 2. HBase Indexer • Maps HBase row updates into Solr index updates Feeding Search Indexes from HBase Using HBase replication for Indexing triggering HBase SEP + Indexer Index (Solr)
  8. • Install, configure, deploy Solr services on the cluster •

    Unified management and monitoring • Over time, more and more advanced index management and monitoring • Shard resource utilization • Rule-based re-sharding • Social and statistical relevance optimizations • And much more… Cloudera Manager: Simplified Management
  9. What is it? Interactive search for Hadoop • Full-text and

    faceted navigation • Batch, near real-time, and on-demand indexing Apache Solr integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated into the Hadoop platform Open Source • Apache licensed • Standard Solr APIs