Search in Hadoop - Speaker Deck

Slide 1

Slide 1 text

Search in Hadoop Doug Cutting Chief Architect, Cloudera Chairman, Apache Software Foundation

Slide 2

Slide 2 text

Why add Search? Easy ● An interactive experience without technical knowledge ● Single data set for multiple computing frameworks Fast ● Exploratory analysis, esp. unstructured data ● Broad range of indexing options to accommodate needs Efficient ● Single scalable platform; no incremental investment ● No need for separate systems, storage Experts know MapReduce. Savvy people know SQL. Everyone knows Search.

Slide 3

Slide 3 text

How to add Search? an Integrated Part of the Hadoop System One pool of data One security framework One set of system resources One management interface

Slide 4

Slide 4 text

HDFS: Scalable and Robust Index Storage HDFS Zookeeper Lucene Solr SolrCloud Querying API Indexing API

Slide 5

Slide 5 text

Indexing ● Apache Tika for format conversion ● Additional file formats ○ Sequence File ○ Avro ● Morphlines ○ configuration-based transformation pipelines ● Flume ● MapReduce ● HBase

Slide 6

Slide 6 text

Flume: Near Real Time Indexing at Ingest Log File Events piped through a Flume hierarchy HDFS Flume Agent Indexer Other Log File Flume Agent Indexer

Slide 7

Slide 7 text

MapReduce: Scalable Batch Indexing End-Reducer (shard 1): Index document End-Reducer (shard 2): Index document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging

Slide 8

Slide 8 text

HDFS MapReduce: Scalable Batch Indexing Files Index shard Files Index shard Indexer Files Files Solr server Indexer Solr server GOLIVE

Slide 9

Slide 9 text

1. HBase Side Effect Proc. (SEP) ● Indexing trigger mechanism ● light-weight process ● zero impact on write path 2. HBase Indexer ● Maps HBase row updates into Solr index updates Feeding Search Indexes from HBase Using HBase replication for Indexing triggering HBase SEP + Indexer Index (Solr)

Slide 10

Slide 10 text

Customized Search in Hue

Slide 11

Slide 11 text

● Install, configure, deploy Solr services on the cluster ● Unified management and monitoring ● Over time, more and more advanced index management and monitoring ● Shard resource utilization ● Rule-based re-sharding ● Social and statistical relevance optimizations ● And much more… Cloudera Manager: Simplified Management

Slide 12

Slide 12 text

What is it? Interactive search for Hadoop ● Full-text and faceted navigation ● Batch, near real-time, and on-demand indexing Apache Solr integrated with CDH ● Established, mature search with vibrant community ● Separate runtime like MapReduce, Impala ● Incorporated into the Hadoop platform Open Source ● Apache licensed ● Standard Solr APIs

Slide 13

Slide 13 text

@cutting