Search in Hadoop

Search in Hadoop Doug Cutting Chief Architect, Cloudera Chairman, Apache
Software Foundation

Why add Search? Easy • An interactive experience without technical
knowledge • Single data set for multiple computing frameworks Fast • Exploratory analysis, esp. unstructured data • Broad range of indexing options to accommodate needs Efficient • Single scalable platform; no incremental investment • No need for separate systems, storage Experts know MapReduce. Savvy people know SQL. Everyone knows Search.

How to add Search? an Integrated Part of the Hadoop
System One pool of data One security framework One set of system resources One management interface

HDFS: Scalable and Robust Index Storage HDFS Zookeeper Lucene Solr
SolrCloud Querying API Indexing API

Indexing • Apache Tika for format conversion • Additional file
formats ◦ Sequence File ◦ Avro • Morphlines ◦ configuration-based transformation pipelines • Flume • MapReduce • HBase

Flume: Near Real Time Indexing at Ingest Log File Events
piped through a Flume hierarchy HDFS Flume Agent Indexer Other Log File Flume Agent Indexer

MapReduce: Scalable Batch Indexing End-Reducer (shard 1): Index document End-Reducer
(shard 2): Index document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging

HDFS MapReduce: Scalable Batch Indexing Files Index shard Files Index
shard Indexer Files Files Solr server Indexer Solr server GOLIVE

1. HBase Side Effect Proc. (SEP) • Indexing trigger mechanism
• light-weight process • zero impact on write path 2. HBase Indexer • Maps HBase row updates into Solr index updates Feeding Search Indexes from HBase Using HBase replication for Indexing triggering HBase SEP + Indexer Index (Solr)

Customized Search in Hue

• Install, configure, deploy Solr services on the cluster •
Unified management and monitoring • Over time, more and more advanced index management and monitoring • Shard resource utilization • Rule-based re-sharding • Social and statistical relevance optimizations • And much more… Cloudera Manager: Simplified Management

What is it? Interactive search for Hadoop • Full-text and
faceted navigation • Batch, near real-time, and on-demand indexing Apache Solr integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated into the Hadoop platform Open Source • Apache licensed • Standard Solr APIs

@cutting

Search in Hadoop

Search in Hadoop

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript

Search in Hadoop Doug Cutting Chief Architect, Cloudera Chairman, Apache

Why add Search? Easy • An interactive experience without technical

How to add Search? an Integrated Part of the Hadoop

HDFS: Scalable and Robust Index Storage HDFS Zookeeper Lucene Solr

Indexing • Apache Tika for format conversion • Additional file

Flume: Near Real Time Indexing at Ingest Log File Events

MapReduce: Scalable Batch Indexing End-Reducer (shard 1): Index document End-Reducer

HDFS MapReduce: Scalable Batch Indexing Files Index shard Files Index

1. HBase Side Effect Proc. (SEP) • Indexing trigger mechanism

Customized Search in Hue

• Install, configure, deploy Solr services on the cluster •

What is it? Interactive search for Hadoop • Full-text and

@cutting