Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-Time Search and Big Data using Elasticsearch

Real-Time Search and Big Data using Elasticsearch

Quick presentation on real-time search and Hadoop using Elasticsearch as presented at the October Elasticsearch User Group meet-up in Amsterdam:
http://www.meetup.com/ElasticSearch-NL/events/138812982/

Elasticsearch Inc

October 10, 2013
Tweet

More Decks by Elasticsearch Inc

Other Decks in Programming

Transcript

  1. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Real-Time Search & Big Data Costin Leau @costinl
  2. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Hadoop as a Big Data Platform Hadoop Distributed File System (HDFS) Map Reduce Framework (MapRed)
  3. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited What about Elasticsearch? Hadoop is just ETL Batch-based Used for preparing & loading data Fails at data look-ups ES Excels at search Used as a “hot/fast” store Maps nicely onto the Map/Reduce paradigm
  4. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited A Holistic View of a Big Data System ETL Real Time Streams Unstructured Data (HDFS) Real Time Structured Database (hBase, Gemfire, Cassandra, ES,Mongo) Big SQL (Greenplum, AsterData, Etc…) Batch Processing Real-Time Processing (s4, storm) Analytics
  5. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited A Holistic View of a Big Data System ETL Real Time Streams Unstructured Data (HDFS) Real Time Structured Database (hBase, Gemfire, Cassandra, ES,Mongo) Big SQL (Greenplum, AsterData, Etc…) Batch Processing Real-Time Processing (s4, storm) Analytics
  6. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Elasticsearch-Hadoop Read and write data to/from Hadoop Input/Output Format Cascading Tap/Sink Hive StorageHandler Pig Load/StoreFunc
  7. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Some numbers UFO sightings data set National UFO Center http://www.nuforc.org/ ~60000 entries / ~72 MB data ~ local Pig/Hive/ElasticSearch Date Sighted Date Reported Location Shape Duration Description 19980710 19980721 Brick, NJ sphere 5 min On the evening of July 10, 1998, I was walking near my home... 19940815 19980625 Rancho Mirage, CA rectangle 45 sec An extreemly close sighting of a ... 19970527 19981207 Arlington, VA disk 4 sec As I was on my way home, ... 19980828 19980830 Bend, OR fireball 20 sec Bright Blue (as that of an arc welder) light, that lit up ...
  8. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Hive - Loading Vanilla Hive (~3s/20s) ES (~22s) ADD JAR /path_to_jar/es-hadoop.jar CREATE TEMPORARY FUNCTION geohash as ' org.es.example.hive.GeoHash()'; CREATE EXTERNAL TABLE ufo (sight BIGINT, report BIGINT, location STRING, shape STRING, duration STRING, description STRING) STORED BY 'org.elasticsearch.hadoop.hive.ESStorageHandler' TBLPROPERTIES('es.location' = ufo/sightings'); INSERT OVERWRITE TABLE ufo SELECT s.sight, s.report, geohash(location), s.duration, s.description FROM source s; CREATE TABLE source (sight BIGINT, report BIGINT, location STRING,shape STRING, duration STRING, description STRING); LOAD DATA INPATH ‘ufo.dat' OVERWRITE INTO TABLE source; INSERT OVERWRITE TABLE geoSource SELECT s.sight, s.report, geohash(location), s.duration, s.description FROM source;
  9. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Hive - Querying Vanilla Hive (~55s) Read ES as views (~15s) CREATE EXTERNAL TABLE ufo(... ) STORED BY 'org.elasticsearch.hadoop.hive.ESStorageHandler' TBLPROPERTIES('es.location' = ‘ufo/sightings/_search?q=fast'); SELECT * FROM ufos; SELECT COUNT(*) from ufos; SELECT * FROM source WHERE description LIKE ‘%fast%’; SLEECT COUNT(*) FROM source WHERE description LIKE ‘%fast%’;
  10. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Pig • Use ES to store and index data Vanilla Pig (~50s) vs Search UFOs instantaneously (~8s) REGISTER /path_to_jar/es-hadoop-<version>.jar; %define ESSTORAGE org.elasticsearch.hadoop.pig.ESStorage() %define geohash org.es.example.pig.GeoHash(); data = LOAD ‘ufo.dat' USING PigStorage() AS (sight:long, report:long, location, name, duration, description); sightings = FOREACH data GENERATE sight, report, geohash(location), name, duration, description; STORE sightings INTO ‘ufo/sightings' USING $ESSTORAGE; fast_ufos = LOAD ‘ufo/sightings /_search?q=fast’ USING $ESSTORAGE; total = LOAD ‘ufo/sightings /_count?q=fast’ USING $ESSTORAGE; fast_ufos = FILTER sightings BY (description MATCHES ‘.*fast.*’); all_fast = GROUP fast_ufos ALL; total = FOREACH all_fast GENERATE COUNT(fast_ufos);
  11. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Elasticsearch + Hadoop 0 10 20 30 40 50 60 M/R Pig Hive Raw w/ ES 0 10 20 30 40 50 60 M/R Pig Hive Raw w/ ES Writing Reading / Querying
  12. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Rich Querying • Find UFOs around Area 51 (~8s) • Vanilla Pig / Hive area51UFO = LOAD ‘ufo/sightings /_query? { "term" : { "pin.geohash" : geohash("area 51") } ... { "geo_distance" : { "distance" : "10km" } } USING $ESSTORAGE; ???
  13. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited More then just fast text search… • Geo-location • Custom scoring • Terms • Analytics
  14. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited node 3 node 2 node 1 Partitioning 1P 2P 3P 3R 2R 1R C C C C C
  15. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited node 3 node 2 node 1 Dynamic splitting 1P 2P 3P 3R 2R 1R C C C C C C C
  16. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited node 3 node 2 node 1 Co-location 1P 2P 3P 3R 2R 1R C C C C C C C
  17. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Road map • Snapshot/Restore on HDFS • Pushdown operation • HBase secondary indices*
  18. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Go get it! •elasticsearch.org/hadoop • docs • download – get the jar, add it to your job, that’s it • github.com/elasticsearch/elasticsearch-hadoop • all things code related