is strictly prohibited What about Elasticsearch? Hadoop is just ETL Batch-based Used for preparing & loading data Fails at data look-ups ES Excels at search Used as a “hot/fast” store Maps nicely onto the Map/Reduce paradigm
is strictly prohibited A Holistic View of a Big Data System ETL Real Time Streams Unstructured Data (HDFS) Real Time Structured Database (hBase, Gemfire, Cassandra, ES,Mongo) Big SQL (Greenplum, AsterData, Etc…) Batch Processing Real-Time Processing (s4, storm) Analytics
is strictly prohibited A Holistic View of a Big Data System ETL Real Time Streams Unstructured Data (HDFS) Real Time Structured Database (hBase, Gemfire, Cassandra, ES,Mongo) Big SQL (Greenplum, AsterData, Etc…) Batch Processing Real-Time Processing (s4, storm) Analytics
is strictly prohibited Elasticsearch-Hadoop Read and write data to/from Hadoop Input/Output Format Cascading Tap/Sink Hive StorageHandler Pig Load/StoreFunc
is strictly prohibited Some numbers UFO sightings data set National UFO Center http://www.nuforc.org/ ~60000 entries / ~72 MB data ~ local Pig/Hive/ElasticSearch Date Sighted Date Reported Location Shape Duration Description 19980710 19980721 Brick, NJ sphere 5 min On the evening of July 10, 1998, I was walking near my home... 19940815 19980625 Rancho Mirage, CA rectangle 45 sec An extreemly close sighting of a ... 19970527 19981207 Arlington, VA disk 4 sec As I was on my way home, ... 19980828 19980830 Bend, OR fireball 20 sec Bright Blue (as that of an arc welder) light, that lit up ...
is strictly prohibited Hive - Querying Vanilla Hive (~55s) Read ES as views (~15s) CREATE EXTERNAL TABLE ufo(... ) STORED BY 'org.elasticsearch.hadoop.hive.ESStorageHandler' TBLPROPERTIES('es.location' = ‘ufo/sightings/_search?q=fast'); SELECT * FROM ufos; SELECT COUNT(*) from ufos; SELECT * FROM source WHERE description LIKE ‘%fast%’; SLEECT COUNT(*) FROM source WHERE description LIKE ‘%fast%’;
is strictly prohibited Pig • Use ES to store and index data Vanilla Pig (~50s) vs Search UFOs instantaneously (~8s) REGISTER /path_to_jar/es-hadoop-<version>.jar; %define ESSTORAGE org.elasticsearch.hadoop.pig.ESStorage() %define geohash org.es.example.pig.GeoHash(); data = LOAD ‘ufo.dat' USING PigStorage() AS (sight:long, report:long, location, name, duration, description); sightings = FOREACH data GENERATE sight, report, geohash(location), name, duration, description; STORE sightings INTO ‘ufo/sightings' USING $ESSTORAGE; fast_ufos = LOAD ‘ufo/sightings /_search?q=fast’ USING $ESSTORAGE; total = LOAD ‘ufo/sightings /_count?q=fast’ USING $ESSTORAGE; fast_ufos = FILTER sightings BY (description MATCHES ‘.*fast.*’); all_fast = GROUP fast_ufos ALL; total = FOREACH all_fast GENERATE COUNT(fast_ufos);
is strictly prohibited Go get it! •elasticsearch.org/hadoop • docs • download – get the jar, add it to your job, that’s it • github.com/elasticsearch/elasticsearch-hadoop • all things code related