Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch, Hadoop, and Friends: Spark, Storm, and More

Elastic Co
March 10, 2015

Elasticsearch, Hadoop, and Friends: Spark, Storm, and More

A practical overview of using Elasticsearch within a Hadoop environment to perform real-time indexing, search and data-analysis.

In this session, Costin will deep dive into Elasticsearch for Apache Hadoop, showing off our rich integrations between the various Hadoop libraries, whether batch (Map/Reduce, Pig, Hive) or stream oriented (such as Apache Spark). He'll also touch on YARN support and the HDFS snapshot/restore plugin.

Elastic Co

March 10, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. { } CC-BY-ND 4.0 Hadoop 0.20.x/1.x Compute Hadoop  Distributed  File

     System  (HDFS)   Map  /  Reduce  Framework   Storage Machine   Machine   Machine   Machine   Machine  
  2. { } CC-BY-ND 4.0 Hadoop 2.x / NextGen Compute Hadoop

     Distributed  File  System  (HDFS)   Map  /  Reduce  Framework   Storage Machine   Machine   Machine   Machine   Machine   YetAnotherResourceNegociator    (YARN)   Resource Mgmt. Other  
  3. { } CC-BY-ND 4.0 Hadoop 2.x / NextGen Hadoop  Distributed

     File  System  (HDFS)   YARN   Map  /  Reduce   Other  
  4. { } CC-BY-ND 4.0 Partition-to-partition architecture Node1      

          2P   1R   Node2             1P   3R   Node3             2R   3P  
  5. { } CC-BY-ND 4.0 Dynamic runtime matching Node1    

            2P   1R   Node2             1P   3R   Node3             2R   3P  
  6. { } CC-BY-ND 4.0 Failure handling Node1      

          2P   1R   Node2             1P   3R   Node3             2R   3P  
  7. { } CC-BY-ND 4.0 Co-location Node1        

        2P   1R   Node2             1P   3R   Node3             2R   3P  
  8. { } CC-BY-ND 4.0 Native integration - Map / Reduce

    JobConf  conf  =  new  JobConf();     conf.setInputFormat(EsInputFormat.class);     conf.set("es.resource",  "radio/artists");     conf.set("es.query",  "?q=me*");       JobClient.runJob(conf);   JobConf  conf  =  new  JobConf();     conf.setOutputFormat(EsOutputFormat.class);     conf.set("es.resource",  "radio/artists");     JobClient.runJob(conf);  
  9. { } CC-BY-ND 4.0 Native integration - Cascading Tap  in

     =  new  EsTap("radio/artists","?q=me*");   Tap  out  =  new  StdOut(new  TextLine());   new  LocalFlowConnector().              connect(in,  out,  new  Pipe(“pipe")).complete();     JobClient.runJob(conf);   Tap  in  =  Lfs(new  TextDelimited(        new  Fields("id",  "name",  "url",  "picture")),  "artists.dat");   Tap  out  =  new  EsTap("radio/artists",          new  Fields("name",  "url",  "picture"));   new  HadoopFlowConnector().                      connect(in,  out,  new  Pipe(“pipe")).complete();  
  10. { } CC-BY-ND 4.0 Native integration - Apache Pig A

     =  LOAD  'radio/artists'  USING          org.elasticsearch.hadoop.pig.EsStorage('es.query=?q=me*');   DUMP  A;   A  =  LOAD  'src/artists.dat'  USING  PigStorage()  AS                      (id:long,  name,  url:chararray,  picture:  chararray);   B  =  FOREACH  A  GENERATE  name,  TOTUPLE(url,  picture)  AS  links;     STORE  B  INTO  'radio/artists'  USING                              org.elasticsearch.hadoop.pig.EsStorage();  
  11. { } CC-BY-ND 4.0 Native integration - Apache Hive CREATE

     EXTERNAL  TABLE  artists  (      id  BIGINT,name  STRING,  links  STRUCT<url:STRING,  picture:STRING>)   STORED  BY  'org.elasticsearch.hadoop.hive.EsStorageHandler'   TBLPROPERTIES('es.resource'='radio/artists','es.query'='?q=me*');     SELECT  FROM  artists;   CREATE  EXTERNAL  TABLE  artists  (      id  BIGINT,name  STRING,  links  STRUCT<url:STRING,  picture:STRING>)   STORED  BY  'org.elasticsearch.hadoop.hive.EsStorageHandler'   TBLPROPERTIES('es.resource'='radio/artists');     INSERT  OVERWRITE  TABLE  artists  SELECT      s.name,  named_struct('url',  s.url,  'picture',  s.pic)  FROM  source  s;    
  12. { } CC-BY-ND 4.0 Native integration - Apache Spark import

     org.elasticsearch.spark._     val  sc  =  new  SparkContext(new  SparkConf())   val  rdd  =  sc.esRDD("radio/artists",  "?me*")   import  org.elasticsearch.spark._                     case  class  Artist(name:  String,  albums:  Int)     val  u2  =  Artist("U2",  12)   val  bh  =  Map("name"-­‐>"Buckethead","albums"  -­‐>  95,  "age"  -­‐>  45)     sc.makeRDD(Seq(u2,  h2)).saveToEs("radio/artists")  
  13. { } CC-BY-ND 4.0 Native integration - Spark SQL  

      import  org.elasticsearch.hadoop.mr._     val  conf  =  new  Configuration()   conf.set("es.resource",  "radio/artists")                                         conf.set("es.query",  "?q=me*")     val  mrNewApiRDD  =  sc.newAPIHadoopRDD(conf,                                  classOf[EsInputFormat[Text,  MapWritable]],                                    classOf[Text],  classOf[MapWritable]))     val  mrOldApiRDD  =  sc.hadoopRDD(conf,                                  classOf[EsInputFormat[Text,  MapWritable]],                                    classOf[Text],  classOf[MapWritable]))      
  14. { } CC-BY-ND 4.0 Native integration - Spark SQL val

     sql  =  new  SQLContext...   val  df  =  sql.load("radio/artists",  "org.elasticsearch.spark.sql")   df.filter(df("age")  >  40)   val  sql  =  new  SQLContext...   val  table  =  sql.sql("CREATE  TEMPORARY  TABLE  artists  "  +                "USING  org.elasticsearch.spark.sql  "  +                "OPTIONS(resource=`radio/artists`)  ")     val  names  =  sql.sql("SELECT  name  FROM  artists")  
  15. { } CC-BY-ND 4.0 Native integration - Apache Storm TopologyBuilder

     builder  =  new  TopologyBuilder();   builder.setBolt("esBolt",  new  EsBolt("twitter/tweets"));   TopologyBuilder  builder  =  new  TopologyBuilder();   builder.setSpout("esSpout",new  EsSpout("twitter/tweets","?q=nfl*",5);   Builder.setBolt("bolt“,  new   PrinterBolt()).shuffleGrouping("esSpout");  
  16. { } CC-BY-ND 4.0 YARN support – In Beta Run

    Elasticsearch on YARN* * YARN doesn’t support long-lived services: •  No provisioning •  No ip/network guarantees •  Data/node affinity Next YARN releases plan to address this
  17. { } CC-BY-ND 4.0 HDFS integration Use HDFS as a

    shared storage Backup and recover data Works great with snapshot immutable data Snapshot / Restore HDFS as a File-System – not recommended / tread carefully Incomplete FS semantics (last-delete-on-close, fsync) NFSv3 (metadata issues) See Elasticsearch issue #9072
  18. { } CC-BY-ND 4.0 What’s next Beta 1 - Apache

    Spark Java/Scala DSL Beta 2 - Apache Storm Beta 3 - YARN and SSL/TLS Beta 4 - Client-node routing, Spark Sources + Data Frame 2.1 – in development 2.2 Marvel integration Machine Learning – MLlib