Slide 1

Slide 1 text

Elasticsearch, Hadoop & Friends: Spark, Storm and more Costin Leau, @costinl

Slide 2

Slide 2 text

{ } CC-BY-ND 4.0 How to count words At scale; In real-time!

Slide 3

Slide 3 text

{ } CC-BY-ND 4.0

Slide 4

Slide 4 text

{ } CC-BY-ND 4.0 Hadoop

Slide 5

Slide 5 text

{ } CC-BY-ND 4.0 Hadoop 0.20.x/1.x Compute Hadoop  Distributed  File  System  (HDFS)   Map  /  Reduce  Framework   Storage Machine   Machine   Machine   Machine   Machine  

Slide 6

Slide 6 text

{ } CC-BY-ND 4.0 Map / Reduce overview

Slide 7

Slide 7 text

{ } CC-BY-ND 4.0 Hadoop 0.20.x/1.x Hadoop  Distributed  File  System  (HDFS)   Map  /  Reduce  Framework  

Slide 8

Slide 8 text

{ } CC-BY-ND 4.0 Hadoop 2.x / NextGen Compute Hadoop  Distributed  File  System  (HDFS)   Map  /  Reduce  Framework   Storage Machine   Machine   Machine   Machine   Machine   YetAnotherResourceNegociator    (YARN)   Resource Mgmt. Other  

Slide 9

Slide 9 text

{ } CC-BY-ND 4.0 Hadoop 2.x / NextGen Hadoop  Distributed  File  System  (HDFS)   YARN   Map  /  Reduce   Other  

Slide 10

Slide 10 text

{ } CC-BY-ND 4.0 Elasticsearch Hadoop

Slide 11

Slide 11 text

{ } CC-BY-ND 4.0 Elasticsearch for Apache Hadoop™

Slide 12

Slide 12 text

{ } CC-BY-ND 4.0 Certified to work

Slide 13

Slide 13 text

{ } CC-BY-ND 4.0 Compute

Slide 14

Slide 14 text

{ } CC-BY-ND 4.0 Partition-to-partition architecture Node1             2P   1R   Node2             1P   3R   Node3             2R   3P  

Slide 15

Slide 15 text

{ } CC-BY-ND 4.0 Dynamic runtime matching Node1             2P   1R   Node2             1P   3R   Node3             2R   3P  

Slide 16

Slide 16 text

{ } CC-BY-ND 4.0 Failure handling Node1             2P   1R   Node2             1P   3R   Node3             2R   3P  

Slide 17

Slide 17 text

{ } CC-BY-ND 4.0 Co-location Node1             2P   1R   Node2             1P   3R   Node3             2R   3P  

Slide 18

Slide 18 text

{ } CC-BY-ND 4.0 Native integration - Map / Reduce JobConf  conf  =  new  JobConf();     conf.setInputFormat(EsInputFormat.class);     conf.set("es.resource",  "radio/artists");     conf.set("es.query",  "?q=me*");       JobClient.runJob(conf);   JobConf  conf  =  new  JobConf();     conf.setOutputFormat(EsOutputFormat.class);     conf.set("es.resource",  "radio/artists");     JobClient.runJob(conf);  

Slide 19

Slide 19 text

{ } CC-BY-ND 4.0 Native integration - Cascading Tap  in  =  new  EsTap("radio/artists","?q=me*");   Tap  out  =  new  StdOut(new  TextLine());   new  LocalFlowConnector().              connect(in,  out,  new  Pipe(“pipe")).complete();     JobClient.runJob(conf);   Tap  in  =  Lfs(new  TextDelimited(        new  Fields("id",  "name",  "url",  "picture")),  "artists.dat");   Tap  out  =  new  EsTap("radio/artists",          new  Fields("name",  "url",  "picture"));   new  HadoopFlowConnector().                      connect(in,  out,  new  Pipe(“pipe")).complete();  

Slide 20

Slide 20 text

{ } CC-BY-ND 4.0 Native integration - Apache Pig A  =  LOAD  'radio/artists'  USING          org.elasticsearch.hadoop.pig.EsStorage('es.query=?q=me*');   DUMP  A;   A  =  LOAD  'src/artists.dat'  USING  PigStorage()  AS                      (id:long,  name,  url:chararray,  picture:  chararray);   B  =  FOREACH  A  GENERATE  name,  TOTUPLE(url,  picture)  AS  links;     STORE  B  INTO  'radio/artists'  USING                              org.elasticsearch.hadoop.pig.EsStorage();  

Slide 21

Slide 21 text

{ } CC-BY-ND 4.0 Native integration - Apache Hive CREATE  EXTERNAL  TABLE  artists  (      id  BIGINT,name  STRING,  links  STRUCT)   STORED  BY  'org.elasticsearch.hadoop.hive.EsStorageHandler'   TBLPROPERTIES('es.resource'='radio/artists','es.query'='?q=me*');     SELECT  FROM  artists;   CREATE  EXTERNAL  TABLE  artists  (      id  BIGINT,name  STRING,  links  STRUCT)   STORED  BY  'org.elasticsearch.hadoop.hive.EsStorageHandler'   TBLPROPERTIES('es.resource'='radio/artists');     INSERT  OVERWRITE  TABLE  artists  SELECT      s.name,  named_struct('url',  s.url,  'picture',  s.pic)  FROM  source  s;    

Slide 22

Slide 22 text

{ } CC-BY-ND 4.0 Native integration - Apache Spark import  org.elasticsearch.spark._     val  sc  =  new  SparkContext(new  SparkConf())   val  rdd  =  sc.esRDD("radio/artists",  "?me*")   import  org.elasticsearch.spark._                     case  class  Artist(name:  String,  albums:  Int)     val  u2  =  Artist("U2",  12)   val  bh  =  Map("name"-­‐>"Buckethead","albums"  -­‐>  95,  "age"  -­‐>  45)     sc.makeRDD(Seq(u2,  h2)).saveToEs("radio/artists")  

Slide 23

Slide 23 text

{ } CC-BY-ND 4.0 Native integration - Spark SQL     import  org.elasticsearch.hadoop.mr._     val  conf  =  new  Configuration()   conf.set("es.resource",  "radio/artists")                                         conf.set("es.query",  "?q=me*")     val  mrNewApiRDD  =  sc.newAPIHadoopRDD(conf,                                  classOf[EsInputFormat[Text,  MapWritable]],                                    classOf[Text],  classOf[MapWritable]))     val  mrOldApiRDD  =  sc.hadoopRDD(conf,                                  classOf[EsInputFormat[Text,  MapWritable]],                                    classOf[Text],  classOf[MapWritable]))      

Slide 24

Slide 24 text

{ } CC-BY-ND 4.0 Native integration - Spark SQL val  sql  =  new  SQLContext...   val  df  =  sql.load("radio/artists",  "org.elasticsearch.spark.sql")   df.filter(df("age")  >  40)   val  sql  =  new  SQLContext...   val  table  =  sql.sql("CREATE  TEMPORARY  TABLE  artists  "  +                "USING  org.elasticsearch.spark.sql  "  +                "OPTIONS(resource=`radio/artists`)  ")     val  names  =  sql.sql("SELECT  name  FROM  artists")  

Slide 25

Slide 25 text

{ } CC-BY-ND 4.0 Native integration - Apache Storm TopologyBuilder  builder  =  new  TopologyBuilder();   builder.setBolt("esBolt",  new  EsBolt("twitter/tweets"));   TopologyBuilder  builder  =  new  TopologyBuilder();   builder.setSpout("esSpout",new  EsSpout("twitter/tweets","?q=nfl*",5);   Builder.setBolt("bolt“,  new   PrinterBolt()).shuffleGrouping("esSpout");  

Slide 26

Slide 26 text

{ } CC-BY-ND 4.0 Resource Management

Slide 27

Slide 27 text

{ } CC-BY-ND 4.0 YARN support – In Beta Run Elasticsearch on YARN* * YARN doesn’t support long-lived services: •  No provisioning •  No ip/network guarantees •  Data/node affinity Next YARN releases plan to address this

Slide 28

Slide 28 text

{ } CC-BY-ND 4.0 Storage

Slide 29

Slide 29 text

{ } CC-BY-ND 4.0 HDFS integration Use HDFS as a shared storage Backup and recover data Works great with snapshot immutable data Snapshot / Restore HDFS as a File-System – not recommended / tread carefully Incomplete FS semantics (last-delete-on-close, fsync) NFSv3 (metadata issues) See Elasticsearch issue #9072

Slide 30

Slide 30 text

{ } CC-BY-ND 4.0 What’s next Beta 1 - Apache Spark Java/Scala DSL Beta 2 - Apache Storm Beta 3 - YARN and SSL/TLS Beta 4 - Client-node routing, Spark Sources + Data Frame 2.1 – in development 2.2 Marvel integration Machine Learning – MLlib

Slide 31

Slide 31 text

{ } Thank you! @costinl