Elasticsearch, Hadoop, and Friends: Spark, Storm, and More

Slide 1

Slide 1 text

Elasticsearch, Hadoop & Friends: Spark, Storm and more Costin Leau, @costinl

Slide 2

Slide 2 text

{ } CC-BY-ND 4.0 How to count words At scale; In real-time!

Slide 3

Slide 3 text

{ } CC-BY-ND 4.0

Slide 4

Slide 4 text

{ } CC-BY-ND 4.0 Hadoop

Slide 5

Slide 5 text

{ } CC-BY-ND 4.0 Hadoop 0.20.x/1.x Compute Hadoop Distributed File System (HDFS) Map / Reduce Framework Storage Machine Machine Machine Machine Machine

Slide 6

Slide 6 text

{ } CC-BY-ND 4.0 Map / Reduce overview

Slide 7

Slide 7 text

{ } CC-BY-ND 4.0 Hadoop 0.20.x/1.x Hadoop Distributed File System (HDFS) Map / Reduce Framework

Slide 8

Slide 8 text

{ } CC-BY-ND 4.0 Hadoop 2.x / NextGen Compute Hadoop Distributed File System (HDFS) Map / Reduce Framework Storage Machine Machine Machine Machine Machine YetAnotherResourceNegociator (YARN) Resource Mgmt. Other

Slide 9

Slide 9 text

{ } CC-BY-ND 4.0 Hadoop 2.x / NextGen Hadoop Distributed File System (HDFS) YARN Map / Reduce Other

Slide 10

Slide 10 text

{ } CC-BY-ND 4.0 Elasticsearch Hadoop

Slide 11

Slide 11 text

{ } CC-BY-ND 4.0 Elasticsearch for Apache Hadoop™

Slide 12

Slide 12 text

{ } CC-BY-ND 4.0 Certified to work

Slide 13

Slide 13 text

{ } CC-BY-ND 4.0 Compute

Slide 14

Slide 14 text

{ } CC-BY-ND 4.0 Partition-to-partition architecture Node1 2P 1R Node2 1P 3R Node3 2R 3P

Slide 15

Slide 15 text

{ } CC-BY-ND 4.0 Dynamic runtime matching Node1 2P 1R Node2 1P 3R Node3 2R 3P

Slide 16

Slide 16 text

{ } CC-BY-ND 4.0 Failure handling Node1 2P 1R Node2 1P 3R Node3 2R 3P

Slide 17

Slide 17 text

{ } CC-BY-ND 4.0 Co-location Node1 2P 1R Node2 1P 3R Node3 2R 3P

Slide 18

Slide 18 text

{ } CC-BY-ND 4.0 Native integration - Map / Reduce JobConf conf = new JobConf(); conf.setInputFormat(EsInputFormat.class); conf.set("es.resource", "radio/artists"); conf.set("es.query", "?q=me*"); JobClient.runJob(conf); JobConf conf = new JobConf(); conf.setOutputFormat(EsOutputFormat.class); conf.set("es.resource", "radio/artists"); JobClient.runJob(conf);

Slide 19

Slide 19 text

{ } CC-BY-ND 4.0 Native integration - Cascading Tap in = new EsTap("radio/artists","?q=me*"); Tap out = new StdOut(new TextLine()); new LocalFlowConnector(). connect(in, out, new Pipe(“pipe")).complete(); JobClient.runJob(conf); Tap in = Lfs(new TextDelimited( new Fields("id", "name", "url", "picture")), "artists.dat"); Tap out = new EsTap("radio/artists", new Fields("name", "url", "picture")); new HadoopFlowConnector(). connect(in, out, new Pipe(“pipe")).complete();

Slide 20

Slide 20 text

{ } CC-BY-ND 4.0 Native integration - Apache Pig A = LOAD 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?q=me*'); DUMP A; A = LOAD 'src/artists.dat' USING PigStorage() AS (id:long, name, url:chararray, picture: chararray); B = FOREACH A GENERATE name, TOTUPLE(url, picture) AS links; STORE B INTO 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage();

Slide 21

Slide 21 text

{ } CC-BY-ND 4.0 Native integration - Apache Hive CREATE EXTERNAL TABLE artists ( id BIGINT,name STRING, links STRUCT) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'='radio/artists','es.query'='?q=me*'); SELECT FROM artists; CREATE EXTERNAL TABLE artists ( id BIGINT,name STRING, links STRUCT) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'='radio/artists'); INSERT OVERWRITE TABLE artists SELECT s.name, named_struct('url', s.url, 'picture', s.pic) FROM source s;

Slide 22

Slide 22 text

{ } CC-BY-ND 4.0 Native integration - Apache Spark import org.elasticsearch.spark._ val sc = new SparkContext(new SparkConf()) val rdd = sc.esRDD("radio/artists", "?me*") import org.elasticsearch.spark._ case class Artist(name: String, albums: Int) val u2 = Artist("U2", 12) val bh = Map("name"-‐>"Buckethead","albums" -‐> 95, "age" -‐> 45) sc.makeRDD(Seq(u2, h2)).saveToEs("radio/artists")

Slide 23

Slide 23 text

{ } CC-BY-ND 4.0 Native integration - Spark SQL import org.elasticsearch.hadoop.mr._ val conf = new Configuration() conf.set("es.resource", "radio/artists") conf.set("es.query", "?q=me*") val mrNewApiRDD = sc.newAPIHadoopRDD(conf, classOf[EsInputFormat[Text, MapWritable]], classOf[Text], classOf[MapWritable])) val mrOldApiRDD = sc.hadoopRDD(conf, classOf[EsInputFormat[Text, MapWritable]], classOf[Text], classOf[MapWritable]))

Slide 24

Slide 24 text

{ } CC-BY-ND 4.0 Native integration - Spark SQL val sql = new SQLContext... val df = sql.load("radio/artists", "org.elasticsearch.spark.sql") df.filter(df("age") > 40) val sql = new SQLContext... val table = sql.sql("CREATE TEMPORARY TABLE artists " + "USING org.elasticsearch.spark.sql " + "OPTIONS(resource=`radio/artists`) ") val names = sql.sql("SELECT name FROM artists")

Slide 25

Slide 25 text

{ } CC-BY-ND 4.0 Native integration - Apache Storm TopologyBuilder builder = new TopologyBuilder(); builder.setBolt("esBolt", new EsBolt("twitter/tweets")); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("esSpout",new EsSpout("twitter/tweets","?q=nfl*",5); Builder.setBolt("bolt“, new PrinterBolt()).shuffleGrouping("esSpout");

Slide 26

Slide 26 text

{ } CC-BY-ND 4.0 Resource Management

Slide 27

Slide 27 text

{ } CC-BY-ND 4.0 YARN support – In Beta Run Elasticsearch on YARN* * YARN doesn’t support long-lived services: •  No provisioning •  No ip/network guarantees •  Data/node affinity Next YARN releases plan to address this

Slide 28

Slide 28 text

{ } CC-BY-ND 4.0 Storage

Slide 29

Slide 29 text

{ } CC-BY-ND 4.0 HDFS integration Use HDFS as a shared storage Backup and recover data Works great with snapshot immutable data Snapshot / Restore HDFS as a File-System – not recommended / tread carefully Incomplete FS semantics (last-delete-on-close, fsync) NFSv3 (metadata issues) See Elasticsearch issue #9072

Slide 30

Slide 30 text

{ } CC-BY-ND 4.0 What’s next Beta 1 - Apache Spark Java/Scala DSL Beta 2 - Apache Storm Beta 3 - YARN and SSL/TLS Beta 4 - Client-node routing, Spark Sources + Data Frame 2.1 – in development 2.2 Marvel integration Machine Learning – MLlib

Slide 31

Slide 31 text

{ } Thank you! @costinl