What's Happening in Hadoop and Spark

1 Costin Leau ES-Hadoop Lead @costinl What’s Happening In Hadoop
& Spark

2 h"ps://github.com/elas3c/elas3csearch-‐hadoop

3 Agenda ES-Hadoop Overview Hive Spark HDFS Snapshot/Restore Roadmap 2
3 4 5 1

4 Hadoop overview Hadoop Distributed File System (HDFS) YARN Map
/ Reduce Other * *

5 * * Hadoop overview Hadoop Distributed File System (HDFS)
YARN Map / Reduce Other Storage Resource Management Compute

6 ES-Hadoop components Compute Resource Mgmt Storage Spark Hive Storm
M/R Cascading Pig running against ES ES on YARN Snapshot/Restore on HDFS

7 Elasticsearch for Apache Hadoop ™ aka ES-Hadoop

8 Make your data useful HDFS

9 Compute 9

10 Partition to partition architecture Node1 2P 3R Node2 1P
3P Node3 2R 1R

11 Dynamic runtime matching Node1 2P 3R Node2 1P 3P
Node 3 2R 1R

12 Failure handling Node1 2P 3R Node2 1P 3P Node3
2R 1R

13 Co-location Node1 2P 3R Node2 1P 3P Node3 2R
1R

14 ES-Hadoop compute integrations Library / API ES-Hadoop exposed as
Map/Reduce Input / OutputFormat Cascading Tap / Sink Apache Pig Loader / Storage Apache Hive (EXTERNAL) TABLE Apache Storm Spout / Bolt Apache Spark RDD, DataFrame, DataSource

15 Apache Hive CREATE EXTERNAL TABLE playlist (name STRING, year:long)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'='buckethead/albums','es.query'='?q=pikes'); SELECT name FROM playlist WHERE year > 2010; CREATE EXTERNAL TABLE playlist(name STRING, year:long) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'='buckethead/albums'); INSERT INTO TABLE playlist VALUES ('Buildor', 2016), ('Florrmat', 2016);

16 Query Analysis SELECT name FROM playlist WHERE year >
2010; { name : “shapeless”, year : 2015 } Target fields Filtering/Operating fields

17 Query DSL Conversion SELECT name FROM playlist WHERE year
> 2010; { "fields" : ["name"], "query" : { "range" : {"year" : { "gt" : "2010" }}} }

18 Query DSL Conversion SELECT name FROM playlist WHERE year
> 2010; { "fields" : ["name"], "query" : { "range" : {"year" : { "gt" : "2010" }}} } Push down Projection

19 Query optimization Where possible, convert the API query to
ES query Return only the results without any intermediate data Library / API Operation Awareness Map/Reduce Does not apply Cascading Projection Apache Pig Projection Apache Hive Projection Apache Storm Projection Apache Spark Projection & Push-Down

20 Apache Hive Queries Projection supported No push down hooks
in Hive ‒ Operators ‒ Type conversion

21 Apache Hive - Trivia Versions 0.10 through 1.2.x supported
•  0.13 broke bwc (HiveOutputFormat) •  0.14 broke bwc (removed interface SerDe) ‒  Also released with SNAPSHOT deps (HIVE-8857/8906) •  1.0-1.2 required a rewrite of the testing infrastructure Apache Tez •  Not 100% compatible with M/R jobs •  ES-Hadoop fall backs / mimics the environment

22 Apache Spark

23 ES-Hadoop native Spark integration Offers both Scala & Java
API Understand Scala & Java types ‒  Case classes ‒  Java Beans Available as Spark package Supports Spark Core & SQL all 1.x version (1.0-1.6) Available for Scala 2.10 and 2.11 spark-‐packages.org/package/elas3c/elas3csearch-‐hadoop

24 Apache Spark – Resilient Distributed Dataset (RDD) import org.elasticsearch.spark._
val sc = new SparkContext() sc.esRDD("buckethead/albums", ?q=pikes") import org.elasticsearch.spark._ case class Album(name: String, year: Long) val lent = Map("name" -> "Celery", "year" -> 2014) val onMyDesk = Album("Electric Tears", 2002) sc.makeRDD(Seq(lent, onMyDesk)).saveToEs("buckethead/albums")

25 Apache Spark – JSON RDDs jsonRDD : RDD[(String, String)]
jsonRDD = sc.esJsonRDD("buckethead/albums", "?q=pikes") val p95 = """{"name" : "Hold Me Forever", "year" : 2014 }""" val p180 = """{"name" : "Heaven Is Your Home", "year" : 2015}""" sc.makeRDD(Seq(p95, p180)).saveJsonToEs("buckethead/albums")

26 Apache Spark SQL – DataFrames “Spark SQL is Spark’s
module for working with structured data” RDD + schema = DataFrame (inspired by Python Pandas) Allows usage of SQL Integrates with Hive* * trivia – the project was initially based on Hive (Shark)

27 Apache Spark SQL Support import org.elasticsearch.spark.sql._ val df =
sqlCtx.read.format("es").load("buckethead/albums") df.filter(df("category").equalTo("pikes").and(df("year").gt(2015))) CREATE TEMPORARY TABLE dfAsTable USING org.elasticsearch.spark.sql OPTIONS (‘path’ = ‘buckethead/albums’); SELECT name FROM dfAsTable WHERE year > 2015 and category = “pikes”; val df = sqlContext.read.json("buckethead/2015/albums.json") df.saveToES("buckethead/albums")

28 Spark SQL to Query DSL •  Example of translation
df.filter(df("category").equalTo("pikes").and(df("year").geq(2015))) { "query" : { "bool" : { "must" : [ "match" : { "category" : "pikes" } ], "filter" : [ { "range" : { "year" : {"gte" : "2015" }}} ] }} }

29 Advanced Spark SQL features in ES-Hadop Spark SQL 1.3
- 1.6 (DataFrame) Spark SQL 1.1 - 1.2 (SchemaRDD) Supports all filters in Spark SQL -  EqualTo/EqualNullSafe -  GreaterThan/GreaterThanOrEqual/LessThan/LessThanOrEqual -  In/ IsNull/ IsNotNull -  And/Or/Not -  StringStartsWith/StringEndsWith/StringContains

30 Advanced Spark SQL features in ES-Hadop DataSource (Spark 1.3)
DataSource Reader (Spark 1.4) DataSourceRegister (Spark 1.5) DataSource.unhandledFilters (Spark 1.6) - Tell Spark that a certain Filter is already handled (by ES/ES-Hadoop) val df = sql.load("spark/index", "org.elasticsearch.spark.sql") val df = sql.read.format("org.elasticsearch.spark.sql").load(“spark/index”) val df = sql.read.format(“es").load(“spark/index”)

31 Advanced Mapping { “album” : { “mapping” : {
“year” : “int”, “name” : “string” }}} What is the type of field “year” ? 1.  Int 2.  Array of Int 3.  Array of Array of Int

32 Advanced Mapping – Inferring schema { “album” : {
“mapping” : { “year” : “int”, “name” : “string” }}} What is the type of field “year” ? 1.  Int 2.  Array of Int 3.  Array of Array of Int Any of the above One can tell ES-Hadoop what fields are arrays (and their depth)

33 Advanced Mapping – Geo Types Elasticsearch supports two geo
types: •  geo_point – 4 formats •  geo_shape – 9 shapes To avoid configuration madness, ES-Hadoop performs sampling "pin" : {"location" : { "lat" : 41.12, "lon" : -71.34 } } "pin" : { "location" : "41.12,-71.34" } "pin" : { "location" : "drm3btev3e86" } "pin" : { "location" : [-71.34, 41.12] }

34 Advanced Spark – Dealing with parallelism mismatch Node1 2P
3R

35 Advanced Spark – Dealing with parallelism mismatch INFO (sparkDriver-akka.actor.default-dispatcher-3)
BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8 > GB) ERROR (Executor task launch worker-0) Executor:96 - > Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) org.apache.spark.storage.DiskStore$$anonfun $getBytes$2.apply(DiskStore.scala:125) org.apache.spark.storage.DiskStore$$anonfun$getBytes $2.apply(DiskStore.scala:113) org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127) org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134) org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:509) org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427) org.apache.spark.storage.BlockManager.get(BlockManager.scala:615) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154) Node1 2P 3R :( SPARK-6235

36 Advanced Spark – Dealing with parallelism mismatch Node1 2P
3R Sub-sharding: - future workaround - done through scripts - impact on security - expensive

37 Storage 37

38 HDFS Repository Plugin Snapshot/Restore Use HDFS as a shared
storage Backup and recover data Issues: FileSystem API – not a file system Incomplete semantics (last-delete-on-close, fsync, atomic operations) NFSv3 bridge (not v4, metadata issues) https://github.com/elastic/elasticsearch/issues/9072

39 HDFS Repository Update Moved to ES Master Hadoop/HDFS 2
only Backed by FileContext API •  Locked to hdfs:// only •  Atomic support •  Hsync support EnumSet<CreateFlag> flags = EnumSet.of(CreateFlag.CREATE, CreateFlag.SYNC_BLOCK); FSDataOutputStream stream = fileContext.create(blob, flags); // write stream stream.hsync(); https://github.com/elastic/elasticsearch/issues/15191

40 HDFS Repository - Security Lots of work to be
done with security List of permissions at http://j.mp/hdfs-client-jvm-perms permission java.lang.RuntimePermission "getClassLoader"; // UserGroupInformation (UGI) Metrics clinit permission java.lang.RuntimePermission "accessDeclaredMembers"; permission java.lang.reflect.ReflectPermission "suppressAccessChecks"; // org.apache.hadoop.util.StringUtils clinit permission java.util.PropertyPermission "*", "read,write"; // org.apache.hadoop.util.ShutdownHookManager clinit permission java.lang.RuntimePermission "shutdownHooks"; // JAAS is used always, we use a fake subject, hurts nobody permission javax.security.auth.AuthPermission "getSubject"; permission javax.security.auth.AuthPermission "doAs"; permission javax.security.auth.AuthPermission "modifyPrivateCredentials";

41 HDFS Repository - Security https://github.com/apache/hadoop/blob/trunk/ hadoop-common-project/hadoop-common/ src/main/java/org/apache/hadoop/util/Shell.java

42 HDFS Repository – Security Fix https://github.com/elastic/elasticsearch/.../org/elasticsearch/bootstrap/ESPolicy.java

43 YARN 43

44 YARN – In Beta $ hadoop jar elasticsearch-yarn-<version>.jar -start
containers=2 Launched a 2 nodes Elasticsearch-YARN cluster [application_1415921358606_0006@http://hadoop:8088/proxy/ application_1415921358606_0006/] at Sun Feb 14 02:23:21 EET 2016 Run Elasticsearch on YARN* * YARN still doesn’t support long-lived services: •  No provisioning •  No ip/network guarantees •  Data/node affinity Next YARN releases plan to address this

45 Roadmap 45

46 Roadmap Aggregation support Monitoring Expand Spark DataFrame (Dataset, potentially
MLLib)

47 47 Thank you! github.com/elastic/elasticsearch-hadoop @costinl

48 More questions? Also find us at the AMA Booth

49 Please attribute Elastic with a link to elastic.co Except
where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders.

What's Happening in Hadoop and Spark

What's Happening in Hadoop and Spark

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript