Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What's Happening in Hadoop and Spark

Elastic Co
February 18, 2016

What's Happening in Hadoop and Spark

An overview of the Elastic Hadoop ecosystem with a focus on what has happened over the last year with Elasticsearch for Apache Hadoop. You can expect to hear about Apache Storm, Apache Spark, DataFrames, and other Hadoop goodies.

Elastic Co

February 18, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 5 * * Hadoop overview Hadoop Distributed File System (HDFS)

    YARN Map / Reduce Other Storage   Resource  Management   Compute  
  2. 6 ES-Hadoop components Compute Resource Mgmt Storage Spark Hive Storm

    M/R Cascading Pig running against ES ES on YARN Snapshot/Restore on HDFS
  3. 14 ES-Hadoop compute integrations Library / API ES-Hadoop exposed as

    Map/Reduce Input / OutputFormat Cascading Tap / Sink Apache Pig Loader / Storage Apache Hive (EXTERNAL) TABLE Apache Storm Spout / Bolt Apache Spark RDD, DataFrame, DataSource
  4. 15 Apache Hive CREATE EXTERNAL TABLE playlist (name STRING, year:long)

    STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'='buckethead/albums','es.query'='?q=pikes'); SELECT name FROM playlist WHERE year > 2010; CREATE EXTERNAL TABLE playlist(name STRING, year:long) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'='buckethead/albums'); INSERT INTO TABLE playlist VALUES ('Buildor', 2016), ('Florrmat', 2016);
  5. 16 Query Analysis SELECT name FROM playlist WHERE year >

    2010; { name : “shapeless”, year : 2015 } Target fields Filtering/Operating fields
  6. 17 Query DSL Conversion SELECT name FROM playlist WHERE year

    > 2010; { "fields" : ["name"], "query" : { "range" : {"year" : { "gt" : "2010" }}} }
  7. 18 Query DSL Conversion SELECT name FROM playlist WHERE year

    > 2010; { "fields" : ["name"], "query" : { "range" : {"year" : { "gt" : "2010" }}} } Push down Projection
  8. 19 Query optimization Where possible, convert the API query to

    ES query Return only the results without any intermediate data Library / API Operation Awareness Map/Reduce Does not apply Cascading Projection Apache Pig Projection Apache Hive Projection Apache Storm Projection Apache Spark Projection & Push-Down
  9. 20 Apache Hive Queries Projection supported No push down hooks

    in Hive ‒ Operators ‒ Type conversion
  10. 21 Apache Hive - Trivia Versions 0.10 through 1.2.x supported

    •  0.13 broke bwc (HiveOutputFormat) •  0.14 broke bwc (removed interface SerDe) ‒  Also released with SNAPSHOT deps (HIVE-8857/8906) •  1.0-1.2 required a rewrite of the testing infrastructure Apache Tez •  Not 100% compatible with M/R jobs •  ES-Hadoop fall backs / mimics the environment
  11. 23 ES-Hadoop native Spark integration Offers both Scala & Java

    API Understand Scala & Java types ‒  Case classes ‒  Java Beans Available as Spark package Supports Spark Core & SQL all 1.x version (1.0-1.6) Available for Scala 2.10 and 2.11 spark-­‐packages.org/package/elas3c/elas3csearch-­‐hadoop  
  12. 24 Apache Spark – Resilient Distributed Dataset (RDD) import org.elasticsearch.spark._

    val sc = new SparkContext() sc.esRDD("buckethead/albums", ?q=pikes") import org.elasticsearch.spark._ case class Album(name: String, year: Long) val lent = Map("name" -> "Celery", "year" -> 2014) val onMyDesk = Album("Electric Tears", 2002) sc.makeRDD(Seq(lent, onMyDesk)).saveToEs("buckethead/albums")
  13. 25 Apache Spark – JSON RDDs jsonRDD : RDD[(String, String)]

    jsonRDD = sc.esJsonRDD("buckethead/albums", "?q=pikes") val p95 = """{"name" : "Hold Me Forever", "year" : 2014 }""" val p180 = """{"name" : "Heaven Is Your Home", "year" : 2015}""" sc.makeRDD(Seq(p95, p180)).saveJsonToEs("buckethead/albums")
  14. 26 Apache Spark SQL – DataFrames “Spark SQL is Spark’s

    module for working with structured data” RDD + schema = DataFrame (inspired by Python Pandas) Allows usage of SQL Integrates with Hive* * trivia – the project was initially based on Hive (Shark)
  15. 27 Apache Spark SQL Support import org.elasticsearch.spark.sql._ val df =

    sqlCtx.read.format("es").load("buckethead/albums") df.filter(df("category").equalTo("pikes").and(df("year").gt(2015))) CREATE TEMPORARY TABLE dfAsTable USING org.elasticsearch.spark.sql OPTIONS (‘path’ = ‘buckethead/albums’); SELECT name FROM dfAsTable WHERE year > 2015 and category = “pikes”; val df = sqlContext.read.json("buckethead/2015/albums.json") df.saveToES("buckethead/albums")
  16. 28 Spark SQL to Query DSL •  Example of translation

    df.filter(df("category").equalTo("pikes").and(df("year").geq(2015))) { "query" : { "bool" : { "must" : [ "match" : { "category" : "pikes" } ], "filter" : [ { "range" : { "year" : {"gte" : "2015" }}} ] }} }
  17. 29 Advanced Spark SQL features in ES-Hadop Spark SQL 1.3

    - 1.6 (DataFrame) Spark SQL 1.1 - 1.2 (SchemaRDD) Supports all filters in Spark SQL -  EqualTo/EqualNullSafe -  GreaterThan/GreaterThanOrEqual/LessThan/LessThanOrEqual -  In/ IsNull/ IsNotNull -  And/Or/Not -  StringStartsWith/StringEndsWith/StringContains
  18. 30 Advanced Spark SQL features in ES-Hadop DataSource (Spark 1.3)

    DataSource Reader (Spark 1.4) DataSourceRegister (Spark 1.5) DataSource.unhandledFilters (Spark 1.6) - Tell Spark that a certain Filter is already handled (by ES/ES-Hadoop) val df = sql.load("spark/index", "org.elasticsearch.spark.sql") val df = sql.read.format("org.elasticsearch.spark.sql").load(“spark/index”) val df = sql.read.format(“es").load(“spark/index”)
  19. 31 Advanced Mapping { “album” : { “mapping” : {

    “year” : “int”, “name” : “string” }}} What is the type of field “year” ? 1.  Int 2.  Array of Int 3.  Array of Array of Int
  20. 32 Advanced Mapping – Inferring schema { “album” : {

    “mapping” : { “year” : “int”, “name” : “string” }}} What is the type of field “year” ? 1.  Int 2.  Array of Int 3.  Array of Array of Int Any of the above One can tell ES-Hadoop what fields are arrays (and their depth)
  21. 33 Advanced Mapping – Geo Types Elasticsearch supports two geo

    types: •  geo_point  –  4 formats •  geo_shape  –  9 shapes To avoid configuration madness, ES-Hadoop performs sampling "pin" : {"location" : { "lat" : 41.12, "lon" : -71.34 } } "pin" : { "location" : "41.12,-71.34" } "pin" : { "location" : "drm3btev3e86" } "pin" : { "location" : [-71.34, 41.12] }
  22. 35 Advanced Spark – Dealing with parallelism mismatch INFO (sparkDriver-akka.actor.default-dispatcher-3)

    BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8 > GB) ERROR (Executor task launch worker-0) Executor:96 - > Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) org.apache.spark.storage.DiskStore$$anonfun $getBytes$2.apply(DiskStore.scala:125) org.apache.spark.storage.DiskStore$$anonfun$getBytes $2.apply(DiskStore.scala:113) org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127) org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134) org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:509) org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427) org.apache.spark.storage.BlockManager.get(BlockManager.scala:615) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154) Node1 2P 3R :( SPARK-6235
  23. 36 Advanced Spark – Dealing with parallelism mismatch Node1 2P

    3R Sub-sharding: - future workaround - done through scripts - impact on security - expensive
  24. 38 HDFS Repository Plugin Snapshot/Restore Use HDFS as a shared

    storage Backup and recover data Issues: FileSystem API – not a file system Incomplete semantics (last-delete-on-close, fsync, atomic operations) NFSv3 bridge (not v4, metadata issues) https://github.com/elastic/elasticsearch/issues/9072
  25. 39 HDFS Repository Update Moved to ES Master Hadoop/HDFS 2

    only Backed by FileContext API •  Locked to hdfs:// only •  Atomic support •  Hsync support EnumSet<CreateFlag> flags = EnumSet.of(CreateFlag.CREATE, CreateFlag.SYNC_BLOCK); FSDataOutputStream stream = fileContext.create(blob, flags); // write stream stream.hsync(); https://github.com/elastic/elasticsearch/issues/15191
  26. 40 HDFS Repository - Security Lots of work to be

    done with security List of permissions at http://j.mp/hdfs-client-jvm-perms permission java.lang.RuntimePermission "getClassLoader"; // UserGroupInformation (UGI) Metrics clinit permission java.lang.RuntimePermission "accessDeclaredMembers"; permission java.lang.reflect.ReflectPermission "suppressAccessChecks"; // org.apache.hadoop.util.StringUtils clinit permission java.util.PropertyPermission "*", "read,write"; // org.apache.hadoop.util.ShutdownHookManager clinit permission java.lang.RuntimePermission "shutdownHooks"; // JAAS is used always, we use a fake subject, hurts nobody permission javax.security.auth.AuthPermission "getSubject"; permission javax.security.auth.AuthPermission "doAs"; permission javax.security.auth.AuthPermission "modifyPrivateCredentials";
  27. 44 YARN – In Beta $ hadoop jar elasticsearch-yarn-<version>.jar -start

    containers=2 Launched a 2 nodes Elasticsearch-YARN cluster [application_1415921358606_0006@http://hadoop:8088/proxy/ application_1415921358606_0006/] at Sun Feb 14 02:23:21 EET 2016   Run Elasticsearch on YARN* * YARN still doesn’t support long-lived services: •  No provisioning •  No ip/network guarantees •  Data/node affinity Next YARN releases plan to address this
  28. 49 Please attribute Elastic with a link to elastic.co Except

    where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders.