Slide 1

Slide 1 text

Harness the power of Spark SQL with the Data Source API Stefano Baghino Data Engineer at eBay

Slide 2

Slide 2 text

Motivation

Slide 3

Slide 3 text

Motivation ● I am a Spark user ● Getting the API right can be hard ○ Language support? ○ Will it withstand changing requirements? ● SQL is arguably the most popular language to query data ● Spark SQL enables the expressivity of SQL on top of Spark ○ Automatically support all language supported by Spark ○ Evolving the API becomes a schema evolution problem ● Limited number of natively supported formats ○ The storage engine knows stuff about the data layout that Spark cannot be aware of

Slide 4

Slide 4 text

Aspiration val spark = SparkSession.builder.config(new SparkConf).getOrCreate val df = spark.read.format("example").option("some-opt", true).load(source) val result = trulyAmazingAndIncredibleProcessing(df) result.write.format("example").save(sink)

Slide 5

Slide 5 text

The Data Source API allows to extend the native capabilities of Spark SQL to query new data sources

Slide 6

Slide 6 text

Spark and Spark SQL

Slide 7

Slide 7 text

Spark in 1 slide ● Exposes a (mostly) functional API (think Scala Collections or Java Streams) ● Translates your code into a directed acyclic graph of tasks ● Executes tasks on top of atomic units of concurrency called partitions ● The computation lineage is ran lazily and re-ran in case of failure

Slide 8

Slide 8 text

Spark SQL in 1 slide ● Exposes a (mostly) declarative API (think SQL or Pandas) ● Optimizes the query plan, both logical and physical ● Produces an optimized directed acyclic graph ● Runs on top of the existing system

Slide 9

Slide 9 text

The Data Source API

Slide 10

Slide 10 text

The relation ● At the core of your data source is a so-called relation ● A relation defines what gets loaded when you load and written when you save ● Your implementation can express the following capabilities: ○ Retrieve the schema (mandatory) ○ Reading ■ Perform a full table scan ■ Column pruning ■ Column pruning and predicate pushdown ○ Writing ■ Insertion

Slide 11

Slide 11 text

The relation provider ● The relation must be instantiated by a relation provider ● Spark retrieves the provider at runtime ● Can define how a new relation gets created ● Can define a short name for your data source

Slide 12

Slide 12 text

An example

Slide 13

Slide 13 text

An example format 3 name alice bob charlie age 42 47 33 name age alice 42 bob 47 charlie 33

Slide 14

Slide 14 text

An example relation final class ExampleRelation(override val sqlContext: SQLContext, content: IndexedSeq[String]) extends BaseRelation with PrunedScan { private val numRows: Int = content(0).toInt private val columnStartingIndexes = 1 until content.size by (numRows + 1) private val columnNames = columnStartingIndexes.map(content) private val columnIndex = columnNames.zip(columnStartingIndexes).toMap override def schema: StructType = StructType(columnNames.map(StructField(_, StringType))) override def buildScan(cs: Array[String]): RDD[Row] = sqlContext.sparkContext.parallelize( (1 to numRows).map(i => Row(cs.collect(columnIndex).map(_ + i).map(content): _*)) ) }

Slide 15

Slide 15 text

An example relation provider final class ExampleRelationProvider extends RelationProvider with CreatableRelationProvider with DataSourceRegister { def shortName(): String = "example" def createRelation(sqlContext: SQLContext, parameters: Map[String, String]) = { val content: Array[String] = readFile(parameters("path")) new ExampleRelation(sqlContext, content) } def createRelation(s: SQLContext, m: SaveMode, p: Map[String, String], df: DataFrame) = { val content = toWritableFormat(df) writeFile(content, parameter("path")) new ExampleRelation(s, content) } }

Slide 16

Slide 16 text

Conclusions

Slide 17

Slide 17 text

Advantages over custom APIs ● Write in Java or Scala, enable usage from Python, SQL and R as well ● Take advantage of your data source native capabilities (predicate pushdown, column pruning) ● Abstract users from the underlying data source ○ Why not swapping different storage backends behind the same API? ○ Why not using different storage backends, depending on the kind of query that has to be run?

Slide 18

Slide 18 text

Not covered in this talk ● Catalyst integration ● Streaming support ● Spark 2.3 (released on February 28th) introduces a new Data Source API ○ Not stable yet ○ The described API is still valid and supported ○ Easier to implement data sources in Java ○ Abstracts away from core Spark concepts ● Alternatives (like Drill or Presto)

Slide 19

Slide 19 text

Thanks! Code: https://github.com/stefanobaghino/spark-data-source-api-examples Find me on Twitter @stefanobaghino