Harness the power of Spark SQL with the Data Source API

Harness the power of Spark SQL with the Data Source API

An introduction to the Spark Data Source API, which allows new data sources to be exposed for consumption through Spark SQL itself. You will be able to go beyond natively supported formats like JSON or CSV and query your favorite non-relational database or even your own web services through the convenient and familiar interface of SQL, using Spark itself as a distributed query engine.


Stefano Baghino

March 08, 2018


  1. 1.

    Harness the power of Spark SQL with the Data Source

    API Stefano Baghino Data Engineer at eBay
  2. 3.

    Motivation • I am a Spark user • Getting the

    API right can be hard ◦ Language support? ◦ Will it withstand changing requirements? • SQL is arguably the most popular language to query data • Spark SQL enables the expressivity of SQL on top of Spark ◦ Automatically support all language supported by Spark ◦ Evolving the API becomes a schema evolution problem • Limited number of natively supported formats ◦ The storage engine knows stuff about the data layout that Spark cannot be aware of
  3. 4.

    Aspiration val spark = SparkSession.builder.config(new SparkConf).getOrCreate val df = spark.read.format("example").option("some-opt",

    true).load(source) val result = trulyAmazingAndIncredibleProcessing(df) result.write.format("example").save(sink)
  4. 5.

    The Data Source API allows to extend the native capabilities

    of Spark SQL to query new data sources
  5. 7.

    Spark in 1 slide • Exposes a (mostly) functional API

    (think Scala Collections or Java Streams) • Translates your code into a directed acyclic graph of tasks • Executes tasks on top of atomic units of concurrency called partitions • The computation lineage is ran lazily and re-ran in case of failure
  6. 8.

    Spark SQL in 1 slide • Exposes a (mostly) declarative

    API (think SQL or Pandas) • Optimizes the query plan, both logical and physical • Produces an optimized directed acyclic graph • Runs on top of the existing system
  7. 10.

    The relation • At the core of your data source

    is a so-called relation • A relation defines what gets loaded when you load and written when you save • Your implementation can express the following capabilities: ◦ Retrieve the schema (mandatory) ◦ Reading ▪ Perform a full table scan ▪ Column pruning ▪ Column pruning and predicate pushdown ◦ Writing ▪ Insertion
  8. 11.

    The relation provider • The relation must be instantiated by

    a relation provider • Spark retrieves the provider at runtime • Can define how a new relation gets created • Can define a short name for your data source
  9. 13.

    An example format 3 name alice bob charlie age 42

    47 33 name age alice 42 bob 47 charlie 33
  10. 14.

    An example relation final class ExampleRelation(override val sqlContext: SQLContext, content:

    IndexedSeq[String]) extends BaseRelation with PrunedScan { private val numRows: Int = content(0).toInt private val columnStartingIndexes = 1 until content.size by (numRows + 1) private val columnNames = columnStartingIndexes.map(content) private val columnIndex = columnNames.zip(columnStartingIndexes).toMap override def schema: StructType = StructType(columnNames.map(StructField(_, StringType))) override def buildScan(cs: Array[String]): RDD[Row] = sqlContext.sparkContext.parallelize( (1 to numRows).map(i => Row(cs.collect(columnIndex).map(_ + i).map(content): _*)) ) }
  11. 15.

    An example relation provider final class ExampleRelationProvider extends RelationProvider with

    CreatableRelationProvider with DataSourceRegister { def shortName(): String = "example" def createRelation(sqlContext: SQLContext, parameters: Map[String, String]) = { val content: Array[String] = readFile(parameters("path")) new ExampleRelation(sqlContext, content) } def createRelation(s: SQLContext, m: SaveMode, p: Map[String, String], df: DataFrame) = { val content = toWritableFormat(df) writeFile(content, parameter("path")) new ExampleRelation(s, content) } }
  12. 17.

    Advantages over custom APIs • Write in Java or Scala,

    enable usage from Python, SQL and R as well • Take advantage of your data source native capabilities (predicate pushdown, column pruning) • Abstract users from the underlying data source ◦ Why not swapping different storage backends behind the same API? ◦ Why not using different storage backends, depending on the kind of query that has to be run?
  13. 18.

    Not covered in this talk • Catalyst integration • Streaming

    support • Spark 2.3 (released on February 28th) introduces a new Data Source API ◦ Not stable yet ◦ The described API is still valid and supported ◦ Easier to implement data sources in Java ◦ Abstracts away from core Spark concepts • Alternatives (like Drill or Presto)