Harness the power of Spark SQL with the Data Source API

Harness the power of Spark SQL with the Data Source
API Stefano Baghino Data Engineer at eBay

Motivation

Motivation • I am a Spark user • Getting the
API right can be hard ◦ Language support? ◦ Will it withstand changing requirements? • SQL is arguably the most popular language to query data • Spark SQL enables the expressivity of SQL on top of Spark ◦ Automatically support all language supported by Spark ◦ Evolving the API becomes a schema evolution problem • Limited number of natively supported formats ◦ The storage engine knows stuff about the data layout that Spark cannot be aware of

Aspiration val spark = SparkSession.builder.config(new SparkConf).getOrCreate val df = spark.read.format("example").option("some-opt",
true).load(source) val result = trulyAmazingAndIncredibleProcessing(df) result.write.format("example").save(sink)

The Data Source API allows to extend the native capabilities
of Spark SQL to query new data sources

Spark and Spark SQL

Spark in 1 slide • Exposes a (mostly) functional API
(think Scala Collections or Java Streams) • Translates your code into a directed acyclic graph of tasks • Executes tasks on top of atomic units of concurrency called partitions • The computation lineage is ran lazily and re-ran in case of failure

Spark SQL in 1 slide • Exposes a (mostly) declarative
API (think SQL or Pandas) • Optimizes the query plan, both logical and physical • Produces an optimized directed acyclic graph • Runs on top of the existing system

The Data Source API

The relation • At the core of your data source
is a so-called relation • A relation defines what gets loaded when you load and written when you save • Your implementation can express the following capabilities: ◦ Retrieve the schema (mandatory) ◦ Reading ▪ Perform a full table scan ▪ Column pruning ▪ Column pruning and predicate pushdown ◦ Writing ▪ Insertion

The relation provider • The relation must be instantiated by
a relation provider • Spark retrieves the provider at runtime • Can define how a new relation gets created • Can define a short name for your data source

An example

An example format 3 name alice bob charlie age 42
47 33 name age alice 42 bob 47 charlie 33

An example relation final class ExampleRelation(override val sqlContext: SQLContext, content:
IndexedSeq[String]) extends BaseRelation with PrunedScan { private val numRows: Int = content(0).toInt private val columnStartingIndexes = 1 until content.size by (numRows + 1) private val columnNames = columnStartingIndexes.map(content) private val columnIndex = columnNames.zip(columnStartingIndexes).toMap override def schema: StructType = StructType(columnNames.map(StructField(_, StringType))) override def buildScan(cs: Array[String]): RDD[Row] = sqlContext.sparkContext.parallelize( (1 to numRows).map(i => Row(cs.collect(columnIndex).map(_ + i).map(content): _*)) ) }

An example relation provider final class ExampleRelationProvider extends RelationProvider with
CreatableRelationProvider with DataSourceRegister { def shortName(): String = "example" def createRelation(sqlContext: SQLContext, parameters: Map[String, String]) = { val content: Array[String] = readFile(parameters("path")) new ExampleRelation(sqlContext, content) } def createRelation(s: SQLContext, m: SaveMode, p: Map[String, String], df: DataFrame) = { val content = toWritableFormat(df) writeFile(content, parameter("path")) new ExampleRelation(s, content) } }

Conclusions

Advantages over custom APIs • Write in Java or Scala,
enable usage from Python, SQL and R as well • Take advantage of your data source native capabilities (predicate pushdown, column pruning) • Abstract users from the underlying data source ◦ Why not swapping different storage backends behind the same API? ◦ Why not using different storage backends, depending on the kind of query that has to be run?

Not covered in this talk • Catalyst integration • Streaming
support • Spark 2.3 (released on February 28th) introduces a new Data Source API ◦ Not stable yet ◦ The described API is still valid and supported ◦ Easier to implement data sources in Java ◦ Abstracts away from core Spark concepts • Alternatives (like Drill or Presto)

Thanks! Code: https://github.com/stefanobaghino/spark-data-source-api-examples Find me on Twitter @stefanobaghino

Harness the power of Spark SQL with the Data So...

Harness the power of Spark SQL with the Data Source API

Stefano Baghino

More Decks by Stefano Baghino

Other Decks in Technology

Featured

Transcript