Testing batch and streaming Spark applications

Testing batch and streaming Spark applications @lukaszgawron Software Engineer @PerformGroup

Overview • Why to run aplication outside of a cluster?
• Spark in nutshell • Unit and integration tests • Tools • Spark Streaming integration tests • Best practices and pitfalls

Why to run application outside of a cluster?

Why we want to test?

Why we want to test? • safety / regression

Why we want to test? • safety / regression •
fast feedback

fast feedback • communication

fast feedback • communication • best possible design

Spark in nutshell

Batch and streaming

Example – word count WordCount maps (extracts) words from an
input source and reduces (summarizes) the results, returning a count of each word.

object App { def main(args: Array[String]): Unit = { val
conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf)

conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words)

conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 })

conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")

conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap(WordsCount.extractWords) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")

object WordsCount { def extractWords(line: String): Array[String] = { line.split("
") } }

Example unit test class S00_UnitTest extends FunSpec with Matchers {
it("should split a sentence into words") { val line = "Ala ma kota" val words: Array[String] = WordCount.extractWords(line = line) val expected = Array("Ala", "ma", "kota") words should be (expected) } }

Example unit test class BasicScalaTest extends FunSpec with Matchers{ }

Example unit test class S00_UnitTest extends BasicScalaTest { it("should split
a sentence into words") { val line = "Ala ma kota" val words: Array[String] = WordCount.extractWords(line = line) val expected = Array("Ala", "ma", "kota") words should be (expected) } }

Things to note • Extract anonymous functions so they will
be testable • what can be unit tested? • Executor and driver code not related to Spark • Udf functions

Production code vs test code

Production code vs test code Production code • distributed mode
Test code • local mode

• RDD from storage Test code • local mode • RDD from resources/memory

• RDD from storage • Evaluate transformations on RDD or DStream API. Test code • local mode • RDD from resources/memory • Evaluate transformations on RDD or DStream API.

• RDD from storage • Evaluate transformations on RDD or DStream API. • Store outcomes Test code • local mode • RDD from resources/memory • Evaluate transformations on RDD or DStream API. • Assert outcomes

What to test in integration tests?

What to test in integration tests? val words = List("Ala
ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")

Integration test def extractAndCountWords(wordsRDD: RDD[String]): RDD[(String, Int)] = { wordsRDD
.flatMap(WordCount.extractWords) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }) }

Integration test - RDD

class S01_IntegrationTest extends SparkSessionBase { it("should count words occurence in
all lines") { Given("RDD of sentences") val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")) When("extract and count words") val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd) val actual: Map[String, Int] = wordsCountRdd.collectAsMap() Then("words should be counted") val expected = Map( "Ala" -> 2, "ma" -> 2, "kota" -> 1, ................ ) actual should be(expected)

class SparkSessionBase extends FunSpec with BeforeAndAfterAll with Matchers with GivenWhenThen
{ var ss: SparkSession = _ override def beforeAll() { val conf = new SparkConf() .setMaster("local[4]") ss = SparkSession.builder() .appName("TestApp" + System.currentTimeMillis()) .config(conf) .getOrCreate() } override def afterAll() { ss.stop() ss = null }

class S01_IntegrationTest extends SparkSessionBase { it( "should count words occurence
in all lines" ) { Given("RDD of sentences") val linesRdd: RDD[String] = ss .sparkContext.parallelize( List ( "Ala ma kota" , "Bolek i Lolek" , "Ala ma psa" )) When("extract and count words") val wordsCountRdd: RDD[(String, Int)] = WordsCount. extractAndCountWords (linesRdd) val actual: Map[String, Int] = wordsCountRdd.collectAsMap() Then("words should be counted") val expected = Map( "Ala" -> 2, "ma" -> 2, "kota" -> 1, ................ ) actual should equal(expected)

Integration test – DataFrame def extractFilterAndCountWords(wordsDf: DataFrame): DataFrame = {
val words: Column = explode(split(col("line"), " ")).as("word") wordsDf .select(words) .where( col("word").equalTo("Ala").or(col("word").equalTo("Bolek"))) .groupBy("word") .count() }

it("should count words occurence in all lines") { Given("few lines
of sentences") val schema = StructType(List( StructField("line", StringType, true) )) val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json")) When("extract and count words") val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf) val wordCount: Array[Row] = wordsCountDf.collect() Then("filtered words should be counted") val actualWordCount = wordCount .map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count"))) .toMap val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1) actualWordCount should be(expectedWordCount) }

Integration test – Dataset def extractFilterAndCountWordsDataset(wordsDs: Dataset[Line]): Dataset[WordCount] = {
import wordsDs.sparkSession.implicits._ wordsDs .flatMap((line: Line) => line.text.split(" ")) .filter((word: String) => word == "Ala" || word == "Bolek") .groupBy(col("word")) .agg(count("word").as("count")) .as[WordCount] }

it("should return total count of Ala and Bolek words in
all lines of text") { Given("few sentences") implicit val lineEncoder = product[Line] val lines = List( Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa")) val linesDs: Dataset[Line] = ss.createDataset(lines) When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount .extractFilterAndCountWordsDataset(linesDs) val actualWordCount: Array[WordCount] = wordsCountDs.collect() Then("filtered words should be counted") val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1)) actualWordCount should contain theSameElementsAs expectedWordCount }

it("should return total count of Ala and Bolek words in
all lines of text") { import spark.implicits._ Given("few sentences") implicit val lineEncoder = product[Line] val linesDs: Dataset[Lines] = List( Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa")).toDS() When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount .extractFilterAndCountWordsDataset(linesDs) val actualWordCount: Array[WordCount] = wordsCountDs.collect() Then("filtered words should be counted") val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1)) actualWordCount should contain theSameElementsAs expectedWordCount }

Things to note • What can be tested in integration
tests? • Single transformation on Spark abstractions • Chain of transformations • Integration with external services e.g. Kafka, HDFS, YARN • Embedded instances • Docker environment • Prefer Datasets over RDDs or DataFrames

spark-fast-tests class S04_IntegrationDatasetFastTest extends SparkSessionBase with DatasetComparer { it("should return
total count of Ala and Bolek words in all lines of text ") { Given("few lines of sentences") implicit val lineEncoder = product[Line] implicit val wordEncoder = product[WordCount] val lines = List(Line(text = "Ala ma kota"),Line(text = "Bolek i Lolek"),Line(text = "Ala ma psa")) val linesDs: Dataset[Line] = ss.createDataset(lines) When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount .extractFilterAndCountWordsDataset(linesDs) Then("filtered words should be counted") val expectedDs = ss.createDataset(Array(WordCount("Ala", 2),WordCount("Bolek", 1))) assertSmallDatasetEquality(wordsCountDs, expectedDs, orderedComparison = false)

spark-fast-tests – nice failure messages Different values

Spark Testing Base class S06_01_IntegrationDatasetSparkTestingBaseTest extends FunSpec with DatasetSuiteBase with
GivenWhenThen { it("counting word occurences on few lines of text should return count Ala and Bolek words in this text") { Given("few lines of sentences") implicit val lineEncoder = product[Line] implicit val wordEncoder = product[WordCount] val lines = List(Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa")) val linesDs: Dataset[Line] = spark.createDataset(lines) When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount.extractFilterAndCountWordsDataset(linesDs) Then("filtered words should be counted") val expectedDs: Dataset[WordCount] = spark.createDataset(Seq(WordCount("Bolek", 1),WordCount("Ala", 2))) assertDatasetEquals(expected = expectedDs, result = wordsCountDs)

Spark Testing Base – not so nice failure messages •
Different length 1 did not equal 2 Length not EqualScalaTestFailureLocation: com.holdenkarau.spark.testing.TestSuite$class at • Different order of elements Tuple2;((0,(WordCount(Ala,2),WordCount(Bolek,1))), (1,(WordCount(Bolek,1),WordCount(Ala,2)))) was not empty • Differente values Tuple2;((0,(WordCount(Bole,1),WordCount(Bolek,1)))) was not empty

spark-fast-test vs spark-testing-base

Other tools • https://github.com/dwestheide/kontextfrei • https://github.com/MrPowers/spark-daria • https://github.com/hammerlab/spark-tests

Spark streaming - inifinite flow of data

DStream

DStream transformations

Streaming – spark testing base class S06_02_StreamingTest_SparkTestingBase extends FunSuite with
StreamingSuiteBase { test("count words") { val input = List(List("a b")) val expected = List(List(("a", 1), ("b", 1))) testOperation[String, (String, Int)](input, count _, expected, ordered = false) } // This is the sample operation we are testing def count(lines: DStream[String]): DStream[(String, Int)] = { lines.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) } }

How to design easy testable Spark code? • Extract functions
so they will be reusable and testable • Single transformation should do one thing • Compose transformations using „transform” function • Prefer Column based functions over UDFs • Column based functions • Dataset operators • UDF functions

Name Quality Excites

Name Quality Excites Name Greeting Quality Excites Hello!!

Column based function import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions._ object HelloWorld {
def withGreeting()(df: DataFrame): DataFrame = { df.withColumn("greeting", lit(”Hello!!")) } } //def lit(literal: Any): Column

it("appends a greeting column to a Dataframe") { Given("Source dataframe")
val sourceDF = Seq( ("Quality Excites") ).toDF("name") When("adding greeting column") val actualDF = sourceDF .transform(HelloWorld.withGreeting()) Then("new data frame contains column greeting") val expectedSchema = List(StructField("name", StringType, true),StructField("greeting", StringType, false)) val expectedData = Seq(Row("Quality Excites", ”Hello!!")) val expectedDF = ss.createDataFrame(ss.sparkContext.parallelize(expectedData),StructType(expectedSchema)) assertSmallDatasetEquality(actualDF, expectedDF, orderedComparison = false) }

it("appends a greeting column to a Dataframe") { Given("Source dataframe")
val sourceDF = Seq( ("Quality Excites") ).toDF("name") When("adding greeting column") val actualDF = sourceDF .transform(HelloWorld.withGreeting()) .transform(HelloWorld.withGreetingUdf())

object HelloWorld { def withGreeting()(df: DataFrame): DataFrame = { df.withColumn("greeting",
lit("Hello!!")) } val litFunction: () => String = () => "Hello!!" val udfLit = udf(litFunction) def withGreetingUdf()(df: DataFrame): DataFrame = { df.withColumn("greetingUdf", udfLit()) } }

Pitfalls you should look out • cannot refer to one
RDD inside another RDD • processing batch of data, not single message or domain entity • case classes defined in test class body - throws SerializationException • Spark reads json based on http://jsonlines.org/ specification

Costly false positive

Thank you

References • https://databricks.com/session/mastering-spark-unit-testing • https://medium.com/@mrpowers/designing-easily-testable-spark-code- df0755ef00a4 • https://medium.com/@mrpowers/testing-spark-applications- 8c590d3215fa •
http://shop.oreilly.com/product/0636920046967.do • https://spark.apache.org/docs/latest/streaming-programming-guide.html • https://spark.apache.org/docs/latest/sql-programming-guide.html • https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs- blackbox.html

Testing batch and streaming Spark applications

Testing batch and streaming Spark applications

Other Decks in Programming

Featured

Transcript