Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testing batch and streaming Spark applications

Testing batch and streaming Spark applications

Apache Spark is a general engine for processing data on a large scale. Employing this tool in a distributed environment to process large data sets is undeniably beneficial.

But what about fast feedback loop while developing such application with Apache Spark? Testing it on a cluster is essential, but it does not seem to be what most developers accustomed to TDD workflow would like to do.

In the talk, Łukasz will share with you some tips on how to write the unit and integration tests, and how Docker can be applied to test Spark application on a local machine.

Examples will be presented within the ScalaTest framework, and it should be easy to grasp by people who know Scala and other JVM languages.

Avatar for Łukasz Gawron

Łukasz Gawron

June 23, 2018
Tweet

Other Decks in Programming

Transcript

  1. Overview • Why to run aplication outside of a cluster?

    • Spark in nutshell • Unit and integration tests • Tools • Spark Streaming integration tests • Best practices and pitfalls
  2. Why we want to test? • safety / regression •

    fast feedback • communication
  3. Why we want to test? • safety / regression •

    fast feedback • communication • best possible design
  4. Example – word count WordCount maps (extracts) words from an

    input source and reduces (summarizes) the results, returning a count of each word.
  5. object App { def main(args: Array[String]): Unit = { val

    conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf)
  6. object App { def main(args: Array[String]): Unit = { val

    conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words)
  7. object App { def main(args: Array[String]): Unit = { val

    conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 })
  8. object App { def main(args: Array[String]): Unit = { val

    conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  9. object App { def main(args: Array[String]): Unit = { val

    conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  10. object App { def main(args: Array[String]): Unit = { val

    conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap(WordsCount.extractWords) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  11. Example unit test class S00_UnitTest extends FunSpec with Matchers {

    it("should split a sentence into words") { val line = "Ala ma kota" val words: Array[String] = WordCount.extractWords(line = line) val expected = Array("Ala", "ma", "kota") words should be (expected) } }
  12. Example unit test class S00_UnitTest extends FunSpec with Matchers {

    it("should split a sentence into words") { val line = "Ala ma kota" val words: Array[String] = WordCount.extractWords(line = line) val expected = Array("Ala", "ma", "kota") words should be (expected) } }
  13. Example unit test class S00_UnitTest extends BasicScalaTest { it("should split

    a sentence into words") { val line = "Ala ma kota" val words: Array[String] = WordCount.extractWords(line = line) val expected = Array("Ala", "ma", "kota") words should be (expected) } }
  14. Things to note • Extract anonymous functions so they will

    be testable • what can be unit tested? • Executor and driver code not related to Spark • Udf functions
  15. Production code vs test code Production code • distributed mode

    • RDD from storage Test code • local mode • RDD from resources/memory
  16. Production code vs test code Production code • distributed mode

    • RDD from storage • Evaluate transformations on RDD or DStream API. Test code • local mode • RDD from resources/memory • Evaluate transformations on RDD or DStream API.
  17. Production code vs test code Production code • distributed mode

    • RDD from storage • Evaluate transformations on RDD or DStream API. • Store outcomes Test code • local mode • RDD from resources/memory • Evaluate transformations on RDD or DStream API. • Assert outcomes
  18. What to test in integration tests? val words = List("Ala

    ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  19. What to test in integration tests? val words = List("Ala

    ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  20. Integration test def extractAndCountWords(wordsRDD: RDD[String]): RDD[(String, Int)] = { wordsRDD

    .flatMap(WordCount.extractWords) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }) }
  21. class S01_IntegrationTest extends SparkSessionBase { it("should count words occurence in

    all lines") { Given("RDD of sentences") val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")) When("extract and count words") val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd) val actual: Map[String, Int] = wordsCountRdd.collectAsMap() Then("words should be counted") val expected = Map( "Ala" -> 2, "ma" -> 2, "kota" -> 1, ................ ) actual should be(expected)
  22. class S01_IntegrationTest extends SparkSessionBase { it("should count words occurence in

    all lines") { Given("RDD of sentences") val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")) When("extract and count words") val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd) val actual: Map[String, Int] = wordsCountRdd.collectAsMap() Then("words should be counted") val expected = Map( "Ala" -> 2, "ma" -> 2, "kota" -> 1, ................ ) actual should be(expected)
  23. class SparkSessionBase extends FunSpec with BeforeAndAfterAll with Matchers with GivenWhenThen

    { var ss: SparkSession = _ override def beforeAll() { val conf = new SparkConf() .setMaster("local[4]") ss = SparkSession.builder() .appName("TestApp" + System.currentTimeMillis()) .config(conf) .getOrCreate() } override def afterAll() { ss.stop() ss = null }
  24. class S01_IntegrationTest extends SparkSessionBase { it( "should count words occurence

    in all lines" ) { Given("RDD of sentences") val linesRdd: RDD[String] = ss .sparkContext.parallelize( List ( "Ala ma kota" , "Bolek i Lolek" , "Ala ma psa" )) When("extract and count words") val wordsCountRdd: RDD[(String, Int)] = WordsCount. extractAndCountWords (linesRdd) val actual: Map[String, Int] = wordsCountRdd.collectAsMap() Then("words should be counted") val expected = Map( "Ala" -> 2, "ma" -> 2, "kota" -> 1, ................ ) actual should equal(expected)
  25. Integration test – DataFrame def extractFilterAndCountWords(wordsDf: DataFrame): DataFrame = {

    val words: Column = explode(split(col("line"), " ")).as("word") wordsDf .select(words) .where( col("word").equalTo("Ala").or(col("word").equalTo("Bolek"))) .groupBy("word") .count() }
  26. it("should count words occurence in all lines") { Given("few lines

    of sentences") val schema = StructType(List( StructField("line", StringType, true) )) val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json")) When("extract and count words") val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf) val wordCount: Array[Row] = wordsCountDf.collect() Then("filtered words should be counted") val actualWordCount = wordCount .map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count"))) .toMap val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1) actualWordCount should be(expectedWordCount) }
  27. it("should count words occurence in all lines") { Given("few lines

    of sentences") val schema = StructType(List( StructField("line", StringType, true) )) val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json")) When("extract and count words") val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf) val wordCount: Array[Row] = wordsCountDf.collect() Then("filtered words should be counted") val actualWordCount = wordCount .map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count"))) .toMap val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1) actualWordCount should be(expectedWordCount) }
  28. Integration test – Dataset def extractFilterAndCountWordsDataset(wordsDs: Dataset[Line]): Dataset[WordCount] = {

    import wordsDs.sparkSession.implicits._ wordsDs .flatMap((line: Line) => line.text.split(" ")) .filter((word: String) => word == "Ala" || word == "Bolek") .groupBy(col("word")) .agg(count("word").as("count")) .as[WordCount] }
  29. it("should return total count of Ala and Bolek words in

    all lines of text") { Given("few sentences") implicit val lineEncoder = product[Line] val lines = List( Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa")) val linesDs: Dataset[Line] = ss.createDataset(lines) When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount .extractFilterAndCountWordsDataset(linesDs) val actualWordCount: Array[WordCount] = wordsCountDs.collect() Then("filtered words should be counted") val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1)) actualWordCount should contain theSameElementsAs expectedWordCount }
  30. it("should return total count of Ala and Bolek words in

    all lines of text") { import spark.implicits._ Given("few sentences") implicit val lineEncoder = product[Line] val linesDs: Dataset[Lines] = List( Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa")).toDS() When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount .extractFilterAndCountWordsDataset(linesDs) val actualWordCount: Array[WordCount] = wordsCountDs.collect() Then("filtered words should be counted") val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1)) actualWordCount should contain theSameElementsAs expectedWordCount }
  31. Things to note • What can be tested in integration

    tests? • Single transformation on Spark abstractions • Chain of transformations • Integration with external services e.g. Kafka, HDFS, YARN • Embedded instances • Docker environment • Prefer Datasets over RDDs or DataFrames
  32. spark-fast-tests class S04_IntegrationDatasetFastTest extends SparkSessionBase with DatasetComparer { it("should return

    total count of Ala and Bolek words in all lines of text ") { Given("few lines of sentences") implicit val lineEncoder = product[Line] implicit val wordEncoder = product[WordCount] val lines = List(Line(text = "Ala ma kota"),Line(text = "Bolek i Lolek"),Line(text = "Ala ma psa")) val linesDs: Dataset[Line] = ss.createDataset(lines) When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount .extractFilterAndCountWordsDataset(linesDs) Then("filtered words should be counted") val expectedDs = ss.createDataset(Array(WordCount("Ala", 2),WordCount("Bolek", 1))) assertSmallDatasetEquality(wordsCountDs, expectedDs, orderedComparison = false)
  33. Spark Testing Base class S06_01_IntegrationDatasetSparkTestingBaseTest extends FunSpec with DatasetSuiteBase with

    GivenWhenThen { it("counting word occurences on few lines of text should return count Ala and Bolek words in this text") { Given("few lines of sentences") implicit val lineEncoder = product[Line] implicit val wordEncoder = product[WordCount] val lines = List(Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa")) val linesDs: Dataset[Line] = spark.createDataset(lines) When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount.extractFilterAndCountWordsDataset(linesDs) Then("filtered words should be counted") val expectedDs: Dataset[WordCount] = spark.createDataset(Seq(WordCount("Bolek", 1),WordCount("Ala", 2))) assertDatasetEquals(expected = expectedDs, result = wordsCountDs)
  34. Spark Testing Base – not so nice failure messages •

    Different length 1 did not equal 2 Length not EqualScalaTestFailureLocation: com.holdenkarau.spark.testing.TestSuite$class at • Different order of elements Tuple2;((0,(WordCount(Ala,2),WordCount(Bolek,1))), (1,(WordCount(Bolek,1),WordCount(Ala,2)))) was not empty • Differente values Tuple2;((0,(WordCount(Bole,1),WordCount(Bolek,1)))) was not empty
  35. Streaming – spark testing base class S06_02_StreamingTest_SparkTestingBase extends FunSuite with

    StreamingSuiteBase { test("count words") { val input = List(List("a b")) val expected = List(List(("a", 1), ("b", 1))) testOperation[String, (String, Int)](input, count _, expected, ordered = false) } // This is the sample operation we are testing def count(lines: DStream[String]): DStream[(String, Int)] = { lines.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) } }
  36. How to design easy testable Spark code? • Extract functions

    so they will be reusable and testable • Single transformation should do one thing • Compose transformations using „transform” function • Prefer Column based functions over UDFs • Column based functions • Dataset operators • UDF functions
  37. Column based function import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions._ object HelloWorld {

    def withGreeting()(df: DataFrame): DataFrame = { df.withColumn("greeting", lit(”Hello!!")) } } //def lit(literal: Any): Column
  38. it("appends a greeting column to a Dataframe") { Given("Source dataframe")

    val sourceDF = Seq( ("Quality Excites") ).toDF("name") When("adding greeting column") val actualDF = sourceDF .transform(HelloWorld.withGreeting()) Then("new data frame contains column greeting") val expectedSchema = List(StructField("name", StringType, true),StructField("greeting", StringType, false)) val expectedData = Seq(Row("Quality Excites", ”Hello!!")) val expectedDF = ss.createDataFrame(ss.sparkContext.parallelize(expectedData),StructType(expectedSchema)) assertSmallDatasetEquality(actualDF, expectedDF, orderedComparison = false) }
  39. it("appends a greeting column to a Dataframe") { Given("Source dataframe")

    val sourceDF = Seq( ("Quality Excites") ).toDF("name") When("adding greeting column") val actualDF = sourceDF .transform(HelloWorld.withGreeting()) .transform(HelloWorld.withGreetingUdf())
  40. object HelloWorld { def withGreeting()(df: DataFrame): DataFrame = { df.withColumn("greeting",

    lit("Hello!!")) } val litFunction: () => String = () => "Hello!!" val udfLit = udf(litFunction) def withGreetingUdf()(df: DataFrame): DataFrame = { df.withColumn("greetingUdf", udfLit()) } }
  41. Pitfalls you should look out • cannot refer to one

    RDD inside another RDD • processing batch of data, not single message or domain entity • case classes defined in test class body - throws SerializationException • Spark reads json based on http://jsonlines.org/ specification
  42. Q&A

  43. References • https://databricks.com/session/mastering-spark-unit-testing • https://medium.com/@mrpowers/designing-easily-testable-spark-code- df0755ef00a4 • https://medium.com/@mrpowers/testing-spark-applications- 8c590d3215fa •

    http://shop.oreilly.com/product/0636920046967.do • https://spark.apache.org/docs/latest/streaming-programming-guide.html • https://spark.apache.org/docs/latest/sql-programming-guide.html • https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs- blackbox.html