Spark streaming HadoopCon 2016

Erica Li
September 12, 2016

  1. About Me • Mark Yang 楊擇中 • Taiwan Spark User

    Group(台北) 共同創辦人 • scala /akka /spark • Boundless Cloud 大千雲端 技術經理
  2. Window Transformations • window (windowLength, slideInterval) • countByWindow (windowLength,slideInterval) •

    reduceByWindow (func, windowLength,slideInterval) • reduceByKeyAndWindow (func,windowLength, slideInterval, [numTasks]) • reduceByKeyAndWindow (func, invFunc,windowLength, slideInterval, [numTasks]) • countByValueAndWindow (windowLength,slideInterval, [numTasks])
  3. Output • print () • saveAsTextFiles (prefix, [suffix]) • saveAsObjectFiles

    (prefix, [suffix]) • saveAsHadoopFiles (prefix, [suffix]) • foreachRDD (func)
  4. foreachRDD-Design Pattern dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords =>

    val connection = createNewConnection() partitionOfRecords.foreach(record => connection.send(record)) connection.close() } } 2
  5. foreachRDD-Design Pattern dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords =>

    // ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) ConnectionPool.returnConnection(connection) // return to the pool for future reuse } } 3
  6. Example DStream+DataFrame words.foreachRDD { rdd => val sparkSession = //

    SQLContext.getOrCreate(rdd.sparkContext) //spark 1.x SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate() //spark 2.0 import sparkSession.implicits._ val wordsDataFrame = rdd.toDF("word") wordsDataFrame // .registerTempTable("words") //spark 1.x .createOrReplaceTempView("words") //spark 2.0 val wordCountsDataFrame = sparkSession.sql("select word, count(*) as total from words group by word") wordCountsDataFrame.show() }
  7. • 創建Topic bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1

    --topic test • 取得Topic列表 bin/kafka-topics.sh --list --zookeeper localhost:2181 開始使用 Kafka