Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark streaming HadoopCon 2016

Erica Li
September 12, 2016

Spark streaming HadoopCon 2016

Presented by Mark Yang

Erica Li

September 12, 2016
Tweet

More Decks by Erica Li

Other Decks in Technology

Transcript

  1. About Me • Mark Yang 楊擇中 • Taiwan Spark User

    Group(台北) 共同創辦人 • scala /akka /spark • Boundless Cloud 大千雲端 技術經理
  2. Window Transformations • window (windowLength, slideInterval) • countByWindow (windowLength,slideInterval) •

    reduceByWindow (func, windowLength,slideInterval) • reduceByKeyAndWindow (func,windowLength, slideInterval, [numTasks]) • reduceByKeyAndWindow (func, invFunc,windowLength, slideInterval, [numTasks]) • countByValueAndWindow (windowLength,slideInterval, [numTasks])
  3. Output • print () • saveAsTextFiles (prefix, [suffix]) • saveAsObjectFiles

    (prefix, [suffix]) • saveAsHadoopFiles (prefix, [suffix]) • foreachRDD (func)
  4. foreachRDD-Design Pattern dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords =>

    val connection = createNewConnection() partitionOfRecords.foreach(record => connection.send(record)) connection.close() } } 2
  5. foreachRDD-Design Pattern dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords =>

    // ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) ConnectionPool.returnConnection(connection) // return to the pool for future reuse } } 3
  6. Example DStream+DataFrame words.foreachRDD { rdd => val sparkSession = //

    SQLContext.getOrCreate(rdd.sparkContext) //spark 1.x SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate() //spark 2.0 import sparkSession.implicits._ val wordsDataFrame = rdd.toDF("word") wordsDataFrame // .registerTempTable("words") //spark 1.x .createOrReplaceTempView("words") //spark 2.0 val wordCountsDataFrame = sparkSession.sql("select word, count(*) as total from words group by word") wordCountsDataFrame.show() }
  7. • 創建Topic bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1

    --topic test • 取得Topic列表 bin/kafka-topics.sh --list --zookeeper localhost:2181 開始使用 Kafka