Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark vs rest of the world – Problems an...

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

Apache Spark is a great solution for building Big Data applications. It provides really fast SQL-like processing, machine learning library, and streaming module for near real time processing of data streams. Unfortunately, during application development and production deployments we often encounter many difficulties in mixing various data sources or bulk loading of computed data to SQL or NoSQL databases

https://www.bigdataspain.org/2017/talk/apache-spark-vs-rest-of-the-world-problems-and-solutions

Big Data Spain 2017
16th - 17th November Kinépolis Madrid

Big Data Spain

November 23, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. #BigDataSpain 2017 About Arkadiusz • Senior Data Scientist at AGORA

    SA - user profiling & content personalization - recommendation system • PhD Student at 
 Poznan University of Technology - multi-class & multi-label classification - multi-output prediction - recommendation algorithms 2
  2. #BigDataSpain 2017 Agora’s BigData Team 3 my boss Luiza :)

    it’s me! we are all here at #BDS! I invite to talk of these guys :) Arek Wojtek Paweł Paweł Dawid Bartek Jacek Daniel
  3. #BigDataSpain 2017 Spark in Agora's BigData Platform 5 DATA COLLECTING

    AND INTEGRATION USER PROFILING
 SYSTEM DATA ANALYTICS RECOMMENDATION SYSTEM DATA ENRICHMENT AND CONTENT STRUCTURISATION HADOOP CLUSTER own build, v2.2 structured streaming Spark SQL, MLlib Spark streaming over 3 years of experience
  4. #BigDataSpain 2017 Today discussed problems 6 1. Processing parts of

    data and loading from 
 Spark to relational database in parallel 2. Bulk loading do HBase database 3. From relational database to Spark DataFrame (with user defined functions) 4. From HBase to Spark by Hive external table (with timestamps of HBase cells) 5. Spark Streaming with Kafka - how to implement own offset manager
  5. #BigDataSpain 2017 I will show some code… • I will

    show real technical problems we have encountered during Spark deployment • We use Spark in Agora for over 3 years so we have great experience • I will present practical solutions showing some code in Scala • Scala is natural for Spark 7
  6. 1. Processing and writing parts of data in parallel Problem

    description: • We have processed huge DataFrame of computed recommendations for users • There are 4 defined types of recommendations • For each type we want to take top-K recommendations for each user • Recommendations of each type should be loaded to different PostgreSQL table #BigDataSpain 2017 8 User Recommendation type Article Score Grzegorz TYPE_3 Article F 1.0 Bożena TYPE_4 Article B 0.2 Grażyna TYPE_2 Article B 0.2 Grzegorz TYPE_3 Article D 0.9 Krzysztof TYPE_3 Article D 0.4 Grażyna TYPE_2 Article C 0.9 Grażyna TYPE_1 Article D 0.3 Bożena TYPE_2 Article E 0.9 Grzegorz TYPE_1 Article E 1.0 Grzegorz TYPE_1 Article A 0.7
  7. #BigDataSpain 2017 Code intro: input & output 9 Grzegorz, Article

    A, 1.0 Grzegorz, Article F, 0.9 Grzegorz, Article C, 0.9 Grzegorz, Article D, 0.8 Grzegorz, Article B, 0.75 Bożena, ... ... TYPE 1
 5 recos. per user save table_1 Krzysztof, Article F, 1.0 Krzysztof, Article D, 1.0 Krzysztof, Article C, 0.8 Krzysztof, Article B, 0.85 Grażyna, Article C, 1.0 Grażyna, ... ... TYPE 2
 4 recos. per user save table_2 Grzegorz, Article E, 1.0 Grzegorz, Article B, 0.75 Grzegorz, Article A, 0.8 Bożena, Article E, 0.9 Bożena, Article A, 0.75 Bożena, Article C 0.75 TYPE 3
 3 recos. per user save table_3 Grażyna, Article A, 1.0 Grażyna, Article F, 0.9 Bożena, Article B, 0.9 Bożena, Article D, 0.9 Grzegorz, Article B, 1.0 Grzegorz, Article E, 0.95 TYPE 4
 2 recos. per user save table_4
  8. #BigDataSpain 2017 Standard approach recoTypes.foreach(recoType => { val topNrecommendations =

    processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 10 no-parallelism parallelism but most of the tasks skipped
  9. #BigDataSpain 2017 maybe we can add .par ? recoTypes .par.foreach(recoType

    => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 11 parallelism but too much tasks :(
  10. #BigDataSpain 2017 Our trick parallelizeProcessing(recoTypes, (recoType: RecoType) => { val

    topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = { f(recoTypes.head) if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_)) } 12 execute Spark action for the first type… parallelize the rest
  11. 2. Fast bulk-loading to HBase Problems with standard HBase client

    (inserts with Put class): • Difficult integration with Spark • Complicated parallelization • For non pre-splited tables problem with *Region*Exception-s • Slow for millions of rows #BigDataSpain 2017 13 Spark DataFrame / RDD .foreachPartition hTable
 .put(…) hTable
 .put(…) hTable
 .put(…) hTable
 .put(…)
  12. #BigDataSpain 2017 Idea Our approach is based on: https://github.com/zeyuanxy/ spark-hbase-bulk-loading

    Input RDD: data: RDD[( //pair RDD Array[Byte], //HBase row key Map[ //data: String, //column-family Array[( String, //column name (String, //cell value Long) //timestamp )] ] )] 14 General idea: We have to save our RDD data as HFiles (HBase data are stored in such files) and load them into the given pre-existing table. General steps: 1. Implement Spark Partitioner that defines how our data in a key-value pair RDD should be partitioned for HBase row key 2. Repartition and sort the RDD within column-families and starting row keys for every HBase region 3. Save RDD to HDFS as HFiles by rdd.saveAsNewAPIHadoopFile method 4. Load files to table by LoadIncrementalHFiles (HBase API)
  13. #BigDataSpain 2017 Implementation // Prepare hConnection, tableName, hTable ... val

    regionLocator = hConnection.getRegionLocator(tableName) val columnFamilies = hTable.getTableDescriptor .getFamiliesKeys.map(Bytes.toString(_)) val partitioner = new HFilePartitioner(regionLocator.getStartKeys, fraction) // prepare partitioned RDD val rdds = for { family <- columnFamilies rdd = data .collect{ case (key, dataMap) if dataMap.contains(family) => (key, dataMap(family))} .flatMap{ case (key, familyDataMap) => familyDataMap.map{ case (column: String, valueTs: (String, Long)) => (((key, Bytes.toBytes(column)), valueTs._2), Bytes.toBytes(valueTs._1)) } } } yield getPartitionedRdd(rdd, family, partitioner) 15 val rddToSave = rdds.reduce(_ ++ _) // prepare map-reduce job for bulk-load HFileOutputFormat2.configureIncrementalLoad( job, hTable, regionLocator) // prepare path for HFiles output val fs = FileSystem.get(hbaseConfig) val hFilePath = new Path(...) try { rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration) // prepare HFiles for incremental load by setting // folders permissions read/write/exec for all... setRecursivePermission(hFilePath) val loader = new LoadIncrementalHFiles(hbaseConfig) loader.doBulkLoad(hFilePath, hConnection.getAdmin, hTable, regionLocator) } // finally close resources, ... Prepare HBase connection, table
 and region locator Prepare Spark partitioner for HBase regions Repartition and sort data within partitions by the partitioner Save HFiles by NewAPIHadoopFile 
 to HDFS Load HFiles 
 to HBase table
  14. #BigDataSpain 2017 Keep in mind • Set optimally HBase parameter:


    hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32) • For large data too small value of this parameter may causes
 IllegalArgumentException: Size exceeds Integer.MAX_VALUE • Create HBase tables with splits adapted to the expected row keys - example: for row keys of HEX IDs create table with splits like:
 create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2',
 ‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']} - for further single puts it minimizes *Region*Exceptions 16
  15. #BigDataSpain 2017 3. Loading data from Postgres to Spark 


    This is possible for data from Hive:
 val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val data: DataFrame = sparkSesstion.sql( "SELECT id, toUpperCaseUdf(code) FROM types" ) 17 But this is not possible for data from JDBC (for example PostgreSQL):
 val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT toUpperCaseUdf(code) " + "FROM codes) as codesData", connectionConf) this query is executed by Postgres (not Spark) here you can can specify just Postgres table name and how to parallelize data loading?
  16. #BigDataSpain 2017 Try to load ’raw’ data without UDFs and

    next use .withColumn with UDF as expression: val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT code " + "FROM codes) as codesData", connectionConf) .withColumn("upperCode", expr("toUpperCaseUdf(code)")) Our solution 18 .jdbc produces DataFrame We will split the table read across executors on the selected column: val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc( url = jdbcUrl, table = "(SELECT code, type_id " + "FROM codes) as codesData", columnName = "type_id", lowerBound = 1L, upperBound = 100L, numPartitions = 10, connectionProperties = connectionConf) but it’s one partition!
  17. #BigDataSpain 2017 Is it working? spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table

    = "users", properties = connectionProperties) .cache() spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table = "users", columnName = "type", lowerBound = 1L, upperBound = 100L, numPartitions = 4, connectionProperties = connectionProperties) .cache() 19 test data 1 partition 4 partitions
  18. #BigDataSpain 2017 4. From HBase to Spark by Hive There

    are commonly used method for loading data from HBase to Spark by Hive external table: CREATE TABLE hive_view_on_hbase ( key int, value string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key, cf1:val" ) TBLPROPERTIES ( "hbase.table.name" = "xyz" ); 20 72A9DBA74524 column-family: cities Poznan Warsaw Cracow Gdansk 40 5 1 3 58383B36275A Poznan Warsaw Cracow Gdansk 120 60 5 009D22419988 Poznan Warsaw Cracow Gdansk 75 1 user_id cities_map last_city 72A9DBA 74524 map(Poznan->40, Warsaw->5,
 Cracow->1, Gdansk->3) ? 58383B3 6275A map(Warsaw->120, 
 Cracow->60, Gdansk->5) ? 009D224 19988 map(Poznan->75, Warsaw->1) ? HiveHBaseHandler but how to get the last (most recent) values? where are timestamps?
  19. #BigDataSpain 2017 Our case • We use HDP distribution of

    Hadoop cluster with HBase 1.1.x • There is possibility to add to Hive view on HBase table the latest timestamp of row modification: CREATE TABLE hive_view_on_hbase ( key int, value string, ts timestamp ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key, cf1:val, :timestamp' ) TBLPROPERTIES ( 'hbase.table.name' = 'xyz' ); 21 • How to extract timestamp of each cell? • Answer: rewrite Hive-HBase-Handler that is responsible for creating the Hive views on HBase tables :) … but first … • Do not download source code of Hive from the Hive GitHub repository - check your Hadoop distribution! (for example HDP has own code branch)
  20. #BigDataSpain 2017 22 There is a patch on Hive repo…

    …but still not reviewed and merged :(
  21. #BigDataSpain 2017 There is a lot of code… …but we

    have some tips on how to change Hive-HBase-Handler: • Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java which returns ColumnMappings object • LazyHBaseRow class stores data from HBase row. • Timestamps of processed HBase cells can be read from loaded (by scanner) rows in LazyHBaseCellMap class • Column parser and HBase scanner is initialized in HBaseStorageHandler.java 23
  22. #BigDataSpain 2017 5. Spark + Kafka: own offset manager Problem

    description: • Spark output operations are at-least-once • For exactly-once semantics, you must store offsets after an idempotent output, or in an atomic transaction alongside output • Options: 1. Checkpoints + easy to enable by Spark checkpointing - output operation must be idempotent - cannot recover from a checkpoint if application code has changed 2. Own data store + regardless of changes to your application code + you can use data stores that support transactions + exactly-once semantics 24 Single Spark batch Process and save data Save offsets Image source: Spark Streaming documentation https://spark.apache.org/docs/latest/streaming-programming-guide.html
  23. #BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext

    = new StreamingContext(…) val stream: DStream[ConsumerRecord[String, String]] = ... stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, ...) }) 25 Single Spark batch Process and save data Save offsets
  24. #BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext

    = new StreamingContext(...) val stream: DStream[ConsumerRecord[String, String]] = kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams) stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, zkPath) }) def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore, kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = { offsetsStore.readOffsets(topic, zkPath) match { case Some(offsetsMap) => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap)) case None => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams) ) } } 26
  25. #BigDataSpain 2017 Code of offset store class MyOffsetsStore(zkHosts: String) {

    val zkUtils = ZkUtils(zkHosts, 10000, 10000, false) def saveOffsets(rdd: RDD[_], zkPath: String): Unit = { val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges offsetsRanges.groupBy(_.topic).foreach { case (topic, offsetsRangesPerTopic) => { val offsetsRangesStr = offsetsRangesPerTopic .map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",") zkUtils.updatePersistentPath(zkPath, offsetsRangesStr) } }} def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = { val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath) offsetsRangesStrOpt match { case Some(offsetsRangesStr) => Some(offsetsRangesStr.split(",").map(s => s.split(":")).map { case Array(partitionStr, offsetStr) => new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong }.toMap) case None => None } } } 27