Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

Apache Spark vs rest of the world - Problems and
Solutions Arkadiusz Jachnik

#BigDataSpain 2017 About Arkadiusz • Senior Data Scientist at AGORA
SA - user proﬁling & content personalization - recommendation system • PhD Student at   Poznan University of Technology - multi-class & multi-label classiﬁcation - multi-output prediction - recommendation algorithms 2

#BigDataSpain 2017 Agora’s BigData Team 3 my boss Luiza :)
it’s me! we are all here at #BDS! I invite to talk of these guys :) Arek Wojtek Paweł Paweł Dawid Bartek Jacek Daniel

#BigDataSpain 2017 4 Internet Press Polish Media Company Magazines Radio
Cinemas Advertising TV Books

#BigDataSpain 2017 Spark in Agora's BigData Platform 5 DATA COLLECTING
AND INTEGRATION USER PROFILING  SYSTEM DATA ANALYTICS RECOMMENDATION SYSTEM DATA ENRICHMENT AND CONTENT STRUCTURISATION HADOOP CLUSTER own build, v2.2 structured streaming Spark SQL, MLlib Spark streaming over 3 years of experience

#BigDataSpain 2017 Today discussed problems 6 1. Processing parts of
data and loading from   Spark to relational database in parallel 2. Bulk loading do HBase database 3. From relational database to Spark DataFrame (with user deﬁned functions) 4. From HBase to Spark by Hive external table (with timestamps of HBase cells) 5. Spark Streaming with Kafka - how to implement own offset manager

#BigDataSpain 2017 I will show some code… • I will
show real technical problems we have encountered during Spark deployment • We use Spark in Agora for over 3 years so we have great experience • I will present practical solutions showing some code in Scala • Scala is natural for Spark 7

1. Processing and writing parts of data in parallel Problem
description: • We have processed huge DataFrame of computed recommendations for users • There are 4 deﬁned types of recommendations • For each type we want to take top-K recommendations for each user • Recommendations of each type should be loaded to different PostgreSQL table #BigDataSpain 2017 8 User Recommendation type Article Score Grzegorz TYPE_3 Article F 1.0 Bożena TYPE_4 Article B 0.2 Grażyna TYPE_2 Article B 0.2 Grzegorz TYPE_3 Article D 0.9 Krzysztof TYPE_3 Article D 0.4 Grażyna TYPE_2 Article C 0.9 Grażyna TYPE_1 Article D 0.3 Bożena TYPE_2 Article E 0.9 Grzegorz TYPE_1 Article E 1.0 Grzegorz TYPE_1 Article A 0.7

#BigDataSpain 2017 Code intro: input & output 9 Grzegorz, Article
A, 1.0 Grzegorz, Article F, 0.9 Grzegorz, Article C, 0.9 Grzegorz, Article D, 0.8 Grzegorz, Article B, 0.75 Bożena, ... ... TYPE 1  5 recos. per user save table_1 Krzysztof, Article F, 1.0 Krzysztof, Article D, 1.0 Krzysztof, Article C, 0.8 Krzysztof, Article B, 0.85 Grażyna, Article C, 1.0 Grażyna, ... ... TYPE 2  4 recos. per user save table_2 Grzegorz, Article E, 1.0 Grzegorz, Article B, 0.75 Grzegorz, Article A, 0.8 Bożena, Article E, 0.9 Bożena, Article A, 0.75 Bożena, Article C 0.75 TYPE 3  3 recos. per user save table_3 Grażyna, Article A, 1.0 Grażyna, Article F, 0.9 Bożena, Article B, 0.9 Bożena, Article D, 0.9 Grzegorz, Article B, 1.0 Grzegorz, Article E, 0.95 TYPE 4  2 recos. per user save table_4

#BigDataSpain 2017 Standard approach recoTypes.foreach(recoType => { val topNrecommendations =
processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 10 no-parallelism parallelism but most of the tasks skipped

#BigDataSpain 2017 maybe we can add .par ? recoTypes .par.foreach(recoType
=> { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 11 parallelism but too much tasks :(

#BigDataSpain 2017 Our trick parallelizeProcessing(recoTypes, (recoType: RecoType) => { val
topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = { f(recoTypes.head) if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_)) } 12 execute Spark action for the first type… parallelize the rest

2. Fast bulk-loading to HBase Problems with standard HBase client
(inserts with Put class): • Difﬁcult integration with Spark • Complicated parallelization • For non pre-splited tables problem with *Region*Exception-s • Slow for millions of rows #BigDataSpain 2017 13 Spark DataFrame / RDD .foreachPartition hTable  .put(…) hTable  .put(…) hTable  .put(…) hTable  .put(…)

#BigDataSpain 2017 Idea Our approach is based on: https://github.com/zeyuanxy/ spark-hbase-bulk-loading
Input RDD: data: RDD[( //pair RDD Array[Byte], //HBase row key Map[ //data: String, //column-family Array[( String, //column name (String, //cell value Long) //timestamp )] ] )] 14 General idea: We have to save our RDD data as HFiles (HBase data are stored in such files) and load them into the given pre-existing table. General steps: 1. Implement Spark Partitioner that defines how our data in a key-value pair RDD should be partitioned for HBase row key 2. Repartition and sort the RDD within column-families and starting row keys for every HBase region 3. Save RDD to HDFS as HFiles by rdd.saveAsNewAPIHadoopFile method 4. Load files to table by LoadIncrementalHFiles (HBase API)

#BigDataSpain 2017 Implementation // Prepare hConnection, tableName, hTable ... val
regionLocator = hConnection.getRegionLocator(tableName) val columnFamilies = hTable.getTableDescriptor .getFamiliesKeys.map(Bytes.toString(_)) val partitioner = new HFilePartitioner(regionLocator.getStartKeys, fraction) // prepare partitioned RDD val rdds = for { family <- columnFamilies rdd = data .collect{ case (key, dataMap) if dataMap.contains(family) => (key, dataMap(family))} .flatMap{ case (key, familyDataMap) => familyDataMap.map{ case (column: String, valueTs: (String, Long)) => (((key, Bytes.toBytes(column)), valueTs._2), Bytes.toBytes(valueTs._1)) } } } yield getPartitionedRdd(rdd, family, partitioner) 15 val rddToSave = rdds.reduce(_ ++ _) // prepare map-reduce job for bulk-load HFileOutputFormat2.configureIncrementalLoad( job, hTable, regionLocator) // prepare path for HFiles output val fs = FileSystem.get(hbaseConfig) val hFilePath = new Path(...) try { rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration) // prepare HFiles for incremental load by setting // folders permissions read/write/exec for all... setRecursivePermission(hFilePath) val loader = new LoadIncrementalHFiles(hbaseConfig) loader.doBulkLoad(hFilePath, hConnection.getAdmin, hTable, regionLocator) } // finally close resources, ... Prepare HBase connection, table  and region locator Prepare Spark partitioner for HBase regions Repartition and sort data within partitions by the partitioner Save HFiles by NewAPIHadoopFile   to HDFS Load HFiles   to HBase table

#BigDataSpain 2017 Keep in mind • Set optimally HBase parameter: 
hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32) • For large data too small value of this parameter may causes  IllegalArgumentException: Size exceeds Integer.MAX_VALUE • Create HBase tables with splits adapted to the expected row keys - example: for row keys of HEX IDs create table with splits like:  create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2',  ‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']} - for further single puts it minimizes *Region*Exceptions 16

#BigDataSpain 2017 3. Loading data from Postgres to Spark  
This is possible for data from Hive:  val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val data: DataFrame = sparkSesstion.sql( "SELECT id, toUpperCaseUdf(code) FROM types" ) 17 But this is not possible for data from JDBC (for example PostgreSQL):  val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT toUpperCaseUdf(code) " + "FROM codes) as codesData", connectionConf) this query is executed by Postgres (not Spark) here you can can specify just Postgres table name and how to parallelize data loading?

#BigDataSpain 2017 Try to load ’raw’ data without UDFs and
next use .withColumn with UDF as expression: val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT code " + "FROM codes) as codesData", connectionConf) .withColumn("upperCode", expr("toUpperCaseUdf(code)")) Our solution 18 .jdbc produces DataFrame We will split the table read across executors on the selected column: val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc( url = jdbcUrl, table = "(SELECT code, type_id " + "FROM codes) as codesData", columnName = "type_id", lowerBound = 1L, upperBound = 100L, numPartitions = 10, connectionProperties = connectionConf) but it’s one partition!

#BigDataSpain 2017 Is it working? spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table
= "users", properties = connectionProperties) .cache() spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table = "users", columnName = "type", lowerBound = 1L, upperBound = 100L, numPartitions = 4, connectionProperties = connectionProperties) .cache() 19 test data 1 partition 4 partitions

#BigDataSpain 2017 4. From HBase to Spark by Hive There
are commonly used method for loading data from HBase to Spark by Hive external table: CREATE TABLE hive_view_on_hbase ( key int, value string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key, cf1:val" ) TBLPROPERTIES ( "hbase.table.name" = "xyz" ); 20 72A9DBA74524 column-family: cities Poznan Warsaw Cracow Gdansk 40 5 1 3 58383B36275A Poznan Warsaw Cracow Gdansk 120 60 5 009D22419988 Poznan Warsaw Cracow Gdansk 75 1 user_id cities_map last_city 72A9DBA 74524 map(Poznan->40, Warsaw->5,  Cracow->1, Gdansk->3) ? 58383B3 6275A map(Warsaw->120,   Cracow->60, Gdansk->5) ? 009D224 19988 map(Poznan->75, Warsaw->1) ? HiveHBaseHandler but how to get the last (most recent) values? where are timestamps?

#BigDataSpain 2017 Our case • We use HDP distribution of
Hadoop cluster with HBase 1.1.x • There is possibility to add to Hive view on HBase table the latest timestamp of row modiﬁcation: CREATE TABLE hive_view_on_hbase ( key int, value string, ts timestamp ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key, cf1:val, :timestamp' ) TBLPROPERTIES ( 'hbase.table.name' = 'xyz' ); 21 • How to extract timestamp of each cell? • Answer: rewrite Hive-HBase-Handler that is responsible for creating the Hive views on HBase tables :) … but ﬁrst … • Do not download source code of Hive from the Hive GitHub repository - check your Hadoop distribution! (for example HDP has own code branch)

#BigDataSpain 2017 22 There is a patch on Hive repo…
…but still not reviewed and merged :(

#BigDataSpain 2017 There is a lot of code… …but we
have some tips on how to change Hive-HBase-Handler: • Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java which returns ColumnMappings object • LazyHBaseRow class stores data from HBase row. • Timestamps of processed HBase cells can be read from loaded (by scanner) rows in LazyHBaseCellMap class • Column parser and HBase scanner is initialized in HBaseStorageHandler.java 23

#BigDataSpain 2017 5. Spark + Kafka: own offset manager Problem
description: • Spark output operations are at-least-once • For exactly-once semantics, you must store oﬀsets after an idempotent output, or in an atomic transaction alongside output • Options: 1. Checkpoints + easy to enable by Spark checkpointing - output operation must be idempotent - cannot recover from a checkpoint if application code has changed 2. Own data store + regardless of changes to your application code + you can use data stores that support transactions + exactly-once semantics 24 Single Spark batch Process and save data Save offsets Image source: Spark Streaming documentation https://spark.apache.org/docs/latest/streaming-programming-guide.html

#BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext
= new StreamingContext(…) val stream: DStream[ConsumerRecord[String, String]] = ... stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, ...) }) 25 Single Spark batch Process and save data Save offsets

#BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext
= new StreamingContext(...) val stream: DStream[ConsumerRecord[String, String]] = kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams) stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, zkPath) }) def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore, kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = { offsetsStore.readOffsets(topic, zkPath) match { case Some(offsetsMap) => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap)) case None => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams) ) } } 26

#BigDataSpain 2017 Code of offset store class MyOffsetsStore(zkHosts: String) {
val zkUtils = ZkUtils(zkHosts, 10000, 10000, false) def saveOffsets(rdd: RDD[_], zkPath: String): Unit = { val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges offsetsRanges.groupBy(_.topic).foreach { case (topic, offsetsRangesPerTopic) => { val offsetsRangesStr = offsetsRangesPerTopic .map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",") zkUtils.updatePersistentPath(zkPath, offsetsRangesStr) } }} def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = { val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath) offsetsRangesStrOpt match { case Some(offsetsRangesStr) => Some(offsetsRangesStr.split(",").map(s => s.split(":")).map { case Array(partitionStr, offsetStr) => new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong }.toMap) case None => None } } } 27

Thank you! Questions? [email protected] www.linkedin.com/in/arkadiusz-jachnik

Apache Spark vs rest of the world – Problems an...

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

Apache Spark vs rest of the world - Problems and

#BigDataSpain 2017 About Arkadiusz • Senior Data Scientist at AGORA

#BigDataSpain 2017 Agora’s BigData Team 3 my boss Luiza :)

#BigDataSpain 2017 4 Internet Press Polish Media Company Magazines Radio

#BigDataSpain 2017 Spark in Agora's BigData Platform 5 DATA COLLECTING

#BigDataSpain 2017 Today discussed problems 6 1. Processing parts of

#BigDataSpain 2017 I will show some code… • I will

1. Processing and writing parts of data in parallel Problem

#BigDataSpain 2017 Code intro: input & output 9 Grzegorz, Article

#BigDataSpain 2017 Standard approach recoTypes.foreach(recoType => { val topNrecommendations =

#BigDataSpain 2017 maybe we can add .par ? recoTypes .par.foreach(recoType

#BigDataSpain 2017 Our trick parallelizeProcessing(recoTypes, (recoType: RecoType) => { val

2. Fast bulk-loading to HBase Problems with standard HBase client

#BigDataSpain 2017 Idea Our approach is based on: https://github.com/zeyuanxy/ spark-hbase-bulk-loading

#BigDataSpain 2017 Implementation // Prepare hConnection, tableName, hTable ... val

#BigDataSpain 2017 Keep in mind • Set optimally HBase parameter:

#BigDataSpain 2017 3. Loading data from Postgres to Spark

#BigDataSpain 2017 Try to load ’raw’ data without UDFs and

#BigDataSpain 2017 Is it working? spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table

#BigDataSpain 2017 4. From HBase to Spark by Hive There

#BigDataSpain 2017 Our case • We use HDP distribution of

#BigDataSpain 2017 22 There is a patch on Hive repo…

#BigDataSpain 2017 There is a lot of code… …but we

#BigDataSpain 2017 5. Spark + Kafka: own offset manager Problem

#BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext

#BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext

#BigDataSpain 2017 Code of offset store class MyOffsetsStore(zkHosts: String) {

Thank you! Questions? [email protected] www.linkedin.com/in/arkadiusz-jachnik