Upgrade to Pro — share decks privately, control downloads, hide ads and more …

No more struggles with Apache Spark (PySpark) workloads in production by Chetan Khatri

Pycon ZA
October 11, 2019

No more struggles with Apache Spark (PySpark) workloads in production by Chetan Khatri

Spark is a good tool for processing large amounts of data, but there are many pitfalls to avoid in order to build large scale systems in production, This talk will take you through fundamental concepts of Apache Spark for Python Developers. We'll examine some of the data serialization and interoperability issues specifically with Python libraries like Numpy, Pandas which are highly impacting PySpark performance. We will address this issue with Apache arrow (PyArrow API) which is a cross-language development platform for in-memory data. This talk will show what the challenges you may face while productionizing Spark for TB’s of data and their possible solutions.

Pycon ZA

October 11, 2019
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. No more struggles with Apache Spark (PySpark) workloads in production

    Chetan Khatri, Solution Architect - Data Science. Accionlabs India. PyconZA 2019, The Wanderers Club in Illovo. Johannesburg, South Africa 11th Oct, 2019 Twitter: @khatri_chetan, Email: chetan.khatri@live.com chetan.khatri@accionlabs.com LinkedIn: https://www.linkedin.com/in/chetkhatri Github: chetkhatri
  2. Who am I? Solution Architect - Data Science @ Accion

    labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang. Co-Authored University Curriculum @ University of Kachchh, India. Ex - Data Engineering @: Nazara Games, Eccella Corporation. Masters - Computer Science from University of Kachchh, India. Daily Activity? Functional Programming, Distributed Computing, Python, Scala, Haskell, Data Science, Product Development
  3. Helping organizations create innovative product and solutions using the emerging

    technologies An Innovation Focused Technology Services Firm Employees Clients Accelerators Global Offices Development Centers 2300+ 75+ 20+ 12+ 7
  4. Accion Labs - Introduction • A Global Technology Services firm

    focussed Emerging Technologies ◦ 12 offices, 7 dev centers, 2300+ employees, 75+ active clients • Profitable, venture-backed company ◦ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership • Flexible Outcome-based Engagement Models ◦ Projects, Extended teams, Shared IP, Co-development, Professional Services • Framework Based Approach to Accelerate Digital Transformation ◦ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency • Action-oriented Leadership Team ◦ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018 4
  5. Accion’s Emerging Tech Capabilities Adaptive UI, UX Engineering NLP, Voice

    Interface & Chat Bots Artificial Intelligence and Machine Learning Data Lake & Big Data Analytics Blockchain, Payment Technologies Cloud Strategy and Transformation Mobile Development MicroServices and Serverless Computing QA Engineering, RPA and DevOps Automation SFDC, ServiceNow, IBM Solutions, Azure 5
  6. Agenda • Apache Spark • Primary data structures (RDD, DataSet,

    Dataframe) • Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. • Parallel read from JDBC: Challenges and best practices. • Bulk Load API vs JDBC write • An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin • Avoid unnecessary shuffle • Optimize Spark stage generation plan • Predicate pushdown with partitioning and bucketing • Airflow DAG scheduling for Apache Spark worflow. - Design, Architecture, Demo.
  7. What is Apache Spark? • Apache Spark is a fast

    and general-purpose cluster computing system / Unified Engine for massive data processing. • It provides high level API for Scala, Java, Python and R and optimized engine that supports general execution graphs. Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  8. What are RDDs ?

  9. 1. Distributed Data Abstraction RDD RDD RDD RDD Logical Model

    Across Distributed Storage on Cluster HDFS, S3
  10. 2. Resilient & Immutable RDD RDD RDD T T RDD

    -> T -> RDD -> T -> RDD T = Transformation
  11. 3. Compile-time Type Safe / Strongly type inference Integer RDD

    String or Text RDD Double or Binary RDD
  12. 4. Lazy evaluation RDD RDD RDD T T RDD RDD

    RDD T A RDD - T - RDD - T - RDD - T - RDD - A - RDD T = Transformation A = Action
  13. Apache Spark Operations Operations Transformation Action

  14. Essential Spark Operations TRANSFORMATIONS ACTIONS General Math / Statistical Set

    Theory / Relational Data Structure / I/O map gilter flatMap mapPartitions mapPartitionsWithIndex groupBy sortBy sample randomSplit union intersection subtract distinct cartesian zip keyBy zipWithIndex zipWithUniqueID zipPartitions coalesce repartition repartitionAndSortWithinPartitions pipe reduce collect aggregate fold first take forEach top treeAggregate treeReduce forEachPartition collectAsMap count takeSample max min sum histogram mean variance stdev sampleVariance countApprox countApproxDistinct takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile saveAsHadoopDataset saveAsHadoopFile saveAsNewAPIHadoopDataset saveAsNewAPIHadoopFile
  15. When to use RDDs ? You care about control of

    dataset and knows how data looks like, you care about low level API. Don’t care about lot’s of lambda functions than DSL. Don’t care about Schema or Structure of Data. Don’t care about optimization, performance & inefficiencies! Very slow for non-JVM languages like Python, R. Don’t care about Inadvertent inefficiencies.
  16. Inadvertent inefficiencies in RDDs

  17. Structured in Spark DataFrames Datasets

  18. Structured APIs in Apache Spark SQL DataFrames Datasets Syntax Errors

    Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  19. DataFrame API Code // convert RDD -> DF with column

    names parsedDF = parsedRDD.toDF("project", "sprint", "numStories") // filter, groupBy, sum, and then agg() parsedDF.filter(lambda x: x[1] === "finance") .groupBy("sprint") .agg(sum("numStories").as("count")) .limit(100) .show(100) project sprint numStories finance 3 20 finance 4 22
  20. DataFrame -> SQL View -> SQL Query parsedDF.createOrReplaceTempView("audits") results =

    spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  21. Catalyst in Spark SQL AST DataFrame Datasets Unresolved Logical Plan

    Logical Plan Optimized Logical Plan Physical Plans Cost Model Selected Physical Plan RDD
  22. Example: DataFrame Optimization employees.join(events, employees("id") === events("eid")) .filter(events("date") > "2015-01-01")

    events file employees table join filter Logical Plan scan (employees) filter Scan (events) join Physical Plan Optimized scan (events) Optimized scan (employees) join Physical Plan With Predicate Pushdown and Column Pruning
  23. DataFrames are Faster than RDDs Source: Databricks

  24. Pragmatic Approach Executors Cores Containers Stage Job Task

  25. Spark Internals terminology Job - Each transformation and action mapping

    in Spark would create a separate jobs. Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor. Task - Lowest level of Concurrent and Parallel execution Unit. Each stage is split into #number-of-partitions tasks, i.e Number of Tasks = stage * number of partitions in the stage
  26. Spark Internals: Jobs

  27. Spark Internals: Stage

  28. Spark Internals: Stage

  29. Spark Internals: Tasks

  30. Spark on Yarn Internals terminology yarn.scheduler.minimum-allocation-vcores = 1 Yarn.scheduler.maximum-allocation-vcores =

    6 Yarn.scheduler.minimum-allocation-mb = 4096 Yarn.scheduler.maximum-allocation-mb = 28832 Yarn.nodemanager.resource.memory-mb = 54000 Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 / Yarn.scheduler.minimum-allocation-mb = 4096) = 13
  31. Spark on Yarn Internals terminology

  32. Resource Manager (Yarn) Tuning

  33. Resource Manager (Yarn) Tuning

  34. Resource Manager (Yarn) Tuning

  35. Resource Manager (Yarn) Tuning

  36. Spark Scheduler FIFO to FAIR

  37. Parallel read from JDBC: Challenges and best practices.

  38. Spark JDBC Read What happens when you run this code?

    What would be the impact at Database engine side?
  39. Spark JDBC Read: Impact on Database engine e.g MSSQL Server

  40. Spark JDBC Read: Impact on Database engine e.g MSSQL Server

  41. Spark Parallel JDBC Read

  42. Spark Parallel JDBC Read

  43. Spark Parallel JDBC Read

  44. Impact on Database after Spark Parallel Read

  45. Bulk Load API vs JDBC write

  46. Bulk Load API vs JDBC write

  47. Bulk Load API vs JDBC write

  48. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin JoinSelection execution

    planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join. # check broadcast join threshold >>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024 10 # logical plan with tree numbered sampleDF.queryExecution.logical.numberedTreeString # Query plan sampleDF.explain
  49. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin Repartition: Boost

    the Parallelism, by increasing the number of Partitions. Partition on Joins, to get same key joins faster. // Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster. employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT")) For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions and each partition will write 2500 records Parallely. Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
  50. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin 1. //

    disable autoBroadcastJoin spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) 2. // Order doesn't matter table1.leftjoin(table2) or table2.leftjoin(table1) 3. force broadcast, if one DataFrame is not small! 4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition, HashPartitioner
  51. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin

  52. None
  53. None
  54. None
  55. None
  56. None
  57. Spark Submit Hyper-parameters and Dynamic Allocation ./bin/spark-submit \ --conf spark.yarn.maxAppAttempts=1

    \ --name PyConLT19 \ --master yarn \ --deploy-mode cluster \ --driver-memory 18g \ --executor-memory 24g \ --num-executors 4 \ --executor-cores 6 \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.speculation=false \ --conf spark.broadcast.compress=true \ --conf spark.sql.broadcastTimeout=36000 \ --conf spark.network.timeout=2500s \ --conf spark.dynamicAllocation.executorAllocationRatio=1 \ --conf spark.executor.heartbeatInterval=30s \ --conf spark.dynamicAllocation.executorIdleTimeout=60s \ --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s \ --conf spark.network.timeout=1200s \ --conf spark.dynamicAllocation.schedulerBacklogTimeout=15s \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=True \ --conf spark.dynamicAllocation.minExecutors=2 \ --conf spark.dynamicAllocation.initialExecutors=2 \ --conf spark.dynamicAllocation.maxExecutors=6 \ examples/src/main/python/pi.py
  58. Case Study: High Level Architecture OLTP Shadow Data Source Apache

    Spark Spark SQL Sqoop HDFS Parquet Yarn Cluster manager Customer Specific Reporting DB Bulk Load Parallelism Orchestration: Airflow
  59. Spark Streaming Code! Ref. https://github.com/chetkhatri/getting-started-airflow-for-spark/blob/master/spark _streaming_kafka.py

  60. Key role of Apache Airflow for Scheduling Data Pipelines Codebase:

    https://github.com/chetkhatri/getting-started-airflow-for-spark
  61. Trigger the Airflow DAG from API curl -d ' {"conf":"{\"retail_id\":\"29\"

    , \"env_type\":\"dev\", \"size_is\":\"medium\"}", "run_id": "retailer_1111"}' -H "Content-Type: application/json" -X POST http://localhost:8000/api/experimental/dags/nextgen_data_platforms/dag_runs Ref. https://github.com/teamclairvoyant/airflow-rest-api-plugin Spark Submit Operator inherited from BashOperator https://github.com/apache/airflow/blob/master/airflow/contrib/operators/spark_s ubmit_operator.py
  62. Airflow - config.txt

  63. Airflow spark_config.txt

  64. Airflow - spark_hyperparameters.json

  65. Airflow - nextgen_data_platform DAG

  66. Airflow - nextgen_data_platform DAG

  67. Airflow - nextgen_data_platform DAG

  68. Airflow - nextgen_data_platform DAG

  69. Airflow - nextgen_data_platform DAG

  70. Airflow - nextgen_data_platform DAG

  71. Airflow - nextgen_data_platform DAG

  72. Airflow - nextgen_data_platform DAG

  73. Airflow - nextgen_data_master_tables_subdag

  74. Airflow - nextgen_data_master_tables_subdag

  75. Airflow - common_util

  76. Airflow - common_util

  77. Airflow - common_util

  78. Airflow - common_util

  79. Airflow - common_util

  80. Airflow - common_util

  81. References [1] How to Setup Airflow Multi-Node Cluster with Celery

    & RabbitMQ. [URL] https://medium.com/@khatri_chetan/challenges-and-struggle-while-setting-up-multi-node-airflow-clu ster-7f19e998ebb [2] Setup and Configure Multi Node Airflow Cluster with HDP Ambari and Celery for Data Pipelines. [URL] https://medium.com/@khatri_chetan/setup-and-configure-multi-node-airflow-cluster-with-hdp-ambari- and-celery-for-data-pipelines-dc1e96f3d773 [3] Challenges and Struggle while Setting up Multi-Node Airflow Cluster [URL] https://medium.com/@khatri_chetan/how-to-setup-airflow-multi-node-cluster-with-celery-rabbitmq-cf de7756bb6a [4] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks. https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-run ning-tasks/
  82. Questions ?

  83. Thank you! PyCon ZA Organizers and South Africa Python Community.