Upgrade to Pro — share decks privately, control downloads, hide ads and more …

No more struggles with Apache Spark (PySpark) w...

Pycon ZA
October 11, 2019

No more struggles with Apache Spark (PySpark) workloads in production by Chetan Khatri

Spark is a good tool for processing large amounts of data, but there are many pitfalls to avoid in order to build large scale systems in production, This talk will take you through fundamental concepts of Apache Spark for Python Developers. We'll examine some of the data serialization and interoperability issues specifically with Python libraries like Numpy, Pandas which are highly impacting PySpark performance. We will address this issue with Apache arrow (PyArrow API) which is a cross-language development platform for in-memory data. This talk will show what the challenges you may face while productionizing Spark for TB’s of data and their possible solutions.

Pycon ZA

October 11, 2019
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. No more struggles with Apache Spark (PySpark) workloads in production

    Chetan Khatri, Solution Architect - Data Science. Accionlabs India. PyconZA 2019, The Wanderers Club in Illovo. Johannesburg, South Africa 11th Oct, 2019 Twitter: @khatri_chetan, Email: [email protected] [email protected] LinkedIn: https://www.linkedin.com/in/chetkhatri Github: chetkhatri
  2. Who am I? Solution Architect - Data Science @ Accion

    labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang. Co-Authored University Curriculum @ University of Kachchh, India. Ex - Data Engineering @: Nazara Games, Eccella Corporation. Masters - Computer Science from University of Kachchh, India. Daily Activity? Functional Programming, Distributed Computing, Python, Scala, Haskell, Data Science, Product Development
  3. Helping organizations create innovative product and solutions using the emerging

    technologies An Innovation Focused Technology Services Firm Employees Clients Accelerators Global Offices Development Centers 2300+ 75+ 20+ 12+ 7
  4. Accion Labs - Introduction • A Global Technology Services firm

    focussed Emerging Technologies ◦ 12 offices, 7 dev centers, 2300+ employees, 75+ active clients • Profitable, venture-backed company ◦ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership • Flexible Outcome-based Engagement Models ◦ Projects, Extended teams, Shared IP, Co-development, Professional Services • Framework Based Approach to Accelerate Digital Transformation ◦ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency • Action-oriented Leadership Team ◦ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018 4
  5. Accion’s Emerging Tech Capabilities Adaptive UI, UX Engineering NLP, Voice

    Interface & Chat Bots Artificial Intelligence and Machine Learning Data Lake & Big Data Analytics Blockchain, Payment Technologies Cloud Strategy and Transformation Mobile Development MicroServices and Serverless Computing QA Engineering, RPA and DevOps Automation SFDC, ServiceNow, IBM Solutions, Azure 5
  6. Agenda • Apache Spark • Primary data structures (RDD, DataSet,

    Dataframe) • Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. • Parallel read from JDBC: Challenges and best practices. • Bulk Load API vs JDBC write • An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin • Avoid unnecessary shuffle • Optimize Spark stage generation plan • Predicate pushdown with partitioning and bucketing • Airflow DAG scheduling for Apache Spark worflow. - Design, Architecture, Demo.
  7. What is Apache Spark? • Apache Spark is a fast

    and general-purpose cluster computing system / Unified Engine for massive data processing. • It provides high level API for Scala, Java, Python and R and optimized engine that supports general execution graphs. Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  8. 1. Distributed Data Abstraction RDD RDD RDD RDD Logical Model

    Across Distributed Storage on Cluster HDFS, S3
  9. 2. Resilient & Immutable RDD RDD RDD T T RDD

    -> T -> RDD -> T -> RDD T = Transformation
  10. 3. Compile-time Type Safe / Strongly type inference Integer RDD

    String or Text RDD Double or Binary RDD
  11. 4. Lazy evaluation RDD RDD RDD T T RDD RDD

    RDD T A RDD - T - RDD - T - RDD - T - RDD - A - RDD T = Transformation A = Action
  12. Essential Spark Operations TRANSFORMATIONS ACTIONS General Math / Statistical Set

    Theory / Relational Data Structure / I/O map gilter flatMap mapPartitions mapPartitionsWithIndex groupBy sortBy sample randomSplit union intersection subtract distinct cartesian zip keyBy zipWithIndex zipWithUniqueID zipPartitions coalesce repartition repartitionAndSortWithinPartitions pipe reduce collect aggregate fold first take forEach top treeAggregate treeReduce forEachPartition collectAsMap count takeSample max min sum histogram mean variance stdev sampleVariance countApprox countApproxDistinct takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile saveAsHadoopDataset saveAsHadoopFile saveAsNewAPIHadoopDataset saveAsNewAPIHadoopFile
  13. When to use RDDs ? You care about control of

    dataset and knows how data looks like, you care about low level API. Don’t care about lot’s of lambda functions than DSL. Don’t care about Schema or Structure of Data. Don’t care about optimization, performance & inefficiencies! Very slow for non-JVM languages like Python, R. Don’t care about Inadvertent inefficiencies.
  14. Structured APIs in Apache Spark SQL DataFrames Datasets Syntax Errors

    Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  15. DataFrame API Code // convert RDD -> DF with column

    names parsedDF = parsedRDD.toDF("project", "sprint", "numStories") // filter, groupBy, sum, and then agg() parsedDF.filter(lambda x: x[1] === "finance") .groupBy("sprint") .agg(sum("numStories").as("count")) .limit(100) .show(100) project sprint numStories finance 3 20 finance 4 22
  16. DataFrame -> SQL View -> SQL Query parsedDF.createOrReplaceTempView("audits") results =

    spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  17. Catalyst in Spark SQL AST DataFrame Datasets Unresolved Logical Plan

    Logical Plan Optimized Logical Plan Physical Plans Cost Model Selected Physical Plan RDD
  18. Example: DataFrame Optimization employees.join(events, employees("id") === events("eid")) .filter(events("date") > "2015-01-01")

    events file employees table join filter Logical Plan scan (employees) filter Scan (events) join Physical Plan Optimized scan (events) Optimized scan (employees) join Physical Plan With Predicate Pushdown and Column Pruning
  19. Spark Internals terminology Job - Each transformation and action mapping

    in Spark would create a separate jobs. Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor. Task - Lowest level of Concurrent and Parallel execution Unit. Each stage is split into #number-of-partitions tasks, i.e Number of Tasks = stage * number of partitions in the stage
  20. Spark on Yarn Internals terminology yarn.scheduler.minimum-allocation-vcores = 1 Yarn.scheduler.maximum-allocation-vcores =

    6 Yarn.scheduler.minimum-allocation-mb = 4096 Yarn.scheduler.maximum-allocation-mb = 28832 Yarn.nodemanager.resource.memory-mb = 54000 Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 / Yarn.scheduler.minimum-allocation-mb = 4096) = 13
  21. Spark JDBC Read What happens when you run this code?

    What would be the impact at Database engine side?
  22. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin JoinSelection execution

    planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join. # check broadcast join threshold >>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024 10 # logical plan with tree numbered sampleDF.queryExecution.logical.numberedTreeString # Query plan sampleDF.explain
  23. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin Repartition: Boost

    the Parallelism, by increasing the number of Partitions. Partition on Joins, to get same key joins faster. // Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster. employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT")) For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions and each partition will write 2500 records Parallely. Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
  24. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin 1. //

    disable autoBroadcastJoin spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) 2. // Order doesn't matter table1.leftjoin(table2) or table2.leftjoin(table1) 3. force broadcast, if one DataFrame is not small! 4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition, HashPartitioner
  25. Spark Submit Hyper-parameters and Dynamic Allocation ./bin/spark-submit \ --conf spark.yarn.maxAppAttempts=1

    \ --name PyConLT19 \ --master yarn \ --deploy-mode cluster \ --driver-memory 18g \ --executor-memory 24g \ --num-executors 4 \ --executor-cores 6 \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.speculation=false \ --conf spark.broadcast.compress=true \ --conf spark.sql.broadcastTimeout=36000 \ --conf spark.network.timeout=2500s \ --conf spark.dynamicAllocation.executorAllocationRatio=1 \ --conf spark.executor.heartbeatInterval=30s \ --conf spark.dynamicAllocation.executorIdleTimeout=60s \ --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s \ --conf spark.network.timeout=1200s \ --conf spark.dynamicAllocation.schedulerBacklogTimeout=15s \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=True \ --conf spark.dynamicAllocation.minExecutors=2 \ --conf spark.dynamicAllocation.initialExecutors=2 \ --conf spark.dynamicAllocation.maxExecutors=6 \ examples/src/main/python/pi.py
  26. Case Study: High Level Architecture OLTP Shadow Data Source Apache

    Spark Spark SQL Sqoop HDFS Parquet Yarn Cluster manager Customer Specific Reporting DB Bulk Load Parallelism Orchestration: Airflow
  27. Key role of Apache Airflow for Scheduling Data Pipelines Codebase:

    https://github.com/chetkhatri/getting-started-airflow-for-spark
  28. Trigger the Airflow DAG from API curl -d ' {"conf":"{\"retail_id\":\"29\"

    , \"env_type\":\"dev\", \"size_is\":\"medium\"}", "run_id": "retailer_1111"}' -H "Content-Type: application/json" -X POST http://localhost:8000/api/experimental/dags/nextgen_data_platforms/dag_runs Ref. https://github.com/teamclairvoyant/airflow-rest-api-plugin Spark Submit Operator inherited from BashOperator https://github.com/apache/airflow/blob/master/airflow/contrib/operators/spark_s ubmit_operator.py
  29. References [1] How to Setup Airflow Multi-Node Cluster with Celery

    & RabbitMQ. [URL] https://medium.com/@khatri_chetan/challenges-and-struggle-while-setting-up-multi-node-airflow-clu ster-7f19e998ebb [2] Setup and Configure Multi Node Airflow Cluster with HDP Ambari and Celery for Data Pipelines. [URL] https://medium.com/@khatri_chetan/setup-and-configure-multi-node-airflow-cluster-with-hdp-ambari- and-celery-for-data-pipelines-dc1e96f3d773 [3] Challenges and Struggle while Setting up Multi-Node Airflow Cluster [URL] https://medium.com/@khatri_chetan/how-to-setup-airflow-multi-node-cluster-with-celery-rabbitmq-cf de7756bb6a [4] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks. https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-run ning-tasks/