No more struggles with Apache Spark (PySpark) workloads in production by Chetan Khatri

No more struggles with Apache Spark (PySpark) workloads in production
Chetan Khatri, Solution Architect - Data Science. Accionlabs India. PyconZA 2019, The Wanderers Club in Illovo. Johannesburg, South Africa 11th Oct, 2019 Twitter: @khatri_chetan, Email: [email protected] [email protected] LinkedIn: https://www.linkedin.com/in/chetkhatri Github: chetkhatri

Who am I? Solution Architect - Data Science @ Accion
labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang. Co-Authored University Curriculum @ University of Kachchh, India. Ex - Data Engineering @: Nazara Games, Eccella Corporation. Masters - Computer Science from University of Kachchh, India. Daily Activity? Functional Programming, Distributed Computing, Python, Scala, Haskell, Data Science, Product Development

Helping organizations create innovative product and solutions using the emerging
technologies An Innovation Focused Technology Services Firm Employees Clients Accelerators Global Ofﬁces Development Centers 2300+ 75+ 20+ 12+ 7

Accion Labs - Introduction • A Global Technology Services firm
focussed Emerging Technologies ◦ 12 offices, 7 dev centers, 2300+ employees, 75+ active clients • Profitable, venture-backed company ◦ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership • Flexible Outcome-based Engagement Models ◦ Projects, Extended teams, Shared IP, Co-development, Professional Services • Framework Based Approach to Accelerate Digital Transformation ◦ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency • Action-oriented Leadership Team ◦ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018 4

Accion’s Emerging Tech Capabilities Adaptive UI, UX Engineering NLP, Voice
Interface & Chat Bots Artiﬁcial Intelligence and Machine Learning Data Lake & Big Data Analytics Blockchain, Payment Technologies Cloud Strategy and Transformation Mobile Development MicroServices and Serverless Computing QA Engineering, RPA and DevOps Automation SFDC, ServiceNow, IBM Solutions, Azure 5

Agenda • Apache Spark • Primary data structures (RDD, DataSet,
Dataframe) • Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. • Parallel read from JDBC: Challenges and best practices. • Bulk Load API vs JDBC write • An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin • Avoid unnecessary shuffle • Optimize Spark stage generation plan • Predicate pushdown with partitioning and bucketing • Airflow DAG scheduling for Apache Spark worflow. - Design, Architecture, Demo.

What is Apache Spark? • Apache Spark is a fast
and general-purpose cluster computing system / Unified Engine for massive data processing. • It provides high level API for Scala, Java, Python and R and optimized engine that supports general execution graphs. Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming

What are RDDs ?

1. Distributed Data Abstraction RDD RDD RDD RDD Logical Model
Across Distributed Storage on Cluster HDFS, S3

2. Resilient & Immutable RDD RDD RDD T T RDD
-> T -> RDD -> T -> RDD T = Transformation

3. Compile-time Type Safe / Strongly type inference Integer RDD
String or Text RDD Double or Binary RDD

4. Lazy evaluation RDD RDD RDD T T RDD RDD
RDD T A RDD - T - RDD - T - RDD - T - RDD - A - RDD T = Transformation A = Action

Apache Spark Operations Operations Transformation Action

Essential Spark Operations TRANSFORMATIONS ACTIONS General Math / Statistical Set
Theory / Relational Data Structure / I/O map gilter flatMap mapPartitions mapPartitionsWithIndex groupBy sortBy sample randomSplit union intersection subtract distinct cartesian zip keyBy zipWithIndex zipWithUniqueID zipPartitions coalesce repartition repartitionAndSortWithinPartitions pipe reduce collect aggregate fold first take forEach top treeAggregate treeReduce forEachPartition collectAsMap count takeSample max min sum histogram mean variance stdev sampleVariance countApprox countApproxDistinct takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile saveAsHadoopDataset saveAsHadoopFile saveAsNewAPIHadoopDataset saveAsNewAPIHadoopFile

When to use RDDs ? You care about control of
dataset and knows how data looks like, you care about low level API. Don’t care about lot’s of lambda functions than DSL. Don’t care about Schema or Structure of Data. Don’t care about optimization, performance & inefﬁciencies! Very slow for non-JVM languages like Python, R. Don’t care about Inadvertent inefﬁciencies.

Inadvertent inefﬁciencies in RDDs

Structured in Spark DataFrames Datasets

Structured APIs in Apache Spark SQL DataFrames Datasets Syntax Errors
Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster

DataFrame API Code // convert RDD -> DF with column
names parsedDF = parsedRDD.toDF("project", "sprint", "numStories") // filter, groupBy, sum, and then agg() parsedDF.filter(lambda x: x[1] === "finance") .groupBy("sprint") .agg(sum("numStories").as("count")) .limit(100) .show(100) project sprint numStories ﬁnance 3 20 ﬁnance 4 22

DataFrame -> SQL View -> SQL Query parsedDF.createOrReplaceTempView("audits") results =
spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories ﬁnance 3 20 ﬁnance 4 22

Catalyst in Spark SQL AST DataFrame Datasets Unresolved Logical Plan
Logical Plan Optimized Logical Plan Physical Plans Cost Model Selected Physical Plan RDD

Example: DataFrame Optimization employees.join(events, employees("id") === events("eid")) .filter(events("date") > "2015-01-01")
events file employees table join filter Logical Plan scan (employees) filter Scan (events) join Physical Plan Optimized scan (events) Optimized scan (employees) join Physical Plan With Predicate Pushdown and Column Pruning

DataFrames are Faster than RDDs Source: Databricks

Pragmatic Approach Executors Cores Containers Stage Job Task

Spark Internals terminology Job - Each transformation and action mapping
in Spark would create a separate jobs. Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor. Task - Lowest level of Concurrent and Parallel execution Unit. Each stage is split into #number-of-partitions tasks, i.e Number of Tasks = stage * number of partitions in the stage

Spark Internals: Jobs

Spark Internals: Stage

Spark Internals: Tasks

Spark on Yarn Internals terminology yarn.scheduler.minimum-allocation-vcores = 1 Yarn.scheduler.maximum-allocation-vcores =
6 Yarn.scheduler.minimum-allocation-mb = 4096 Yarn.scheduler.maximum-allocation-mb = 28832 Yarn.nodemanager.resource.memory-mb = 54000 Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 / Yarn.scheduler.minimum-allocation-mb = 4096) = 13

Spark on Yarn Internals terminology

Resource Manager (Yarn) Tuning

Spark Scheduler FIFO to FAIR

Parallel read from JDBC: Challenges and best practices.

Spark JDBC Read What happens when you run this code?
What would be the impact at Database engine side?

Spark JDBC Read: Impact on Database engine e.g MSSQL Server

Spark Parallel JDBC Read

Impact on Database after Spark Parallel Read

Bulk Load API vs JDBC write

An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin JoinSelection execution
planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join. # check broadcast join threshold >>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024 10 # logical plan with tree numbered sampleDF.queryExecution.logical.numberedTreeString # Query plan sampleDF.explain

An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin Repartition: Boost
the Parallelism, by increasing the number of Partitions. Partition on Joins, to get same key joins faster. // Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster. employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT")) For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions and each partition will write 2500 records Parallely. Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.

An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin 1. //
disable autoBroadcastJoin spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) 2. // Order doesn't matter table1.leftjoin(table2) or table2.leftjoin(table1) 3. force broadcast, if one DataFrame is not small! 4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition, HashPartitioner

An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin

Spark Submit Hyper-parameters and Dynamic Allocation ./bin/spark-submit \ --conf spark.yarn.maxAppAttempts=1
\ --name PyConLT19 \ --master yarn \ --deploy-mode cluster \ --driver-memory 18g \ --executor-memory 24g \ --num-executors 4 \ --executor-cores 6 \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.speculation=false \ --conf spark.broadcast.compress=true \ --conf spark.sql.broadcastTimeout=36000 \ --conf spark.network.timeout=2500s \ --conf spark.dynamicAllocation.executorAllocationRatio=1 \ --conf spark.executor.heartbeatInterval=30s \ --conf spark.dynamicAllocation.executorIdleTimeout=60s \ --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s \ --conf spark.network.timeout=1200s \ --conf spark.dynamicAllocation.schedulerBacklogTimeout=15s \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=True \ --conf spark.dynamicAllocation.minExecutors=2 \ --conf spark.dynamicAllocation.initialExecutors=2 \ --conf spark.dynamicAllocation.maxExecutors=6 \ examples/src/main/python/pi.py

Case Study: High Level Architecture OLTP Shadow Data Source Apache
Spark Spark SQL Sqoop HDFS Parquet Yarn Cluster manager Customer Speciﬁc Reporting DB Bulk Load Parallelism Orchestration: Airﬂow

Spark Streaming Code! Ref. https://github.com/chetkhatri/getting-started-airﬂow-for-spark/blob/master/spark _streaming_kafka.py

Key role of Apache Airﬂow for Scheduling Data Pipelines Codebase:
https://github.com/chetkhatri/getting-started-airﬂow-for-spark

Trigger the Airflow DAG from API curl -d ' {"conf":"{\"retail_id\":\"29\"
, \"env_type\":\"dev\", \"size_is\":\"medium\"}", "run_id": "retailer_1111"}' -H "Content-Type: application/json" -X POST http://localhost:8000/api/experimental/dags/nextgen_data_platforms/dag_runs Ref. https://github.com/teamclairvoyant/airflow-rest-api-plugin Spark Submit Operator inherited from BashOperator https://github.com/apache/airflow/blob/master/airflow/contrib/operators/spark_s ubmit_operator.py

Airﬂow - conﬁg.txt

Airﬂow spark_conﬁg.txt

Airﬂow - spark_hyperparameters.json

Airﬂow - nextgen_data_platform DAG

Airﬂow - nextgen_data_master_tables_subdag

Airﬂow - common_util

References [1] How to Setup Airflow Multi-Node Cluster with Celery
& RabbitMQ. [URL] https://medium.com/@khatri_chetan/challenges-and-struggle-while-setting-up-multi-node-airflow-clu ster-7f19e998ebb [2] Setup and Configure Multi Node Airflow Cluster with HDP Ambari and Celery for Data Pipelines. [URL] https://medium.com/@khatri_chetan/setup-and-configure-multi-node-airflow-cluster-with-hdp-ambari- and-celery-for-data-pipelines-dc1e96f3d773 [3] Challenges and Struggle while Setting up Multi-Node Airflow Cluster [URL] https://medium.com/@khatri_chetan/how-to-setup-airflow-multi-node-cluster-with-celery-rabbitmq-cf de7756bb6a [4] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks. https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-run ning-tasks/

Questions ?

Thank you! PyCon ZA Organizers and South Africa Python Community.

No more struggles with Apache Spark (PySpark) w...

No more struggles with Apache Spark (PySpark) workloads in production by Chetan Khatri

More Decks by Pycon ZA

Other Decks in Programming

Featured

Transcript