Slide 1

Slide 1 text

‹#› © Cloudera, Inc. All rights reserved. Juliet Hougland Sept 2015 @j_houg PySpark Best Practices

Slide 2

Slide 2 text

‹#› © Cloudera, Inc. All rights reserved.

Slide 3

Slide 3 text

‹#› © Cloudera, Inc. All rights reserved. • Core written, operates on the JVM • Also has Python and Java APIs • Hadoop Friendly • Input from HDFS, HBase, Kafka • Management via YARN • Interactive REPL • ML library == MLLib Spark

Slide 4

Slide 4 text

‹#› © Cloudera, Inc. All rights reserved. Spark MLLib • Model building and eval • Fast • Basics covered • LR, SVM, Decision tree • PCA, SVD • K-means • ALS • Algorithms expect RDDs of consistent types (i.e. LabeledPoints) !

Slide 5

Slide 5 text

‹#› © Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis

Slide 6

Slide 6 text

‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 7

Slide 7 text

‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 8

Slide 8 text

‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 9

Slide 9 text

‹#› © Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Count Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 10

Slide 10 text

‹#› © Cloudera, Inc. All rights reserved. Spark Execution Model

Slide 11

Slide 11 text

‹#› © Cloudera, Inc. All rights reserved. PySpark Execution Model

Slide 12

Slide 12 text

‹#› © Cloudera, Inc. All rights reserved. PySpark Driver Program sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() Function closures need to be executed on worker nodes by a python process.

Slide 13

Slide 13 text

‹#› © Cloudera, Inc. All rights reserved. How do we ship around Python functions? sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 14

Slide 14 text

‹#› © Cloudera, Inc. All rights reserved. Pickle! https://flic.kr/p/c8N4sE

Slide 15

Slide 15 text

‹#› © Cloudera, Inc. All rights reserved. Pickle! sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 16

Slide 16 text

‹#› © Cloudera, Inc. All rights reserved. Best Practices for Writing PySpark

Slide 17

Slide 17 text

‹#› © Cloudera, Inc. All rights reserved. REPLs and Notebooks https://flic.kr/p/5hnPZp

Slide 18

Slide 18 text

‹#› © Cloudera, Inc. All rights reserved. Share your code https://flic.kr/p/sw2cnL

Slide 19

Slide 19 text

‹#› © Cloudera, Inc. All rights reserved. Standard Python Project my_pyspark_proj/ awesome/ __init__.py bin/ docs/ setup.py tests/ awesome_tests.py __init__.py

Slide 20

Slide 20 text

‹#› © Cloudera, Inc. All rights reserved. What is the shape of a PySpark job? https://flic.kr/p/4vWP6U

Slide 21

Slide 21 text

‹#› © Cloudera, Inc. All rights reserved. ! • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data PySpark Structure? https://flic.kr/p/ZW54 Shout out to my colleagues in the UK

Slide 22

Slide 22 text

‹#› © Cloudera, Inc. All rights reserved. PySpark Structure? my_pyspark_proj/ awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv ! • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data

Slide 23

Slide 23 text

‹#› © Cloudera, Inc. All rights reserved. Simple Main Method

Slide 24

Slide 24 text

‹#› © Cloudera, Inc. All rights reserved. • Write a function for anything inside an transformation • Make it static • Separate Feature generation or data standardization from your modeling Write Testable Code Featurize.py … ! @static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double ! @static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])

Slide 25

Slide 25 text

‹#› © Cloudera, Inc. All rights reserved. • Functions and the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead Write Serializable Code https://flic.kr/p/za5cy

Slide 26

Slide 26 text

‹#› © Cloudera, Inc. All rights reserved. • Provides a SparkContext configures Spark master • Quiets Py4J • https://github.com/holdenk/ spark-testing-base Testing with SparkTestingBase

Slide 27

Slide 27 text

‹#› © Cloudera, Inc. All rights reserved. • Unit test as much as possible • Integration test the whole flow ! • Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results Testing Suggestions https://flic.kr/p/tucHHL

Slide 28

Slide 28 text

‹#› © Cloudera, Inc. All rights reserved. Best Practices for Running PySpark

Slide 29

Slide 29 text

‹#› © Cloudera, Inc. All rights reserved. Writing distributed code is the easy part… Running it is hard.

Slide 30

Slide 30 text

‹#› © Cloudera, Inc. All rights reserved. Get Serious About Logs • Get the YARN app id from the WebUI or Console • yarn logs • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces

Slide 31

Slide 31 text

‹#› © Cloudera, Inc. All rights reserved. Know your environment • You may want to use python packages on your cluster • Actively manage dependencies on your cluster • Anaconda or virtualenv is good for this. • Spark versions <1.4.0 require the same version of Python on driver and workers

Slide 32

Slide 32 text

‹#› © Cloudera, Inc. All rights reserved. Complex Dependencies

Slide 33

Slide 33 text

‹#› © Cloudera, Inc. All rights reserved. Many Python Environments Path to Python binary to use on the cluster can be set with PYSPARK_PYTHON ! Can be set it in spark-env.sh if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON= fi

Slide 34

Slide 34 text

‹#› © Cloudera, Inc. All rights reserved. Thank You Questions? ! @j_houg