PySpark Best Practices - Speaker Deck

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

‹#› © Cloudera, Inc. All rights reserved. • Core written, operates on the JVM • Also has Python and Java APIs • Hadoop Friendly • Input from HDFS, HBase, Kafka • Management via YARN • Interactive REPL • ML library == MLLib Spark

Slide 4

Slide 4 text

‹#› © Cloudera, Inc. All rights reserved. Spark MLLib • Model building and eval • Fast • Basics covered • LR, SVM, Decision tree • PCA, SVD • K-means • ALS • Algorithms expect RDDs of consistent types (i.e. LabeledPoints) !

Slide 5

Slide 5 text

Slide 6

Slide 6 text

‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 7

Slide 7 text

‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 8

Slide 8 text

‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 9

Slide 9 text

‹#› © Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Count Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

‹#› © Cloudera, Inc. All rights reserved. PySpark Driver Program sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() Function closures need to be executed on worker nodes by a python process.

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

‹#› © Cloudera, Inc. All rights reserved. ! • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data PySpark Structure? https://flic.kr/p/ZW54 Shout out to my colleagues in the UK

Slide 22

Slide 22 text

‹#› © Cloudera, Inc. All rights reserved. PySpark Structure? my_pyspark_proj/ awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv ! • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data

Slide 23

Slide 23 text

Slide 24

Slide 24 text

‹#› © Cloudera, Inc. All rights reserved. • Write a function for anything inside an transformation • Make it static • Separate Feature generation or data standardization from your modeling Write Testable Code Featurize.py … ! @static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double ! @static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])

Slide 25

Slide 25 text

‹#› © Cloudera, Inc. All rights reserved. • Functions and the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead Write Serializable Code https://flic.kr/p/za5cy

Slide 26

Slide 26 text

Slide 27

Slide 27 text

‹#› © Cloudera, Inc. All rights reserved. • Unit test as much as possible • Integration test the whole flow ! • Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results Testing Suggestions https://flic.kr/p/tucHHL

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

‹#› © Cloudera, Inc. All rights reserved. Get Serious About Logs • Get the YARN app id from the WebUI or Console • yarn logs • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces

Slide 31

Slide 31 text

‹#› © Cloudera, Inc. All rights reserved. Know your environment • You may want to use python packages on your cluster • Actively manage dependencies on your cluster • Anaconda or virtualenv is good for this. • Spark versions <1.4.0 require the same version of Python on driver and workers

Slide 32

Slide 32 text

Slide 33

Slide 33 text

‹#› © Cloudera, Inc. All rights reserved. Many Python Environments Path to Python binary to use on the cluster can be set with PYSPARK_PYTHON ! Can be set it in spark-env.sh if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON= fi

Slide 34

Slide 34 text