Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
‹#› © Cloudera, Inc. All rights reserved. Juliet Hougland Sept 2015 @j_houg PySpark Best Practices
Slide 2
Slide 2 text
‹#› © Cloudera, Inc. All rights reserved.
Slide 3
Slide 3 text
‹#› © Cloudera, Inc. All rights reserved. • Core written, operates on the JVM • Also has Python and Java APIs • Hadoop Friendly • Input from HDFS, HBase, Kafka • Management via YARN • Interactive REPL • ML library == MLLib Spark
Slide 4
Slide 4 text
‹#› © Cloudera, Inc. All rights reserved. Spark MLLib • Model building and eval • Fast • Basics covered • LR, SVM, Decision tree • PCA, SVD • K-means • ALS • Algorithms expect RDDs of consistent types (i.e. LabeledPoints) !
Slide 5
Slide 5 text
‹#› © Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis
Slide 6
Slide 6 text
‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
Slide 7
Slide 7 text
‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
Slide 8
Slide 8 text
‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
Slide 9
Slide 9 text
‹#› © Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Count Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
Slide 10
Slide 10 text
‹#› © Cloudera, Inc. All rights reserved. Spark Execution Model
Slide 11
Slide 11 text
‹#› © Cloudera, Inc. All rights reserved. PySpark Execution Model
Slide 12
Slide 12 text
‹#› © Cloudera, Inc. All rights reserved. PySpark Driver Program sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() Function closures need to be executed on worker nodes by a python process.
Slide 13
Slide 13 text
‹#› © Cloudera, Inc. All rights reserved. How do we ship around Python functions? sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
Slide 14
Slide 14 text
‹#› © Cloudera, Inc. All rights reserved. Pickle! https://flic.kr/p/c8N4sE
Slide 15
Slide 15 text
‹#› © Cloudera, Inc. All rights reserved. Pickle! sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
Slide 16
Slide 16 text
‹#› © Cloudera, Inc. All rights reserved. Best Practices for Writing PySpark
Slide 17
Slide 17 text
‹#› © Cloudera, Inc. All rights reserved. REPLs and Notebooks https://flic.kr/p/5hnPZp
Slide 18
Slide 18 text
‹#› © Cloudera, Inc. All rights reserved. Share your code https://flic.kr/p/sw2cnL
Slide 19
Slide 19 text
‹#› © Cloudera, Inc. All rights reserved. Standard Python Project my_pyspark_proj/ awesome/ __init__.py bin/ docs/ setup.py tests/ awesome_tests.py __init__.py
Slide 20
Slide 20 text
‹#› © Cloudera, Inc. All rights reserved. What is the shape of a PySpark job? https://flic.kr/p/4vWP6U
Slide 21
Slide 21 text
‹#› © Cloudera, Inc. All rights reserved. ! • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data PySpark Structure? https://flic.kr/p/ZW54 Shout out to my colleagues in the UK
Slide 22
Slide 22 text
‹#› © Cloudera, Inc. All rights reserved. PySpark Structure? my_pyspark_proj/ awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv ! • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data
Slide 23
Slide 23 text
‹#› © Cloudera, Inc. All rights reserved. Simple Main Method
Slide 24
Slide 24 text
‹#› © Cloudera, Inc. All rights reserved. • Write a function for anything inside an transformation • Make it static • Separate Feature generation or data standardization from your modeling Write Testable Code Featurize.py … ! @static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double ! @static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])
Slide 25
Slide 25 text
‹#› © Cloudera, Inc. All rights reserved. • Functions and the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead Write Serializable Code https://flic.kr/p/za5cy
Slide 26
Slide 26 text
‹#› © Cloudera, Inc. All rights reserved. • Provides a SparkContext configures Spark master • Quiets Py4J • https://github.com/holdenk/ spark-testing-base Testing with SparkTestingBase
Slide 27
Slide 27 text
‹#› © Cloudera, Inc. All rights reserved. • Unit test as much as possible • Integration test the whole flow ! • Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results Testing Suggestions https://flic.kr/p/tucHHL
Slide 28
Slide 28 text
‹#› © Cloudera, Inc. All rights reserved. Best Practices for Running PySpark
Slide 29
Slide 29 text
‹#› © Cloudera, Inc. All rights reserved. Writing distributed code is the easy part… Running it is hard.
Slide 30
Slide 30 text
‹#› © Cloudera, Inc. All rights reserved. Get Serious About Logs • Get the YARN app id from the WebUI or Console • yarn logs • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces
Slide 31
Slide 31 text
‹#› © Cloudera, Inc. All rights reserved. Know your environment • You may want to use python packages on your cluster • Actively manage dependencies on your cluster • Anaconda or virtualenv is good for this. • Spark versions <1.4.0 require the same version of Python on driver and workers
Slide 32
Slide 32 text
‹#› © Cloudera, Inc. All rights reserved. Complex Dependencies
Slide 33
Slide 33 text
‹#› © Cloudera, Inc. All rights reserved. Many Python Environments Path to Python binary to use on the cluster can be set with PYSPARK_PYTHON ! Can be set it in spark-env.sh if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON= fi
Slide 34
Slide 34 text
‹#› © Cloudera, Inc. All rights reserved. Thank You Questions? ! @j_houg