Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PySpark Best Practices

Juliet Hougland
September 18, 2015

PySpark Best Practices

PySpark (the component of Spark that allows users to write their code in Python) has grabbed the attention of Python programmers who analyze and process data for a living. The appeal is obvious: you don't need to learn a new language, and you still have access to modules (pandas, nltk, statsmodels, etc.) that you are familiar with, but you are able to run complex computations quickly and at scale using the power of Spark.

In this talk, we will examine a real PySpark job that runs a statistical analysis of time series data to motivate the issues described above and provides a concrete example of best practices for real world PySpark applications. We will cover:

• Python package management on a cluster using Anaconda or virtualenv.

• Testing PySpark applications.

• Spark's computational model and its relationship to how you structure your code.

Juliet Hougland

September 18, 2015
Tweet

More Decks by Juliet Hougland

Other Decks in Technology

Transcript

  1. ‹#› © Cloudera, Inc. All rights reserved. • Core written,

    operates on the JVM • Also has Python and Java APIs • Hadoop Friendly • Input from HDFS, HBase, Kafka • Management via YARN • Interactive REPL • ML library == MLLib Spark
  2. ‹#› © Cloudera, Inc. All rights reserved. Spark MLLib •

    Model building and eval • Fast • Basics covered • LR, SVM, Decision tree • PCA, SVD • K-means • ALS • Algorithms expect RDDs of consistent types (i.e. LabeledPoints) !
  3. ‹#› © Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://…”, 4)

    .map(to_series) .filter(has_outlier) .count() HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis
  4. ‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD HDFS

    Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  5. ‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD …RDD

    HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  6. ‹#› © Cloudera, Inc. All rights reserved. RDDs …RDD …RDD

    HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  7. ‹#› © Cloudera, Inc. All rights reserved. …RDD …RDD RDDs

    HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Count Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  8. ‹#› © Cloudera, Inc. All rights reserved. PySpark Driver Program

    sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() Function closures need to be executed on worker nodes by a python process.
  9. ‹#› © Cloudera, Inc. All rights reserved. How do we

    ship around Python functions? sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  10. ‹#› © Cloudera, Inc. All rights reserved. Standard Python Project

    my_pyspark_proj/ awesome/ __init__.py bin/ docs/ setup.py tests/ awesome_tests.py __init__.py
  11. ‹#› © Cloudera, Inc. All rights reserved. What is the

    shape of a PySpark job? https://flic.kr/p/4vWP6U
  12. ‹#› © Cloudera, Inc. All rights reserved. ! • Parse

    CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data PySpark Structure? https://flic.kr/p/ZW54 Shout out to my colleagues in the UK
  13. ‹#› © Cloudera, Inc. All rights reserved. PySpark Structure? my_pyspark_proj/

    awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv ! • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data
  14. ‹#› © Cloudera, Inc. All rights reserved. • Write a

    function for anything inside an transformation • Make it static • Separate Feature generation or data standardization from your modeling Write Testable Code Featurize.py … ! @static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double ! @static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])
  15. ‹#› © Cloudera, Inc. All rights reserved. • Functions and

    the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead Write Serializable Code https://flic.kr/p/za5cy
  16. ‹#› © Cloudera, Inc. All rights reserved. • Provides a

    SparkContext configures Spark master • Quiets Py4J • https://github.com/holdenk/ spark-testing-base Testing with SparkTestingBase
  17. ‹#› © Cloudera, Inc. All rights reserved. • Unit test

    as much as possible • Integration test the whole flow ! • Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results Testing Suggestions https://flic.kr/p/tucHHL
  18. ‹#› © Cloudera, Inc. All rights reserved. Get Serious About

    Logs • Get the YARN app id from the WebUI or Console • yarn logs <app-id> • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces
  19. ‹#› © Cloudera, Inc. All rights reserved. Know your environment

    • You may want to use python packages on your cluster • Actively manage dependencies on your cluster • Anaconda or virtualenv is good for this. • Spark versions <1.4.0 require the same version of Python on driver and workers
  20. ‹#› © Cloudera, Inc. All rights reserved. Many Python Environments

    Path to Python binary to use on the cluster can be set with PYSPARK_PYTHON ! Can be set it in spark-env.sh if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON=<path> fi