Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling and Unifying SciKit Learn and Spark Pipelines using Ray (Raghu Ganti, IBM)

Scaling and Unifying SciKit Learn and Spark Pipelines using Ray (Raghu Ganti, IBM)

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as "fit" and "transform" are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.

Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray's parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray's compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.

Attendees will learn how pipelined workflows can be mapped to Ray's compute model and how they can both unify and accelerate their pipelines with Ray.

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

July 23, 2021
Tweet

Transcript

  1. IBM Research Scaling and Unifying Scikit Learn and Spark Pipelines

    using Ray rganti@us.ibm.com Raghu Ganti Principal Research Staff Member IBM T.J. Watson Research Center Team (IBM & Red Hat): Michael Behrendt, Linsong Chu, Carlos Costa, Erik Erlandson, Mudhakar Srivatsa raghu-ganti-26096810
  2. 2 And many more… So many pipelines…

  3. … 3 Can we do pipelines on Ray? Can we

    scale popular AI/ML pipelines on Ray? Can we unify scikit learn and Spark pipelines? Ray.IO
  4. 4 § Focus on scikit learn and Spark pipelines §

    Scikit learn missing scaling; Spark focus on data parallel scaling Transform Fit X X y X’ Fitted model Current Pipeline API
  5. Scaling Pipelines: I/O as List of Object References 5 Transform

    Fit [X1, X2, … XN] [X1, X2, … XN] [y1, y2, … yN] [X1’, X2’, …, XN’] [FM1, FM2, … FMN]
  6. Scaling Pipelines: AND/OR graphs 6 X1 X2 XN X1’ X2’

    XN’ OR node X Step 1 Step 2 Step N X’ X’ X’ … AND node … …
  7. Function as a unit of compute List of objects as

    I/O Cross environment AND/OR Graphs 7 § Object references as I/O for unit of compute § Sharing of objects using Plasma store § Enables zero-copy object sharing § Python function as unit of compute § Intuitive for data scientist § Follows transformer APIs § MPI-style scaling § Scikit learn typically in Python § Ray.IO with RayDP enables efficient data exchange § Enriched DAGs from plain pipelines § OR nodes for fan-out expressions § AND nodes for arbitrary lambdas 7
  8. Illustrative Example 8 Preprocess Random Forrest Gradient Boost Decision Tree

    Sample Pipeline Our pipeline implementation c_a = ScaleTestEstimator(50, DecisionTreeClassifier()) c_b = ScaleTestEstimator(50, RandomForestClassifier()) c_c = ScaleTestEstimator(50, GradientBoostingClassifier()) classifiers = [c_a, c_b, c_c] classifier_results=[] for classifier in classifiers: pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', classifier)]) pipe.fit(X_train, y_train) pipe.predict(X_train) pipeline = dm.Pipeline() node_a = dm.OrNode('preprocess', preprocessor) node_b = dm.OrNode('c_a', c_a) node_c = dm.OrNode('c_b', c_b) node_d = dm.OrNode('c_c', c_c) pipeline.add_edge(node_a, node_b) pipeline.add_edge(node_a, node_c) pipeline.add_edge(node_a, node_d) in_args = {node_a: [Xy_ref_ptr]} out_args = rt.execute_pipeline(pipeline, ExecutionType.TRAIN, in_args) 2xfaster
  9. 9 Pipelines Galore… Spark Pipeline Task parallelism ✓ ✓ ✗

    ✓ ✓ Data parallelism ✗ ✗ ✗ ✓ ✓ AND/OR Graphs ✓ ✓ ✗ ✗ ✓ Computational unit Container Container Python function Python/Java function Python/Java function Mutability of DAG ✗ ✗ ✓ ✓ ✓ Our pipeline implementation
  10. What to expect? § Execution strategies based on graph traversals

    § Early stopping criteria § Mutability of execution pipelines § Related Summit Talks § Current status: Proposal discussion with Ray and OSS community 10 Related talks Powering Open Data Hub with Ray | Erik Erlandson Serverless Earth Science Data Labeling using Unsupervised Deep Learning with Ray | Linsong Chu
  11. Q&A Thank you! Contacts Raghu Ganti Michael Behrendt Linsong Chu

    Carlos Costa Erik Erlandson Mudhakar Srivatsa rganti@us.ibm.com michaelbehrendt@de.ibm.com lchu@us.ibm.com chcost@us.ibm.com eerlands@redhat.com msrivats@us.ibm.com