Scaling and Unifying SciKit Learn and Spark Pipelines using Ray (Raghu Ganti, IBM)

IBM Research Scaling and Unifying Scikit Learn and Spark Pipelines
using Ray [email protected] Raghu Ganti Principal Research Staff Member IBM T.J. Watson Research Center Team (IBM & Red Hat): Michael Behrendt, Linsong Chu, Carlos Costa, Erik Erlandson, Mudhakar Srivatsa raghu-ganti-26096810

2 And many more… So many pipelines…

… 3 Can we do pipelines on Ray? Can we
scale popular AI/ML pipelines on Ray? Can we unify scikit learn and Spark pipelines? Ray.IO

4 § Focus on scikit learn and Spark pipelines §
Scikit learn missing scaling; Spark focus on data parallel scaling Transform Fit X X y X’ Fitted model Current Pipeline API

Scaling Pipelines: I/O as List of Object References 5 Transform
Fit [X1, X2, … XN] [X1, X2, … XN] [y1, y2, … yN] [X1’, X2’, …, XN’] [FM1, FM2, … FMN]

Scaling Pipelines: AND/OR graphs 6 X1 X2 XN X1’ X2’
XN’ OR node X Step 1 Step 2 Step N X’ X’ X’ … AND node … …

Function as a unit of compute List of objects as
I/O Cross environment AND/OR Graphs 7 § Object references as I/O for unit of compute § Sharing of objects using Plasma store § Enables zero-copy object sharing § Python function as unit of compute § Intuitive for data scientist § Follows transformer APIs § MPI-style scaling § Scikit learn typically in Python § Ray.IO with RayDP enables efficient data exchange § Enriched DAGs from plain pipelines § OR nodes for fan-out expressions § AND nodes for arbitrary lambdas 7

Illustrative Example 8 Preprocess Random Forrest Gradient Boost Decision Tree
Sample Pipeline Our pipeline implementation c_a = ScaleTestEstimator(50, DecisionTreeClassifier()) c_b = ScaleTestEstimator(50, RandomForestClassifier()) c_c = ScaleTestEstimator(50, GradientBoostingClassifier()) classifiers = [c_a, c_b, c_c] classifier_results=[] for classifier in classifiers: pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', classifier)]) pipe.fit(X_train, y_train) pipe.predict(X_train) pipeline = dm.Pipeline() node_a = dm.OrNode('preprocess', preprocessor) node_b = dm.OrNode('c_a', c_a) node_c = dm.OrNode('c_b', c_b) node_d = dm.OrNode('c_c', c_c) pipeline.add_edge(node_a, node_b) pipeline.add_edge(node_a, node_c) pipeline.add_edge(node_a, node_d) in_args = {node_a: [Xy_ref_ptr]} out_args = rt.execute_pipeline(pipeline, ExecutionType.TRAIN, in_args) 2xfaster

9 Pipelines Galore… Spark Pipeline Task parallelism ✓ ✓ ✗
✓ ✓ Data parallelism ✗ ✗ ✗ ✓ ✓ AND/OR Graphs ✓ ✓ ✗ ✗ ✓ Computational unit Container Container Python function Python/Java function Python/Java function Mutability of DAG ✗ ✗ ✓ ✓ ✓ Our pipeline implementation

What to expect? § Execution strategies based on graph traversals
§ Early stopping criteria § Mutability of execution pipelines § Related Summit Talks § Current status: Proposal discussion with Ray and OSS community 10 Related talks Powering Open Data Hub with Ray | Erik Erlandson Serverless Earth Science Data Labeling using Unsupervised Deep Learning with Ray | Linsong Chu

Q&A Thank you! Contacts Raghu Ganti Michael Behrendt Linsong Chu
Carlos Costa Erik Erlandson Mudhakar Srivatsa [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

Scaling and Unifying SciKit Learn and Spark Pip...

Scaling and Unifying SciKit Learn and Spark Pipelines using Ray (Raghu Ganti, IBM)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

IBM Research Scaling and Unifying Scikit Learn and Spark Pipelines

2 And many more… So many pipelines…

… 3 Can we do pipelines on Ray? Can we

4 § Focus on scikit learn and Spark pipelines §

Scaling Pipelines: I/O as List of Object References 5 Transform

Scaling Pipelines: AND/OR graphs 6 X1 X2 XN X1’ X2’

Function as a unit of compute List of objects as

Illustrative Example 8 Preprocess Random Forrest Gradient Boost Decision Tree

9 Pipelines Galore… Spark Pipeline Task parallelism ✓ ✓ ✗

What to expect? § Execution strategies based on graph traversals

Q&A Thank you! Contacts Raghu Ganti Michael Behrendt Linsong Chu