Slide 1

Slide 1 text

IBM Research Scaling and Unifying Scikit Learn and Spark Pipelines using Ray [email protected] Raghu Ganti Principal Research Staff Member IBM T.J. Watson Research Center Team (IBM & Red Hat): Michael Behrendt, Linsong Chu, Carlos Costa, Erik Erlandson, Mudhakar Srivatsa raghu-ganti-26096810

Slide 2

Slide 2 text

2 And many more… So many pipelines…

Slide 3

Slide 3 text

… 3 Can we do pipelines on Ray? Can we scale popular AI/ML pipelines on Ray? Can we unify scikit learn and Spark pipelines? Ray.IO

Slide 4

Slide 4 text

4 § Focus on scikit learn and Spark pipelines § Scikit learn missing scaling; Spark focus on data parallel scaling Transform Fit X X y X’ Fitted model Current Pipeline API

Slide 5

Slide 5 text

Scaling Pipelines: I/O as List of Object References 5 Transform Fit [X1, X2, … XN] [X1, X2, … XN] [y1, y2, … yN] [X1’, X2’, …, XN’] [FM1, FM2, … FMN]

Slide 6

Slide 6 text

Scaling Pipelines: AND/OR graphs 6 X1 X2 XN X1’ X2’ XN’ OR node X Step 1 Step 2 Step N X’ X’ X’ … AND node … …

Slide 7

Slide 7 text

Function as a unit of compute List of objects as I/O Cross environment AND/OR Graphs 7 § Object references as I/O for unit of compute § Sharing of objects using Plasma store § Enables zero-copy object sharing § Python function as unit of compute § Intuitive for data scientist § Follows transformer APIs § MPI-style scaling § Scikit learn typically in Python § Ray.IO with RayDP enables efficient data exchange § Enriched DAGs from plain pipelines § OR nodes for fan-out expressions § AND nodes for arbitrary lambdas 7

Slide 8

Slide 8 text

Illustrative Example 8 Preprocess Random Forrest Gradient Boost Decision Tree Sample Pipeline Our pipeline implementation c_a = ScaleTestEstimator(50, DecisionTreeClassifier()) c_b = ScaleTestEstimator(50, RandomForestClassifier()) c_c = ScaleTestEstimator(50, GradientBoostingClassifier()) classifiers = [c_a, c_b, c_c] classifier_results=[] for classifier in classifiers: pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', classifier)]) pipe.fit(X_train, y_train) pipe.predict(X_train) pipeline = dm.Pipeline() node_a = dm.OrNode('preprocess', preprocessor) node_b = dm.OrNode('c_a', c_a) node_c = dm.OrNode('c_b', c_b) node_d = dm.OrNode('c_c', c_c) pipeline.add_edge(node_a, node_b) pipeline.add_edge(node_a, node_c) pipeline.add_edge(node_a, node_d) in_args = {node_a: [Xy_ref_ptr]} out_args = rt.execute_pipeline(pipeline, ExecutionType.TRAIN, in_args) 2xfaster

Slide 9

Slide 9 text

9 Pipelines Galore… Spark Pipeline Task parallelism ✓ ✓ ✗ ✓ ✓ Data parallelism ✗ ✗ ✗ ✓ ✓ AND/OR Graphs ✓ ✓ ✗ ✗ ✓ Computational unit Container Container Python function Python/Java function Python/Java function Mutability of DAG ✗ ✗ ✓ ✓ ✓ Our pipeline implementation

Slide 10

Slide 10 text

What to expect? § Execution strategies based on graph traversals § Early stopping criteria § Mutability of execution pipelines § Related Summit Talks § Current status: Proposal discussion with Ray and OSS community 10 Related talks Powering Open Data Hub with Ray | Erik Erlandson Serverless Earth Science Data Labeling using Unsupervised Deep Learning with Ray | Linsong Chu

Slide 11

Slide 11 text

Q&A Thank you! Contacts Raghu Ganti Michael Behrendt Linsong Chu Carlos Costa Erik Erlandson Mudhakar Srivatsa [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]