Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RayDP: Build Large-scale End-to-end Data Analytics and AI Pipelines Using Spark and Ray (Carson Wang, Intel Data Analytics Software Group)

RayDP: Build Large-scale End-to-end Data Analytics and AI Pipelines Using Spark and Ray (Carson Wang, Intel Data Analytics Software Group)

A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A primitive setup is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in a Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start a Spark job on Ray in your python program and utilize Ray's in-memory object store to efficiently exchange data between Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.

Anyscale

July 16, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. • Big Data & AI Background • RayDP Overview •

    RayDP Deep Dive • RayDP Examples Agenda
  2. Big Data & AI HorovodOnSpark TensorflowOnSpark BigDL, Analytic-Zoo CaffeOnSpark Petastorm

    XGBoostOnSpark • Massive data is critical for better AI. • Distributed training will be a norm. • Many community efforts to integrate big data with AI. spark-tensorflow distributor spark-tensorflow connector
  3. Separate Spark and AI Cluster Challenges: • Data movement between

    clusters. • Overhead of managing two clusters. • Segmented application and glue code. Spark Cluster Data Preprocessing ML/DL Cluster Model Training ML/DL Storage
  4. Running ML/DL Frameworks on Spark Challenges: • Specific to Spark

    and requires ML/DL frameworks supported on Spark. • Data exchange between frameworks relies on distributed filesystems like HDFS or S3. Spark Cluster Data Preprocessing Model Training
  5. Running on Kubernetes Kubernetes Cluster Challenges: • The pipeline must

    be written in multiple programs and configuration files (v.s. a single python program). • Data exchange between frameworks relies on distributed filesystems like HDFS or S3. Data Preprocessing Model Training
  6. What is RayDP? RayDP PyTorch/Tensorflow Estimator Ray MLDataset Converter Spark

    on Ray Ray Libraries RayDP provides simple APIs for running Spark on Ray and integrating Spark with distributed ML/DL frameworks.
  7. Build End-to-End Pipeline using RayDP and Ray Data Preprocessing Model

    Training/Tuning Model Serving End-to-End Integrated Python Program Object Store
  8. Scale From Laptop To Cloud/Kubernetes Seamlessly Your Python Program Written

    by Ray, RayDP, pySpark, Tensorflow, PyTorch, etc APIs Develop Locally Scale to Cloud/Kubernetes Ray Cluster Launcher
  9. Why RayDP? Increased Productivity ▪ Simplify how to build and

    manage end-to-end pipeline. Write Spark, Xgboost, Tensorflow, PyTorch, Horovod code in a single python program. Better Performance ▪ In-memory data exchange. ▪ Built-in Spark optimization. Increased Resource Utilization ▪ Auto scaling at the cluster level and the application level.
  10. Spark on Ray API import ray import raydp ray.init(address='auto') spark

    = raydp.init_spark(app_name='Spark on Ray', num_executors=2, executor_cores=2, executor_memory='1G’, configs=None) df = spark.read.parquet(…) raydp.stop_spark() • Use raydp.init_spark to start a Spark job on a Ray cluster. • Use raydp.stop_spark to stop the job and release the resource.
  11. Spark on Ray Architecture AppMaster (Java Actor) ObjectStore Raylet GCS

    GCS GCS Web UI Debugging Tools Profiling Tools Spark Executor (Java Actor) Worker ObjectStore Raylet Worker ObjectStore Raylet Spark Executor (Java Actor) Driver Spark Driver 1 2 3 3 2 • One Ray actor for Spark AppMaster to start/stop Spark executors. • All Spark executors are in Ray Java actors. • Leverage object store for data exchange between Spark and other Ray libraries.
  12. PyTorch/Tensorflow Estimator estimator = TorchEstimator(num_workers=2, model=your_model, optimizer=optimizer, loss=criterion, feature_columns=features, label_column="fare_amount",

    batch_size=64, num_epochs=30) estimator.fit_on_spark(train_df, test_df) • Create an Estimator with parameters like model, optimizer, loss function, etc. • Fit the estimator with Spark dataframes directly.
  13. Ray MLDataset Converter from raydp.spark import RayMLDataset spark_df = …

    torch_ds = RayMLDataset .from_spark(spark_df, …) .transform(func) .to_torch(…) torch_dataloader = DataLoader(torch_ds.get_shard(shard_i ndex), …) Operations Execution Planning Object Store Ray Scheduler Spark Actor MLDataset Actor Object 1 MLDataset Shard Spark Dataframe PyTorch Actor PyTorch Dataset 1. from_spark 2. transform + to_torch Object 2 Object 3 Spark Dataframe ML Dataset ML Dataset ML Dataset from_sparkt transform to_torch Build Operation Graph • Create from Spark dataframe, In-memory objects, etc • Transform using user defined functions • Convert to PyTorch/Tensorflow Dataset Transformations are lazy, executed in pipeline Phase 1 Phase 2
  14. Spark + XGBoost on Ray import ray import raydp ray.init(address='auto')

    spark = raydp.init_spark('Spark + XGBoost', num_executors=2, executor_cores=4, executor_memory='8G') df = spark.read.csv(...) ... train_df, test_df = random_split(df, [0.9, 0.1]) train_dataset = RayMLDataset.from_spark(train_df, ...) test_dataset = RayMLDataset.from_spark(test_df, ...) from xgboost_ray import RayDMatrix, train, RayParams dtrain = RayDMatrix(train_dataset, label='fare_amount') dtest = RayDMatrix(test_dataset, label='fare_amount’) … bst = train( config, dtrain, evals=[(dtest, "eval")], evals_result=evals_result, ray_params=RayParams(…) num_boost_round=10) Data Preprocessing Model Training End-to-End Integrated Python Program RayD P RayD P
  15. Spark + Horovod on Ray import ray import raydp ray.init(address='auto')

    spark = raydp.init_spark('Spark + Horovod', num_executors=2, executor_cores=4, executor_memory=‘8G’) df = spark.read.csv(...) ... torch_ds= RayMLDataset.from_spark(df, …) .to_torch(...) #PyTorch Model class My_Model(nn.Module): … #Horovod on Ray def train_fn(dataset, num_features): hvd.init() rank = hvd.rank() train_data = dataset.get_shard(rank) ... from horovod.ray import RayExecutor executor = RayExecutor(settings, num_hosts=1, num_slots=1, cpus_per_slot=1) executor.start() executor.run(train_fn, args=[torch_ds, num_features]) Data Preprocessing Model Training End-to-End Integrated Python Program RayD P RayD P
  16. Spark + Horovod + RayTune on Ray import ray import

    raydp ray.init(address='auto') spark = raydp.init_spark(‘Spark + Horovod', num_executors=2, executor_cores=4, executor_memory=‘8G’) df = spark.read.csv(...) ... torch_ds= RayMLDataset.from_spark(df, …) .to_torch(...) #PyTorch Model class My_Model(nn.Module): … #Horovod on Ray + Ray Tune def train_fn(config: Dict): ... trainable = DistributedTrainableCreator( train_fn, num_slots=2, use_gpu=use_gpu) analysis = tune.run( trainable, num_samples=2, config={ "epochs": tune.grid_search([1, 2, 3]), "lr": tune.grid_search([0.1, 0.2, 0.3]), } ) print(analysis.best_config) Data Preprocessing Model Training/Tuning End-to-End Integrated Python Program RayD P RayD P
  17. • Ray is a general-purpose framework that can be used

    as a single substrate for end-to-end data analytics and AI pipelines. • RayDP provides simple APIs for running Spark on Ray and integrating Spark with distributed ML/DL frameworks. • For more information, please visit https://github.com/oap-project/raydp Summary