Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mars-on-Ray: Accelerating Large-scale Tensor and DataFrame Workloads (Chris Qin, Alibaba Cloud)

Mars-on-Ray: Accelerating Large-scale Tensor and DataFrame Workloads (Chris Qin, Alibaba Cloud)

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions. Ray, which provides a general purpose task and actor API, is extremely expressive, and guarantees high performance for distributed Python applications. Mars-on-Ray takes advantage of both Mars and Ray; Mars provides familiar data science APIs, and highly optimized, locality-aware, fine-grained task scheduling, and now Mars-on-Ray allows Mars workloads to run on Ray with the same scheduling strategy, but has an extensive ability to scale on Ray clusters.

Anyscale

July 21, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. About Me • Senior software engineer at Alibaba Cloud. •

    Focusing on Mars as core developer and architect. • Developed a framework named PyODPS which supports pandas-like syntax and compile DataFrame operations into SQL to run on big data platform MaxCompute and databases including PostgreSQL, MySQL etc.
  2. Tensor Workloads Knowledge Graph • Sparse tensor • Large scale

    tensor operations, e.g. matrix multiplication Vector Search • Widely used in picture search, recommendation systems. • Entities described as an vector. • Computation is extremely heavy. Data Preprocessing and Post-processing for Machine learning • Preprocessing includes normalization, splitting train and test data etc. • Post-processing includes metrics, visualization etc.
  3. DataFrame workloads Data Cleaning • Remove corrupted data • Fill

    missing value • … Data Analysis • Aggregations • Group by • … Visualization • Matplotlib • Seaborn • Plotly • … Data Exploration • Auto data exploration • Speed crucial when data is large
  4. Mars: a tensor-based uni fi ed framework for large-scale data

    computation Data Science Tools • Used by millions of people • Fit for not so large data • Easy to install via pip/conda • Acceptable performance due to underlying C/Cython Mars Tensor Mars DataFrame Mars Learn • Born for large-scale data • Familiar APIs that enables smooth transition • Thousands of workers, up to 12,000 cores per task • Relatively easy to setup
  5. Platforms Mars Tensor Mars DataFrame Mars Remote Mars Learn Oscar:

    lightweight actor framework Mars Actors Fundamental Components Machine Learning & Integrated Libraries • Leverage Ray’s facilities as many as possible • Based on Ray actor API • Use Ray’s serialization, RPC, object store which enables object spilling • Communication via unix socket or socket • Mars’ own serialization based on pickle5 protocol Distributed numpy Distributed pandas Supported data format for tensor Supported data format for DataFrame Distributed scikit-learn Meta Storage Scheduling Task Subtask Lifecycle Session Distributed python functions Services
  6. Mars Tensor 1 import numpy as np 2 from scipy.special

    import erf 3 4 5 def black_scholes(P, S, T, rate, vol): 6 a = np.log(P / S) 7 b = T * -rate 8 z = T * (vol * vol * 2) 9 c = 0.25 * z 10 y = 1.0 / np.sqrt(z) 11 12 w1 = (a - b + c) * y 13 w2 = (a - b - c) * y 14 15 d1 = 0.5 + 0.5 * erf(w1) 16 d2 = 0.5 + 0.5 * erf(w2) 17 18 Se = np.exp(b) * S 19 20 call = P * d1 - Se * d2 21 put = call - P + Se 22 23 return call, put 24 25 26 N = 50000000 27 price = np.random.uniform(10.0, 50.0, N), 28 strike = np.random.uniform(10.0, 50.0, N) 29 t = np.random.uniform(1.0, 2.0, N) 30 print(black_scholes(price, strike, t, 0.1, 0.2)) 1 import mars.tensor as mt 2 from mars.tensor.special import erf 3 4 5 def black_scholes(P, S, T, rate, vol): 6 a = mt.log(P / S) 7 b = T * -rate 8 z = T * (vol * vol * 2) 9 c = 0.25 * z 10 y = 1.0 / mt.sqrt(z) 11 12 w1 = (a - b + c) * y 13 w2 = (a - b - c) * y 14 15 d1 = 0.5 + 0.5 * erf(w1) 16 d2 = 0.5 + 0.5 * erf(w2) 17 18 Se = mt.exp(b) * S 19 20 call = P * d1 - Se * d2 21 put = call - P + Se 22 23 return call, put 24 25 26 N = 50000000 27 price = mt.random.uniform(10.0, 50.0, N) 28 strike = mt.random.uniform(10.0, 50.0, N) 29 t = mt.random.uniform(1.0, 2.0, N) 30 print(mars.execute(black_scholes(price, strike, t, 0.1, 0.2))) Mars Tensor Import replacement Lazy evaluation
  7. Mars DataFrame 1 import pandas as pd 2 3 df

    = pd.read_csv('file.csv') 4 stat_df = df.groupby('id').agg('mean') 5 sort_df = stat_df.sort_values('rating') 6 sort_df.plot(backend='pandas_bokeh') 1 import mars.dataframe as md 2 3 df = md.read_csv('file.csv') 4 stat_df = df.groupby('id').agg('mean') 5 sort_df = stat_df.sort_values('rating') 6 sort_df.plot(backend='pandas_bokeh') Mars DataFrame Import replacement sort_df.execute() is automatically called inside plot function
  8. Mars learn 1 from sklearn.neighbors import NearestNeighbors 2 3 nn

    = NearestNeighbors(algorithm='brute', 4 metric='cosine') 5 nn.fit(X) 6 dist, ind = nn.kneighbors(y) 1 from mars.learn.neighbors import NearestNeighbors 2 3 nn = NearestNeighbors(algorithm='brute', 4 metric='cosine') 5 nn.fit(X) 6 dist, ind = nn.kneighbors(y) Mars Learn Import replacement Execute is automatically called inside learn API including fi t, predict, kneighbors etc.
  9. Shuf fl e-based aggregation Tree-based aggregation Superset vs. subset •

    Mars tensor/DataFrame/learn implements a subset APIs of numpy/pandas/scikit-learn • In a long term, superset is our goal • Some ops are born for distributed, e.g. map_chunk, cartesian_chunk, map_reduce • Some ops may run di ff erently according number of workers, data size, data pattern etc, thus there may be various algorithms under the hood for a same op. • Di ff erent algorithms for machine learning. 1 def complicated_process(df1_chunk, df2_chunk): 2 # some complicated cartesian logic between two chunks 3 4 df1.cartesian_chunk(df2, complicated_process) Worker1 Worker2 Worker3 Chunk2 Chunk1 Chunk3 Chunk1 Chunk2 df1 df2 cartesian_chunk 1 df.groupby('field').agg(['sum']) Chunk1 Chunk2 Chunk3 Chunk4 DataFrameGroupByAgg Op Group keys small Chunk1 Chunk2 Chunk3 Chunk4 DataFrameGroupByAgg Op Shu ff l e Op Group keys large 1 from mars.learn.neighbors import NearestNeighbors 2 3 nn = NearestNeighbors(algorithm='proxima') 4 nn.fit(X) 5 dist, ind = nn.kneighbors(y) • Highly optimized for k-nearest neighbors. • 400,000,000 vectors x 400,000,000 vectors in ~3.5h, including IO time. • 4x-5x faster than sklearn-based implementation, even for brute-force algorithm.
  10. Lazy evaluation vs. eager vs ?? In [1]: import mars.tensor

    as mt In [2]: a = mt.arange(10) In [3]: a Out[3]: Tensor <op=TensorArange, shape=(10,), key=db07958a8fbe7c4f61eee10cd5dcfa05> In [4]: s = a.sum() In [5]: s Out[5]: Tensor <op=TensorSum, shape=(), key=a396e1590e3d05e9b2aa965feec54bae> In [6]: s.execute() Out[6]: 45 In [1]: from mars.config import options In [2]: options.eager_mode = True In [3]: import mars.tensor as mt In [4]: a = mt.arange(10) In [5]: a Out[5]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [6]: s = a.sum() In [7]: s Out[7]: 45 • Better performance • Fuse intermediate nodes, save graph size • More optimization can be applied • Save more memory usage • Worse performance • Each Mars object will generate data that live in the cluster unless it’s gc collected • Hard to optimize, especially for those intermediate objects that users do not care Any better solution? Defer mode Combine them together Stay tuned!!
  11. Client How Mars tensor/DataFrame scales In [1]: import mars.tensor as

    mt In [2]: import mars.dataframe as md In [3]: a = mt.ones((10, 10), chunk_size=5) In [4]: a[5, 5] = 8 In [5]: df = md.DataFrame(a) In [6]: s = df.sum() In [7]: s.execute() Out[7]: 0 10.0 1 10.0 2 10.0 3 10.0 4 10.0 5 17.0 6 10.0 7 10.0 8 10.0 9 10.0 dtype: float64 Operand TileableData Tileable Ones IndexSe tValue FromTe nsor Sum TensorData TensorData DataFrame Data SeriesData DataFrame(df) Series(s) data data Tensor(a) data Immutable
  12. Worker Supervisor Client Ones IndexSe tValue FromTe nsor Sum TensorData

    TensorData DataFrame Data SeriesData Worker Worker Submit Process Tileable Graph Tileable Graph Optimize Chunk Graph Tile Chunk Graph Optimize Column Pruning, etc Ray object store 1 class MyOperand(Operand): 2 @classmethod 3 def tile(cls, op: Operand): 4 chunks = [] 5 for chunk in op.inputs[0].chunks: 6 chunk_op = ... 7 chunks.append(chunk_op.new_chunk([ 8 chunk], **kwargs)) 9 return op.copy().new_dataframes(op. 10 inputs, chunks=chunks) 11 12 @classmethod 13 def execute(cls, ctx: Dict, op: Operand): 14 inputs = [ctx[inp.key] for inp in op. 15 inputs] 16 # calculation 17 ctx[op.outputs[0].key] = ... Typical Operand Implementation Tells supervisor how to tile a tileable into chunks Fusion, etc Ones Ones Ones Ones Inde xSetValu e FromT ensor FromT ensor FromT ensor FromT ensor Sum Sum Sum Sum Conc at Conc at Sum Sum Fuse Fuse Fuse Fuse Fuse Fuse Subtask Ones Inde xSetValu e FromT ensor Sum Subtask Ones Inde xSetValu e FromT ensor Sum Schedule • Depth- fi rst • Locality-aware Subtask Subtask Subtask Subtask Subtask Subtask Fuse
  13. Tour of Mars on Ray Job 1 import mars 2

    import mars.tensor as mt 3 from mars.deploy.ray import new_cluster 4 5 cluster = new_cluster('my_cluster', worker_num=10) 6 mars.new_session(cluster.address, default=True) 7 8 a = mt.random.rand(10_000) 9 a.sum().execute() Steps • Creating placement group according to number of workers and cpus • Creating Mars supervisors and workers, each of them is consist of several Mars actor pools. Each Mars actor pool will be wrapped in one Ray actor. • Creating Mars services, e.g. meta, task. One service will create a few Mars actors, each actor will be allocated to one Mars actor pool. Creating a Mars session that connects to the cluster, mark it as default session. Mars task will be submitted to supervisor. When computation fi nished, data will be fetched to client for display.
  14. Ray simpli fi es distributed applications Ray Worker Node Ray

    Worker Node Ray Worker Node Ray object store 1 cluster = new_cluster('my_cluster', worker_num=2, worker_cpu=2) 1 ray.util.placement_group(name=pg_name, 2 bundles=[{'CPU': 1}, {'CPU': 2}, {'CPU': 2}], 3 strategy="SPREAD") Internally creating placement group Supervisor Worker Worker Mars actor pool Mars actor pool Mars actor pool Mars actor pool Mars actor pool Ray actors Actors created by Task service Actors created by SubTask service Actors created by Storage service • Data stored into object store that enables object spilling • Data transfer been handed over to Ray itself. Actor call becomes a Ray RPC call.
  15. More work in the near future • Failover leverages the

    ability of Ray. • Auto scaling. • Generating subtask that fuses more ops that can use locality-aware scheduling of Ray. • Mars task can interact with normal Ray tasks and actors.
  16. Conclusion • Mars on Ray is still under quick development

    • With the power that Rays enables, Mars on Ray could be 1+1>2 • Mars on Ray: https://docs.ray.io/en/master/mars-on-ray.html • Track Mars on Ray: https://github.com/mars-project/mars/projects/18 • Mars • Try out: pip install pymars • Github: https://github.com/mars-project/mars • Documents: https://docs.pymars.org/en/latest/