Mars-on-Ray: Accelerating Large-scale Tensor and DataFrame Workloads (Chris Qin, Alibaba Cloud)

Mars-on-Ray: Accelerating large-scale Tensor and DataFrame Workloads Chris Qin

About Me • Senior software engineer at Alibaba Cloud. •
Focusing on Mars as core developer and architect. • Developed a framework named PyODPS which supports pandas-like syntax and compile DataFrame operations into SQL to run on big data platform MaxCompute and databases including PostgreSQL, MySQL etc.

Tensor Workloads Knowledge Graph • Sparse tensor • Large scale
tensor operations, e.g. matrix multiplication Vector Search • Widely used in picture search, recommendation systems. • Entities described as an vector. • Computation is extremely heavy. Data Preprocessing and Post-processing for Machine learning • Preprocessing includes normalization, splitting train and test data etc. • Post-processing includes metrics, visualization etc.

DataFrame workloads Data Cleaning • Remove corrupted data • Fill
missing value • … Data Analysis • Aggregations • Group by • … Visualization • Matplotlib • Seaborn • Plotly • … Data Exploration • Auto data exploration • Speed crucial when data is large

Mars: a tensor-based uni fi ed framework for large-scale data
computation Data Science Tools • Used by millions of people • Fit for not so large data • Easy to install via pip/conda • Acceptable performance due to underlying C/Cython Mars Tensor Mars DataFrame Mars Learn • Born for large-scale data • Familiar APIs that enables smooth transition • Thousands of workers, up to 12,000 cores per task • Relatively easy to setup

Platforms Mars Tensor Mars DataFrame Mars Remote Mars Learn Oscar:
lightweight actor framework Mars Actors Fundamental Components Machine Learning & Integrated Libraries • Leverage Ray’s facilities as many as possible • Based on Ray actor API • Use Ray’s serialization, RPC, object store which enables object spilling • Communication via unix socket or socket • Mars’ own serialization based on pickle5 protocol Distributed numpy Distributed pandas Supported data format for tensor Supported data format for DataFrame Distributed scikit-learn Meta Storage Scheduling Task Subtask Lifecycle Session Distributed python functions Services

Mars Tensor 1 import numpy as np 2 from scipy.special
import erf 3 4 5 def black_scholes(P, S, T, rate, vol): 6 a = np.log(P / S) 7 b = T * -rate 8 z = T * (vol * vol * 2) 9 c = 0.25 * z 10 y = 1.0 / np.sqrt(z) 11 12 w1 = (a - b + c) * y 13 w2 = (a - b - c) * y 14 15 d1 = 0.5 + 0.5 * erf(w1) 16 d2 = 0.5 + 0.5 * erf(w2) 17 18 Se = np.exp(b) * S 19 20 call = P * d1 - Se * d2 21 put = call - P + Se 22 23 return call, put 24 25 26 N = 50000000 27 price = np.random.uniform(10.0, 50.0, N), 28 strike = np.random.uniform(10.0, 50.0, N) 29 t = np.random.uniform(1.0, 2.0, N) 30 print(black_scholes(price, strike, t, 0.1, 0.2)) 1 import mars.tensor as mt 2 from mars.tensor.special import erf 3 4 5 def black_scholes(P, S, T, rate, vol): 6 a = mt.log(P / S) 7 b = T * -rate 8 z = T * (vol * vol * 2) 9 c = 0.25 * z 10 y = 1.0 / mt.sqrt(z) 11 12 w1 = (a - b + c) * y 13 w2 = (a - b - c) * y 14 15 d1 = 0.5 + 0.5 * erf(w1) 16 d2 = 0.5 + 0.5 * erf(w2) 17 18 Se = mt.exp(b) * S 19 20 call = P * d1 - Se * d2 21 put = call - P + Se 22 23 return call, put 24 25 26 N = 50000000 27 price = mt.random.uniform(10.0, 50.0, N) 28 strike = mt.random.uniform(10.0, 50.0, N) 29 t = mt.random.uniform(1.0, 2.0, N) 30 print(mars.execute(black_scholes(price, strike, t, 0.1, 0.2))) Mars Tensor Import replacement Lazy evaluation

Mars DataFrame 1 import pandas as pd 2 3 df
= pd.read_csv('file.csv') 4 stat_df = df.groupby('id').agg('mean') 5 sort_df = stat_df.sort_values('rating') 6 sort_df.plot(backend='pandas_bokeh') 1 import mars.dataframe as md 2 3 df = md.read_csv('file.csv') 4 stat_df = df.groupby('id').agg('mean') 5 sort_df = stat_df.sort_values('rating') 6 sort_df.plot(backend='pandas_bokeh') Mars DataFrame Import replacement sort_df.execute() is automatically called inside plot function

Mars learn 1 from sklearn.neighbors import NearestNeighbors 2 3 nn
= NearestNeighbors(algorithm='brute', 4 metric='cosine') 5 nn.fit(X) 6 dist, ind = nn.kneighbors(y) 1 from mars.learn.neighbors import NearestNeighbors 2 3 nn = NearestNeighbors(algorithm='brute', 4 metric='cosine') 5 nn.fit(X) 6 dist, ind = nn.kneighbors(y) Mars Learn Import replacement Execute is automatically called inside learn API including fi t, predict, kneighbors etc.

Shuf fl e-based aggregation Tree-based aggregation Superset vs. subset •
Mars tensor/DataFrame/learn implements a subset APIs of numpy/pandas/scikit-learn • In a long term, superset is our goal • Some ops are born for distributed, e.g. map_chunk, cartesian_chunk, map_reduce • Some ops may run di ff erently according number of workers, data size, data pattern etc, thus there may be various algorithms under the hood for a same op. • Di ff erent algorithms for machine learning. 1 def complicated_process(df1_chunk, df2_chunk): 2 # some complicated cartesian logic between two chunks 3 4 df1.cartesian_chunk(df2, complicated_process) Worker1 Worker2 Worker3 Chunk2 Chunk1 Chunk3 Chunk1 Chunk2 df1 df2 cartesian_chunk 1 df.groupby('field').agg(['sum']) Chunk1 Chunk2 Chunk3 Chunk4 DataFrameGroupByAgg Op Group keys small Chunk1 Chunk2 Chunk3 Chunk4 DataFrameGroupByAgg Op Shu ff l e Op Group keys large 1 from mars.learn.neighbors import NearestNeighbors 2 3 nn = NearestNeighbors(algorithm='proxima') 4 nn.fit(X) 5 dist, ind = nn.kneighbors(y) • Highly optimized for k-nearest neighbors. • 400,000,000 vectors x 400,000,000 vectors in ~3.5h, including IO time. • 4x-5x faster than sklearn-based implementation, even for brute-force algorithm.

Lazy evaluation vs. eager vs ?? In [1]: import mars.tensor
as mt In [2]: a = mt.arange(10) In [3]: a Out[3]: Tensor <op=TensorArange, shape=(10,), key=db07958a8fbe7c4f61eee10cd5dcfa05> In [4]: s = a.sum() In [5]: s Out[5]: Tensor <op=TensorSum, shape=(), key=a396e1590e3d05e9b2aa965feec54bae> In [6]: s.execute() Out[6]: 45 In [1]: from mars.config import options In [2]: options.eager_mode = True In [3]: import mars.tensor as mt In [4]: a = mt.arange(10) In [5]: a Out[5]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [6]: s = a.sum() In [7]: s Out[7]: 45 • Better performance • Fuse intermediate nodes, save graph size • More optimization can be applied • Save more memory usage • Worse performance • Each Mars object will generate data that live in the cluster unless it’s gc collected • Hard to optimize, especially for those intermediate objects that users do not care Any better solution? Defer mode Combine them together Stay tuned!!

Client How Mars tensor/DataFrame scales In [1]: import mars.tensor as
mt In [2]: import mars.dataframe as md In [3]: a = mt.ones((10, 10), chunk_size=5) In [4]: a[5, 5] = 8 In [5]: df = md.DataFrame(a) In [6]: s = df.sum() In [7]: s.execute() Out[7]: 0 10.0 1 10.0 2 10.0 3 10.0 4 10.0 5 17.0 6 10.0 7 10.0 8 10.0 9 10.0 dtype: float64 Operand TileableData Tileable Ones IndexSe tValue FromTe nsor Sum TensorData TensorData DataFrame Data SeriesData DataFrame(df) Series(s) data data Tensor(a) data Immutable

Worker Supervisor Client Ones IndexSe tValue FromTe nsor Sum TensorData
TensorData DataFrame Data SeriesData Worker Worker Submit Process Tileable Graph Tileable Graph Optimize Chunk Graph Tile Chunk Graph Optimize Column Pruning, etc Ray object store 1 class MyOperand(Operand): 2 @classmethod 3 def tile(cls, op: Operand): 4 chunks = [] 5 for chunk in op.inputs[0].chunks: 6 chunk_op = ... 7 chunks.append(chunk_op.new_chunk([ 8 chunk], **kwargs)) 9 return op.copy().new_dataframes(op. 10 inputs, chunks=chunks) 11 12 @classmethod 13 def execute(cls, ctx: Dict, op: Operand): 14 inputs = [ctx[inp.key] for inp in op. 15 inputs] 16 # calculation 17 ctx[op.outputs[0].key] = ... Typical Operand Implementation Tells supervisor how to tile a tileable into chunks Fusion, etc Ones Ones Ones Ones Inde xSetValu e FromT ensor FromT ensor FromT ensor FromT ensor Sum Sum Sum Sum Conc at Conc at Sum Sum Fuse Fuse Fuse Fuse Fuse Fuse Subtask Ones Inde xSetValu e FromT ensor Sum Subtask Ones Inde xSetValu e FromT ensor Sum Schedule • Depth- fi rst • Locality-aware Subtask Subtask Subtask Subtask Subtask Subtask Fuse

Tour of Mars on Ray Job 1 import mars 2
import mars.tensor as mt 3 from mars.deploy.ray import new_cluster 4 5 cluster = new_cluster('my_cluster', worker_num=10) 6 mars.new_session(cluster.address, default=True) 7 8 a = mt.random.rand(10_000) 9 a.sum().execute() Steps • Creating placement group according to number of workers and cpus • Creating Mars supervisors and workers, each of them is consist of several Mars actor pools. Each Mars actor pool will be wrapped in one Ray actor. • Creating Mars services, e.g. meta, task. One service will create a few Mars actors, each actor will be allocated to one Mars actor pool. Creating a Mars session that connects to the cluster, mark it as default session. Mars task will be submitted to supervisor. When computation fi nished, data will be fetched to client for display.

Ray simpli fi es distributed applications Ray Worker Node Ray
Worker Node Ray Worker Node Ray object store 1 cluster = new_cluster('my_cluster', worker_num=2, worker_cpu=2) 1 ray.util.placement_group(name=pg_name, 2 bundles=[{'CPU': 1}, {'CPU': 2}, {'CPU': 2}], 3 strategy="SPREAD") Internally creating placement group Supervisor Worker Worker Mars actor pool Mars actor pool Mars actor pool Mars actor pool Mars actor pool Ray actors Actors created by Task service Actors created by SubTask service Actors created by Storage service • Data stored into object store that enables object spilling • Data transfer been handed over to Ray itself. Actor call becomes a Ray RPC call.

More work in the near future • Failover leverages the
ability of Ray. • Auto scaling. • Generating subtask that fuses more ops that can use locality-aware scheduling of Ray. • Mars task can interact with normal Ray tasks and actors.

Conclusion • Mars on Ray is still under quick development
• With the power that Rays enables, Mars on Ray could be 1+1>2 • Mars on Ray: https://docs.ray.io/en/master/mars-on-ray.html • Track Mars on Ray: https://github.com/mars-project/mars/projects/18 • Mars • Try out: pip install pymars • Github: https://github.com/mars-project/mars • Documents: https://docs.pymars.org/en/latest/

THANK YOU

Mars-on-Ray: Accelerating Large-scale Tensor an...

Mars-on-Ray: Accelerating Large-scale Tensor and DataFrame Workloads (Chris Qin, Alibaba Cloud)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

Mars-on-Ray: Accelerating large-scale Tensor and DataFrame Workloads Chris Qin

About Me • Senior software engineer at Alibaba Cloud. •

Tensor Workloads Knowledge Graph • Sparse tensor • Large scale

DataFrame workloads Data Cleaning • Remove corrupted data • Fill

Mars: a tensor-based uni fi ed framework for large-scale data

Platforms Mars Tensor Mars DataFrame Mars Remote Mars Learn Oscar:

Mars Tensor 1 import numpy as np 2 from scipy.special

Mars DataFrame 1 import pandas as pd 2 3 df

Mars learn 1 from sklearn.neighbors import NearestNeighbors 2 3 nn

Shuf fl e-based aggregation Tree-based aggregation Superset vs. subset •

Lazy evaluation vs. eager vs ?? In [1]: import mars.tensor

Client How Mars tensor/DataFrame scales In [1]: import mars.tensor as

Worker Supervisor Client Ones IndexSe tValue FromTe nsor Sum TensorData

Tour of Mars on Ray Job 1 import mars 2

Ray simpli fi es distributed applications Ray Worker Node Ray

More work in the near future • Failover leverages the

Demo

Conclusion • Mars on Ray is still under quick development

THANK YOU