Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mars-on-Ray: Accelerating Large-scale Tensor and DataFrame Workloads (Chris Qin, Alibaba Cloud)

Mars-on-Ray: Accelerating Large-scale Tensor and DataFrame Workloads (Chris Qin, Alibaba Cloud)

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions. Ray, which provides a general purpose task and actor API, is extremely expressive, and guarantees high performance for distributed Python applications. Mars-on-Ray takes advantage of both Mars and Ray; Mars provides familiar data science APIs, and highly optimized, locality-aware, fine-grained task scheduling, and now Mars-on-Ray allows Mars workloads to run on Ray with the same scheduling strategy, but has an extensive ability to scale on Ray clusters.

Anyscale
PRO

July 21, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Mars-on-Ray: Accelerating large-scale
    Tensor and DataFrame Workloads
    Chris Qin

    View Slide

  2. About Me
    • Senior software engineer at Alibaba Cloud.

    • Focusing on Mars as core developer and
    architect.

    • Developed a framework named PyODPS
    which supports pandas-like syntax and
    compile DataFrame operations into SQL to
    run on big data platform MaxCompute and
    databases including PostgreSQL, MySQL
    etc.

    View Slide

  3. Tensor Workloads
    Knowledge Graph
    • Sparse tensor


    • Large scale tensor
    operations, e.g. matrix
    multiplication
    Vector Search
    • Widely used in picture search,
    recommendation systems.


    • Entities described as an vector.


    • Computation is extremely heavy.
    Data Preprocessing and
    Post-processing for
    Machine learning
    • Preprocessing includes
    normalization, splitting train
    and test data etc.


    • Post-processing includes
    metrics, visualization etc.

    View Slide

  4. DataFrame workloads
    Data Cleaning
    • Remove corrupted data


    • Fill missing value


    • …
    Data Analysis
    • Aggregations


    • Group by


    • …
    Visualization
    • Matplotlib


    • Seaborn


    • Plotly


    • …
    Data Exploration
    • Auto data exploration


    • Speed crucial when
    data is large

    View Slide

  5. Mars: a tensor-based uni
    fi
    ed framework for
    large-scale data computation
    Data Science Tools
    • Used by millions of
    people


    • Fit for not so large
    data


    • Easy to install via
    pip/conda


    • Acceptable
    performance due to
    underlying C/Cython
    Mars Tensor
    Mars DataFrame
    Mars Learn
    • Born for large-scale
    data


    • Familiar APIs that
    enables smooth
    transition


    • Thousands of
    workers, up to
    12,000 cores per task


    • Relatively easy to
    setup

    View Slide

  6. Platforms
    Mars Tensor Mars DataFrame Mars Remote
    Mars Learn
    Oscar: lightweight actor framework Mars Actors
    Fundamental
    Components
    Machine Learning
    & Integrated
    Libraries
    • Leverage Ray’s facilities as many as
    possible


    • Based on Ray actor API


    • Use Ray’s serialization, RPC, object
    store which enables object spilling
    • Communication via unix socket
    or socket


    • Mars’ own serialization based on
    pickle5 protocol
    Distributed numpy Distributed pandas
    Supported data
    format for tensor
    Supported data format
    for DataFrame
    Distributed scikit-learn
    Meta Storage Scheduling Task Subtask Lifecycle
    Session
    Distributed python
    functions
    Services

    View Slide

  7. Mars Tensor
    1 import numpy as np
    2 from scipy.special import erf
    3
    4
    5 def black_scholes(P, S, T, rate, vol):
    6 a = np.log(P / S)
    7 b = T * -rate
    8 z = T * (vol * vol * 2)
    9 c = 0.25 * z
    10 y = 1.0 / np.sqrt(z)
    11
    12 w1 = (a - b + c) * y
    13 w2 = (a - b - c) * y
    14
    15 d1 = 0.5 + 0.5 * erf(w1)
    16 d2 = 0.5 + 0.5 * erf(w2)
    17
    18 Se = np.exp(b) * S
    19
    20 call = P * d1 - Se * d2
    21 put = call - P + Se
    22
    23 return call, put
    24
    25
    26 N = 50000000
    27 price = np.random.uniform(10.0, 50.0, N),
    28 strike = np.random.uniform(10.0, 50.0, N)
    29 t = np.random.uniform(1.0, 2.0, N)
    30 print(black_scholes(price, strike, t, 0.1, 0.2))
    1 import mars.tensor as mt
    2 from mars.tensor.special import erf
    3
    4
    5 def black_scholes(P, S, T, rate, vol):
    6 a = mt.log(P / S)
    7 b = T * -rate
    8 z = T * (vol * vol * 2)
    9 c = 0.25 * z
    10 y = 1.0 / mt.sqrt(z)
    11
    12 w1 = (a - b + c) * y
    13 w2 = (a - b - c) * y
    14
    15 d1 = 0.5 + 0.5 * erf(w1)
    16 d2 = 0.5 + 0.5 * erf(w2)
    17
    18 Se = mt.exp(b) * S
    19
    20 call = P * d1 - Se * d2
    21 put = call - P + Se
    22
    23 return call, put
    24
    25
    26 N = 50000000
    27 price = mt.random.uniform(10.0, 50.0, N)
    28 strike = mt.random.uniform(10.0, 50.0, N)
    29 t = mt.random.uniform(1.0, 2.0, N)
    30 print(mars.execute(black_scholes(price, strike, t, 0.1, 0.2)))
    Mars Tensor
    Import replacement
    Lazy evaluation

    View Slide

  8. Mars DataFrame
    1 import pandas as pd
    2
    3 df = pd.read_csv('file.csv')
    4 stat_df = df.groupby('id').agg('mean')
    5 sort_df = stat_df.sort_values('rating')
    6 sort_df.plot(backend='pandas_bokeh')
    1 import mars.dataframe as md
    2
    3 df = md.read_csv('file.csv')
    4 stat_df = df.groupby('id').agg('mean')
    5 sort_df = stat_df.sort_values('rating')
    6 sort_df.plot(backend='pandas_bokeh')
    Mars DataFrame
    Import replacement
    sort_df.execute() is
    automatically called inside
    plot function

    View Slide

  9. Mars learn
    1 from sklearn.neighbors import NearestNeighbors
    2
    3 nn = NearestNeighbors(algorithm='brute',
    4 metric='cosine')
    5 nn.fit(X)
    6 dist, ind = nn.kneighbors(y)
    1 from mars.learn.neighbors import NearestNeighbors
    2
    3 nn = NearestNeighbors(algorithm='brute',
    4 metric='cosine')
    5 nn.fit(X)
    6 dist, ind = nn.kneighbors(y) Mars Learn
    Import replacement
    Execute is automatically called
    inside learn API including
    fi
    t,
    predict, kneighbors etc.

    View Slide

  10. Shuf
    fl
    e-based


    aggregation
    Tree-based


    aggregation
    Superset vs. subset
    • Mars tensor/DataFrame/learn implements a
    subset APIs of numpy/pandas/scikit-learn

    • In a long term, superset is our goal

    • Some ops are born for distributed, e.g.
    map_chunk, cartesian_chunk,
    map_reduce

    • Some ops may run di
    ff
    erently according
    number of workers, data size, data pattern
    etc, thus there may be various algorithms
    under the hood for a same op.

    • Di
    ff
    erent algorithms for machine learning.
    1 def complicated_process(df1_chunk, df2_chunk):
    2 # some complicated cartesian logic between two chunks
    3
    4 df1.cartesian_chunk(df2, complicated_process)
    Worker1 Worker2 Worker3
    Chunk2
    Chunk1 Chunk3
    Chunk1 Chunk2
    df1 df2
    cartesian_chunk
    1 df.groupby('field').agg(['sum'])
    Chunk1
    Chunk2
    Chunk3
    Chunk4
    DataFrameGroupByAgg Op
    Group keys small
    Chunk1
    Chunk2
    Chunk3
    Chunk4
    DataFrameGroupByAgg Op Shu
    ff
    l
    e Op
    Group keys large
    1 from mars.learn.neighbors import NearestNeighbors
    2
    3 nn = NearestNeighbors(algorithm='proxima')
    4 nn.fit(X)
    5 dist, ind = nn.kneighbors(y)
    • Highly optimized for k-nearest neighbors.


    • 400,000,000 vectors x 400,000,000 vectors in
    ~3.5h, including IO time.


    • 4x-5x faster than sklearn-based implementation,
    even for brute-force algorithm.

    View Slide

  11. Lazy evaluation vs. eager vs ??
    In [1]: import mars.tensor as mt
    In [2]: a = mt.arange(10)
    In [3]: a
    Out[3]: Tensor

    key=db07958a8fbe7c4f61eee10cd5dcfa05>
    In [4]: s = a.sum()
    In [5]: s
    Out[5]: Tensor

    key=a396e1590e3d05e9b2aa965feec54bae>
    In [6]: s.execute()
    Out[6]: 45
    In [1]: from mars.config import options
    In [2]: options.eager_mode = True
    In [3]: import mars.tensor as mt
    In [4]: a = mt.arange(10)
    In [5]: a
    Out[5]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    In [6]: s = a.sum()
    In [7]: s
    Out[7]: 45
    • Better performance


    • Fuse intermediate nodes, save graph
    size


    • More optimization can be applied


    • Save more memory usage
    • Worse performance


    • Each Mars object will generate data
    that live in the cluster unless it’s gc
    collected


    • Hard to optimize, especially for those
    intermediate objects that users do
    not care
    Any better solution?
    Defer mode
    Combine them together

    Stay tuned!!

    View Slide

  12. Client
    How Mars tensor/DataFrame scales
    In [1]: import mars.tensor as mt
    In [2]: import mars.dataframe as md
    In [3]: a = mt.ones((10, 10), chunk_size=5)
    In [4]: a[5, 5] = 8
    In [5]: df = md.DataFrame(a)
    In [6]: s = df.sum()
    In [7]: s.execute()
    Out[7]:
    0 10.0
    1 10.0
    2 10.0
    3 10.0
    4 10.0
    5 17.0
    6 10.0
    7 10.0
    8 10.0
    9 10.0
    dtype: float64 Operand
    TileableData
    Tileable
    Ones
    IndexSe
    tValue
    FromTe
    nsor
    Sum
    TensorData
    TensorData
    DataFrame
    Data
    SeriesData
    DataFrame(df)
    Series(s)
    data
    data
    Tensor(a)
    data
    Immutable

    View Slide

  13. Worker
    Supervisor
    Client
    Ones
    IndexSe
    tValue
    FromTe
    nsor
    Sum
    TensorData
    TensorData
    DataFrame
    Data
    SeriesData
    Worker Worker
    Submit
    Process
    Tileable
    Graph
    Tileable
    Graph
    Optimize Chunk
    Graph
    Tile Chunk
    Graph
    Optimize
    Column
    Pruning, etc
    Ray object store
    1 class MyOperand(Operand):
    2 @classmethod
    3 def tile(cls, op: Operand):
    4 chunks = []
    5 for chunk in op.inputs[0].chunks:
    6 chunk_op = ...
    7 chunks.append(chunk_op.new_chunk([
    8 chunk], **kwargs))
    9 return op.copy().new_dataframes(op.
    10 inputs, chunks=chunks)
    11
    12 @classmethod
    13 def execute(cls, ctx: Dict, op: Operand):
    14 inputs = [ctx[inp.key] for inp in op.
    15 inputs]
    16 # calculation
    17 ctx[op.outputs[0].key] = ...
    Typical Operand
    Implementation
    Tells supervisor how to tile a
    tileable into chunks
    Fusion, etc
    Ones Ones Ones Ones
    Inde
    xSetValu
    e
    FromT
    ensor
    FromT
    ensor
    FromT
    ensor
    FromT
    ensor
    Sum Sum Sum
    Sum
    Conc
    at
    Conc
    at
    Sum
    Sum
    Fuse Fuse Fuse Fuse
    Fuse Fuse
    Subtask
    Ones
    Inde
    xSetValu
    e
    FromT
    ensor
    Sum
    Subtask
    Ones
    Inde
    xSetValu
    e
    FromT
    ensor
    Sum
    Schedule


    • Depth-
    fi
    rst


    • Locality-aware
    Subtask Subtask Subtask Subtask Subtask Subtask
    Fuse

    View Slide

  14. Tour of Mars on Ray Job
    1 import mars
    2 import mars.tensor as mt
    3 from mars.deploy.ray import new_cluster
    4
    5 cluster = new_cluster('my_cluster', worker_num=10)
    6 mars.new_session(cluster.address, default=True)
    7
    8 a = mt.random.rand(10_000)
    9 a.sum().execute()
    Steps


    • Creating placement group
    according to number of
    workers and cpus


    • Creating Mars supervisors
    and workers, each of
    them is consist of several
    Mars actor pools. Each
    Mars actor pool will be
    wrapped in one Ray actor.


    • Creating Mars services,
    e.g. meta, task. One
    service will create a few
    Mars actors, each actor
    will be allocated to one
    Mars actor pool.
    Creating a Mars session that
    connects to the cluster, mark it
    as default session.
    Mars task will be submitted to
    supervisor. When computation
    fi
    nished, data will be fetched to client
    for display.

    View Slide

  15. Ray simpli
    fi
    es distributed applications
    Ray Worker Node
    Ray Worker Node Ray Worker Node
    Ray object store
    1 cluster = new_cluster('my_cluster', worker_num=2, worker_cpu=2)
    1 ray.util.placement_group(name=pg_name,
    2 bundles=[{'CPU': 1}, {'CPU': 2}, {'CPU': 2}],
    3 strategy="SPREAD")
    Internally creating
    placement group
    Supervisor Worker Worker
    Mars actor pool Mars actor pool Mars actor pool Mars actor pool Mars actor pool
    Ray actors
    Actors created by Task service Actors created by SubTask service Actors created by Storage service
    • Data stored into
    object store that
    enables object spilling


    • Data transfer been
    handed over to Ray
    itself.
    Actor call becomes a Ray
    RPC call.

    View Slide

  16. More work in the near future
    • Failover leverages the ability of Ray.

    • Auto scaling.

    • Generating subtask that fuses more ops that
    can use locality-aware scheduling of Ray.

    • Mars task can interact with normal Ray tasks
    and actors.

    View Slide

  17. Demo

    View Slide

  18. Conclusion
    • Mars on Ray is still under quick development

    • With the power that Rays enables, Mars on Ray could be 1+1>2

    • Mars on Ray: https://docs.ray.io/en/master/mars-on-ray.html

    • Track Mars on Ray: https://github.com/mars-project/mars/projects/18

    • Mars

    • Try out: pip install pymars

    • Github: https://github.com/mars-project/mars

    • Documents: https://docs.pymars.org/en/latest/

    View Slide

  19. THANK YOU

    View Slide