Save 37% off PRO during our Black Friday Sale! »

NumS: Scalable Array Programming for the Cloud (Melih Elibol, UC Berkeley)

NumS: Scalable Array Programming for the Cloud (Melih Elibol, UC Berkeley)

Runtime improvements to multi-dimensional array operations increasingly rely on parallelism, and Python's scientific computing community has a growing need to train on larger datasets. Existing SPMD distributed memory solutions present Python users with an uncommon programming model, and block-partitioned array abstractions that rely on task graph scheduling heuristics are optimized for general workloads instead of array operations. In this work, we present a novel approach to optimizing NumPy programs at runtime while maintaining an array abstraction that is faithful to the NumPy API, providing a Ray library that is both performant and easy to use. We explicitly formulate scheduling as an optimization problem and empirically show that our runtime optimizer achieves near optimal performance. Our library, called NumS, is able to provide a 10-20x speedup on basic linear algebra operations over Dask, and a 3-6x speedup on logistic regression compared to Dask and Spark on terabyte-scale data.

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

July 20, 2021
Tweet

Transcript

  1. None
  2. Melih Elibol Samyu Yagati Lianmin Zheng Vinamra Benara Devin Petersohn

    Suresh Saggar Alvin Cheung Michael I. Jordan Ion Stoica University of California, Berkeley NumS NumS NumS is an open-source project publically available under the Apache 2.0 license. github.com/nums-project
  3. MOTIVATION Improve the runtime of multi-dimensional array programs in Python,

    and enable Python’s scientific computing community to analyze and model larger datasets.
  4. MOTIVATION An ideal Python solution seamlessly parallelizes and scales NumPy-like

    array operations, allowing scientists and statisticians to leverage their existing programming knowledge.
  5. PROBLEM 1 To effectively parallelize NumPy-like code, we must determine

    dependencies between operations, and then concurrently execute any independent operations.
  6. PROBLEM 2 We need to avoid high over-heads from parallelization.

    For example, the naive approach of sending one RPC per element-wise array operation poses untenable overheads.
  7. PROBLEM 3 To scale array operations on distributed memory, we

    must both avoid high network overheads and load-balance the data among the different nodes.
  8. SOLUTIONS Related Solutions • NumPy is primarily serial, with shared-memory

    parallelism for basic linear algebra via system’s BLAS implementation. • Existing single program, multiple-data distributed memory solutions present Python users with an unfamiliar programming model. • Block-partitioned Python array libraries rely on task graph scheduling heuristics optimized for general workloads instead of array operations.
  9. NumS OUR SOLUTION Scalable Numerical Array Programming for the Cloud.

  10. DESIGN NumS Data Flow Diagram NumPy API Array Application State

    I/O Distributed System Interface Compute Manager Cluster State Distributed System Node Node User Application Persistent Storage Arrays GraphArray BlockArray Block Block Block Scheduler
  11. SOLUTION NumS exposes a NumPy-compatible array abstraction defined in terms

    of futures, allowing the scheduler to see the computation graph in advance and parallelize execution. To effectively parallelize NumPy-like code, we must determine dependencies between operations, and then concurrently execute any independent operations. PROBLEM 1
  12. SOLUTION 1 Futures and Promises

  13. SOLUTION 1 Array Access Dependency Resolution x = A[:, i].T

    @ B[:, i] y = A[:, j].T @ B[:, i] z = x * y Serial Execution proc0 A B x A y z B Futures with Concurrency proc0 proc1 A x A y z B resource dependency data dependency
  14. SOLUTION We coarsen operations by partitioning arrays into a grid

    of blocks and perform array operations block-wise, rather than element-wise. We need to avoid high over-heads from parallelization. For example, the naive approach of sending one RPC per element-wise array operation poses untenable overheads. PROBLEM 2
  15. SOLUTION 2 Block Partitioned Arrays BlockArray shape = (4, 6)

    block_shape = (2, 2) grid_shape = (2, 3) 0,0 1,0 0,1 1,1 0,2 1,2 0,3 1,3 0,4 1,4 0,5 1,5 2,0 3,0 2,1 3,1 2,2 3,2 2,3 3,3 2,4 3,4 2,5 3,5 4 6 Array entry
  16. SOLUTION 2 Block Partitioned Arrays BlockArray shape = (4, 6)

    block_shape = (2, 2) grid_shape = (2, 3) 0,0 1,0 0,1 1,1 0,2 1,2 0,3 1,3 0,4 1,4 0,5 1,5 2,0 3,0 2,1 3,1 2,2 3,2 2,3 3,3 2,4 3,4 2,5 3,5 2 2 Block
  17. SOLUTION 2 Block Partitioned Arrays BlockArray shape = (4, 6)

    block_shape = (2, 2) grid_shape = (2, 3) 0,1 0,2 0,0 1,1 1,2 1,0 Grid entry
  18. SOLUTION 2 Execution on Ray: RPC import nums x: BlockArray

    = nums.read("data/x") read returns immediately, executing tasks required to construct x asynchronously. node1 node2 worker store worker store read(part0) read(part1) node0 driver read(part0) read(part1) Storage
  19. SOLUTION 2 Execution on Ray: Return import nums x: BlockArray

    = nums.read("data/x") node1 node2 worker store worker store obj1 obj0 node0 driver read(part0) read(part1) returns in RPCs are put in the local object store Storage
  20. SOLUTION 2 Execution on Ray: References import nums x: BlockArray

    = nums.read("data/x") Objects are held in the store so long as a reference to the object exists in the application. node1 node2 worker store worker store obj1 obj0 node0 driver ref0 ref1 Storage
  21. SOLUTION 1 API Example • Load X, y and initialize

    beta concurrently as block-partitioned arrays. import nums import nums.numpy as nps X: BlockArray = nps.read("data/X") y: BlockArray = nps.read("data/y") beta: BlockArray = nps.zeros(X.shape[1]) for i in range(max_iter): mu: BlockArray = 1 / (1 + nps.exp(-X @ beta)) g: BlockArray = X.T @ (mu - y) h: BlockArray = (X.T * mu * (1 - mu)) @ X beta -= nps.inv(h) @ g if g.T @ g <= tol: break
  22. SOLUTION 1 API Example import nums import nums.numpy as nps

    X: BlockArray = nps.read("data/X") y: BlockArray = nps.read("data/y") beta: BlockArray = nps.zeros(X.shape[1]) for i in range(max_iter): mu: BlockArray = 1 / (1 + nps.exp(-X @ beta)) g: BlockArray = X.T @ (mu - y) h: BlockArray = (X.T * mu * (1 - mu)) @ X beta -= nps.inv(h) @ g if g.T @ g <= tol: break • Execute all operations in loop body concurrently. • All operations are executed block-wise.
  23. SOLUTION 1 API Example import nums import nums.numpy as nps

    X: BlockArray = nps.read("data/X") y: BlockArray = nps.read("data/y") beta: BlockArray = nps.zeros(X.shape[1]) for i in range(max_iter): mu: BlockArray = 1 / (1 + nps.exp(-X @ beta)) g: BlockArray = X.T @ (mu - y) h: BlockArray = (X.T * mu * (1 - mu)) @ X beta -= nps.inv(h) @ g if g.T @ g <= tol: break • Evaluate termination condition concurrently. • Block until complete, to perform branching operation on driver process.
  24. SOLUTION We designed a novel scheduler, Cyclic Random Tree Search

    (CRTS), which combines a traditional block-cyclic data layout with an objective-based operation scheduler. To scale on distributed memory, we must both avoid high network overheads and load-balance the computation among the different nodes. PROBLEM 3
  25. 1 2 3 4 5 6 7 8 - NumS

    decomposes n-dimensional arrays into blocks. BlockArray shape = (12, 4) block_shape = (3, 2) grid_shape = (4, 2) block 1, 5 2, 6 3, 7 4, 8 node Cluster grid_shape = (2, 2) - A cluster of nodes is represented as an n-dimensional grid. - Persistent arrays are dispersed over n-dim grid of nodes. - Balances data load and locality for optimizer. STRUCTURES Block-Cyclic Data Layout
  26. OPTIMIZATION Optimizer - Cluster State: Estimate memory and network load

    on each node using array size. - Objective: Place operations so that maximum memory and network load over all nodes is minimized. - Computation State: An array-of-trees data structure on which we perform computations. - Cyclic Random Tree Search (CRTS): An iterative algorithm that places a single operation per iteration according to the objective, and updates both the cluster state and computation state.
  27. OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 b1

    b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 BlockArray GraphArray
  28. OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 a1

    b1 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 Add a2 b2 a3 b3 a4 b4 BlockArray GraphArray GraphArray
  29. OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 a1

    b1 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 Optimizer c1 c2 c3 c4 Add a2 b2 a3 b3 a4 b4 BlockArray GraphArray GraphArray GraphArray
  30. OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 a1

    b1 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 Optimizer c1 c2 c3 c4 Add a2 b2 a3 b3 a4 b4 c1 c2 c4 c3 BlockArray GraphArray GraphArray GraphArray BlockArray
  31. OPTIMIZATION Representations of Tensor Dot b1 b2 a1 a2 A

    @ B @ @ + a1 b1 a2 b2 Syntactic Representation - A is 4 by 8 - B is 8 by 4 BlockArray Representation - a_i is 4 by 4 - b_i is 4 by 4 GraphArray Representation - frontier nodes
  32. OPTIMIZATION CRTS: Optimization of Tensor Dot @ @ + a1

    b1 a2 b2 GraphArray Representation - frontier nodes Randomly sample a frontier node. @ a1 b1 Schedule the operation based on cluster simulation. @ a1 b1 device1 device2 @ a1 b1 a1 b1 a2 b2
  33. OPTIMIZATION CRTS: Objective memory net_in net_out device1 48 0 0

    device2 32 0 0 memory net_in net_out device1 32 0 32 device2 48 32 0 Schedule op on device1 Schedule op on device2 Capture desired scheduling behavior in a simple objective: • s_i corresponds to scheduling option i from state s. We minimize this objective over i. • M is the vector of memory load of each node. • I is the vector of input load of each node. • O is the vector of output load of each node. • The infinity norm computes the max value of a vector.
  34. RESULTS Does CRTS improve performance?

  35. RESULTS Does CRTS improve performance? *End-to-end logistic regression (Newton) on

    128GB data. 16 nodes. 64 workers and 512GB RAM / node.
  36. RESULTS Does CRTS improve performance? *End-to-end logistic regression (Newton) on

    128GB data. 16 nodes. 64 workers and 512GB RAM / node.
  37. RESULTS Does NumS scale? (Weak Scaling)

  38. RESULTS Does NumS scale? (Weak Scaling) Setup - 512 RAM/node

    - 2.5GB/s network - 32 workers/node - 2GB data / worker
  39. RESULTS Does NumS scale? (Weak Scaling) Setup - 512 RAM/node

    - 2.5GB/s network - 32 workers/node - 2GB data / worker
  40. How does NumS compare? RESULTS

  41. RESULTS Setup - 512 RAM/node, 2.5GB/s network, 32 workers/node. -

    Measures sample, execution, and evaluation time. - Datasets are partitioned into 2GB blocks. How does NumS compare?
  42. RESULTS How does NumS compare? Setup - 512 RAM/node, 2.5GB/s

    network, 32 workers/node. - Measures sample, execution, and evaluation time. - Dataset partitioning is tuned to best performance for each library.
  43. RESULTS Can NumS Run on GPUs?

  44. RESULTS Setup - 4 NVIDIA Tesla V100 w/ 200GBps NVLink

    - 1.25GB/s network bandwidth - 2 nodes used at 20 and 40 GB. Can NumS Run on GPUs?
  45. RESULTS 4 NVIDIA Tesla V100 GPUs connected via NVLink at

    200GBps Can NumS Run on GPUs?
  46. RESULTS Can NumS solve real data science problems? Tool Stack

    Load (s) (7.5GB HIGGS) Train (s) (Logistic Regression) Predict (s) Total (s) Pandas NumPy Scikit-Learn 65.55 44.75 0.43 110.8 NumS (1 node) 11.79 3.21 0.20 15.9 NumS (2 nodes) 7.88 2.92 0.18 11.53 7x Speedup *Logistic Regression trained on the HIGGS dataset (7.5GB) on a single 64 core 512GB node.
  47. OPEN SOURCE RELEASE NumS 0.2.1 pip install nums • Tested

    on Python 3.7, 3.8, and 3.9. • Runs on latest version of Ray (1.3 as of this talk). • Runs on Windows.
  48. OPEN SOURCE RELEASE NumS Features • Full support for array

    assignment, broadcasting, and basic operations. • I/O support for distributed file systems, S3, and CSV files. • Prepackaged support for GLMs. • Experimental integration with Modin (DataFrames) and XGBoost (Tree-based models). • Expanding NumPy API coverage through contributions from Berkeley undergrads! ◦ Mohamed Elgharbawy ◦ Balaji Veeramani ◦ Brian Park ◦ Daniel Zou ◦ Priyans Desai ◦ Sehoon Kim
  49. FUTURE WORK Future Work • Integrate GPU support. • Add

    support for sparse arrays. • Continue to improve memory and runtime performance. • Continue expanding API coverage. • Continue adding support for Linear Algebra and Machine Learning. ◦ LU and Cholesky decomposition, matrix inversion, and multi-class logistic regression are in the works.
  50. you Thank

  51. Project Members: Melih Elibol Samyu Yagati Lianmin Zheng Vinamra Benara

    Devin Petersohn Alvin Cheung Michael I. Jordan Ion Stoica, U.C. Berkeley — Suresh Saggar, Amazon Core AI Inderjit Dhillon, Amazon Search Graphic Design: Mike Matthews, VERT mike@vert.io Project info: github.com/nums-project Scalable Numerical Array Programming for the Cloud. “NumS” and the “N-Dimensional Logo” are protected under a Creative Commons 4.0 Attribution License. All other trademarks are copyright their respective owners.