NumS: Scalable Array Programming for the Cloud (Melih Elibol, UC Berkeley)

Melih Elibol Samyu Yagati Lianmin Zheng Vinamra Benara Devin Petersohn
Suresh Saggar Alvin Cheung Michael I. Jordan Ion Stoica University of California, Berkeley NumS NumS NumS is an open-source project publically available under the Apache 2.0 license. github.com/nums-project

MOTIVATION Improve the runtime of multi-dimensional array programs in Python,
and enable Python’s scientiﬁc computing community to analyze and model larger datasets.

MOTIVATION An ideal Python solution seamlessly parallelizes and scales NumPy-like
array operations, allowing scientists and statisticians to leverage their existing programming knowledge.

PROBLEM 1 To effectively parallelize NumPy-like code, we must determine
dependencies between operations, and then concurrently execute any independent operations.

PROBLEM 2 We need to avoid high over-heads from parallelization.
For example, the naive approach of sending one RPC per element-wise array operation poses untenable overheads.

PROBLEM 3 To scale array operations on distributed memory, we
must both avoid high network overheads and load-balance the data among the different nodes.

SOLUTIONS Related Solutions • NumPy is primarily serial, with shared-memory
parallelism for basic linear algebra via system’s BLAS implementation. • Existing single program, multiple-data distributed memory solutions present Python users with an unfamiliar programming model. • Block-partitioned Python array libraries rely on task graph scheduling heuristics optimized for general workloads instead of array operations.

NumS OUR SOLUTION Scalable Numerical Array Programming for the Cloud.

DESIGN NumS Data Flow Diagram NumPy API Array Application State
I/O Distributed System Interface Compute Manager Cluster State Distributed System Node Node User Application Persistent Storage Arrays GraphArray BlockArray Block Block Block Scheduler

SOLUTION NumS exposes a NumPy-compatible array abstraction deﬁned in terms
of futures, allowing the scheduler to see the computation graph in advance and parallelize execution. To effectively parallelize NumPy-like code, we must determine dependencies between operations, and then concurrently execute any independent operations. PROBLEM 1

SOLUTION 1 Futures and Promises

SOLUTION 1 Array Access Dependency Resolution x = A[:, i].T
@ B[:, i] y = A[:, j].T @ B[:, i] z = x * y Serial Execution proc0 A B x A y z B Futures with Concurrency proc0 proc1 A x A y z B resource dependency data dependency

SOLUTION We coarsen operations by partitioning arrays into a grid
of blocks and perform array operations block-wise, rather than element-wise. We need to avoid high over-heads from parallelization. For example, the naive approach of sending one RPC per element-wise array operation poses untenable overheads. PROBLEM 2

SOLUTION 2 Block Partitioned Arrays BlockArray shape = (4, 6)
block_shape = (2, 2) grid_shape = (2, 3) 0,0 1,0 0,1 1,1 0,2 1,2 0,3 1,3 0,4 1,4 0,5 1,5 2,0 3,0 2,1 3,1 2,2 3,2 2,3 3,3 2,4 3,4 2,5 3,5 4 6 Array entry

block_shape = (2, 2) grid_shape = (2, 3) 0,0 1,0 0,1 1,1 0,2 1,2 0,3 1,3 0,4 1,4 0,5 1,5 2,0 3,0 2,1 3,1 2,2 3,2 2,3 3,3 2,4 3,4 2,5 3,5 2 2 Block

block_shape = (2, 2) grid_shape = (2, 3) 0,1 0,2 0,0 1,1 1,2 1,0 Grid entry

SOLUTION 2 Execution on Ray: RPC import nums x: BlockArray
= nums.read("data/x") read returns immediately, executing tasks required to construct x asynchronously. node1 node2 worker store worker store read(part0) read(part1) node0 driver read(part0) read(part1) Storage

SOLUTION 2 Execution on Ray: Return import nums x: BlockArray
= nums.read("data/x") node1 node2 worker store worker store obj1 obj0 node0 driver read(part0) read(part1) returns in RPCs are put in the local object store Storage

SOLUTION 2 Execution on Ray: References import nums x: BlockArray
= nums.read("data/x") Objects are held in the store so long as a reference to the object exists in the application. node1 node2 worker store worker store obj1 obj0 node0 driver ref0 ref1 Storage

SOLUTION 1 API Example • Load X, y and initialize
beta concurrently as block-partitioned arrays. import nums import nums.numpy as nps X: BlockArray = nps.read("data/X") y: BlockArray = nps.read("data/y") beta: BlockArray = nps.zeros(X.shape[1]) for i in range(max_iter): mu: BlockArray = 1 / (1 + nps.exp(-X @ beta)) g: BlockArray = X.T @ (mu - y) h: BlockArray = (X.T * mu * (1 - mu)) @ X beta -= nps.inv(h) @ g if g.T @ g <= tol: break

SOLUTION 1 API Example import nums import nums.numpy as nps
X: BlockArray = nps.read("data/X") y: BlockArray = nps.read("data/y") beta: BlockArray = nps.zeros(X.shape[1]) for i in range(max_iter): mu: BlockArray = 1 / (1 + nps.exp(-X @ beta)) g: BlockArray = X.T @ (mu - y) h: BlockArray = (X.T * mu * (1 - mu)) @ X beta -= nps.inv(h) @ g if g.T @ g <= tol: break • Execute all operations in loop body concurrently. • All operations are executed block-wise.

SOLUTION 1 API Example import nums import nums.numpy as nps
X: BlockArray = nps.read("data/X") y: BlockArray = nps.read("data/y") beta: BlockArray = nps.zeros(X.shape[1]) for i in range(max_iter): mu: BlockArray = 1 / (1 + nps.exp(-X @ beta)) g: BlockArray = X.T @ (mu - y) h: BlockArray = (X.T * mu * (1 - mu)) @ X beta -= nps.inv(h) @ g if g.T @ g <= tol: break • Evaluate termination condition concurrently. • Block until complete, to perform branching operation on driver process.

SOLUTION We designed a novel scheduler, Cyclic Random Tree Search
(CRTS), which combines a traditional block-cyclic data layout with an objective-based operation scheduler. To scale on distributed memory, we must both avoid high network overheads and load-balance the computation among the different nodes. PROBLEM 3

1 2 3 4 5 6 7 8 - NumS
decomposes n-dimensional arrays into blocks. BlockArray shape = (12, 4) block_shape = (3, 2) grid_shape = (4, 2) block 1, 5 2, 6 3, 7 4, 8 node Cluster grid_shape = (2, 2) - A cluster of nodes is represented as an n-dimensional grid. - Persistent arrays are dispersed over n-dim grid of nodes. - Balances data load and locality for optimizer. STRUCTURES Block-Cyclic Data Layout

OPTIMIZATION Optimizer - Cluster State: Estimate memory and network load
on each node using array size. - Objective: Place operations so that maximum memory and network load over all nodes is minimized. - Computation State: An array-of-trees data structure on which we perform computations. - Cyclic Random Tree Search (CRTS): An iterative algorithm that places a single operation per iteration according to the objective, and updates both the cluster state and computation state.

OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 b1
b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 BlockArray GraphArray

OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 a1
b1 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 Add a2 b2 a3 b3 a4 b4 BlockArray GraphArray GraphArray

b1 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 Optimizer c1 c2 c3 c4 Add a2 b2 a3 b3 a4 b4 BlockArray GraphArray GraphArray GraphArray

b1 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 Optimizer c1 c2 c3 c4 Add a2 b2 a3 b3 a4 b4 c1 c2 c4 c3 BlockArray GraphArray GraphArray GraphArray BlockArray

OPTIMIZATION Representations of Tensor Dot b1 b2 a1 a2 A
@ B @ @ + a1 b1 a2 b2 Syntactic Representation - A is 4 by 8 - B is 8 by 4 BlockArray Representation - a_i is 4 by 4 - b_i is 4 by 4 GraphArray Representation - frontier nodes

OPTIMIZATION CRTS: Optimization of Tensor Dot @ @ + a1
b1 a2 b2 GraphArray Representation - frontier nodes Randomly sample a frontier node. @ a1 b1 Schedule the operation based on cluster simulation. @ a1 b1 device1 device2 @ a1 b1 a1 b1 a2 b2

OPTIMIZATION CRTS: Objective memory net_in net_out device1 48 0 0
device2 32 0 0 memory net_in net_out device1 32 0 32 device2 48 32 0 Schedule op on device1 Schedule op on device2 Capture desired scheduling behavior in a simple objective: • s_i corresponds to scheduling option i from state s. We minimize this objective over i. • M is the vector of memory load of each node. • I is the vector of input load of each node. • O is the vector of output load of each node. • The infinity norm computes the max value of a vector.

RESULTS Does CRTS improve performance?

RESULTS Does CRTS improve performance? *End-to-end logistic regression (Newton) on
128GB data. 16 nodes. 64 workers and 512GB RAM / node.

RESULTS Does NumS scale? (Weak Scaling)

RESULTS Does NumS scale? (Weak Scaling) Setup - 512 RAM/node
- 2.5GB/s network - 32 workers/node - 2GB data / worker

How does NumS compare? RESULTS

RESULTS Setup - 512 RAM/node, 2.5GB/s network, 32 workers/node. -
Measures sample, execution, and evaluation time. - Datasets are partitioned into 2GB blocks. How does NumS compare?

RESULTS How does NumS compare? Setup - 512 RAM/node, 2.5GB/s
network, 32 workers/node. - Measures sample, execution, and evaluation time. - Dataset partitioning is tuned to best performance for each library.

RESULTS Can NumS Run on GPUs?

RESULTS Setup - 4 NVIDIA Tesla V100 w/ 200GBps NVLink
- 1.25GB/s network bandwidth - 2 nodes used at 20 and 40 GB. Can NumS Run on GPUs?

RESULTS 4 NVIDIA Tesla V100 GPUs connected via NVLink at
200GBps Can NumS Run on GPUs?

RESULTS Can NumS solve real data science problems? Tool Stack
Load (s) (7.5GB HIGGS) Train (s) (Logistic Regression) Predict (s) Total (s) Pandas NumPy Scikit-Learn 65.55 44.75 0.43 110.8 NumS (1 node) 11.79 3.21 0.20 15.9 NumS (2 nodes) 7.88 2.92 0.18 11.53 7x Speedup *Logistic Regression trained on the HIGGS dataset (7.5GB) on a single 64 core 512GB node.

OPEN SOURCE RELEASE NumS 0.2.1 pip install nums • Tested
on Python 3.7, 3.8, and 3.9. • Runs on latest version of Ray (1.3 as of this talk). • Runs on Windows.

OPEN SOURCE RELEASE NumS Features • Full support for array
assignment, broadcasting, and basic operations. • I/O support for distributed ﬁle systems, S3, and CSV ﬁles. • Prepackaged support for GLMs. • Experimental integration with Modin (DataFrames) and XGBoost (Tree-based models). • Expanding NumPy API coverage through contributions from Berkeley undergrads! ◦ Mohamed Elgharbawy ◦ Balaji Veeramani ◦ Brian Park ◦ Daniel Zou ◦ Priyans Desai ◦ Sehoon Kim

FUTURE WORK Future Work • Integrate GPU support. • Add
support for sparse arrays. • Continue to improve memory and runtime performance. • Continue expanding API coverage. • Continue adding support for Linear Algebra and Machine Learning. ◦ LU and Cholesky decomposition, matrix inversion, and multi-class logistic regression are in the works.

you Thank

Project Members: Melih Elibol Samyu Yagati Lianmin Zheng Vinamra Benara
Devin Petersohn Alvin Cheung Michael I. Jordan Ion Stoica, U.C. Berkeley — Suresh Saggar, Amazon Core AI Inderjit Dhillon, Amazon Search Graphic Design: Mike Matthews, VERT mike@vert.io Project info: github.com/nums-project Scalable Numerical Array Programming for the Cloud. “NumS” and the “N-Dimensional Logo” are protected under a Creative Commons 4.0 Attribution License. All other trademarks are copyright their respective owners.

NumS: Scalable Array Programming for the Cloud ...

NumS: Scalable Array Programming for the Cloud (Melih Elibol, UC Berkeley)

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript