Melih Elibol
Samyu Yagati
Lianmin Zheng
Vinamra Benara
Devin Petersohn
Suresh Saggar
Alvin Cheung
Michael I. Jordan
Ion Stoica
University of California,
Berkeley
NumS
NumS
NumS is an open-source project publically
available under the Apache 2.0 license.
github.com/nums-project
Slide 3
Slide 3 text
MOTIVATION
Improve the runtime of multi-dimensional
array programs in Python, and enable
Python’s scientific computing community to
analyze and model larger datasets.
Slide 4
Slide 4 text
MOTIVATION
An ideal Python solution seamlessly
parallelizes and scales NumPy-like
array operations, allowing scientists
and statisticians to leverage their
existing programming knowledge.
Slide 5
Slide 5 text
PROBLEM 1
To effectively parallelize
NumPy-like code, we must
determine dependencies
between operations, and then
concurrently execute any
independent operations.
Slide 6
Slide 6 text
PROBLEM 2
We need to avoid high
over-heads from parallelization.
For example, the naive approach
of sending one RPC per
element-wise array operation
poses untenable overheads.
Slide 7
Slide 7 text
PROBLEM 3
To scale array operations on
distributed memory, we must
both avoid high network
overheads and load-balance the
data among the different nodes.
Slide 8
Slide 8 text
SOLUTIONS
Related Solutions
• NumPy is primarily serial, with shared-memory parallelism
for basic linear algebra via system’s BLAS implementation.
• Existing single program, multiple-data distributed memory
solutions present Python users with an unfamiliar
programming model.
• Block-partitioned Python array libraries rely on task graph
scheduling heuristics optimized for general workloads instead
of array operations.
Slide 9
Slide 9 text
NumS
OUR SOLUTION
Scalable Numerical Array Programming
for the Cloud.
Slide 10
Slide 10 text
DESIGN
NumS Data Flow Diagram
NumPy API
Array Application
State
I/O
Distributed System Interface
Compute Manager
Cluster State
Distributed System
Node Node
User Application
Persistent
Storage
Arrays
GraphArray
BlockArray
Block Block Block
Scheduler
Slide 11
Slide 11 text
SOLUTION NumS exposes a
NumPy-compatible array
abstraction defined in terms of
futures, allowing the scheduler to
see the computation graph in
advance and parallelize execution.
To effectively parallelize NumPy-like code, we
must determine dependencies between
operations, and then concurrently execute any
independent operations.
PROBLEM 1
Slide 12
Slide 12 text
SOLUTION 1
Futures and Promises
Slide 13
Slide 13 text
SOLUTION 1
Array Access Dependency Resolution
x = A[:, i].T @ B[:, i]
y = A[:, j].T @ B[:, i]
z = x * y
Serial Execution
proc0 A B x A y z
B
Futures with Concurrency
proc0
proc1
A x
A
y
z
B resource dependency
data dependency
Slide 14
Slide 14 text
SOLUTION We coarsen operations by
partitioning arrays into a grid of
blocks and perform array
operations block-wise, rather than
element-wise.
We need to avoid high over-heads from
parallelization. For example, the naive approach
of sending one RPC per element-wise array
operation poses untenable overheads.
PROBLEM 2
SOLUTION 2
Execution on Ray: RPC
import nums
x: BlockArray = nums.read("data/x")
read returns immediately, executing
tasks required to construct x
asynchronously.
node1
node2
worker
store
worker
store
read(part0)
read(part1)
node0
driver
read(part0)
read(part1)
Storage
Slide 19
Slide 19 text
SOLUTION 2
Execution on Ray: Return
import nums
x: BlockArray = nums.read("data/x")
node1
node2
worker
store
worker
store obj1
obj0
node0
driver
read(part0)
read(part1)
returns in RPCs are
put in the local
object store
Storage
Slide 20
Slide 20 text
SOLUTION 2
Execution on Ray: References
import nums
x: BlockArray = nums.read("data/x")
Objects are held in the store so long as
a reference to the object exists in the
application.
node1
node2
worker
store
worker
store obj1
obj0
node0
driver
ref0
ref1
Storage
Slide 21
Slide 21 text
SOLUTION 1
API Example
● Load X, y and
initialize beta
concurrently as
block-partitioned
arrays.
import nums
import nums.numpy as nps
X: BlockArray = nps.read("data/X")
y: BlockArray = nps.read("data/y")
beta: BlockArray = nps.zeros(X.shape[1])
for i in range(max_iter):
mu: BlockArray = 1 / (1 + nps.exp(-X @ beta))
g: BlockArray = X.T @ (mu - y)
h: BlockArray = (X.T * mu * (1 - mu)) @ X
beta -= nps.inv(h) @ g
if g.T @ g <= tol:
break
Slide 22
Slide 22 text
SOLUTION 1
API Example
import nums
import nums.numpy as nps
X: BlockArray = nps.read("data/X")
y: BlockArray = nps.read("data/y")
beta: BlockArray = nps.zeros(X.shape[1])
for i in range(max_iter):
mu: BlockArray = 1 / (1 + nps.exp(-X @ beta))
g: BlockArray = X.T @ (mu - y)
h: BlockArray = (X.T * mu * (1 - mu)) @ X
beta -= nps.inv(h) @ g
if g.T @ g <= tol:
break
● Execute all operations
in loop body
concurrently.
● All operations are
executed block-wise.
Slide 23
Slide 23 text
SOLUTION 1
API Example
import nums
import nums.numpy as nps
X: BlockArray = nps.read("data/X")
y: BlockArray = nps.read("data/y")
beta: BlockArray = nps.zeros(X.shape[1])
for i in range(max_iter):
mu: BlockArray = 1 / (1 + nps.exp(-X @ beta))
g: BlockArray = X.T @ (mu - y)
h: BlockArray = (X.T * mu * (1 - mu)) @ X
beta -= nps.inv(h) @ g
if g.T @ g <= tol:
break
● Evaluate termination
condition
concurrently.
● Block until complete,
to perform branching
operation on driver
process.
Slide 24
Slide 24 text
SOLUTION We designed a novel scheduler,
Cyclic Random Tree Search
(CRTS), which combines a
traditional block-cyclic data layout
with an objective-based operation
scheduler.
To scale on distributed memory, we must both
avoid high network overheads and load-balance
the computation among the different nodes.
PROBLEM 3
Slide 25
Slide 25 text
1 2
3 4
5 6
7 8
- NumS decomposes n-dimensional arrays into blocks.
BlockArray
shape = (12, 4)
block_shape = (3, 2)
grid_shape = (4, 2)
block
1, 5 2, 6
3, 7 4, 8
node
Cluster
grid_shape = (2, 2)
- A cluster of nodes is represented as an n-dimensional grid.
- Persistent arrays are dispersed over n-dim grid of nodes.
- Balances data load and locality for optimizer.
STRUCTURES
Block-Cyclic Data Layout
Slide 26
Slide 26 text
OPTIMIZATION
Optimizer
- Cluster State: Estimate memory and network load on each node using array size.
- Objective: Place operations so that maximum memory and network load over all
nodes is minimized.
- Computation State: An array-of-trees data structure on which we perform
computations.
- Cyclic Random Tree Search (CRTS): An iterative algorithm that places a single
operation per iteration according to the objective, and updates both the cluster state
and computation state.
OPTIMIZATION
Representations of Tensor Dot
b1
b2
a1 a2
A @ B
@ @
+
a1 b1 a2 b2
Syntactic Representation
- A is 4 by 8
- B is 8 by 4
BlockArray Representation
- a_i is 4 by 4
- b_i is 4 by 4
GraphArray Representation
- frontier nodes
Slide 32
Slide 32 text
OPTIMIZATION
CRTS: Optimization of Tensor Dot
@ @
+
a1 b1 a2 b2
GraphArray Representation
- frontier nodes
Randomly
sample a
frontier node.
@
a1 b1
Schedule the operation based on cluster
simulation.
@
a1 b1
device1
device2
@
a1 b1
a1
b1
a2
b2
Slide 33
Slide 33 text
OPTIMIZATION
CRTS: Objective
memory net_in net_out
device1 48 0 0
device2 32 0 0
memory net_in net_out
device1 32 0 32
device2 48 32 0
Schedule op on device1
Schedule op on device2
Capture desired scheduling behavior in a simple
objective:
● s_i corresponds to scheduling option i from
state s. We minimize this objective over i.
● M is the vector of memory load of each node.
● I is the vector of input load of each node.
● O is the vector of output load of each node.
● The infinity norm computes the max value of
a vector.
Slide 34
Slide 34 text
RESULTS
Does CRTS improve performance?
Slide 35
Slide 35 text
RESULTS
Does CRTS improve performance?
*End-to-end logistic regression (Newton) on 128GB data. 16 nodes. 64 workers and 512GB RAM / node.
Slide 36
Slide 36 text
RESULTS
Does CRTS improve performance?
*End-to-end logistic regression (Newton) on 128GB data. 16 nodes. 64 workers and 512GB RAM / node.
RESULTS
Setup
- 512 RAM/node, 2.5GB/s network, 32 workers/node.
- Measures sample, execution, and evaluation time.
- Datasets are partitioned into 2GB blocks.
How does NumS compare?
Slide 42
Slide 42 text
RESULTS
How does NumS compare?
Setup
- 512 RAM/node, 2.5GB/s network, 32 workers/node.
- Measures sample, execution, and evaluation time.
- Dataset partitioning is tuned to best performance for each library.
Slide 43
Slide 43 text
RESULTS
Can NumS Run on GPUs?
Slide 44
Slide 44 text
RESULTS
Setup
- 4 NVIDIA Tesla V100 w/ 200GBps NVLink
- 1.25GB/s network bandwidth
- 2 nodes used at 20 and 40 GB.
Can NumS Run on GPUs?
Slide 45
Slide 45 text
RESULTS
4 NVIDIA Tesla V100 GPUs connected via NVLink at 200GBps
Can NumS Run on GPUs?
Slide 46
Slide 46 text
RESULTS
Can NumS solve real data science problems?
Tool Stack Load (s)
(7.5GB HIGGS)
Train (s)
(Logistic Regression)
Predict (s) Total (s)
Pandas
NumPy
Scikit-Learn
65.55 44.75 0.43 110.8
NumS (1 node)
11.79 3.21 0.20 15.9
NumS (2 nodes) 7.88 2.92 0.18 11.53
7x Speedup
*Logistic Regression trained on the HIGGS dataset (7.5GB) on a single 64 core 512GB node.
Slide 47
Slide 47 text
OPEN SOURCE RELEASE
NumS 0.2.1
pip install nums
• Tested on Python 3.7, 3.8, and 3.9.
• Runs on latest version of Ray (1.3 as of this talk).
• Runs on Windows.
Slide 48
Slide 48 text
OPEN SOURCE RELEASE
NumS Features
• Full support for array assignment, broadcasting, and basic operations.
• I/O support for distributed file systems, S3, and CSV files.
• Prepackaged support for GLMs.
• Experimental integration with Modin (DataFrames) and XGBoost (Tree-based models).
• Expanding NumPy API coverage through contributions from Berkeley undergrads!
○ Mohamed Elgharbawy
○ Balaji Veeramani
○ Brian Park
○ Daniel Zou
○ Priyans Desai
○ Sehoon Kim
Slide 49
Slide 49 text
FUTURE WORK
Future Work
• Integrate GPU support.
• Add support for sparse arrays.
• Continue to improve memory and runtime performance.
• Continue expanding API coverage.
• Continue adding support for Linear Algebra and Machine Learning.
○ LU and Cholesky decomposition, matrix inversion, and multi-class logistic
regression are in the works.
Slide 50
Slide 50 text
you
Thank
Slide 51
Slide 51 text
Project Members:
Melih Elibol
Samyu Yagati
Lianmin Zheng
Vinamra Benara
Devin Petersohn
Alvin Cheung
Michael I. Jordan
Ion Stoica, U.C. Berkeley
—
Suresh Saggar, Amazon Core AI
Inderjit Dhillon, Amazon Search
Graphic Design:
Mike Matthews, VERT
[email protected]
Project info:
github.com/nums-project
Scalable Numerical Array
Programming for the Cloud.
“NumS” and the “N-Dimensional Logo” are protected under a Creative
Commons 4.0 Attribution License. All other trademarks are copyright
their respective owners.