Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Melih Elibol Samyu Yagati Lianmin Zheng Vinamra Benara Devin Petersohn Suresh Saggar Alvin Cheung Michael I. Jordan Ion Stoica University of California, Berkeley NumS NumS NumS is an open-source project publically available under the Apache 2.0 license. github.com/nums-project

Slide 3

Slide 3 text

MOTIVATION Improve the runtime of multi-dimensional array programs in Python, and enable Python’s scientific computing community to analyze and model larger datasets.

Slide 4

Slide 4 text

MOTIVATION An ideal Python solution seamlessly parallelizes and scales NumPy-like array operations, allowing scientists and statisticians to leverage their existing programming knowledge.

Slide 5

Slide 5 text

PROBLEM 1 To effectively parallelize NumPy-like code, we must determine dependencies between operations, and then concurrently execute any independent operations.

Slide 6

Slide 6 text

PROBLEM 2 We need to avoid high over-heads from parallelization. For example, the naive approach of sending one RPC per element-wise array operation poses untenable overheads.

Slide 7

Slide 7 text

PROBLEM 3 To scale array operations on distributed memory, we must both avoid high network overheads and load-balance the data among the different nodes.

Slide 8

Slide 8 text

SOLUTIONS Related Solutions • NumPy is primarily serial, with shared-memory parallelism for basic linear algebra via system’s BLAS implementation. • Existing single program, multiple-data distributed memory solutions present Python users with an unfamiliar programming model. • Block-partitioned Python array libraries rely on task graph scheduling heuristics optimized for general workloads instead of array operations.

Slide 9

Slide 9 text

NumS OUR SOLUTION Scalable Numerical Array Programming for the Cloud.

Slide 10

Slide 10 text

DESIGN NumS Data Flow Diagram NumPy API Array Application State I/O Distributed System Interface Compute Manager Cluster State Distributed System Node Node User Application Persistent Storage Arrays GraphArray BlockArray Block Block Block Scheduler

Slide 11

Slide 11 text

SOLUTION NumS exposes a NumPy-compatible array abstraction defined in terms of futures, allowing the scheduler to see the computation graph in advance and parallelize execution. To effectively parallelize NumPy-like code, we must determine dependencies between operations, and then concurrently execute any independent operations. PROBLEM 1

Slide 12

Slide 12 text

SOLUTION 1 Futures and Promises

Slide 13

Slide 13 text

SOLUTION 1 Array Access Dependency Resolution x = A[:, i].T @ B[:, i] y = A[:, j].T @ B[:, i] z = x * y Serial Execution proc0 A B x A y z B Futures with Concurrency proc0 proc1 A x A y z B resource dependency data dependency

Slide 14

Slide 14 text

SOLUTION We coarsen operations by partitioning arrays into a grid of blocks and perform array operations block-wise, rather than element-wise. We need to avoid high over-heads from parallelization. For example, the naive approach of sending one RPC per element-wise array operation poses untenable overheads. PROBLEM 2

Slide 15

Slide 15 text

SOLUTION 2 Block Partitioned Arrays BlockArray shape = (4, 6) block_shape = (2, 2) grid_shape = (2, 3) 0,0 1,0 0,1 1,1 0,2 1,2 0,3 1,3 0,4 1,4 0,5 1,5 2,0 3,0 2,1 3,1 2,2 3,2 2,3 3,3 2,4 3,4 2,5 3,5 4 6 Array entry

Slide 16

Slide 16 text

SOLUTION 2 Block Partitioned Arrays BlockArray shape = (4, 6) block_shape = (2, 2) grid_shape = (2, 3) 0,0 1,0 0,1 1,1 0,2 1,2 0,3 1,3 0,4 1,4 0,5 1,5 2,0 3,0 2,1 3,1 2,2 3,2 2,3 3,3 2,4 3,4 2,5 3,5 2 2 Block

Slide 17

Slide 17 text

SOLUTION 2 Block Partitioned Arrays BlockArray shape = (4, 6) block_shape = (2, 2) grid_shape = (2, 3) 0,1 0,2 0,0 1,1 1,2 1,0 Grid entry

Slide 18

Slide 18 text

SOLUTION 2 Execution on Ray: RPC import nums x: BlockArray = nums.read("data/x") read returns immediately, executing tasks required to construct x asynchronously. node1 node2 worker store worker store read(part0) read(part1) node0 driver read(part0) read(part1) Storage

Slide 19

Slide 19 text

SOLUTION 2 Execution on Ray: Return import nums x: BlockArray = nums.read("data/x") node1 node2 worker store worker store obj1 obj0 node0 driver read(part0) read(part1) returns in RPCs are put in the local object store Storage

Slide 20

Slide 20 text

SOLUTION 2 Execution on Ray: References import nums x: BlockArray = nums.read("data/x") Objects are held in the store so long as a reference to the object exists in the application. node1 node2 worker store worker store obj1 obj0 node0 driver ref0 ref1 Storage

Slide 21

Slide 21 text

SOLUTION 1 API Example ● Load X, y and initialize beta concurrently as block-partitioned arrays. import nums import nums.numpy as nps X: BlockArray = nps.read("data/X") y: BlockArray = nps.read("data/y") beta: BlockArray = nps.zeros(X.shape[1]) for i in range(max_iter): mu: BlockArray = 1 / (1 + nps.exp(-X @ beta)) g: BlockArray = X.T @ (mu - y) h: BlockArray = (X.T * mu * (1 - mu)) @ X beta -= nps.inv(h) @ g if g.T @ g <= tol: break

Slide 22

Slide 22 text

SOLUTION 1 API Example import nums import nums.numpy as nps X: BlockArray = nps.read("data/X") y: BlockArray = nps.read("data/y") beta: BlockArray = nps.zeros(X.shape[1]) for i in range(max_iter): mu: BlockArray = 1 / (1 + nps.exp(-X @ beta)) g: BlockArray = X.T @ (mu - y) h: BlockArray = (X.T * mu * (1 - mu)) @ X beta -= nps.inv(h) @ g if g.T @ g <= tol: break ● Execute all operations in loop body concurrently. ● All operations are executed block-wise.

Slide 23

Slide 23 text

SOLUTION 1 API Example import nums import nums.numpy as nps X: BlockArray = nps.read("data/X") y: BlockArray = nps.read("data/y") beta: BlockArray = nps.zeros(X.shape[1]) for i in range(max_iter): mu: BlockArray = 1 / (1 + nps.exp(-X @ beta)) g: BlockArray = X.T @ (mu - y) h: BlockArray = (X.T * mu * (1 - mu)) @ X beta -= nps.inv(h) @ g if g.T @ g <= tol: break ● Evaluate termination condition concurrently. ● Block until complete, to perform branching operation on driver process.

Slide 24

Slide 24 text

SOLUTION We designed a novel scheduler, Cyclic Random Tree Search (CRTS), which combines a traditional block-cyclic data layout with an objective-based operation scheduler. To scale on distributed memory, we must both avoid high network overheads and load-balance the computation among the different nodes. PROBLEM 3

Slide 25

Slide 25 text

1 2 3 4 5 6 7 8 - NumS decomposes n-dimensional arrays into blocks. BlockArray shape = (12, 4) block_shape = (3, 2) grid_shape = (4, 2) block 1, 5 2, 6 3, 7 4, 8 node Cluster grid_shape = (2, 2) - A cluster of nodes is represented as an n-dimensional grid. - Persistent arrays are dispersed over n-dim grid of nodes. - Balances data load and locality for optimizer. STRUCTURES Block-Cyclic Data Layout

Slide 26

Slide 26 text

OPTIMIZATION Optimizer - Cluster State: Estimate memory and network load on each node using array size. - Objective: Place operations so that maximum memory and network load over all nodes is minimized. - Computation State: An array-of-trees data structure on which we perform computations. - Cyclic Random Tree Search (CRTS): An iterative algorithm that places a single operation per iteration according to the objective, and updates both the cluster state and computation state.

Slide 27

Slide 27 text

OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 BlockArray GraphArray

Slide 28

Slide 28 text

OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 a1 b1 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 Add a2 b2 a3 b3 a4 b4 BlockArray GraphArray GraphArray

Slide 29

Slide 29 text

OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 a1 b1 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 Optimizer c1 c2 c3 c4 Add a2 b2 a3 b3 a4 b4 BlockArray GraphArray GraphArray GraphArray

Slide 30

Slide 30 text

OPTIMIZATION Execution of Element-wise Addition a1 a2 a3 a4 a1 b1 b1 b2 b3 b4 a1 a2 a4 a3 b1 b2 b4 b3 Optimizer c1 c2 c3 c4 Add a2 b2 a3 b3 a4 b4 c1 c2 c4 c3 BlockArray GraphArray GraphArray GraphArray BlockArray

Slide 31

Slide 31 text

OPTIMIZATION Representations of Tensor Dot b1 b2 a1 a2 A @ B @ @ + a1 b1 a2 b2 Syntactic Representation - A is 4 by 8 - B is 8 by 4 BlockArray Representation - a_i is 4 by 4 - b_i is 4 by 4 GraphArray Representation - frontier nodes

Slide 32

Slide 32 text

OPTIMIZATION CRTS: Optimization of Tensor Dot @ @ + a1 b1 a2 b2 GraphArray Representation - frontier nodes Randomly sample a frontier node. @ a1 b1 Schedule the operation based on cluster simulation. @ a1 b1 device1 device2 @ a1 b1 a1 b1 a2 b2

Slide 33

Slide 33 text

OPTIMIZATION CRTS: Objective memory net_in net_out device1 48 0 0 device2 32 0 0 memory net_in net_out device1 32 0 32 device2 48 32 0 Schedule op on device1 Schedule op on device2 Capture desired scheduling behavior in a simple objective: ● s_i corresponds to scheduling option i from state s. We minimize this objective over i. ● M is the vector of memory load of each node. ● I is the vector of input load of each node. ● O is the vector of output load of each node. ● The infinity norm computes the max value of a vector.

Slide 34

Slide 34 text

RESULTS Does CRTS improve performance?

Slide 35

Slide 35 text

RESULTS Does CRTS improve performance? *End-to-end logistic regression (Newton) on 128GB data. 16 nodes. 64 workers and 512GB RAM / node.

Slide 36

Slide 36 text

RESULTS Does CRTS improve performance? *End-to-end logistic regression (Newton) on 128GB data. 16 nodes. 64 workers and 512GB RAM / node.

Slide 37

Slide 37 text

RESULTS Does NumS scale? (Weak Scaling)

Slide 38

Slide 38 text

RESULTS Does NumS scale? (Weak Scaling) Setup - 512 RAM/node - 2.5GB/s network - 32 workers/node - 2GB data / worker

Slide 39

Slide 39 text

RESULTS Does NumS scale? (Weak Scaling) Setup - 512 RAM/node - 2.5GB/s network - 32 workers/node - 2GB data / worker

Slide 40

Slide 40 text

How does NumS compare? RESULTS

Slide 41

Slide 41 text

RESULTS Setup - 512 RAM/node, 2.5GB/s network, 32 workers/node. - Measures sample, execution, and evaluation time. - Datasets are partitioned into 2GB blocks. How does NumS compare?

Slide 42

Slide 42 text

RESULTS How does NumS compare? Setup - 512 RAM/node, 2.5GB/s network, 32 workers/node. - Measures sample, execution, and evaluation time. - Dataset partitioning is tuned to best performance for each library.

Slide 43

Slide 43 text

RESULTS Can NumS Run on GPUs?

Slide 44

Slide 44 text

RESULTS Setup - 4 NVIDIA Tesla V100 w/ 200GBps NVLink - 1.25GB/s network bandwidth - 2 nodes used at 20 and 40 GB. Can NumS Run on GPUs?

Slide 45

Slide 45 text

RESULTS 4 NVIDIA Tesla V100 GPUs connected via NVLink at 200GBps Can NumS Run on GPUs?

Slide 46

Slide 46 text

RESULTS Can NumS solve real data science problems? Tool Stack Load (s) (7.5GB HIGGS) Train (s) (Logistic Regression) Predict (s) Total (s) Pandas NumPy Scikit-Learn 65.55 44.75 0.43 110.8 NumS (1 node) 11.79 3.21 0.20 15.9 NumS (2 nodes) 7.88 2.92 0.18 11.53 7x Speedup *Logistic Regression trained on the HIGGS dataset (7.5GB) on a single 64 core 512GB node.

Slide 47

Slide 47 text

OPEN SOURCE RELEASE NumS 0.2.1 pip install nums • Tested on Python 3.7, 3.8, and 3.9. • Runs on latest version of Ray (1.3 as of this talk). • Runs on Windows.

Slide 48

Slide 48 text

OPEN SOURCE RELEASE NumS Features • Full support for array assignment, broadcasting, and basic operations. • I/O support for distributed file systems, S3, and CSV files. • Prepackaged support for GLMs. • Experimental integration with Modin (DataFrames) and XGBoost (Tree-based models). • Expanding NumPy API coverage through contributions from Berkeley undergrads! ○ Mohamed Elgharbawy ○ Balaji Veeramani ○ Brian Park ○ Daniel Zou ○ Priyans Desai ○ Sehoon Kim

Slide 49

Slide 49 text

FUTURE WORK Future Work • Integrate GPU support. • Add support for sparse arrays. • Continue to improve memory and runtime performance. • Continue expanding API coverage. • Continue adding support for Linear Algebra and Machine Learning. ○ LU and Cholesky decomposition, matrix inversion, and multi-class logistic regression are in the works.

Slide 50

Slide 50 text

you Thank

Slide 51

Slide 51 text

Project Members: Melih Elibol Samyu Yagati Lianmin Zheng Vinamra Benara Devin Petersohn Alvin Cheung Michael I. Jordan Ion Stoica, U.C. Berkeley — Suresh Saggar, Amazon Core AI Inderjit Dhillon, Amazon Search Graphic Design: Mike Matthews, VERT [email protected] Project info: github.com/nums-project Scalable Numerical Array Programming for the Cloud. “NumS” and the “N-Dimensional Logo” are protected under a Creative Commons 4.0 Attribution License. All other trademarks are copyright their respective owners.