Accelerating Empirical Dynamic Modeling using High Performance Computing

Slide 1

Slide 1 text

Keichi Takahashi [email protected] Software Design and Analysis Lab, Nara Institute of Science and Technology Accelerating Empirical Dynamic Modeling Using High Performance Computing University of Melbourne QMNET Talk 8/20/2021

Slide 2

Slide 2 text

Agenda • Empirical Dynamic Modeling • mpEDM: Massively Parallel Empirical Dynamic Modeling • kEDM: Fully GPU-o ff l oaded Empirical Dynamic Modeling 2

Slide 3

Slide 3 text

Empirical Dynamic Modeling

Slide 4

Slide 4 text

Empirical Dynamic Modeling (EDM) • A non-parametric method for modeling non-linear time series data. • Reconstructs the underlying attractor manifolds from time series data. • EDM methods: • Simplex projection (short-term forecasting) • S-Map (quantifying non-linearity) • Convergent Cross Mapping (identifying causal relationships) • … 4 C.W. Chang et al., “Empirical Dynamic Modeling for Beginners”, Ecological Research, vol. 32. , pp. 785—796, 2017.

Slide 5

Slide 5 text

State Space Reconstruction (SSR) 5 Shadow Manifold Reconstructed Manifold Observations Time   lags Di ff eomorphism   Topological features are preserved Ethan R. Deyle, George Sugihara, “Generalized Theorems for Nonlinear State Space Reconstruction”, PLoS ONE, vol. 6, no. 3, e18295, 2011.

Slide 6

Slide 6 text

Simplex Projection (Short-term Forecast) 6 1. Create time-delayed embedding of : 2. Find k-nearest neighbors of in the state space: 3. Compute exponentially weighted average of neighbors: X(t) x(t) = (X(t), X(t − τ), …, X(t − (E − 1)τ)) x(t) x(t1 ), x(t2 ), …, x(tE+1 ) ̂ x(t + Tp ) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ x(ti + Tp ) wi = exp { − ∥x(t) − x(ti )∥ min ∥x(t) − x(ti )∥ } steps   forward Tp x(t1 + Tp ) x(t2 ) x(t1 ) x(t2 + Tp ) x(t3 ) x(t3 + Tp )

Slide 7

Slide 7 text

1. Create time-delayed embedding of both and 2. Find k-nearest neighbors of in the state space: 3. Compute exponentially weighted average of neighbors:         4. If predicts with high accuracy, “CCM-causes” X(t) Y(t) x(t) x(t1 ), x(t2 ), …, x(tE+1 ) X(t) Y(t) Y(t) X(t) Convergent Cross Mapping (Causal Inference) 7 ̂ y(t + Tp ) = E+1 ∑ i=1 wi ∑E+1 i=1 wi ⋅ y(ti + Tp ) wi = exp { − ∥x(t) − x(ti )∥ min ∥x(t) − x(ti )∥ }

Slide 8

Slide 8 text

Our Use Case • We employ CCM to investigate the causal relationships between every neuron in a larval Zebra fi sh brain. • This requires pairwise CCM between 100K time series each with 10K time steps.   10 billion cross mappings! 8 Larval Zebra fi sh Light Sheet Microscopy Neuronal Activity Causal Relationships

Slide 9

Slide 9 text

Sugihara Lab EDM implementations 9 rEDM cppEDM C++ STL LAPACK pyEDM Language bindings Core library Dependencies

Slide 10

Slide 10 text

Prototype using cppEDM • cppEDM on 510 ABCI compute nodes took 8.5 hours to analyze 53K time series each with 1.5K time steps. • Assuming ideal scaling, our target dataset (100K time series w/ 10K time steps) will take ~60 days to analyze on 510 compute nodes. 10 Joseph Park, Gerald Pao, Cameron Smith and George Sugihara, “Massively Parallel Empirical Dynamic Cross Mapping”, 37th Paci fi c Rim Applications and Grid Middleware Assembly Workshop (PRAGMA 37), Sep. 2019.

Slide 11

Slide 11 text

Observations from profiling cppEDM • k-nearest neighbor search is the primary bottleneck • Should focus optimization e ff ort • Memory allocation is also taking time • Due to dynamically allocating bu ff ers during computation • Large load imbalance between compute nodes • Due to static decomposition of work among nodes 11 k-NN  Search Memory  Allocation Others Runtime [%] 0 15 30 45 60 Wassapon Watanakeesuntorn, Kohei Ichikawa, Keichi Takahashi, Jason Haga, Gerald Pao, “Calculation of Complete Zebra fi sh Brain Neural Activities on ABCI”, 36th Paci fi c Rim Applications and Grid Middleware Assembly Workshop (PRAGMA 36), Apr. 2019.

Slide 12

Slide 12 text

mpEDM: Massively Parallel EDM

Slide 13

Slide 13 text

Key Idea: Pre-computed Lookup Tables of Neighbors 13 ̂ y1 (t) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ y1 (ti ) ̂ y2 (t) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ y2 (ti ) ̂ y3 (t) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ y3 (ti ) When computing one-to-all CCM, the distances and indices of k-nearest neighbors   only need to be computed once. CCM

Slide 14

Slide 14 text

Improved Pairwise CCM Algorithm 14 idx 0 25 8 47 1 13 46 6 2 22 48 43 3 12 29 11 idx 0 0.9 0.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 Run all k-NN search in state space Batch predict every other time series (cause) Embed it into state space using time lags For every time series (e ff ect)

Slide 15

Slide 15 text

k-Nearest Neighbor Search Algorithms 15 Exact methods Approximate methods Full search Proximity Graphs Locality Sensitive Hashing Space Partition Faiss, NMSLIB, NGT FALCONN Annoy, FLANN, ScaNN Faiss Product Quantization Faiss, ScaNN Yusuke Matsui, “Billion-scale Approximate Nearest Neighbor Search”, CVPR 2020 Tutorial. k-NN Search

Slide 16

Slide 16 text

Choosing the k-NN search algorithm • Data dimension (i.e. embedding dimension) < 20 • Relatively low dimension • Number of points (i.e. length of time series) < 10,000 • Cost of constructing index structures doesn’t pay o ff • Distance matrix can fi t in CPU/GPU memory (4*10,000^2=400MB) • Need exact solution • Unknown how approximation a ff ects the CCM result 16 We take the brute force search approach

Slide 17

Slide 17 text

CUDA Ecosystem 17 CUDA C/C++/Fortran OpenMP OpenACC Thrust CUB cuBLAS cuDNN SYCL cuFFT cuTENSOR cuSPARSE cuSOLVER HIP ArrayFire PyTorch TensorFlow cupy Kokkos Math libraries and   Parallel primitives Programming models High-level libraries

Slide 18

Slide 18 text

Hierarchical Parallelism 18 Whole Cluster   537 PFLOPS 432 Racks/Cluster Rack   1.3 PFLOPS Node   3.38 TFLOPS Core   70 GFLOPS 384 Nodes/Rack 48 Cores/CPU Core

Slide 19

Slide 19 text

AI Bridging Cloud Infrastructure (ABCI) 19

Slide 20

Slide 20 text

ABCI Compute Node (Type V) 20

Slide 21

Slide 21 text

Inter-node Design 21 Master Worker Parallel File System (Lustre) On-demand Burst Buffer (BeeOND) Worker Worker Worker Worker … Bulk reads/writes Workers asynchronously fetch and execute work Small reads/writes

Slide 22

Slide 22 text

id x 0 25 8 47 1 13 46 6 2 22 48 43 3 12 29 11 id x 0 0.90.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 GPU Step 2 All k-NN Intra-node Design 22 id x 0 25 8 47 1 13 46 6 2 22 48 43 3 12 29 11 id x 0 0.90.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 … id x 0 25 8 47 1 13 46 6 2 22 48 43 3 12 29 11 id x 0 0.90.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 GPU CPU id 0 0.9 0.05 0.05 1 0.50.40.1 2 0.60.30.1 3 0.80.10.1 Step 1 Embedding Step 2 All k-NN Step 3 Batch prediction Embedding Neighbors

Slide 23

Slide 23 text

ArrayFire • A general purpose library for GPU computing • Provides a numpy-like high level interface • Backend invokes vendor-optimized libraries (cuBLAS, cuSPARSE, cuSOLVER, cuDNN, etc.) 23 https://array fi re.com/

Slide 24

Slide 24 text

mpEDM 24 Wassapon Watanakeesuntorn, Keichi Takahashi, Kohei Ichikawa, Joseph Park, George Sugihara, Ryousei Takano, Jason Haga, Gerald M. Pao, "Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution", 26th International Conference on Parallel and Distributed Systems (ICPADS 2020), Dec. 2020. • Massively parallel EDM designed for GPU-centric supercomputers (i.e. ABCI) • Currently supports Simplex projection and CCM • GitHub: https://github.com/keichi/mpEDM

Slide 25

Slide 25 text

# of time series # of time steps cppEDM   512 nodes mpEDM   1 node mpEDM   512 nodes Fish1_Normo 1,450 53,053 8.5h 1,973s 20s Subject6 3,780 92,538 176h* 13,953s 101s Subject11 8,528 101,729 1,081h* 39,572s 199s Evaluation: CCM Runtime 25 1,530x faster 7,941x cheaper   (8,704 USD→1.1 USD) * Extrapolated from Fish1_Normo runtime, not measured ABCI usage fee is 220 JPY (2 USD) per node hour https://abci.ai/en/how_to_use/tari ff s.html

Slide 26

Slide 26 text

Evaluation: Inter-node scalability 26 0 200 400 600 800 1,000 1,200 1,400 Single Node 1 Worker 2 Workers 4 Workers 8 Workers 16 Workers 32 Workers 64 Workers 128 Workers 256 Workers 511 Workers Time [min] CPU GPU 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 Speedup Number of workers CPU GPU Strong scaling performance   (absolute runtime) Strong scaling performance   (relative speedup)

Slide 27

Slide 27 text

Evaluation: Intra-node scalability 27 0 20 40 60 80 100 1,000 5,000 10,000 50,000 100,000 % of total Number of time series kNN Lookup 0 20 40 60 80 100 1,000 2,000 5,000 10,000 20,000 40,000 Number of time steps 0 100 200 300 400 500 600 700 800 900 1,000 2,000 5,000 10,000 20,000 40,000 Time [s] Number of time steps Simplex Projection CCM Others 0 100 200 300 400 500 600 700 800 900 1,000 1,000 5,000 10,000 50,000 100,000 Time [min] Number of time series Simplex Projection CCM Others Runtime   1K-100K time series   10K time steps Runtime breakdown Runtime   1K time series   1K-40K time steps

Slide 28

Slide 28 text

Challenges remained in mpEDM • Using ArrayFire’s k-NN search misses optimization opportunities • For example, k-NN search could be fused with embedding to improve e ffi ciency • k is usually small (<20) in EDM but ArrayFire’s k-NN is tuned for a general case • Batch prediction (lookup) is executed on CPU • We tried to implement it on GPU using ArrayFire (GFOR, sparse arrays, etc.) but performance was poor • Performance portability • CPU version had to be separately maintained for CPU-only platforms 28

Slide 29

Slide 29 text

kEDM: Fully GPU-offloaded EDM

Slide 30

Slide 30 text

NVIDIA GPU Hardware Architecture 30 L2 Cache Global Memory … L1$: 14TB/s L2$: 3TB/s HBM: 800GB/s L1 Cache Registers Scheduler CUDA Cores SM L1 Cache Registers Scheduler CUDA Cores SM L1 Cache Registers Scheduler CUDA Cores SM

Slide 31

Slide 31 text

Grid 0 CUDA Programming: Grids of Blocks of Threads 31 __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } saxpy<<<4096,256>>>(N, 2.0, d_x, d_y); GPU Code (“kernel”) CPU Code Block 0 Block 1 Block 1 Block 2 Thread 0 … Thread 1 Thread 2 … Launch

Slide 32

Slide 32 text

Hardware to Software Mapping 32 L2 Cache Device Memory L1 Cache Registers Scheduler CUDA Cores SM L1 Cache Registers Scheduler CUDA Cores SM Grid 0 Block 0 Block 1 Block 1 Block 2 Thread 0 … Thread 1 Thread 2 … CUDA Core executes threads SM executes thread blocks GPU executes grids

Slide 33

Slide 33 text

Distance matrix D k-Nearest Neighbor Search (Distance Calculation) 33 Dij = E−1 ∑ k=0 X(i − kτ) − X(j − kτ) X(t) Thread block computes row   is cached in shared memory i i X(i − kτ) Thread in block computes one element (time- delayed embedding is performed on-the- fl y) j i

Slide 34

Slide 34 text

Distance matrix D k-Nearest Neighbor Search (Top-k Search) 34 Thread block searches the top-k elements in row i i Each thread scans a segment in the row and maintains the top-k elements it has seen Leader threads merges the local top-k elements k k

Slide 35

Slide 35 text

Batch prediction 35 idx 0 25 8 47 1 13 46 6 2 22 48 43 3 12 29 11 idx 0 0.9 0.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 ̂ Yi (j) = E+1 ∑ k=1 w(j, k) ⋅ Yi (t(j, k)) Thread block makes prediction for time series (time series is cached in shared memory) i i Thread predicts time step in time series j j i

Slide 36

Slide 36 text

Kokkos • A performance portability framework • Same source code runs on many di ff erent architectures. • Supports OpenMP, CUDA, HIP, SYCL and HPX as backends. • Implemented as a C++ template library (no special compiler required). • Rich ecosystem • kokkos-tools: pro fi lers, tracers, debuggers, and adapters to vendor tools. • kokkos-kernels: dense and sparse linear algebra kernels. • Established since 2012, many production applications. 36

Slide 37

Slide 37 text

Kokkos Core Concepts • Data parallel patterns • Parallel pattern: for, reduce, scan • Execution policy: scheduling, thread groups • Body: functor/lambda • Views • Multi-dimensional arrays • Automatically deallocated (Reference counted) • Layout is architecture-dependent 37 parallel_for(RangePolicy(N), KOKKOS_LAMBDA(int i) { y(i) = a * x(i) + y(i); }); Pattern Execution Policy Body View … … Row-major on CPU Column-major on GPU

Slide 38

Slide 38 text

kEDM • EDM based on Kokkos • Currently supports Simplex projection, CCM, and S-Map • Documentation: https://kedm.readthedocs.io/en/latest/ • CPU version can be installed from PyPI • GPU version currently needs to be installed from source 38 $ pip install kedm Keichi Takahashi, Wassapon Watanakeesuntorn, Kohei Ichikawa, Joseph Park, Ryousei Takano, Jason Haga, George Sugihara, Gerald M. Pao, “kEDM: A Performance-portable Implementation of Empirical Dynamical Modeling,” Practice & Experience in Advanced Research Computing (PEARC 2021), Jul. 2021.

Slide 39

Slide 39 text

Evaluation Environment • Ika@Salk: Intel Xeon Gold 6148 (two sockets), NVIDIA V100 and 384GiB RAM • Aori@Salk: AMD EPYC 7742 (two sockets) and 1TiB RAM 39

Slide 40

Slide 40 text

Evaluation: CCM Runtime for Real-world Datasets 40 # of time series # of time steps kEDM mpEDM Speedup kEDM mpEDM Speedup Fish1_Normo_ Trim 154 1,600 3s 11s 3.67x 3s 4s 1.33x Fly80XY 82 10,608 11s 50s 4.55x 22s 30s 1.36x Genes_MEF 45,318 96 344s 334s 0.97x 96s 139s 1.45x Subject6 92,538 3,780 5,391s 29,579s 5.49x 12,145s 11,571s 0.95x Subject11 101,729 8,528 20,517s 85,217s 4.15x 43,812s 38,542s 0.88x F1 8,520 29,484 11,372s 20,128s 1.77x 23,001s 19,950s 0.87x V100 and Xeon Gold 6148 EPYC 7742

Slide 41

Slide 41 text

Evaluation: Micro Benchmark for k-Nearest Neighbor Search • Measured using an arti fi cial time series with 10K time steps. • EPYC: kEDM slightly (~5%) outperforms mpEDM. • V100: kEDM is 6.4x faster than mpEDM when E=1 but kEDM’s sorting performance rapidly degrades as E increases. 41 AMD EPYC 7742 NVIDIA V100

Slide 42

Slide 42 text

Evaluation: Micro Benchmark for Lookups • Measured using 100K arti fi cial time series each having 10K time steps. • EPYC: kEDM is slightly faster than mpEDM. • V100: kEDM outperforms EPYC by a factor of 3 to 4. 42 AMD EPYC 7742 NVIDIA V100

Slide 43

Slide 43 text

Evaluation: Roofline Analysis of Pairwise Distance Kernel • Larger E increases the arithmetic intensity (more data reuse). • EPYC: initially bounded by L3 cache; then hits L1 and L2 roo fl ines. • V100: initially bounded by HBM; larger E saturates load/store units. • V100: can’t use vectorized loads since the memory alignment depends on user-supplied parameters E and tau. 43 AMD EPYC 7742 NVIDIA V100

Slide 44

Slide 44 text

Evaluation: Roofline Analysis of Lookup Kernel • Again, larger E increases the arithmetic intensity. • Bounded by L1 and L2 cache bandwidth on EPYC and V100, respectively. • k-NN lookup table fi ts on EYPC’s L1 cache (4MiB) and V100’s L2 cache (6MiB). 44 AMD EPYC 7742 NVIDIA V100

Slide 45

Slide 45 text

45 References 1. Keichi Takahashi et al., “kEDM: A Performance-portable Implementation of Empirical Dynamical Modeling,”   Practice & Experience in Advanced Research Computing (PEARC 2021), Jul. 2021. 2. Wassapon Watanakeesuntorn et al., "Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution",   26th International Conference on Parallel and Distributed Systems (ICPADS 2020), Dec. 2020. 3. Joseph Park, Gerald Pao, Cameron Smith and George Sugihara, “Massively Parallel Empirical Dynamic Cross Mapping”,   37th Paci fi c Rim Applications and Grid Middleware Assembly Workshop (PRAGMA 37), Sep. 2019. Gerald Pao Wassapon Watanakeesuntorn Acknowledgments Kohei Ichikawa Joseph Park Ryosei Takano Jason Haga George Sugihara Software kEDM GitHub: https://github.com/keichi/kEDM kEDM Documentation: https://kedm.readthedocs.io/en/latest/ mpEDM GitHub: https://github.com/keichi/mpEDM