Elizabeth Ramirez
November 05, 2020
120

Revamped presentation from ScaleConf 2019 for Code Mesh V 2020

#### Elizabeth Ramirez

November 05, 2020

## Transcript

LEARNING
2. ### ABOUT ME BSc Electrical Engineering Signal Processing. MSc Applied Mathematics

Computational Science and Engineering. Applied Scientist Maritime Intelligence. Spend a lot of my time worrying about how to process remote sensing data eﬃciently.
3. ### DEEP LEARNING PRIMITIVES Depend heavily on Linear Algebra. We need

algorithms that make computations especially fast.
4. ### DEEP LEARNING PRIMITIVES Weights, inputs, outputs stored in tensors. Matrix

Multiplication Convolution Inner Product Transposition Rectiﬁed Linear Unit (ReLu)

7. ### SINGULAR VALUE DECOMPOSITION (SVD) Ubiquitous in Deep Learning and Signal

Processing. - Decomposes a matrix into low-rank matrices. - Reduces the model size. - Takes advantage of sparsity, given the amount of small singular values.
8. ### SINGULAR VALUE DECOMPOSITION Let . Then it can be factorized

as: eigenvectors of eigenvectors of singular values of
9. ### SINGULAR VALUE DECOMPOSITION For some , we can get a

low-rank approximation such that is minimized. SVD accelerates matrix multiplication to

13. ### RANDOMIZED SVD: Halko-Martinsson-Tropp Desired rank is speciﬁed in advance: ﬁxed-rank

approximation problem. Given , a target rank , and an oversampling parameter , we seek to construct a matrix such that
14. ### RANDOMIZED SVD: Halko-Martinsson-Tropp What is the optimal value for to

minimize loss of information? If we approximate to rank-2k ( ), then:
15. ### RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 1 1. Generate a gaussian test

matrix . 2. Form the matrix product . 3. Construct a matrix whose columns form an orthonormal basis for the range of .
16. ### RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 2 1. Form , yields the

low-rank factorization 2. Compute SVD factorization of the small matrix 3. Set
17. ### SVD NAÏVE BENCHMARK r=15 np.linalg.svd 100 loops, best of 3:

468 ms per loop tf.linalg.svd 100 loops, best of 3: 13.8 ms per loop sklearn.utils.extmath.randomized_svd 100 loops, best of 3: 10.1 ms per loop
18. ### GENERAL MATRIX-TO-MATRIX MULTIPLICATION (GEMM) - Matrix Multiplications stress even the

fastest computers available. - Fundamental building block for fully connected, recurrent, and convolutional layers. - The most common LA optimization trick in DNNs you never hear of.
19. ### GEMM: COMPUTATION TIME Computation time distribution of individual layers. Yangqing

Jia, University of California, Berkeley.
20. ### GEMM: ARITHMETIC INTENSITY Let and . For matrix multiplication, a

total of FLOPs is required. Arithmetic intensity is deﬁned as number of FLOPs / number of byte access.
21. ### GEMM: CONVOLUTIONAL LAYERS 2D images with depth, a number of

channels for each pixel. E.g. Remote Sensing. *
22. ### GEMM: CONVOLUTIONAL LAYERS Turn the input tensor into a 2D

array that we can treat like a matrix. Same for kernel weights. In TensorFlow, Intel MKL, cuBLAS: using data format, data representation of nD tensors stored in linear (1D) memory address space. NCHW: default in Theano NHWC: default in TensorFlow
23. ### GEMM: CONVOLUTIONAL LAYERS * DK x N (H − K

+ 1)(W − K + 1) x DK
24. ### GEMM: TensorFlow im2col // The im2col buffer has # of

patches rows, and # of filters cols. // It's laid out like this, in row major order in memory: // < filter value count > // ^ +---------------------+ // patch | | // count | | // v +---------------------+ // Each patch row contains a filter_width x filter_height patch of the // input, with the depth channel as the most contiguous in memory, followed // by the width, then the height. This is the standard memory order in the // image world if it helps to visualize it. // Now we've assembled a set of image patches into a matrix, apply a // GEMM matrix multiply of the patches as rows, times the filter // weights in columns, to get partial results in the output matrix.
25. ### GEMM: PARTITIONING Multiplying two partitioned matrices is exactly like multiplying

two matrices with scalar elements, but with the individual elements replaced by submatrices. Let . Then:

27. ### REFERENCES Finding structure with randomness: Stochastic algorithms for constructing approximate

matrix decompositions Halko, et al., 2009 Anatomy of High-Performance Matrix Multiplication. Kazushige Goto and Robert A. van de Geijn, 2008. Using Intel® Math Kernel Library for Matrix Multiplication. NVIDIA® Deep Learning Performance Documentation.