Slide 1

Slide 1 text

Elizabeth Ramirez November 5, 2020 THE LINEAR ALGEBRA OF DEEP LEARNING

Slide 2

Slide 2 text

ABOUT ME BSc Electrical Engineering Signal Processing. MSc Applied Mathematics Computational Science and Engineering. Applied Scientist Maritime Intelligence. Spend a lot of my time worrying about how to process remote sensing data efficiently.

Slide 3

Slide 3 text

DEEP LEARNING PRIMITIVES Depend heavily on Linear Algebra. We need algorithms that make computations especially fast.

Slide 4

Slide 4 text

DEEP LEARNING PRIMITIVES Weights, inputs, outputs stored in tensors. Matrix Multiplication Convolution Inner Product Transposition Rectified Linear Unit (ReLu)

Slide 5

Slide 5 text

MATRIX MULTIPLICATION FUNDAMENTAL TASK Naive: Strassen:

Slide 6

Slide 6 text

TOP 10 ALGORITHMS OF THE 20TH CENTURY.

Slide 7

Slide 7 text

SINGULAR VALUE DECOMPOSITION (SVD) Ubiquitous in Deep Learning and Signal Processing. - Decomposes a matrix into low-rank matrices. - Reduces the model size. - Takes advantage of sparsity, given the amount of small singular values.

Slide 8

Slide 8 text

SINGULAR VALUE DECOMPOSITION Let . Then it can be factorized as: eigenvectors of eigenvectors of singular values of

Slide 9

Slide 9 text

SINGULAR VALUE DECOMPOSITION For some , we can get a low-rank approximation such that is minimized. SVD accelerates matrix multiplication to

Slide 10

Slide 10 text

SINGULAR VALUE DECOMPOSITION GOES-16 scene, blue band (reflectance).

Slide 11

Slide 11 text

SINGULAR VALUE DECOMPOSITION

Slide 12

Slide 12 text

SINGULAR VALUE DECOMPOSITION

Slide 13

Slide 13 text

RANDOMIZED SVD: Halko-Martinsson-Tropp Desired rank is specified in advance: fixed-rank approximation problem. Given , a target rank , and an oversampling parameter , we seek to construct a matrix such that

Slide 14

Slide 14 text

RANDOMIZED SVD: Halko-Martinsson-Tropp What is the optimal value for to minimize loss of information? If we approximate to rank-2k ( ), then:

Slide 15

Slide 15 text

RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 1 1. Generate a gaussian test matrix . 2. Form the matrix product . 3. Construct a matrix whose columns form an orthonormal basis for the range of .

Slide 16

Slide 16 text

RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 2 1. Form , yields the low-rank factorization 2. Compute SVD factorization of the small matrix 3. Set

Slide 17

Slide 17 text

SVD NAÏVE BENCHMARK r=15 np.linalg.svd 100 loops, best of 3: 468 ms per loop tf.linalg.svd 100 loops, best of 3: 13.8 ms per loop sklearn.utils.extmath.randomized_svd 100 loops, best of 3: 10.1 ms per loop

Slide 18

Slide 18 text

GENERAL MATRIX-TO-MATRIX MULTIPLICATION (GEMM) - Matrix Multiplications stress even the fastest computers available. - Fundamental building block for fully connected, recurrent, and convolutional layers. - The most common LA optimization trick in DNNs you never hear of.

Slide 19

Slide 19 text

GEMM: COMPUTATION TIME Computation time distribution of individual layers. Yangqing Jia, University of California, Berkeley.

Slide 20

Slide 20 text

GEMM: ARITHMETIC INTENSITY Let and . For matrix multiplication, a total of FLOPs is required. Arithmetic intensity is defined as number of FLOPs / number of byte access.

Slide 21

Slide 21 text

GEMM: CONVOLUTIONAL LAYERS 2D images with depth, a number of channels for each pixel. E.g. Remote Sensing. *

Slide 22

Slide 22 text

GEMM: CONVOLUTIONAL LAYERS Turn the input tensor into a 2D array that we can treat like a matrix. Same for kernel weights. In TensorFlow, Intel MKL, cuBLAS: using data format, data representation of nD tensors stored in linear (1D) memory address space. NCHW: default in Theano NHWC: default in TensorFlow

Slide 23

Slide 23 text

GEMM: CONVOLUTIONAL LAYERS * DK x N (H − K + 1)(W − K + 1) x DK

Slide 24

Slide 24 text

GEMM: TensorFlow im2col // The im2col buffer has # of patches rows, and # of filters cols. // It's laid out like this, in row major order in memory: // < filter value count > // ^ +---------------------+ // patch | | // count | | // v +---------------------+ // Each patch row contains a filter_width x filter_height patch of the // input, with the depth channel as the most contiguous in memory, followed // by the width, then the height. This is the standard memory order in the // image world if it helps to visualize it. // Now we've assembled a set of image patches into a matrix, apply a // GEMM matrix multiply of the patches as rows, times the filter // weights in columns, to get partial results in the output matrix.

Slide 25

Slide 25 text

GEMM: PARTITIONING Multiplying two partitioned matrices is exactly like multiplying two matrices with scalar elements, but with the individual elements replaced by submatrices. Let . Then:

Slide 26

Slide 26 text

GEMM: PARTITIONING NVIDIA Deep Learning Performance Documentation

Slide 27

Slide 27 text

REFERENCES Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions Halko, et al., 2009 Anatomy of High-Performance Matrix Multiplication. Kazushige Goto and Robert A. van de Geijn, 2008. Using Intel® Math Kernel Library for Matrix Multiplication. NVIDIA® Deep Learning Performance Documentation.

Slide 28

Slide 28 text

THANK YOU!