LADL-Code Mesh V

Elizabeth Ramirez November 5, 2020 THE LINEAR ALGEBRA OF DEEP
LEARNING

ABOUT ME BSc Electrical Engineering Signal Processing. MSc Applied Mathematics
Computational Science and Engineering. Applied Scientist Maritime Intelligence. Spend a lot of my time worrying about how to process remote sensing data eﬃciently.

DEEP LEARNING PRIMITIVES Depend heavily on Linear Algebra. We need
algorithms that make computations especially fast.

DEEP LEARNING PRIMITIVES Weights, inputs, outputs stored in tensors. Matrix
Multiplication Convolution Inner Product Transposition Rectiﬁed Linear Unit (ReLu)

MATRIX MULTIPLICATION FUNDAMENTAL TASK Naive: Strassen:

TOP 10 ALGORITHMS OF THE 20TH CENTURY.

SINGULAR VALUE DECOMPOSITION (SVD) Ubiquitous in Deep Learning and Signal
Processing. - Decomposes a matrix into low-rank matrices. - Reduces the model size. - Takes advantage of sparsity, given the amount of small singular values.

SINGULAR VALUE DECOMPOSITION Let . Then it can be factorized
as: eigenvectors of eigenvectors of singular values of

SINGULAR VALUE DECOMPOSITION For some , we can get a
low-rank approximation such that is minimized. SVD accelerates matrix multiplication to

SINGULAR VALUE DECOMPOSITION GOES-16 scene, blue band (reﬂectance).

SINGULAR VALUE DECOMPOSITION

RANDOMIZED SVD: Halko-Martinsson-Tropp Desired rank is speciﬁed in advance: ﬁxed-rank
approximation problem. Given , a target rank , and an oversampling parameter , we seek to construct a matrix such that

RANDOMIZED SVD: Halko-Martinsson-Tropp What is the optimal value for to
minimize loss of information? If we approximate to rank-2k ( ), then:

RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 1 1. Generate a gaussian test
matrix . 2. Form the matrix product . 3. Construct a matrix whose columns form an orthonormal basis for the range of .

RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 2 1. Form , yields the
low-rank factorization 2. Compute SVD factorization of the small matrix 3. Set

SVD NAÏVE BENCHMARK r=15 np.linalg.svd 100 loops, best of 3:
468 ms per loop tf.linalg.svd 100 loops, best of 3: 13.8 ms per loop sklearn.utils.extmath.randomized_svd 100 loops, best of 3: 10.1 ms per loop

GENERAL MATRIX-TO-MATRIX MULTIPLICATION (GEMM) - Matrix Multiplications stress even the
fastest computers available. - Fundamental building block for fully connected, recurrent, and convolutional layers. - The most common LA optimization trick in DNNs you never hear of.

GEMM: COMPUTATION TIME Computation time distribution of individual layers. Yangqing
Jia, University of California, Berkeley.

GEMM: ARITHMETIC INTENSITY Let and . For matrix multiplication, a
total of FLOPs is required. Arithmetic intensity is deﬁned as number of FLOPs / number of byte access.

GEMM: CONVOLUTIONAL LAYERS 2D images with depth, a number of
channels for each pixel. E.g. Remote Sensing. *

GEMM: CONVOLUTIONAL LAYERS Turn the input tensor into a 2D
array that we can treat like a matrix. Same for kernel weights. In TensorFlow, Intel MKL, cuBLAS: using data format, data representation of nD tensors stored in linear (1D) memory address space. NCHW: default in Theano NHWC: default in TensorFlow

GEMM: CONVOLUTIONAL LAYERS * DK x N (H − K
+ 1)(W − K + 1) x DK

GEMM: TensorFlow im2col // The im2col buffer has # of
patches rows, and # of filters cols. // It's laid out like this, in row major order in memory: // < filter value count > // ^ +---------------------+ // patch | | // count | | // v +---------------------+ // Each patch row contains a filter_width x filter_height patch of the // input, with the depth channel as the most contiguous in memory, followed // by the width, then the height. This is the standard memory order in the // image world if it helps to visualize it. // Now we've assembled a set of image patches into a matrix, apply a // GEMM matrix multiply of the patches as rows, times the filter // weights in columns, to get partial results in the output matrix.

GEMM: PARTITIONING Multiplying two partitioned matrices is exactly like multiplying
two matrices with scalar elements, but with the individual elements replaced by submatrices. Let . Then:

GEMM: PARTITIONING NVIDIA Deep Learning Performance Documentation

REFERENCES Finding structure with randomness: Stochastic algorithms for constructing approximate
matrix decompositions Halko, et al., 2009 Anatomy of High-Performance Matrix Multiplication. Kazushige Goto and Robert A. van de Geijn, 2008. Using Intel® Math Kernel Library for Matrix Multiplication. NVIDIA® Deep Learning Performance Documentation.

THANK YOU!

LADL-Code Mesh V

LADL-Code Mesh V

Elizabeth Ramirez

More Decks by Elizabeth Ramirez

Other Decks in Science

Featured

Transcript

Elizabeth Ramirez November 5, 2020 THE LINEAR ALGEBRA OF DEEP

ABOUT ME BSc Electrical Engineering Signal Processing. MSc Applied Mathematics

DEEP LEARNING PRIMITIVES Depend heavily on Linear Algebra. We need

DEEP LEARNING PRIMITIVES Weights, inputs, outputs stored in tensors. Matrix

MATRIX MULTIPLICATION FUNDAMENTAL TASK Naive: Strassen:

TOP 10 ALGORITHMS OF THE 20TH CENTURY.

SINGULAR VALUE DECOMPOSITION (SVD) Ubiquitous in Deep Learning and Signal

SINGULAR VALUE DECOMPOSITION Let . Then it can be factorized

SINGULAR VALUE DECOMPOSITION For some , we can get a

SINGULAR VALUE DECOMPOSITION GOES-16 scene, blue band (reﬂectance).

SINGULAR VALUE DECOMPOSITION

SINGULAR VALUE DECOMPOSITION

RANDOMIZED SVD: Halko-Martinsson-Tropp Desired rank is speciﬁed in advance: ﬁxed-rank

RANDOMIZED SVD: Halko-Martinsson-Tropp What is the optimal value for to

RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 1 1. Generate a gaussian test

RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 2 1. Form , yields the

SVD NAÏVE BENCHMARK r=15 np.linalg.svd 100 loops, best of 3:

GENERAL MATRIX-TO-MATRIX MULTIPLICATION (GEMM) - Matrix Multiplications stress even the

GEMM: COMPUTATION TIME Computation time distribution of individual layers. Yangqing

GEMM: ARITHMETIC INTENSITY Let and . For matrix multiplication, a

GEMM: CONVOLUTIONAL LAYERS 2D images with depth, a number of

GEMM: CONVOLUTIONAL LAYERS Turn the input tensor into a 2D

GEMM: CONVOLUTIONAL LAYERS * DK x N (H − K

GEMM: TensorFlow im2col // The im2col buffer has # of

GEMM: PARTITIONING Multiplying two partitioned matrices is exactly like multiplying

GEMM: PARTITIONING NVIDIA Deep Learning Performance Documentation

REFERENCES Finding structure with randomness: Stochastic algorithms for constructing approximate

THANK YOU!