The Linear Algebra of Deep Learning

Elizabeth Ramirez @eramirem (And a bit of Calculus, too)

About me Ingeniera Electrónica Procesamiento de Señales Applied Mathematician Computational
Science and Engineering. Applied Scientist Spend a lot of my time worrying about how to process remote sensing data computationally eﬃcient.

https:/ /xkcd.com/1838/

Deep Learning depends heavily on Linear Algebra Primitives. We need
algorithms that make computations especially fast.

Randomization for Singular Value Decomposition (SVD) Fundamental Linear Algebra Primitive,
ubiquitous in Machine Learning and Signal Processing.

GEneral Matrix to Matrix Multiplication (GEMM) Matrix Multiplications stress even
the fastest computers available.

Automatic Diﬀerentiation (autodiﬀ) One of the best computational tools you’ve
never heard of.

Singular Value Decomposition Let . Then it can be factorized
as: eigenvectors of eigenvectors of singular values Low-rank approximation such that is minimized.

Singular Value Decomposition Stage 1 Find such that Stage 2
Compute , and 1. which yields the low-rank . 2. Compute SVD of small . 3. Set

GOES-16 scene, blue band (reﬂectance)

Randomized SVD: Halko-Martinsson-Tropp Desired rank is speciﬁed in advance: ﬁxed-rank
approximation problem. Given , a target rank , and an oversampling parameter , we seek to construct a matrix

Randomized SVD: Halko-Martinsson-Tropp Stage 1 1. Generate a gaussian test
matrix . 2. Form the matrix product . 3. Construct a matrix whose columns form an orthonormal basis for the range of .

SVD Naïve Benchmark r=15 np.linalg.svd 10 loops, best of 3:
468 ms per loop tf.linalg.svd 100 loops, best of 3: 13.8 ms per loop sklearn.utils.extmath.randomized_svd 100 loops, best of 3: 10.1 ms per loop

GEneral Matrix to Matrix Multiplication (GEMM) Almost all of the
CPU/GPU time is spent on convolutional and fully-connected layers. Matrix Partitioning: Like multiplying two matrices with scalar elements, but with the individual elements replaced by submatrices.

GEMM: Convolutional Layers 2D images with depth, a number of
channels for each pixel. E.g. Remote Sensing *

GEMM: TensorFlow im2col // The im2col buffer has # of
patches rows, and # of filters cols. // It's laid out like this, in row major order in memory: // < filter value count > // ^ +---------------------+ // patch | | // count | | // v +---------------------+ // Each patch row contains a filter_width x filter_height patch of the // input, with the depth channel as the most contiguous in memory, followed // by the width, then the height. This is the standard memory order in the // image world if it helps to visualize it. // Now we've assembled a set of image patches into a matrix, apply a // GEMM matrix multiply of the patches as rows, times the filter // weights in columns, to get partial results in the output matrix.

GEMM: Convolutional Layers Turn the input tensor into a 2D
array that we can treat like a matrix. Same for kernel weights. In TensorFlow, Intel MKL, cuBLAS: GER, using data format, data representation of nD tensors stored in linear (1D) memory address space. NCHW: default in Theano NHWC: default in TensorFlow

Automatic Differentiation (autodiff) Machine Learning algorithms rely on the computation
of gradients and hessians of an objective function. 1. Calculate analytical derivatives and code them: time consuming, requires closed-form solutions. 2. Numerical differentiation using finite differences: slow with partial derivatives. 3. Symbolic differentiation: expression swell 4. Automatic Differentiation

Automatic Differentiation (autodiff) Numerically evaluate the derivative of a function,
propagating the chain rule of differential calculus. Generate numerical derivative evaluations rather than derivative expressions: Computational Graphs.

Automatic Diﬀerentiation (autodiﬀ)

autodiﬀ: forward mode

autodiﬀ: reverse mode

autodiff: backpropagation Sensitivity of the objective value at the output,
using the chain rule to compute partial derivatives of the objective with respect to each weight. Special case of reverse mode autodiff: they have a ❤ history. Theano, Torch and TensorFlow are bringing general purpose autodiff to the mainstream.

autodiﬀ: TensorFlow Provides an implementation of the reverse mode autodiﬀ
in the tf.GradientTape API. def f(x, y): output = tf.math.log(x) + tf.subtract(tf.multiply(x, y), tf.math.sin(y)) return output x = tf.constant(2.0) y = tf.constant(5.0) with tf.GradientTape() as g: g.watch(x) out = f(x, y) g.gradient(out, x) # <tf.Tensor: id=315, shape=(), dtype=float32, numpy=5.5>

References Finding structure with randomness: Stochastic algorithms for constructing approximate
matrix decompositions Halko, et al., 2009 https:/ /arxiv.org/abs/0909.4061 Linear Algebra and Learning from Data. Gilbert Strang, 2019. https:/ /www.tensorﬂow.org/api_docs/python/tf/Gradi entTape

The Linear Algebra of Deep Learning

The Linear Algebra of Deep Learning

Elizabeth Ramirez

More Decks by Elizabeth Ramirez

Other Decks in Programming

Featured

Transcript

Elizabeth Ramirez @eramirem (And a bit of Calculus, too)

About me Ingeniera Electrónica Procesamiento de Señales Applied Mathematician Computational

https:/ /xkcd.com/1838/

Deep Learning depends heavily on Linear Algebra Primitives. We need

Randomization for Singular Value Decomposition (SVD) Fundamental Linear Algebra Primitive,

GEneral Matrix to Matrix Multiplication (GEMM) Matrix Multiplications stress even

Automatic Diﬀerentiation (autodiﬀ) One of the best computational tools you’ve

Singular Value Decomposition Let . Then it can be factorized

Singular Value Decomposition Stage 1 Find such that Stage 2

GOES-16 scene, blue band (reﬂectance)

Randomized SVD: Halko-Martinsson-Tropp Desired rank is speciﬁed in advance: ﬁxed-rank

Randomized SVD: Halko-Martinsson-Tropp Stage 1 1. Generate a gaussian test

SVD Naïve Benchmark r=15 np.linalg.svd 10 loops, best of 3:

GEneral Matrix to Matrix Multiplication (GEMM) Almost all of the

GEMM: Convolutional Layers 2D images with depth, a number of

GEMM: TensorFlow im2col // The im2col buffer has # of

GEMM: Convolutional Layers Turn the input tensor into a 2D

Automatic Diﬀerentiation (autodiﬀ) Machine Learning algorithms rely on the computation

Automatic Diﬀerentiation (autodiﬀ) Numerically evaluate the derivative of a function,

Automatic Diﬀerentiation (autodiﬀ)

autodiﬀ: forward mode

autodiﬀ: reverse mode

autodiﬀ: backpropagation Sensitivity of the objective value at the output,

autodiﬀ: TensorFlow Provides an implementation of the reverse mode autodiﬀ

References Finding structure with randomness: Stochastic algorithms for constructing approximate