Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Linear Algebra of Deep Learning

The Linear Algebra of Deep Learning

(and a little bit of calculus too)

Elizabeth Ramirez

May 17, 2019
Tweet

More Decks by Elizabeth Ramirez

Other Decks in Programming

Transcript

  1. About me Ingeniera Electrónica Procesamiento de Señales Applied Mathematician Computational

    Science and Engineering. Applied Scientist Spend a lot of my time worrying about how to process remote sensing data computationally efficient.
  2. Deep Learning depends heavily on Linear Algebra Primitives. We need

    algorithms that make computations especially fast.
  3. Singular Value Decomposition Let . Then it can be factorized

    as: eigenvectors of eigenvectors of singular values Low-rank approximation such that is minimized.
  4. Singular Value Decomposition Stage 1 Find such that Stage 2

    Compute , and 1. which yields the low-rank . 2. Compute SVD of small . 3. Set
  5. Randomized SVD: Halko-Martinsson-Tropp Desired rank is specified in advance: fixed-rank

    approximation problem. Given , a target rank , and an oversampling parameter , we seek to construct a matrix
  6. Randomized SVD: Halko-Martinsson-Tropp Stage 1 1. Generate a gaussian test

    matrix . 2. Form the matrix product . 3. Construct a matrix whose columns form an orthonormal basis for the range of .
  7. SVD Naïve Benchmark r=15 np.linalg.svd 10 loops, best of 3:

    468 ms per loop tf.linalg.svd 100 loops, best of 3: 13.8 ms per loop sklearn.utils.extmath.randomized_svd 100 loops, best of 3: 10.1 ms per loop
  8. GEneral Matrix to Matrix Multiplication (GEMM) Almost all of the

    CPU/GPU time is spent on convolutional and fully-connected layers. Matrix Partitioning: Like multiplying two matrices with scalar elements, but with the individual elements replaced by submatrices.
  9. GEMM: Convolutional Layers 2D images with depth, a number of

    channels for each pixel. E.g. Remote Sensing *
  10. GEMM: TensorFlow im2col // The im2col buffer has # of

    patches rows, and # of filters cols. // It's laid out like this, in row major order in memory: // < filter value count > // ^ +---------------------+ // patch | | // count | | // v +---------------------+ // Each patch row contains a filter_width x filter_height patch of the // input, with the depth channel as the most contiguous in memory, followed // by the width, then the height. This is the standard memory order in the // image world if it helps to visualize it. // Now we've assembled a set of image patches into a matrix, apply a // GEMM matrix multiply of the patches as rows, times the filter // weights in columns, to get partial results in the output matrix.
  11. GEMM: Convolutional Layers Turn the input tensor into a 2D

    array that we can treat like a matrix. Same for kernel weights. In TensorFlow, Intel MKL, cuBLAS: GER, using data format, data representation of nD tensors stored in linear (1D) memory address space. NCHW: default in Theano NHWC: default in TensorFlow
  12. Automatic Differentiation (autodiff) Machine Learning algorithms rely on the computation

    of gradients and hessians of an objective function. 1. Calculate analytical derivatives and code them: time consuming, requires closed-form solutions. 2. Numerical differentiation using finite differences: slow with partial derivatives. 3. Symbolic differentiation: expression swell 4. Automatic Differentiation
  13. Automatic Differentiation (autodiff) Numerically evaluate the derivative of a function,

    propagating the chain rule of differential calculus. Generate numerical derivative evaluations rather than derivative expressions: Computational Graphs.
  14. autodiff: backpropagation Sensitivity of the objective value at the output,

    using the chain rule to compute partial derivatives of the objective with respect to each weight. Special case of reverse mode autodiff: they have a ❤ history. Theano, Torch and TensorFlow are bringing general purpose autodiff to the mainstream.
  15. autodiff: TensorFlow Provides an implementation of the reverse mode autodiff

    in the tf.GradientTape API. def f(x, y): output = tf.math.log(x) + tf.subtract(tf.multiply(x, y), tf.math.sin(y)) return output x = tf.constant(2.0) y = tf.constant(5.0) with tf.GradientTape() as g: g.watch(x) out = f(x, y) g.gradient(out, x) # <tf.Tensor: id=315, shape=(), dtype=float32, numpy=5.5>
  16. References Finding structure with randomness: Stochastic algorithms for constructing approximate

    matrix decompositions Halko, et al., 2009 https:/ /arxiv.org/abs/0909.4061 Linear Algebra and Learning from Data. Gilbert Strang, 2019. https:/ /www.tensorflow.org/api_docs/python/tf/Gradi entTape