Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LADL-Code Mesh V

LADL-Code Mesh V

Revamped presentation from ScaleConf 2019 for Code Mesh V 2020

Elizabeth Ramirez

November 05, 2020
Tweet

More Decks by Elizabeth Ramirez

Other Decks in Science

Transcript

  1. ABOUT ME BSc Electrical Engineering Signal Processing. MSc Applied Mathematics

    Computational Science and Engineering. Applied Scientist Maritime Intelligence. Spend a lot of my time worrying about how to process remote sensing data efficiently.
  2. DEEP LEARNING PRIMITIVES Depend heavily on Linear Algebra. We need

    algorithms that make computations especially fast.
  3. DEEP LEARNING PRIMITIVES Weights, inputs, outputs stored in tensors. Matrix

    Multiplication Convolution Inner Product Transposition Rectified Linear Unit (ReLu)
  4. SINGULAR VALUE DECOMPOSITION (SVD) Ubiquitous in Deep Learning and Signal

    Processing. - Decomposes a matrix into low-rank matrices. - Reduces the model size. - Takes advantage of sparsity, given the amount of small singular values.
  5. SINGULAR VALUE DECOMPOSITION Let . Then it can be factorized

    as: eigenvectors of eigenvectors of singular values of
  6. SINGULAR VALUE DECOMPOSITION For some , we can get a

    low-rank approximation such that is minimized. SVD accelerates matrix multiplication to
  7. RANDOMIZED SVD: Halko-Martinsson-Tropp Desired rank is specified in advance: fixed-rank

    approximation problem. Given , a target rank , and an oversampling parameter , we seek to construct a matrix such that
  8. RANDOMIZED SVD: Halko-Martinsson-Tropp What is the optimal value for to

    minimize loss of information? If we approximate to rank-2k ( ), then:
  9. RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 1 1. Generate a gaussian test

    matrix . 2. Form the matrix product . 3. Construct a matrix whose columns form an orthonormal basis for the range of .
  10. RANDOMIZED SVD: Halko-Martinsson-Tropp Stage 2 1. Form , yields the

    low-rank factorization 2. Compute SVD factorization of the small matrix 3. Set
  11. SVD NAÏVE BENCHMARK r=15 np.linalg.svd 100 loops, best of 3:

    468 ms per loop tf.linalg.svd 100 loops, best of 3: 13.8 ms per loop sklearn.utils.extmath.randomized_svd 100 loops, best of 3: 10.1 ms per loop
  12. GENERAL MATRIX-TO-MATRIX MULTIPLICATION (GEMM) - Matrix Multiplications stress even the

    fastest computers available. - Fundamental building block for fully connected, recurrent, and convolutional layers. - The most common LA optimization trick in DNNs you never hear of.
  13. GEMM: ARITHMETIC INTENSITY Let and . For matrix multiplication, a

    total of FLOPs is required. Arithmetic intensity is defined as number of FLOPs / number of byte access.
  14. GEMM: CONVOLUTIONAL LAYERS 2D images with depth, a number of

    channels for each pixel. E.g. Remote Sensing. *
  15. GEMM: CONVOLUTIONAL LAYERS Turn the input tensor into a 2D

    array that we can treat like a matrix. Same for kernel weights. In TensorFlow, Intel MKL, cuBLAS: using data format, data representation of nD tensors stored in linear (1D) memory address space. NCHW: default in Theano NHWC: default in TensorFlow
  16. GEMM: TensorFlow im2col // The im2col buffer has # of

    patches rows, and # of filters cols. // It's laid out like this, in row major order in memory: // < filter value count > // ^ +---------------------+ // patch | | // count | | // v +---------------------+ // Each patch row contains a filter_width x filter_height patch of the // input, with the depth channel as the most contiguous in memory, followed // by the width, then the height. This is the standard memory order in the // image world if it helps to visualize it. // Now we've assembled a set of image patches into a matrix, apply a // GEMM matrix multiply of the patches as rows, times the filter // weights in columns, to get partial results in the output matrix.
  17. GEMM: PARTITIONING Multiplying two partitioned matrices is exactly like multiplying

    two matrices with scalar elements, but with the individual elements replaced by submatrices. Let . Then:
  18. REFERENCES Finding structure with randomness: Stochastic algorithms for constructing approximate

    matrix decompositions Halko, et al., 2009 Anatomy of High-Performance Matrix Multiplication. Kazushige Goto and Robert A. van de Geijn, 2008. Using Intel® Math Kernel Library for Matrix Multiplication. NVIDIA® Deep Learning Performance Documentation.