Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[CS Foundation] AIML - 5 - Deep Learning

[CS Foundation] AIML - 5 - Deep Learning

x-village

August 16, 2018
Tweet

More Decks by x-village

Other Decks in Programming

Transcript

  1. Who am I? • Name: Jie-Han Chen • My research

    interests: ◦ Artificial General Intelligence ◦ Reinforcement Learning ◦ Neural Network Architecture Design • Currently, An Master Student in CSIE, National Cheng Kung University, Taiwan. • LinkedIn: link 2
  2. The revolution Recent years, deep learning has provided impressing results

    in many domains: • Computer Vision • Speech and Text Recognition • Decision making policy and Control • Modification and Generation 3
  3. 10

  4. You don’t need to pick up all contents in this

    lecture. Just keep something in mind, when training neural network. 12
  5. Outline • The inspiration of artificial neural network • Perceptron

    & multi-layer perceptrons • Neural network • Optimization and learning algorithm • Tips for training neural network • Reference • Resources 13
  6. “At least two ingredients are necessary for the advancement of

    a technology: • Concept • Implementation” -- (quoted from Neural Network Design) 14
  7. quoted from Neural Network Design, 2nd edition. 由神經科學(neuroscience)得 知神經元包含以下元素: •

    神經細胞本體 Cell Body • 樹突 Dendrites • 軸突 Axon • 突觸 Synapse The inspiration of Artificial Neural Network Biological Neurons 15
  8. Single Neuron Perceptron (感知器) • Proposed by Warren McCulloch and

    Walter Pitts (1943) • Can compute arithmetic or logical function axon from a neuron (軸突) synapse(突觸) dendrite(樹突) output axon (軸突) cell body(神經細胞本體) 16
  9. Single Neuron Perceptron (感知器) w 2 x 2 w 3

    x 3 x 1 axon from a neuron (軸突) synapse(突觸) w 1 x 1 dendrite(樹突) cell body(神經細胞本體) output axon (軸突) Bias corresponding to intecept term transfer function (activation function) 17
  10. Single Neuron Perceptron (感知器) +1 0 Hard Limit Transfer Function

    (hardlim() or sgn()) w 1 x 1 b w 2 x 2 w 3 x 3 18
  11. Single Neuron Perceptron (感知器) +1 0 If wTx is greater

    than or equal to -b, the output will be 1, otherwise the output will be 0. Thus each neuron divides the input space into two regions. w 1 x 1 w 2 x 2 w 3 x 3 b 19
  12. Single Neuron Perceptron (感知器) x2 x1 label: +1 label: 0

    Decision Boundary 20 weight vector (point to positive side)
  13. • AND operation: Single Neuron Perceptron (感知器) x 1 x

    2 output 0 0 0 0 1 0 1 0 0 1 1 1 x 1 x 2 w 1 = 1 w 2 = 1 b = -1.5 21
  14. • OR operation: Single Neuron Perceptron (感知器) x 1 x

    2 output 0 0 0 0 1 1 1 0 1 1 1 1 x 1 x 2 w 1 = 1 w 2 = 1 b = -0.5 23
  15. • NOT operation: Single Neuron Perceptron (感知器) x 1 w

    1 = -0.6 b = 0.5 x 1 output 0 1 1 0 25
  16. Another kind of Single Neuron Perceptron w 2 x 2

    w 3 x 3 w 1 x 1 +1 0 -1 Symmetrical Hard Limit Transfer Function (hardlims()) b 27
  17. When we change the transfer function to linear function, it

    would be a kind of linear regression. Linear Regression Linear function f(x)=x p.s., Such linear activation function has also been applied in ADALINE networks. w 2 x 2 w 3 x 3 w 1 x 1 b 28
  18. Learning algorithm of Perceptron We won’t cover the Perceptron Learning

    Algorithm(PLA) here. if you have strong interest about the learning algorithm, just notice that the learning algorithm would be different with different transfer functions, hardlim and hardlims. • hardlim: see the details in “Neural network design” in Chapter4 (formula 4.34, 4.35) • hardlims: see the details in “Learning from data” in Chapter1 (formula 1.3) • You also can write down the loss function and use derivative to derive the learning algorithm. 30
  19. The disadvantage of Perceptron The decision boundary of perceptron is

    a line, and it cannot handle linear inseparable problem, e.g., XOR logic x1 x2 Decision Boundary 31
  20. Linear Separable V.S. Not Linear Separable 我們可以將 資料 分成兩個類型: Linear

    Separable, Not Linear Separable。 • Linear Separable: 資料本身可以由直線分割。 • Not Linear Separable: 資料本身不可由直線分割。 33
  21. Define the notation of neural network W 1, 2 x

    2 W 1, 3 x 3 W 1, 1 x 1 Now, we replace the weights vector with matrix. W 1, i : the weight is from ith source and to the 1st output neuron b w 2 x 2 w 3 x 3 w 1 x 1 b 41
  22. Define the notation of neural network W 1, 2 x

    2 W 1, 3 x 3 W 1, 1 x 1 b W 1, 1 W 1, 2 W 1, 3 x 1 x 2 x 3 b Wx + b 42
  23. Define the notation of neural network W 1, 2 x

    2 W 1, 3 x 3 W 1, 1 x 1 b W 1, 1 W 1, 2 W 1, 3 x 1 x 2 x 3 b f( Wx + b ) f f 43
  24. Define the notation of neural network W 1, 2 x

    2 W 1, 3 x 3 W 1, 1 x 1 b f The output of a single neuron would be a scalar. 44
  25. A layer of Neurons - Single Neuron Perceptron each output

    denotes a single model(decision boundary) . . . x 1 x 2 x 3 x N N-dimension input data a 1 a 2 a S W 1, 1 W 1, 2 W 1, 3 W 1, N 1 W 1, 0 = b 1 45
  26. A layer of Neurons - Single Neuron Perceptron . .

    . x 1 x 2 x 3 x N N-dimension input data a 1 a 2 a S W 1, 1 W 1, 2 W 1, 3 W 1, N 1 W 1, 0 = b 1 Notice: we add the input of bias here! 46
  27. A layer of Neurons . . . x 1 x

    2 x 3 x N N-dimension input data a 1 a 2 a S W 1, 1 W S, N b 1 b 2 b S f ( W 1 x + b 1 ) f ( W 2 x + b 2 ) f ( W S x + b S ) ... ... 47
  28. A layer of neurons can be expressed by a function:

    The output of such function is a vector from a layer of neurons, and we will combine the results from different neurons soon. A layer of Neurons 48
  29. Two layers of Neurons . . . x 1 x

    2 x 3 x N W 1, 1 W S, N b 1 b 2 b S We can combine different models into a more powerful single model by adding another layer of neurons. f 49
  30. Two layers of Neurons . . . x 1 x

    2 x 3 x N W1 1, 1 W1 S, N b1 1 b1 2 b1 S W2 1, 1 W2 1, 2 W2 1, 3 b2 1 f 1 f 2 50
  31. XOR logic x2 51 x1 x 1 x 2 1

    1 -1 -1 b = -0.5 b = +1.5 1 1 b = -1.5 Use AND operation to combine two models !
  32. Multi-layer perceptron is a kind of Neural Net called feedforward

    neural network. Multi-Layer Perceptron (MLP) Forward Feed data 54
  33. A feedforward neural network is a series of transformation of

    input data x. Feedforward Neural Network x 56
  34. Why do we need non-linear activation function? • Reason1: Neural

    network is a series of transformations of input data. if the transformation is linear, it is useless to add more hidden layers. 58
  35. Why do we need non-linear activation function? • Reason2: The

    neural network is a function approximator to approximate the ideal target function f target in machine learning. The target function f target may be complicated (non-linear). 59
  36. • Reason2: Why do we need non-linear activation function? quoted

    from Hsuan-Tien Lin’s Machine Learning Foundations course 60
  37. From Linear to Non-Linear Use different kinds of activation function

    to transform the linear function to non-linear function. +1 0 +1 0 -1 The condition of linear function 61
  38. The decision boundary of Neural Network quoted from “Introductory Overview

    Lecture The Deep Learning Revolution” in JSM 2018 and you can play with ConvnetJS: https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html 63
  39. Model Capacity Deep Learning & Neural Network Multi-Layer Perceptron Linear

    Regression & Logistic Regression Model Capacity Low High 65
  40. There are many kinds of neural network • Feedforward neural

    network • Recurrent neural network • Radial basis function network • Memory neural network • ... There are all called Deep Learning in this era, because of using deep hidden layers. 66
  41. Representation Learning • Rule-based system, Classic machine learning: relying on

    human’s domain knowledge to design features. • Representation Learning: Without handcrafted features, usually use raw inputs and it can extract representation automatically by series of transformation. • Thus, representation Learning is useful in processing speech data, image data. The image is quoted from Goodfellow et al., Deep Learning. 69
  42. Learning in Neural Network Deep Learning is a subfield of

    machine learning, and we can learn the weights by optimization algorithm. Learning algorithms: • Gradient based method • Evolutionary method • ... 71
  43. Learning in Neural Network Deep Learning is a subfield of

    machine learning, and we can learn the weights by optimization algorithm. Learning algorithms: • Gradient based method • Evolutionary method • ... We focus on this! 72
  44. Gradient descent in Neural Network In the previous lecture, we

    have learned that gradient descent can be used to optimize the learning model. Where J( ) is the objective function and denotes the parameters (weights) in the neural network. 73
  45. Gradient descent in Neural Network There are two questions about

    doing gradient descent in neural network. • How do we compute the gradients for the weights of each layer? • The landscape of loss function is probably not convex. 74
  46. Gradient descent in Neural Network There are two questions about

    doing gradient descent in neural network. • How do we compute the gradients for the weights of each layer? • The landscape of loss function is probably not convex. 75
  47. The operation in Neural Network Let’s think about the simplified

    example, there are 3 kinds of computation involves in neural network. • Addition • Multiplication • Activation x 1 w 1 x 2 w 2 b 77
  48. A simplified example of backpropagation Consider a simple function f(x,

    y) = x + y. It can be expressed by a computational graph. + x y z = f(x, y) = x + y 78
  49. A simplified example of backpropagation Consider a simple function f(x,

    y) = xy. It can be expressed by a computational graph too. * x y z = f(x, y) = xy 79
  50. The derivative of activation function 81 sigmoid function tanh function

    Image Credits: http://ronny.rest/media/blog/2017/2017_08_16_tanh/tanh_v_sigmoid.jpg
  51. A simplified example of backpropagation The following is a computational

    graph of perceptron, and our objective is to find partial derivative of cost function C(y) with respect to weights vector w: * * x 1 + w 1 x 2 w 2 a 1 a 2 f a 3 y 83
  52. Chain Rule and Backpropagation * * x 1 + w

    1 x 2 w 2 a 2 f a 3 y 84 a 1 = w 1 x 1 a 3 = a 1 + a 2 y = f (a 3 ) a 1
  53. Chain Rule and Backpropagation * * x 1 + w

    1 x 2 w 2 a 1 a 2 f a 3 y It is decided by your cost function 85
  54. Chain Rule and Backpropagation * * x 1 + w

    1 x 2 w 2 a 1 a 2 f a 3 y We can compute gradients by chain rule, and the purple line denotes behavior of backpropagation. 86
  55. Chain Rule and Backpropagation * * x 1 + w

    1 x 2 w 2 a 1 a 2 f a 3 y It is decided by your cost function 88
  56. Chain Rule and Backpropagation * * x 1 + w

    1 x 2 w 2 a 1 a 2 f a 3 y Let’s take a deep look at single computation unit 89
  57. Chain Rule and Backpropagation * x 1 w 1 a

    1 If we want to compute the gradients of current computation unit, we need two things: • The gradients of current output with respect to current weights. • The cumulative gradients from output side. 90
  58. What about neural network? We can consider each neuron in

    neural network a computation unit. x 1 x 2 Cost: C(θ) 91
  59. What about neural network? Find: x 1 x 2 w

    1 w 2 Cost: C(θ) w 3 w 4 92
  60. What about neural network? Find: x 1 x 2 w

    1 w 2 w 3 w 4 z = wTx + b a = σ(z) z’ = w 3 a + b 3 z’’ = w 4 a + b 4 93
  61. What about neural network? Case1: Output Layer x 1 x

    2 w 1 w 2 w 3 w 4 y 1 = σ(z’), y 2 = σ (z’’) y 1 y 2 done. decided by cost function 94
  62. What about neural network? Case2: Not Output Layer x 1

    x 2 w 1 w 2 w 3 w 4 y 1 y 2 Compute gradients recursively, until we reach the output layer. 95
  63. Gradient descent in Neural Network There are two questions about

    doing gradient descent in neural network. • How do we compute the gradients for the weights of each layer? • The landscape of loss function is probably not convex. 96
  64. Convex V.S Non-Convex 97 J(θ) θ θ J(θ) convex non-convex

    You can see math definition in wikipedia: https://en.wikipedia.org/wiki/Convex_function
  65. The Visualization of loss landscape in NN A Complicated Loss

    Landscape Image Credits: https://www.cs.umd.edu/~tomg/projects/landscapes/ 98
  66. No Absolute Answer! • The loss landscape is decided by

    training data, which may not be closed to actual loss with just few samples. (Learning theory) • Fortunately, local optimal would be great enough to solve most problems. 100
  67. The issues about gradient descent • Memory Usage • Vanishing

    gradients • Dying ReLUs • Exploding gradients in RNNs 101
  68. Memory Usage 102 * * x 1 + w 1

    x 2 w 2 a 1 a 2 f a 3 y In order to compute gradients efficiently, we often cache the forward results (a 3 , x 2 ) , which may cause large consumption in memory
  69. Vanishing gradients Let’s look at sigmoid function: 103 sigmoid function

    reference: https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b • The maximum derivative value of sigmoid function is 0.25 • The gradients through backpropagation will be smaller and smaller in the early layers. (*0.25*0.25*0.25 …) • If the initialized weights are too big, the outputs(Wx+b) are big and it will make zero gradients.
  70. Vanishing gradients Let’s look at sigmoid function: 104 sigmoid function

    • The maximum derivative value of sigmoid function is 0.25 • The gradients through backpropagation will be smaller and smaller in the early layers. (*0.25*0.25*0.25 …) • If the initialized weights are too big, the outputs(Wx+b) are big and it will make zero gradients.
  71. Vanishing gradients in tanh If we initialized large weights in

    neural network, it also makes vanishing gradients in tanh. 105
  72. Dying ReLUs • If a neuron gets clamped to zero

    in the forward pass, then its weights will get zero gradients and stop updating. • Both initialization with large weights and huge gradients update(aggressive learning rate) during training phase can cause dead ReLUs issue. 106 ReLU (Rectified Linear Unit)
  73. Exploding gradients in RNNs Sometimes, the gradients may explode, especially

    in vanilla recurrent neural network(原始的RNN). Pascanu et al. addressed this problem and proposed a solution called gradients clipping to relieve exploding gradients. A special recurrent unit LSTM can relieve this issue! 108 Pascanu et al., “On the difficulty of training Recurrent Neural Networks”
  74. Exploding gradients in RNNs 110 * * * * a

    b b b b a 4 = a*b*b*b*b a 1 a 2 a 3 a 4
  75. Exploding gradients in RNNs If |b| < 1, the cumulative

    gradients go to zero. if |b| > 1, the cumulative gradients go to infinity. 111 * * * * a b b b b a 4 = a*b*b*b*b a 1 a 2 a 3
  76. The methods to relieve previous issues • Xavier initialization -

    relieve vanishing gradients • Kaiming initialization - relieve dead ReLUs • Use normalization layers ◦ Batch Normalization ◦ Layer Normalization ◦ … • Use LSTM, Gradients Clipping - relieve exploding gradients 112 The articles about initialization and batch normalization(Chinese): https://zhuanlan.zhihu.com/p/25110150 You also can found similar content in the Chapter 6 of “Deep learning from scratch”, O’reilly https://github.com/oreilly-japan/deep-learning-from-scratch
  77. The resources about backpropagation • Lecture notes of CS231n, Stanford

    ◦ http://cs231n.github.io/optimization-1/ • Hung-Yi Lee’s course video in NTU, Taiwan ◦ https://www.youtube.com/watch?v=ibJpTrp5mcE&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89 yu49&index=12 • An article from Andrej Karpathy ◦ Yes you should understand backprop: https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b 113
  78. Tips for training deep neural network 115 • Choose Optimizer

    • Data augmentation • Regularization • Early Stopping • Normalization
  79. Choose Optimizer There are many optimizer in deep learning package,

    the most basic one is SGD(Stochastic Gradient Descent Optimizer) However, in order to well optimize the objective function, researchers design many useful optimizers that help deep learning.Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization 116 • Stochastic Gradient Descent(SGD) • Adagrad • Adadelta • RMSProp • Adam recommended article: http://ruder.io/optimizing-gradient-descent/index.html
  80. Choose Optimizer Nowadays, a better optimizer has following two attributes:

    • Adaptive Learning Rate • Moment (Momentum) e.g., Adam: Adaptive moment estimation 117
  81. Choose Optimizer 122 The results are quoted from Kingma et

    al., ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION
  82. Data augmentation 124 Increase the generalizability of model(avoid overfitting) by

    giving additional human-modified data, including noise. (Image: rotation,scaling,panning,flipping,in different angle)
  83. Data augmentation Be careful!! DO NOT apply transformations that would

    change the correct class. 125 Image from MNIST
  84. Regularization Many strategies used in machine learning are explicitly designed

    to reduce the testing error, possibly at the expense of increased training error. These strategies are known collectively as regularization.(e.g., L1-regularization, L2-regularization) The most popular one in deep learning is Dropout 126
  85. Regularization - Dropout 127 The images are quoted from Srivastava

    et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” (JMLR 2014)
  86. Regularization - Dropout 128 The images are quoted from Srivastava

    et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” (JMLR 2014)
  87. Normalization • Normalization by using neural network computation unit ◦

    Batch Normalization ◦ Layer Normalization 133 The computation graph of Batch Normalization Image Credits: https://kratzert.github.io/2016/02/12/un derstanding-the-gradient-flow-through- the-batch-normalization-layer.html
  88. Normalization • Normalization by using neural network computation unit ◦

    Batch Normalization ◦ Layer Normalization Notice!! The behavior of normalization layer may be different in training and testing. 134
  89. Resources about CNN CNN, Convolutional Neural Network. CNN is popular

    neural network architecture in deep learning, especially in computer vision tasks. Here are some nice resources: • CS231n, Stanford: http://cs231n.stanford.edu/ • ConvNetJS: https://cs.stanford.edu/people/karpathy/convnetjs/index.html • Article: An Intuitive Explanation of Convolutional Neural Networks: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ • Intuitively Understanding Convolutions for Deep Learning: https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee 1 • Intuitively Understanding Convolutions for Deep Learning (Chinese) http://bangqu.com/nNMB58.html#utm_source=Facebook_PicSee&utm_medium=Social 135
  90. Reference Book • Deep Learning • Neural Network Design •

    Learning from Data Course • CS229, Stanford • CS231n, Stanford • Hsuan-Tien Lin’s Machine Learning Foundations, National Taiwan University • Hung-yi Lee’s Machine Learning, National Taiwan University And some papers and articles 138
  91. Resources • Deep Learning(Nature): ◦ https://www.evl.uic.edu/creativecoding/courses/cs523/slides/week3/Deep Learning_LeCun.pdf • Pytorch examples:

    ◦ https://github.com/jcjohnson/pytorch-examples • CS230 code example: ◦ https://github.com/cs230-stanford/cs230-code-examples • Introductory Overview Lecture The Deep Learning Revolution: ◦ http://www.cs.cmu.edu/~rsalakhu/jsm2018.html • Why softmax is named “soft”max? ◦ http://neuralnetworksanddeeplearning.com/chap3.html#softmax 139