Feedforward Neural Network (II): Multi-class Classification

Slide 1

Slide 1 text

Feedforward Neural Network (II): Multi-class Classification Naoaki Okazaki School of Computing, Tokyo Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/

Slide 2

Slide 2 text

Highlights of this lecture  We extend binary classification to multi-class classification  Assign a weight vector for every category  Extend Perceptron algorithm to multi-class classification  Extend sigmoid function to softmax function  Again, automatic differentiation is useful for SGD training  ReLU is a popular activation function for internal layers  Dropout realizes model ensembling and averaging in a simple way 1

Slide 3

Slide 3 text

MNIST: Handwritten Recognition 2

Slide 4

Slide 4 text

MNIST database (LeCun+ 1998) 3 Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of IEEE, 86(11):2278-2324. 4 1 0 5 6 2 8 5 We want to classify an input image into 10 categories (digits)

Slide 5

Slide 5 text

Representing an image on a computer 4  An image (28 x 28 pixels, grayscale) is represented by a 28 x 28 matrix.  The original dataset represents a brightness in an 8-bit integer ([0, 255]).  In this lecture, a brightness is normalized within the range of [0, 1].

Slide 6

Slide 6 text

Multi-Class Classification and Perceptron Algorithm 5

Slide 7

Slide 7 text

Representing an image with a vector 6 Pixel at (, ) Feature ID (row major): 28 − 1 +  We convert an image into a vector where each element presents the brightness of a pixel, flattening a 2D matrix into a 1D vector  A 28 × 28 matrix is converted into a vector of 784 (= 28 × 28) dimension  A more sophisticated method (e.g., Convolutional Neural Network) will be explained later  Even this simple treatment surprisingly works well

Slide 8

Slide 8 text

Linear multi-class classification 7 ⋅ 0 = −1.24 ⋅ 1 = −4.30 ⋅ 2 = −0.68 ⋅ 3 = +3.62 ⋅ 4 = −5.61 ⋅ 5 = −1.94 ⋅ 6 = −5.56 ⋅ 7 = −6.86 ⋅ 8 = −0.08 ⋅ 9 = −3.69 0 1 2 3 4 5 6 7 8 9 Image Compute the score (inner product) for each category Choose the category with the maximum score � = 3 A model has a weight vector for every category Pixels

Slide 9

Slide 9 text

General form: linear multi-class classification 8 � = argmax ∈ ⋅ Input: ∈ ℝ Output: � ∈ Parameter: weight ∈ ℝ (prepared for every category) Set of possible categories for the input (: number of dimension)

Slide 10

Slide 10 text

Supervised learning (training) for multi-class classifier 9  We have a supervision data (: input, : output)  = { 1 , 1 , … , , } ( instances)  Find the weight vectors such that they can predict training instances as correctly as possible  ∀ ∈ {1, … , }: � = argmax ∈ ⋅ =  We assume generalization  If the parameters reproduce training instances well, they will work for unseen instances : the -th instance in the training data : the category for the -th instance

Slide 11

Slide 11 text

Perceptron algorithm (Collins, 2002) 10 1. = 0 for all ∈ 2. Repeat: 3. (, ) ⟵ an instance chosen from at random 4. � ⟵ argmax ∈ ⋅ 5. if � ≠ then: (incorrect prediction) 6. ⟵ + ( ⋅ will be larger) 7. � ⟵ � − (� ⋅ will be smaller) 8. Until no instance updates Michael Collins. 2002. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proc. of EMNLP, 1-8.

Slide 12

Slide 12 text

Intuitive example of Perceptron updates 11 ⋅ 0 = −1.24 ⋅ 1 = −4.30 ⋅ 2 = −0.68 ⋅ 3 = +3.62 ⋅ 4 = −5.61 ⋅ 5 = −1.94 ⋅ 6 = −5.56 ⋅ 7 = −6.86 ⋅ 8 = +3.87 ⋅ 9 = −3.69 0 1 2 3 4 5 6 7 8 9 Image (3) = 3 Compute the score (inner product) for each category Update necessary! The model predicts � = 8 but = 3 in fact Pixels 3 ⟵ 3 + 8 ⟵ 8 − ⋅ 3 will be larger after the update ⋅ 8 will be smaller after the update

Slide 13

Slide 13 text

Multi-class Perceptron implemented in numpy 12 https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb

Slide 14

Slide 14 text

Summary  A linear multi-class classifier has a weight vector for every category ∈  Given an input , a linear multi-class classifier computes a score for every category as an inner product ⋅  It predicts a category � for the input yielding the highest score among the possible categories  Weight vectors can be trained by an extension of Perceptron algorithm to multi-class (structured perceptron)  Again, we cannot use it for multi-layer neural networks  Let’s consider SGD for training multi-class classifiers 13

Slide 15

Slide 15 text

Multi-class Classification with Softmax Function 14

Slide 16

Slide 16 text

Training multi-class classifiers with SGD 15  In order to train binary classifiers using SGD, we had to change the activation function from step to sigmoid  What is the activation function for multi-class classification corresponding to sigmoid function?  Answer: Softmax function

Slide 17

Slide 17 text

Softmax function: Definition 16  Given a vector ∈ ℝ, softmax : ℝ → ℝ yields, = exp( ) ∑=1 exp( )  Here denotes the -th element of the value of  We use the same notation (do not confuse with sigmoid)  A result of softmax function satisfies, ∀: > 0, � =1 = 1

Slide 18

Slide 18 text

Softmax function: Intuitive explanation 17  Softmax function converts scores for caetgories ∈ ℝ into a probability distribution  Similarly to binary classification where sigmoid function converts a score to a probability Softmax

Slide 19

Slide 19 text

Single-layer NNs for multi-class classification 18  Given an input ∈ ℝ, a single-layer NN for multi-class classification yields a probability distribution over categories � ∈ ℝ, � = , =  Here, ∈ ℝ× is a weight matrix  can be seen as a mapping: ℝ → ℝ  Let denote the -th row vector of the matrix  The score for the category is = ⋅  The same to the linear multi-class classification

Slide 20

Slide 20 text

An example with softmax function 19 ⋅ = −1.24 ⋅ = −4.30 ⋅ = −0.68 ⋅ = +3.62 ⋅ = −5.61 ⋅ = −1.94 ⋅ = −5.56 ⋅ = −6.86 ⋅ = −0.08 ⋅ = −3.69 0 1 2 3 4 5 6 7 8 9 Image Pixels (0.74%) (0.03%) (1.29%) (95.1%) (0.01%) (0.37%) (0.01%) (0.00%) (2.35%) (0.06%) � =

Slide 21

Slide 21 text

Supervision data for multi-class (with a notational change) 20  We have a supervision data: = { 1 , 1 , … , , }( instances)  Input: = 1 , 2 , … , 𝑛𝑛 ⊺ ∈ ℝ  Output (changed from the previous notation): = 𝑛 , 𝑛 , … , ⊺ ∈ ℝ (one-hot vector) = 0,0,0,1,0,0,0,0,0,0 ⊺ 0 3 9 … … = 3

Slide 22

Slide 22 text

Instance-wise likelihood 21  We introduce instance-wise likelihood to measure how well the parameters reproduce ( , ) = � =1 � (if 𝑛𝑛 = 1) 1 (if 𝑛𝑛 = 0) = � =1 �  The probability of the true label estimated by the model 20 ⋅ = −1.24 ⋅ = −4.30 ⋅ = −0.68 ⋅ = +3.62 ⋅ = −5.61 ⋅ = −1.94 ⋅ = −5.56 ⋅ = −6.86 ⋅ = −0.08 ⋅ = −3.69 0 1 2 3 4 5 6 7 8 9 (0.74%) (0.03%) (1.29%) (95.1%) (0.01%) (0.37%) (0.01%) (0.00%) (2.35%) (0.06%) = 0.951 = 0,0,0,1,0,0,0,0,0,0 ⊺

Slide 23

Slide 23 text

Likelihood on the training data 22  We assume that all instances in the training data are i.i.d. (independent and identically distributed)  We define likelihood as a joint probability on data, = � =1  When the training data = { 1 , 1 , … , , } is fixed, likelihood is a function of the parameters  Let us maximize by changing  This is called Maximum Likelihood Estimation (MLE)  The maximizer ∗ reproduces the training data well

Slide 24

Slide 24 text

Training as a minimization problem 23  Products of (0,1) values often cause underflow  Instead, use log-likelihood, the logarithm of the likelihood, = log = log � =1 = � =1 log  In mathematical optimization, we usually consider a minimization problem instead of maximization  We define an objective function () by using the negative of the log-likelihood = − = − � =1 log  is called a loss function or error function

Slide 25

Slide 25 text

Training as a minimization problem 24  Given the training data = { 1 , 1 , … , , }, find ∗ as the minimization problem, ∗ = argmin = argmin � =1 − , = − log = − log � =1 � = − � =1 𝑛𝑛 log � 𝑛𝑛 ∗

Slide 26

Slide 26 text

Stochastic Gradient Descent (SGD) 25  The objective function is the sum of losses of instances, = � =1 −  We can use Stochastic Gradient Descent (SGD) and its variants (e.g., Adam) for minimizing  SGD Algorithm ( is the number of updates) 1. Initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ # Learning rate at 4. ( , ) ⟵ an instance chosen from at random 5. ⟵ − − = +

Slide 27

Slide 27 text

Exercise: compute the gradient 26 Prove (we omit the instance index for simplicity): = = � − by computing the gradients and Here: = − � =1 log � , � = = exp( ) ∑=1 exp( ) , = ⋅

Slide 28

Slide 28 text

Answer: the gradient 27 Because it is easy to find = , we concentrate on , = − � =1 log � = − � =1 1 � � = − 1 � � − � ≠ 1 � � The first term is, − 1 � � = − 1 � exp ∑=1 exp = − 1 � exp Σ − exp exp Σ2 = − 1 � exp Σ Σ − exp Σ = − 1 � � 1 − � = − 1 − � = − + � The second term is, − � ≠ 1 � � = − � ≠ 1 � exp ∑ ′=1 exp ′ = − � ≠ 1 � 0 − exp exp Σ2 = � ≠ 1 � � � = � ≠ � Therefore, = − + � + � ≠ � = − + � � =1 = − + �

Slide 29

Slide 29 text

SGD elaborated for training single-layer NNs 28 1. For every , initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ( ⋅ ) 6. ∀: ⟵ − 𝑘𝑘 = + 𝑛𝑛 − � 𝑛𝑛  The algorithm is the same as that for binary classification  For each category , it updates a weight by the amount of the error (𝑛𝑛 − � 𝑛𝑛 ) between the true probability 𝑛𝑛 and the estimated probability � 𝑛𝑛

Slide 30

Slide 30 text

Intuitive example of SGD updates ( = 3, � = 3; = 1) 29 ⋅ = −1.24 ⋅ = −4.30 ⋅ = −0.68 ⋅ = +3.62 ⋅ = −5.61 ⋅ = −1.94 ⋅ = −5.56 ⋅ = −6.86 ⋅ = −0.08 ⋅ = −3.69 0 1 2 3 4 5 6 7 8 9 (0.74%) (0.03%) (1.29%) (95.1%) (0.01%) (0.37%) (0.01%) (0.00%) (2.35%) (0.06%) −= 0.0074 −= 0.0003 −= 0.0129 += 0.0490 −= 0.0001 −= 0.0037 −= 0.0001 −= 0.0000 −= 0.0235 −= 0.0006

Slide 31

Slide 31 text

SGD implemented in numpy 30 https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb

Slide 32

Slide 32 text

Computing the loss with mini-batch 31  Single-batch  Mini-batch (parallelizable in CPU/GPU) � × ( ) = 1 1 1 = − ⋅ log � � × ( ) = = − 1 � =1 ⋅ log �

Slide 33

Slide 33 text

Mini-batch training 32  Most DL frameworks implement mini-batch training by increasing the order of tensors:  For example, → (m × )  Increasing the batch size () may:  Speed up time required for an epoch with parallelization  Decrease the number of parameter updates (1/)  This paper (Goyal+ 2017) recommends:  When the minibatch size is multiplied by , multiply the learning rate by Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.

Slide 34

Slide 34 text

Mini-batch SGD implemented in pytorch 33 xt: torch.tensor ( × 784) yt: torch.tensor () Define a NN model as a sequence of modules Sample instances with the batch size of 256 https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb

Slide 35

Slide 35 text

Regularization 34  MLE often causes over-fitting  When the training data is linearly separable → ∞ as � =1 → 0  Subject to be affected by noises in the training data  We use regularization (MAP estimation)  We introduce a penalty term when becomes large  The loss function with an L2 regularization term: = − � =1 + 2  is the hyper parameter to control the trade-off between over/under fitting

Slide 36

Slide 36 text

Summary and notes  -class classification is realized by an output layer with dimension  Softmax yields a probability distribution � ∈ ℝ  The loss function compares a model output � with a true category  and � are represented as one-hot vectors  Again, automatic differentiation is also useful for training multi-class NNs  A single-layer NN with softmax activation function is also known as multi-class logistic regression and maximum entropy modeling 35

Slide 37

Slide 37 text

Loss functions 36

Slide 38

Slide 38 text

Generic notation for multi-layer NNs 37  Configurations  The number of layers  The numbers of dimensions of hidden layers  An activation function for each layer  A loss function Σ (1) Σ (1) Σ (1) Σ (2) Σ (2) Σ (3) First layer: ℝ2 → ℝ3 (1) = (1) 1 (1) = (1)(0) (1) ∈ ℝ3×2, 1 , 1 ∈ ℝ3 Second layer: ℝ3 → ℝ2 (2) = (2) 2 (2) = (2)(1) (2) ∈ ℝ2×3, 2 , 2 ∈ ℝ2 Final layer: ℝ2 → ℝ2 (3) = (3) 3 (3) = (3)(2) (3) ∈ ℝ2×2, 3 , 3 ∈ ℝ2 1 → ℎ1 (0) 2 → ℎ2 (0) ℎ1 (1) ℎ2 (1) ℎ3 (1) 1 (1) 2 (1) 3 (1) 1 (2) 2 (2) ℎ1 (2) ℎ2 (2) ℎ1 (3) ← 1 1 (3) Σ (3) ℎ2 (3) ← 2 1 (3)

Slide 39

Slide 39 text

Cross entropy loss (binary) 38 , = − � log = 1 log 1 − = 0 = − log () − (1 − ) log 1 −  Cross entropy , = − � log True probability distribution (1 for true category; 0 otherwise) Predicted probability distribution

Slide 40

Slide 40 text

Cross entropy loss (multi) 39 , = − log exp ∑ exp = − + log � exp  Cross entropy , = − � log True probability distribution (1 for true category; 0 otherwise) Predicted probability distribution The probability of the true label estimated by the model ( = − log )

Slide 41

Slide 41 text

Mean Squared Error (MSE) loss 40  Used for regression , = 1 2 − 2 2

Slide 42

Slide 42 text

Activation functions 41

Slide 43

Slide 43 text

Step 42  Pros  Yields a binary output  Cons (never use this)  Zero gradients  SGD cannot update parameters because = 0 Step function: ℝ → {0,1} () = � 1 (if > 0) 0 (otherwise)

Slide 44

Slide 44 text

Sigmoid 43  Pros  Yields an output within (0,1)  Cons  Not zero-centered  Zero (vanishing) gradients when || is large Sigmoid: ℝ → (0,1) () = 1 1 + −

Slide 45

Slide 45 text

Hyperbolic tangent (tanh) 44  Pros  Yields an output within (−1,1)  Zero-centered  Cons  Zero (vanishing) gradients when || is large tanh: ℝ → (−1,1) tanh = − − + − = 2 2 − 1

Slide 46

Slide 46 text

Rectified Linear Unit (ReLU) 45  Pros  Gradients do not vanish when > 0  Light-weight (no ) computation  Faster convergence (e.g., 6x faster on CIFAR-10)  Cons  Not zero centered  Dead neurons when ≤ 0 ReLU: ℝ → ℝ≥0 ReLU = max(0, )

Slide 47

Slide 47 text

Dropout 46

Slide 48

Slide 48 text

Dropout (Srivastava+ 2014) 47  A simple method for preventing overfitting  Randomly drops units from a NN during training  Virtually samples an exponential number of different `thinned’ NNs during training  Prevents units from co-adapting too much  In inference (test) time, approximate the effect of averaging the thinned NNs  Simply by using the entire NN with smaller weights  Improves the performance on test data Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958.

Slide 49

Slide 49 text

Dropout at training phrase 48  For each training instance, choose units at random and drop them  Virtually samples an exponential number of different `thinned’ NNs  Train the thinned NNs by the same algorithm to standard NNs Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. (Srivastava+ 2014)

Slide 50

Slide 50 text

Dropout at training phrase 49  At training time, choose units that probability  At inference (test) time, multiply to the trained weights  This approximates the effect of averaging the predictions from exponentially many thinned models Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. (Srivastava+ 2014)

Slide 51

Slide 51 text

Dropout at training phrase 50 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. � = ⊙ , = Bernoulli() (Srivastava+ 2014)

Slide 52

Slide 52 text

Dropout in pytorch 51  Simply add a Dropout module  Do not forget to switch training and test modes  model.train() Units alive with probability  model.eval() Weights multiplied by Dropout module (we can control the dropout rate by specifying one in an argument; = 0.5 by default) https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb

Slide 53

Slide 53 text

How was 'Dropout' conceived? 52 https://www.reddit.com/r/MachineLearning/comments/4w6tsv/ama_we_are_the_google_brain_team_wed_love_to/ (Aug 12 2016)

Slide 54

Slide 54 text

Summary  We extend binary classification to multi-class classification  Assign a weight vector for every category  Extend Perceptron algorithm to multi-class classification  Extend sigmoid function to softmax function  Again, automatic differentiation is useful for SGD training  ReLU is a popular activation function for internal layers  Dropout realizes model ensembling and averaging in a simple way 53