Naoaki Okazaki
July 28, 2020
10k

# Feedforward Neural Network (II): Multi-class Classification

multi-class classification, linear multi-class classifier, softmax function, Stochastic Gradient Descent (SGD), mini-batch training, loss functions, activation functions, dropout

July 28, 2020

## Transcript

1. ### Feedforward Neural Network (II): Multi-class Classification Naoaki Okazaki School of

Computing, Tokyo Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/
2. ### Highlights of this lecture  We extend binary classification to

multi-class classification  Assign a weight vector for every category  Extend Perceptron algorithm to multi-class classification  Extend sigmoid function to softmax function  Again, automatic differentiation is useful for SGD training  ReLU is a popular activation function for internal layers  Dropout realizes model ensembling and averaging in a simple way 1

4. ### MNIST database (LeCun+ 1998) 3 Yann LeCun, Leon Bottou, Yoshua

Bengio, Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of IEEE, 86(11):2278-2324. 4 1 0 5 6 2 8 5 We want to classify an input image into 10 categories (digits)
5. ### Representing an image on a computer 4  An image

(28 x 28 pixels, grayscale) is represented by a 28 x 28 matrix.  The original dataset represents a brightness in an 8-bit integer ([0, 255]).  In this lecture, a brightness is normalized within the range of [0, 1].

7. ### Representing an image with a vector 6 Pixel at (,

) Feature ID (row major): 28 − 1 +  We convert an image into a vector where each element presents the brightness of a pixel, flattening a 2D matrix into a 1D vector  A 28 × 28 matrix is converted into a vector of 784 (= 28 × 28) dimension  A more sophisticated method (e.g., Convolutional Neural Network) will be explained later  Even this simple treatment surprisingly works well
8. ### Linear multi-class classification 7 ⋅ 0 = −1.24 ⋅ 1

= −4.30 ⋅ 2 = −0.68 ⋅ 3 = +3.62 ⋅ 4 = −5.61 ⋅ 5 = −1.94 ⋅ 6 = −5.56 ⋅ 7 = −6.86 ⋅ 8 = −0.08 ⋅ 9 = −3.69 0 1 2 3 4 5 6 7 8 9 Image Compute the score (inner product) for each category Choose the category with the maximum score � = 3 A model has a weight vector for every category Pixels
9. ### General form: linear multi-class classification 8 � = argmax ∈

⋅ Input: ∈ ℝ Output: � ∈ Parameter: weight ∈ ℝ (prepared for every category) Set of possible categories for the input (: number of dimension)
10. ### Supervised learning (training) for multi-class classifier 9  We have

a supervision data (: input, : output)  = { 1 , 1 , … , , } ( instances)  Find the weight vectors such that they can predict training instances as correctly as possible  ∀ ∈ {1, … , }: � = argmax ∈ ⋅ =  We assume generalization  If the parameters reproduce training instances well, they will work for unseen instances : the -th instance in the training data : the category for the -th instance
11. ### Perceptron algorithm (Collins, 2002) 10 1. = 0 for all

∈ 2. Repeat: 3. (, ) ⟵ an instance chosen from at random 4. � ⟵ argmax ∈ ⋅ 5. if � ≠ then: (incorrect prediction) 6. ⟵ + ( ⋅ will be larger) 7. � ⟵ � − (� ⋅ will be smaller) 8. Until no instance updates Michael Collins. 2002. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proc. of EMNLP, 1-8.
12. ### Intuitive example of Perceptron updates 11 ⋅ 0 = −1.24

⋅ 1 = −4.30 ⋅ 2 = −0.68 ⋅ 3 = +3.62 ⋅ 4 = −5.61 ⋅ 5 = −1.94 ⋅ 6 = −5.56 ⋅ 7 = −6.86 ⋅ 8 = +3.87 ⋅ 9 = −3.69 0 1 2 3 4 5 6 7 8 9 Image (3) = 3 Compute the score (inner product) for each category Update necessary! The model predicts � = 8 but = 3 in fact Pixels 3 ⟵ 3 + 8 ⟵ 8 − ⋅ 3 will be larger after the update ⋅ 8 will be smaller after the update

14. ### Summary  A linear multi-class classifier has a weight vector

for every category ∈  Given an input , a linear multi-class classifier computes a score for every category as an inner product ⋅  It predicts a category � for the input yielding the highest score among the possible categories  Weight vectors can be trained by an extension of Perceptron algorithm to multi-class (structured perceptron)  Again, we cannot use it for multi-layer neural networks  Let’s consider SGD for training multi-class classifiers 13

16. ### Training multi-class classifiers with SGD 15  In order to

train binary classifiers using SGD, we had to change the activation function from step to sigmoid  What is the activation function for multi-class classification corresponding to sigmoid function?  Answer: Softmax function
17. ### Softmax function: Definition 16  Given a vector ∈ ℝ,

softmax : ℝ → ℝ yields, = exp( ) ∑=1 exp( )  Here denotes the -th element of the value of  We use the same notation (do not confuse with sigmoid)  A result of softmax function satisfies, ∀: > 0, � =1 = 1
18. ### Softmax function: Intuitive explanation 17  Softmax function converts scores

for caetgories ∈ ℝ into a probability distribution  Similarly to binary classification where sigmoid function converts a score to a probability Softmax
19. ### Single-layer NNs for multi-class classification 18  Given an input

∈ ℝ, a single-layer NN for multi-class classification yields a probability distribution over categories � ∈ ℝ, � = , =  Here, ∈ ℝ× is a weight matrix  can be seen as a mapping: ℝ → ℝ  Let denote the -th row vector of the matrix  The score for the category is = ⋅  The same to the linear multi-class classification
20. ### An example with softmax function 19 ⋅ = −1.24 ⋅

= −4.30 ⋅ = −0.68 ⋅ = +3.62 ⋅ = −5.61 ⋅ = −1.94 ⋅ = −5.56 ⋅ = −6.86 ⋅ = −0.08 ⋅ = −3.69 0 1 2 3 4 5 6 7 8 9 Image Pixels (0.74%) (0.03%) (1.29%) (95.1%) (0.01%) (0.37%) (0.01%) (0.00%) (2.35%) (0.06%) � =
21. ### Supervision data for multi-class (with a notational change) 20 

We have a supervision data: = { 1 , 1 , … , , }( instances)  Input: = 1 , 2 , … , 𝑛𝑛 ⊺ ∈ ℝ  Output (changed from the previous notation): = 𝑛 , 𝑛 , … , ⊺ ∈ ℝ (one-hot vector) = 0,0,0,1,0,0,0,0,0,0 ⊺ 0 3 9 … … = 3
22. ### Instance-wise likelihood 21  We introduce instance-wise likelihood to measure

how well the parameters reproduce ( , ) = � =1 � (if 𝑛𝑛 = 1) 1 (if 𝑛𝑛 = 0) = � =1 �  The probability of the true label estimated by the model 20 ⋅ = −1.24 ⋅ = −4.30 ⋅ = −0.68 ⋅ = +3.62 ⋅ = −5.61 ⋅ = −1.94 ⋅ = −5.56 ⋅ = −6.86 ⋅ = −0.08 ⋅ = −3.69 0 1 2 3 4 5 6 7 8 9 (0.74%) (0.03%) (1.29%) (95.1%) (0.01%) (0.37%) (0.01%) (0.00%) (2.35%) (0.06%) = 0.951 = 0,0,0,1,0,0,0,0,0,0 ⊺
23. ### Likelihood on the training data 22  We assume that

all instances in the training data are i.i.d. (independent and identically distributed)  We define likelihood as a joint probability on data, = � =1  When the training data = { 1 , 1 , … , , } is fixed, likelihood is a function of the parameters  Let us maximize by changing  This is called Maximum Likelihood Estimation (MLE)  The maximizer ∗ reproduces the training data well
24. ### Training as a minimization problem 23  Products of (0,1)

values often cause underflow  Instead, use log-likelihood, the logarithm of the likelihood, = log = log � =1 = � =1 log  In mathematical optimization, we usually consider a minimization problem instead of maximization  We define an objective function () by using the negative of the log-likelihood = − = − � =1 log  is called a loss function or error function
25. ### Training as a minimization problem 24  Given the training

data = { 1 , 1 , … , , }, find ∗ as the minimization problem, ∗ = argmin = argmin � =1 − , = − log = − log � =1 � = − � =1 𝑛𝑛 log � 𝑛𝑛 ∗
26. ### Stochastic Gradient Descent (SGD) 25  The objective function is

the sum of losses of instances, = � =1 −  We can use Stochastic Gradient Descent (SGD) and its variants (e.g., Adam) for minimizing  SGD Algorithm ( is the number of updates) 1. Initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ # Learning rate at 4. ( , ) ⟵ an instance chosen from at random 5. ⟵ − − = +
27. ### Exercise: compute the gradient 26 Prove (we omit the instance

index for simplicity): = = � − by computing the gradients and Here: = − � =1 log � , � = = exp( ) ∑=1 exp( ) , = ⋅
28. ### Answer: the gradient 27 Because it is easy to find

= , we concentrate on , = − � =1 log � = − � =1 1 � � = − 1 � � − � ≠ 1 � � The first term is, − 1 � � = − 1 � exp ∑=1 exp = − 1 � exp Σ − exp exp Σ2 = − 1 � exp Σ Σ − exp Σ = − 1 � � 1 − � = − 1 − � = − + � The second term is, − � ≠ 1 � � = − � ≠ 1 � exp ∑ ′=1 exp ′ = − � ≠ 1 � 0 − exp exp Σ2 = � ≠ 1 � � � = � ≠ � Therefore, = − + � + � ≠ � = − + � � =1 = − + �
29. ### SGD elaborated for training single-layer NNs 28 1. For every

, initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ( ⋅ ) 6. ∀: ⟵ − 𝑘𝑘 = + 𝑛𝑛 − � 𝑛𝑛  The algorithm is the same as that for binary classification  For each category , it updates a weight by the amount of the error (𝑛𝑛 − � 𝑛𝑛 ) between the true probability 𝑛𝑛 and the estimated probability � 𝑛𝑛
30. ### Intuitive example of SGD updates ( = 3, � =

3; = 1) 29 ⋅ = −1.24 ⋅ = −4.30 ⋅ = −0.68 ⋅ = +3.62 ⋅ = −5.61 ⋅ = −1.94 ⋅ = −5.56 ⋅ = −6.86 ⋅ = −0.08 ⋅ = −3.69 0 1 2 3 4 5 6 7 8 9 (0.74%) (0.03%) (1.29%) (95.1%) (0.01%) (0.37%) (0.01%) (0.00%) (2.35%) (0.06%) −= 0.0074 −= 0.0003 −= 0.0129 += 0.0490 −= 0.0001 −= 0.0037 −= 0.0001 −= 0.0000 −= 0.0235 −= 0.0006

32. ### Computing the loss with mini-batch 31  Single-batch  Mini-batch

(parallelizable in CPU/GPU) � × ( ) = 1 1 1 = − ⋅ log � � × ( ) = = − 1 � =1 ⋅ log �
33. ### Mini-batch training 32  Most DL frameworks implement mini-batch training

by increasing the order of tensors:  For example, → (m × )  Increasing the batch size () may:  Speed up time required for an epoch with parallelization  Decrease the number of parameter updates (1/)  This paper (Goyal+ 2017) recommends:  When the minibatch size is multiplied by , multiply the learning rate by Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.
34. ### Mini-batch SGD implemented in pytorch 33 xt: torch.tensor ( ×

784) yt: torch.tensor () Define a NN model as a sequence of modules Sample instances with the batch size of 256 https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb
35. ### Regularization 34  MLE often causes over-fitting  When the

training data is linearly separable → ∞ as � =1 → 0  Subject to be affected by noises in the training data  We use regularization (MAP estimation)  We introduce a penalty term when becomes large  The loss function with an L2 regularization term: = − � =1 + 2  is the hyper parameter to control the trade-off between over/under fitting
36. ### Summary and notes  -class classification is realized by an

output layer with dimension  Softmax yields a probability distribution � ∈ ℝ  The loss function compares a model output � with a true category  and � are represented as one-hot vectors  Again, automatic differentiation is also useful for training multi-class NNs  A single-layer NN with softmax activation function is also known as multi-class logistic regression and maximum entropy modeling 35

38. ### Generic notation for multi-layer NNs 37  Configurations  The

number of layers  The numbers of dimensions of hidden layers  An activation function for each layer  A loss function Σ (1) Σ (1) Σ (1) Σ (2) Σ (2) Σ (3) First layer: ℝ2 → ℝ3 (1) = (1) 1 (1) = (1)(0) (1) ∈ ℝ3×2, 1 , 1 ∈ ℝ3 Second layer: ℝ3 → ℝ2 (2) = (2) 2 (2) = (2)(1) (2) ∈ ℝ2×3, 2 , 2 ∈ ℝ2 Final layer: ℝ2 → ℝ2 (3) = (3) 3 (3) = (3)(2) (3) ∈ ℝ2×2, 3 , 3 ∈ ℝ2 1 → ℎ1 (0) 2 → ℎ2 (0) ℎ1 (1) ℎ2 (1) ℎ3 (1) 1 (1) 2 (1) 3 (1) 1 (2) 2 (2) ℎ1 (2) ℎ2 (2) ℎ1 (3) ← 1 1 (3) Σ (3) ℎ2 (3) ← 2 1 (3)
39. ### Cross entropy loss (binary) 38 , = − � log

= 1 log 1 − = 0 = − log () − (1 − ) log 1 −  Cross entropy , = − � log True probability distribution (1 for true category; 0 otherwise) Predicted probability distribution
40. ### Cross entropy loss (multi) 39 , = − log exp

∑ exp = − + log � exp  Cross entropy , = − � log True probability distribution (1 for true category; 0 otherwise) Predicted probability distribution The probability of the true label estimated by the model ( = − log )
41. ### Mean Squared Error (MSE) loss 40  Used for regression

, = 1 2 − 2 2

43. ### Step 42  Pros  Yields a binary output 

Cons (never use this)  Zero gradients  SGD cannot update parameters because = 0 Step function: ℝ → {0,1} () = � 1 (if > 0) 0 (otherwise)
44. ### Sigmoid 43  Pros  Yields an output within (0,1)

 Cons  Not zero-centered  Zero (vanishing) gradients when || is large Sigmoid: ℝ → (0,1) () = 1 1 + −
45. ### Hyperbolic tangent (tanh) 44  Pros  Yields an output

within (−1,1)  Zero-centered  Cons  Zero (vanishing) gradients when || is large tanh: ℝ → (−1,1) tanh = − − + − = 2 2 − 1
46. ### Rectified Linear Unit (ReLU) 45  Pros  Gradients do

not vanish when > 0  Light-weight (no ) computation  Faster convergence (e.g., 6x faster on CIFAR-10)  Cons  Not zero centered  Dead neurons when ≤ 0 ReLU: ℝ → ℝ≥0 ReLU = max(0, )

48. ### Dropout (Srivastava+ 2014) 47  A simple method for preventing

overfitting  Randomly drops units from a NN during training  Virtually samples an exponential number of different `thinned’ NNs during training  Prevents units from co-adapting too much  In inference (test) time, approximate the effect of averaging the thinned NNs  Simply by using the entire NN with smaller weights  Improves the performance on test data Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958.
49. ### Dropout at training phrase 48  For each training instance,

choose units at random and drop them  Virtually samples an exponential number of different `thinned’ NNs  Train the thinned NNs by the same algorithm to standard NNs Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. (Srivastava+ 2014)
50. ### Dropout at training phrase 49  At training time, choose

units that probability  At inference (test) time, multiply to the trained weights  This approximates the effect of averaging the predictions from exponentially many thinned models Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. (Srivastava+ 2014)
51. ### Dropout at training phrase 50 Nitish Srivastava, Geoffrey Hinton, Alex

Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. � = ⊙ , = Bernoulli() (Srivastava+ 2014)
52. ### Dropout in pytorch 51  Simply add a Dropout module

 Do not forget to switch training and test modes  model.train() Units alive with probability  model.eval() Weights multiplied by Dropout module (we can control the dropout rate by specifying one in an argument; = 0.5 by default) https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb

54. ### Summary  We extend binary classification to multi-class classification 

Assign a weight vector for every category  Extend Perceptron algorithm to multi-class classification  Extend sigmoid function to softmax function  Again, automatic differentiation is useful for SGD training  ReLU is a popular activation function for internal layers  Dropout realizes model ensembling and averaging in a simple way 53