multi-class classification Assign a weight vector for every category Extend Perceptron algorithm to multi-class classification Extend sigmoid function to softmax function Again, automatic differentiation is useful for SGD training ReLU is a popular activation function for internal layers Dropout realizes model ensembling and averaging in a simple way 1

Bengio, Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of IEEE, 86(11):2278-2324. 4 1 0 5 6 2 8 5 We want to classify an input image into 10 categories (digits)

(28 x 28 pixels, grayscale) is represented by a 28 x 28 matrix. The original dataset represents a brightness in an 8-bit integer ([0, 255]). In this lecture, a brightness is normalized within the range of [0, 1].

) Feature ID (row major): 28 − 1 + We convert an image into a vector where each element presents the brightness of a pixel, flattening a 2D matrix into a 1D vector A 28 × 28 matrix is converted into a vector of 784 (= 28 × 28) dimension A more sophisticated method (e.g., Convolutional Neural Network) will be explained later Even this simple treatment surprisingly works well

a supervision data (: input, : output) = { 1 , 1 , … , , } ( instances) Find the weight vectors such that they can predict training instances as correctly as possible ∀ ∈ {1, … , }: � = argmax ∈ ⋅ = We assume generalization If the parameters reproduce training instances well, they will work for unseen instances : the -th instance in the training data : the category for the -th instance

∈ 2. Repeat: 3. (, ) ⟵ an instance chosen from at random 4. � ⟵ argmax ∈ ⋅ 5. if � ≠ then: (incorrect prediction) 6. ⟵ + ( ⋅ will be larger) 7. � ⟵ � − (� ⋅ will be smaller) 8. Until no instance updates Michael Collins. 2002. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proc. of EMNLP, 1-8.

for every category ∈ Given an input , a linear multi-class classifier computes a score for every category as an inner product ⋅ It predicts a category � for the input yielding the highest score among the possible categories Weight vectors can be trained by an extension of Perceptron algorithm to multi-class (structured perceptron) Again, we cannot use it for multi-layer neural networks Let’s consider SGD for training multi-class classifiers 13

train binary classifiers using SGD, we had to change the activation function from step to sigmoid What is the activation function for multi-class classification corresponding to sigmoid function? Answer: Softmax function

softmax : ℝ → ℝ yields, = exp( ) ∑=1 exp( ) Here denotes the -th element of the value of We use the same notation (do not confuse with sigmoid) A result of softmax function satisfies, ∀: > 0, � =1 = 1

for caetgories ∈ ℝ into a probability distribution Similarly to binary classification where sigmoid function converts a score to a probability Softmax

∈ ℝ, a single-layer NN for multi-class classification yields a probability distribution over categories � ∈ ℝ, � = , = Here, ∈ ℝ× is a weight matrix can be seen as a mapping: ℝ → ℝ Let denote the -th row vector of the matrix The score for the category is = ⋅ The same to the linear multi-class classification

all instances in the training data are i.i.d. (independent and identically distributed) We define likelihood as a joint probability on data, = � =1 When the training data = { 1 , 1 , … , , } is fixed, likelihood is a function of the parameters Let us maximize by changing This is called Maximum Likelihood Estimation (MLE) The maximizer ∗ reproduces the training data well

values often cause underflow Instead, use log-likelihood, the logarithm of the likelihood, = log = log � =1 = � =1 log In mathematical optimization, we usually consider a minimization problem instead of maximization We define an objective function () by using the negative of the log-likelihood = − = − � =1 log is called a loss function or error function

the sum of losses of instances, = � =1 − We can use Stochastic Gradient Descent (SGD) and its variants (e.g., Adam) for minimizing SGD Algorithm ( is the number of updates) 1. Initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ # Learning rate at 4. ( , ) ⟵ an instance chosen from at random 5. ⟵ − − = +

, initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ( ⋅ ) 6. ∀: ⟵ − 𝑘𝑘 = + 𝑛𝑛 − � 𝑛𝑛 The algorithm is the same as that for binary classification For each category , it updates a weight by the amount of the error (𝑛𝑛 − � 𝑛𝑛 ) between the true probability 𝑛𝑛 and the estimated probability � 𝑛𝑛

by increasing the order of tensors: For example, → (m × ) Increasing the batch size () may: Speed up time required for an epoch with parallelization Decrease the number of parameter updates (1/) This paper (Goyal+ 2017) recommends: When the minibatch size is multiplied by , multiply the learning rate by Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.

784) yt: torch.tensor () Define a NN model as a sequence of modules Sample instances with the batch size of 256 https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb

training data is linearly separable → ∞ as � =1 → 0 Subject to be affected by noises in the training data We use regularization (MAP estimation) We introduce a penalty term when becomes large The loss function with an L2 regularization term: = − � =1 + 2 is the hyper parameter to control the trade-off between over/under fitting

output layer with dimension Softmax yields a probability distribution � ∈ ℝ The loss function compares a model output � with a true category and � are represented as one-hot vectors Again, automatic differentiation is also useful for training multi-class NNs A single-layer NN with softmax activation function is also known as multi-class logistic regression and maximum entropy modeling 35

∑ exp = − + log � exp Cross entropy , = − � log True probability distribution (1 for true category; 0 otherwise) Predicted probability distribution The probability of the true label estimated by the model ( = − log )

not vanish when > 0 Light-weight (no ) computation Faster convergence (e.g., 6x faster on CIFAR-10) Cons Not zero centered Dead neurons when ≤ 0 ReLU: ℝ → ℝ≥0 ReLU = max(0, )

overfitting Randomly drops units from a NN during training Virtually samples an exponential number of different `thinned’ NNs during training Prevents units from co-adapting too much In inference (test) time, approximate the effect of averaging the thinned NNs Simply by using the entire NN with smaller weights Improves the performance on test data Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958.

choose units at random and drop them Virtually samples an exponential number of different `thinned’ NNs Train the thinned NNs by the same algorithm to standard NNs Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. (Srivastava+ 2014)

units that probability At inference (test) time, multiply to the trained weights This approximates the effect of averaging the predictions from exponentially many thinned models Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. (Srivastava+ 2014)

Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. � = ⊙ , = Bernoulli() (Srivastava+ 2014)

Do not forget to switch training and test modes model.train() Units alive with probability model.eval() Weights multiplied by Dropout module (we can control the dropout rate by specifying one in an argument; = 0.5 by default) https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb

Assign a weight vector for every category Extend Perceptron algorithm to multi-class classification Extend sigmoid function to softmax function Again, automatic differentiation is useful for SGD training ReLU is a popular activation function for internal layers Dropout realizes model ensembling and averaging in a simple way 53