Feedforward Neural Network (I): Binary Classification

Slide 1

Slide 1 text

Feedforward Neural Network (I): Binary Classification Naoaki Okazaki School of Computing, Tokyo Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/

Slide 2

Slide 2 text

Highlights of this lecture  Single-layer neural networks can realize logical AND, OR, NOT, but cannot XOR  Multi-layer neural networks can realize any logical functions including XOR  We can train single/multi-layer NNs by using gradient-based methods  By implementing graph structures of NNs in a programming language  With automatic differentiation in deep learning frameworks  Neural networks with a single hidden layer can approximate any smooth functions 1

Slide 3

Slide 3 text

Threshold Logic Unit (TLU) 2

Slide 4

Slide 4 text

3 Recap: Logical connectives（論理演算） https://vanya.jp.net/dc/ AND: ∧ OR: ∨ NOT: ¬ NAND (NOT of AND) ¬( ∧ ) NOR (NOT of OR) ¬( ∨ ) XOR (exclusive OR) ⊕

Slide 5

Slide 5 text

Logical circuits used in daily life 4 [1] https://response.jp/article/img/2017/10/02/300517/1228814.html [2] https://toshiba.semicon-storage.com/info/docget.jsp?did=67518&prodName=TC74HC00AP [3] http://download.intel.com/pressroom/kits/corei7/images/Core_i7_300.jpg Activated by the OR of pressed states of all buttons in the Shinkansen Signs in Shinkansen [1] Intel® Core™ i7 Processor [3] Logic IC (TC74HC00AP) [2]

Slide 6

Slide 6 text

Recap: Functional complete set of AND, OR, NOT 5 Any truth tables (Boolean functions with inputs : 0,1 ↦ {0,1}) can be expressed by combinations of logical connectives AND, OR, and NOT Out 0 0 0 0 1 1 1 0 1 1 1 0 ¬ ∧ ∧ ¬ (¬ ∧ ) ∨ ( ∧ ¬) Step 1 Step 2 Each logical formula yields 1 only when the corresponding input is given The overall output is 1 when any of logical formulas is satisfied Rough explanation: Disjunctive Normal Form (DNF) (積和標準形) We can convert a truth table into a logical formula in a systematic way with {AND, OR, NOT}: 1. For each row in the table where the output is true, take the AND of all the inputs:  When an input (column) of the row is true, use the input variable as it is  Otherwise (an input of the row is false), prepend a negation to the input variable 2. Take the OR of all the formulas obtained by the step 1

Slide 7

Slide 7 text

Can we build a logical connective only from input/output pairs? 6  In ML term, train a unknown mapping from supervision data  No knowledge about the internal mechanism associating inputs and outputs is required  We start with an example where all inputs/outputs can be described, but this assumption is impossible an impractical in the real world  Imagine: inputs are natural language questions and outputs are answers  We expect a learned mapping can predict outputs for unseen inputs ? , = 0 0 = 0 = 1 = 1 = 1 , = 0 1 , = 1 0 , = 1 1 = ? (, ) Input Output

Slide 8

Slide 8 text

Realize OR as a mathematical function 7  Find a function that satisfies (, , ∈ {0,1}):  We can manually craft a function like this: , = 0 0 = 0 = 1 = 1 = 1 , = 0 1 , = 1 0 , = 1 1 = (, ) , = + () = � 1 (if > 0) 0 (otherwise) (step function)

Slide 9

Slide 9 text

Realize OR with a single-layer neural network 8  Finding a function from scratch is hard in general  We assume a model with parameters  Assume a single-layer neural network (single-layer NN):  Train a model: find the parameters that can reproduce the input/output of the supervision data (OR) = ( + + ) Output: ∈ {0,1} Input: , ∈ {0,1} Parameters: , , ∈ ℝ

Slide 10

Slide 10 text

Interactive visualization of single-layer neural networks 9 https://chokkan.github.io/deeplearning/demo-slp.html

Slide 11

Slide 11 text

Parameters realizing logical OR: = ∨ 10 0 0 0 0 1 1 1 0 1 1 1 1 Σ −0.5 = 1 = 1 = 1 = 0 (, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 0.5 = 0 − 0.5 = 0 (, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 0.5 = 1 − 0.5 = 1 (, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 0.5 = 1 − 0.5 = 1 (, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 0.5 = 2 − 0.5 = 1 = ( + − 0.5) For example: = = 1, = −0.5

Slide 12

Slide 12 text

Parameters realizing logical AND: = ⋀ 11 0 0 0 0 1 0 1 0 0 1 1 1 = 1 = 0 For example: = = 1, = −1.5 Σ −1.5 = 1 = 1 (, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 1.5 = 0 − 1.5 = 0 (, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 1.5 = 1 − 1.5 = 0 (, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 1.5 = 1 − 1.5 = 0 (, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 1.5 = 2 − 1.5 = 1 = ( + − 1.5)

Slide 13

Slide 13 text

Parameters realizing logical NOT: = ¬ 12 0 1 1 0 = 1 = 0 For example: = −1, = 0, = 0.5 (We ignore as a logical NOT has one input) Σ 0.5 = −1 = 0: = −1 × 0 + 0.5 = 0.5 = 1 = 1: = −1 × 1 + 0.5 = −1 + 0.5 = 0 = (− + 0.5)

Slide 14

Slide 14 text

Parameters realizing logical NAND: = ¬(⋀) 13 0 0 1 0 1 1 1 0 1 1 1 0 = 0 = 1 For example: = = −1, = 1.5 Σ 1.5 = −1 = −1 (, ) = 0 0 ⊺: = −1 − 1 0 0 ⊺ + 1.5 = 0 + 1.5 = 1 (, ) = 0 1 ⊺: = −1 − 1 0 1 ⊺ + 1.5 = −1 + 1.5 = 1 (, ) = 1 0 ⊺: = −1 − 1 1 0 ⊺ + 1.5 = −1 + 1.5 = 1 (, ) = 1 1 ⊺: = −1 − 1 1 1 ⊺ + 1.5 = −2 + 1.5 = 0 = (− − + 1.5)

Slide 15

Slide 15 text

Can we find parameters that realize logical XOR? 14 0 0 0 0 1 1 1 0 1 1 1 0 Can we find parameter values , , such that they reproduce the logical XOR?

Slide 16

Slide 16 text

Single-layer NNs cannot realize XOR (Minsky and Papert, 1969) 15  The decision rule for outputting = 1: + + > 0 ⟺ > − −  This draws a line with the slope − and y-intercept −  However, it is impossible to draw a single line that separates true/false outputs of the XOR logic  We say that XOR inputs are not linearly separable (linearly inseparable) TRUE FALSE Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.

Slide 17

Slide 17 text

How can we realize logical XOR? 16  Combine logical connectives: = ( ∨ ) ∧ ¬ ∧  Alternatively, draw multiple lines (instead of a single line) 1. Draw a line for OR 2. Draw a line for NAND (NOT of AND) 3. Take the AND of these areas ∨ ¬( ∧ ) 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 AND = OR AND NOT AND NAND

Slide 18

Slide 18 text

XOR realized as a combination of single-layer neural networks 17 Σ ℎ1 1 = 1 Σ −1.5 1 = 1 2 = 1 Σ ℎ2 1.5 2 = −1 2 = −1 1 = 1 −0.5 ℎ2 ℎ1 ℎ2 ℎ1 https://chokkan.github.io/deeplearning/demo-mlp.html  XOR is AND of OR and NAND = ∨ ∧ ¬ ∧ = ℎ1 ∧ ℎ2 = ℎ1 + ℎ2 − 1.5 where: ℎ1 = ∨ = + − 0.5 ℎ2 = ¬ ∧ = − − + 1.5

Slide 19

Slide 19 text

Multi-layer neural networks 18  Combining (stacking) single-layer NNs, multi-layer neural networks (multi-layer NNs) can realize any truth tables  Applicable to linearly inseparable input/output  Has the expressive power equivalent to any logic circuits such as Arithmetic Logic Unit (ALU) and memory  Note: Step function is the key to the non-linearlity  If we did not use a step function in the first layer… = 1 ℎ1 + 2 ℎ2 + = 1 11 1 + 12 2 + 1 + 2 21 1 + 22 2 + 2 + = 1 11 + 2 21 1 + 1 12 + 2 22 2 + 1 1 + 2 2 + Reduced to a single-layer NN

Slide 20

Slide 20 text

Generic form: Feed Forward Neural Network (FFNN) 19 1 Σ 2 Σ ℎ1 ℎ2 Σ ℎ3 Σ Σ 1 2 Σ First layer 1: ℝ2 → ℝ3 = 1 = 1(ℎ + ℎ) Second layer 2: ℝ3 → ℝ2 = 2 = 2(𝑧 + 𝑧) Final layer 3: ℝ2 → ℝ1 = 3 = 3(𝑦𝑦 + ) ℎ ∈ ℝ3×2, ℎ ∈ ℝ3 ℎ ∈ ℝ2×3, 𝑧 ∈ ℝ2 𝑦𝑦 ∈ ℝ1×2, 𝑦𝑦 ∈ ℝ Three-layer neural network: = 3 2 1  and are called hidden units, states, or layers  The depth of the neural network is three  1, 2, 3 are called activation functions

Slide 21

Slide 21 text

Summary  The logical units explained here are called Threshold Logic Units (TLU), the first artificial neuron (McCulloch and Pitts, 1943)  Single-layer NN can provide a functionally complete set {AND, OR, NOT}  Single-layer NNs cannot model linearly inseparable data (e.g., XOR)  Multi-layer NNs (stacking single-layer NNs) can be seen as logical compounds  Multi-layer NNs can realize any binary functions: 0,1 ↦ {0,1}  Multi-layer NNs can model linearly inseparable data  We will see multi-layer NNs approximately express any smooth functions  We showed the generic form of feed-forward neural networks 20 W. McCulloch and W. Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115-133.

Slide 22

Slide 22 text

Training Single-Layer Neural Networks 21

Slide 23

Slide 23 text

How to determine parameters of single-layer NNs 22  We saw single-layer NNs realize logical connectives  By crafting parameters (weights and biases) carefully to realize desired connectives  However, crafting parameters is difficult  We are sometimes unsure of the internal mechanism associating input and output variables  We want to find parameters automatically from data  We are interested in determining parameters only from supervision data, pairs of inputs and outputs

Slide 24

Slide 24 text

Supervised learning (training) 23  Supervision data (input: ∈ ℝ, output: ∈ {0,1})  = { 1 , 1 , … , , } ( instances)  Find parameters such that they can reproduce training instances as correctly as possible  We assume generalization  If the parameters reproduce training instances well, we expect that they will work for unseen instances

Slide 25

Slide 25 text

Supervised learning for single-layer NNs (with new notations) 24  For simplicity, we include a bias term ∈ ℝ in ∈ ℝ  (new) = ⨁ 1 = 1 , 2 , … , , 1 ⊺, (new) = ⨁ = 1 , 2 , … , d , ⊺  (new) ⋅ new = 1 1 + 2 2 + ⋯ + + (←original form)  We introduce a new notation to distinguish a computed output � from the gold output in the supervision data  = { 1 , 1 , … , , } ( instances)  We distinguish two kinds of outputs hereafter  � : the output computed (predicted) by the model for the input  : the true (gold) output for the input in the supervision data  Training: find such that, ∀ ∈ {1, … , }: � = ( ⋅ ) = ( is the step function)

Slide 26

Slide 26 text

Perceptron algorithm (Rosenblatt, 1958) 25 1. = 0 2. = 1 (for simplicity) 3. Repeat: 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ⋅ 6. if � ≠ then: 7. if = 1 then: 8. ⟵ + 9. else: 10. ⟵ − 11. Until no instance updates Frank Rosenblatt. 1958. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6):386-408.

Slide 27

Slide 27 text

Exercise: train a single-layer NN to realize OR 26  Convert the truth table into training data  Initialize the weight vector = 0  Apply the perceptron algorithm (previous page) to find  Fix = 1 in this exercise 1 2 0 0 0 0 1 1 1 0 1 1 1 1 = 0 0 1 ⊺, 0 , 0 1 1 ⊺, 1 , 1 0 1 ⊺, 1 , 1 1 1 ⊺, 1

Slide 28

Slide 28 text

Updating weights for OR 27  = 0 0 1 ⊺, 0 , 0 1 1 ⊺, 1 , 1 0 1 ⊺, 1 , 1 1 1 ⊺, 1  Initialization: = 0 0 0 ⊺  Iteration #1: choose (4 , 4 ) = 1 1 1 ⊺, 1  Classification: � = ⋅ 4 = 0 = 0 ≠ 4  Update: ← + 4 = 1 1 1 ⊺  Iteration #2: choose (1 , 1 ) = 0 0 1 ⊺, 0  Classification: � = ⋅ 1 = 1 = 1 ≠ 1  Update: ← − 1 = 1 1 0 ⊺  Terminate (the weight classifies all instances correctly)  = 0 0 1 ⊺: = 1 1 0 0 0 1 ⊺ = 0  = 0 1 1 ⊺: = 1 1 0 0 1 1 ⊺ = 1  = 1 0 1 ⊺: = 1 1 0 1 0 1 ⊺ = 1  = 1 1 1 ⊺: = 1 1 0 1 1 1 ⊺ = 1 We chose the instances in the order that minimizes the required number of updates

Slide 29

Slide 29 text

Perceptron algorithm implemented in numpy 28 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 30

Slide 30 text

Perceptron algorithm implemented in numpy (matrix version) 29 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb = 0 0 1 0 1 1 1 0 1 1 1 1 , = 1 2 3 � = = 3 2 + 3 1 + 3 1 + 2 + 3 Applying step function for each element − � = 1 − � 1 2 − � 2 3 − � 3 4 − � 4 − � ⋅ = 0 0 1 − � 1 0 2 − � 2 2 − � 2 3 − � 3 0 3 − � 3 4 − � 4 4 − � 4 4 − � 4

Slide 31

Slide 31 text

Why Perceptron algorithm works 30  Suppose the parameter misclassifies ( , )  If = 1:  Update the weight vector ⟵ +  If we classify again with the updated weights :  ′ ⋅ = + ⋅ = ⋅ + ⋅ ≥ ⋅  The dot product was increased (more likely to be classified as 1)  Otherwise (if = 0):  Update the weight vector ′ ⟵ −  If we classify again with the updated weights :  ′ ⋅ = − ⋅ = ⋅ − ⋅ ≤ ⋅  The dot product was decreased (more likely to be classified as 0)  The algorithm updates the parameter to the direction where it will classify ( , ) more correctly

Slide 32

Slide 32 text

Summary 31  The perceptron algorithm:  Can find parameters of single-layer NNs for linearly-separable data  Cannot terminate with linearly-inseparable data  Single-layer NNs cannot classify linearly inseparable data  We must force to terminate the algorithm with incomplete parameters  Extending the algorithm to multi-layer is non trivial  We have no training data for hidden states  The famous argument of Minsky and Papert (1969)  In the next section, we consider the gradient-based method, an alternative but standard strategy for training NNs  Important concepts: sigmoid function and backpropagation Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.

Slide 33

Slide 33 text

Single-Layer NN with Sigmoid Function 32

Slide 34

Slide 34 text

Activation function: from step to sigmoid 33  Yields binary outputs  Indifferentiable at zero  With zero gradients  Yields continuous scores  Differentiable at all points  With mostly non-zero gradients  Useful for gradient descent Sigmoid function: ℝ → (0,1) () = 1 1 + − Step function: ℝ → {0,1} () = � 1 (if > 0) 0 (otherwise)

Slide 35

Slide 35 text

General form with sigmoid function 34  Single layer NN with sigmoid function � = ⋅ = 1 1 + −⋅ Given an input ∈ ℝ, it computes an output � ∈ (0,1) by using the parameter ∈ ℝ  This is also known as logistic regression  We can interpret � as the conditional probability 1 where an input is classified to 1 (positive category)  Rule to classify an input to 1: � > 0.5 ⟺ 1 1 + −⋅ > 1 2 ⟺ ⋅ > 0 The classification rule is the same as the one when we use the step function as an activation function

Slide 36

Slide 36 text

Example: logical AND 35  The same parameter in the previous example � = = , = 1 + 2 − 1.5  The outputs are acceptable, but  1 ∧ 2 = 1|1 = 1, 2 = 1 is not so high (62.2%)  Room for improving so that it yields → 1 (100%) for positives (true) and � → 0 (0%) for negatives (false) 1 2 = 1 ∧ 2 = () 0 0 0 -1.5 0.182 0 1 0 -0.5 0.378 1 0 0 -0.5 0.378 1 1 1 0.5 0.622

Slide 37

Slide 37 text

Instance-wise likelihood 36  We introduce instance-wise likelihood , to measure how well the parameters reproduce ( , ) = � (if = 1) 1 − � (otherwise)  Likelihood is a probability representing the ‘fitness’ of the parameters to the training data  We want to increase the likelihood by changing 1 2 = 1 ∧ 2 � = () 0 0 0 -1.5 0.182 1 − � = 0.818 0 1 0 -0.5 0.378 1 − � = 0.622 1 0 0 -0.5 0.378 1 − � = 0.622 1 1 1 0.5 0.622 � = 0.622 1 1 1 1 Parameters of AND: � = , = 1 + 2 − 1.5

Slide 38

Slide 38 text

Likelihood on the training data 37  We assume that all instances in the training data are i.i.d. (independent and identically distributed)  We define likelihood as a joint probability on data, = � =1  When the training data = { 1 , 1 , … , , } is fixed, the likelihood is a function of the parameters  Let us maximize by changing  This is called Maximum Likelihood Estimation (MLE)  The maximizer ∗ reproduces the training data well

Slide 39

Slide 39 text

Training as a minimization problem 38  Products of (0,1) values often cause underflow  Instead, use log-likelihood, the logarithm of the likelihood, = log = log � =1 = � =1 log  In mathematical optimization, we usually consider a minimization problem instead of maximization  We define an objective function () by using the negative of the log-likelihood = − = − � =1 log  is called a loss function or error function

Slide 40

Slide 40 text

Training as a minimization problem 39  Given the training data = { 1 , 1 , … , , }, find ∗ as the minimization problem, ∗ = argmin = argmin � =1 − , = log = log � (if = 1) log 1 − � (otherwise) = log � + (1 − ) log(1 − � ) ∗

Slide 41

Slide 41 text

Stochastic Gradient Descent (SGD) 40  The objective function is the sum of losses of instances, = � =1 −  We can use Stochastic Gradient Descent (SGD) and its variants (e.g., Adam) for minimizing  SGD Algorithm ( is the number of updates) 1. Initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ # Learning rate at 4. ( , ) ⟵ an instance chosen from at random 5. ⟵ − − = +

Slide 42

Slide 42 text

Exercise: compute the gradient 41 Prove: = � � = − � by computing the gradients � , � , Here:  = log � + (1 − ) log(1 − � ) ,  � = = 1 1+− ,  = ⋅

Slide 43

Slide 43 text

Answer: the gradient 42  = log � + (1 − ) log(1 − � ) , � = � + 1− 1− � ⋅ −1 = 1− � − � (1−) � 1− � = − � � (1− � ) ,  � = = 1 1+− , � = −1 ⋅ 1 1+− 2 ⋅ − ⋅ −1 = 1 1+− ⋅ − 1+− = � 1 − � ,  = ⋅ = Therefore, = � � = − � � (1− � ) ⋅ � 1 − � ⋅ = − �

Slide 44

Slide 44 text

SGD elaborated for training single-layer NNs 43 1. Initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ( ⋅ ) 6. ⟵ + = + − � # If = � , no need for updating # If = 1 and � < 1, add scaled by 1 − � to # If = 0 and 0 < � , subtract scaled by � to The algorithm is the same as perceptron except for using the error − � for weighting the amount of an update

Slide 45

Slide 45 text

SGD implemented in numpy 44 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb = 0 0 1 0 1 1 1 0 1 1 1 1 , = 1 2 3 � = = 3 2 + 3 1 + 3 1 + 2 + 3 Applying sigmoid function for each element

Slide 46

Slide 46 text

Note: Why is SGD called ‘stochastic’? 45  The objective function is the sum of losses of instances, = � =1 −  Gradient descent ⟵ − = − − � =1  Update after computing loss values and gradients for all training instances  Stochastic gradient descent: use random samples from the data ⟵ − ~ − − , = − −  Approximate the gradients: from all instances → from a randomly-selected instance  Update after computing the loss value and gradients for each training instances  Faster to reach to the minimizer ∗ of the objective function

Slide 47

Slide 47 text

Note: What is learning rate? 46  A learning rate determines a step size moving towards the steepest direction  A large step size may reach the minimum faster, but jump over the minimum  A small step size may take too long to converge and stuck in a local minimum  We should decay learning rates for a strongly convex function such that: � =1 ∞ = ∞, � =1 ∞ 2 < ∞,  Various scheduling strategies for learning rate: = 0 /, = 0 / , Adap, RMSProp  Strategies used in practice Stepwise Decay Schedule Polynomial Schedule Warming Up https://beta.mxnet.io/guide/modules/lr_scheduler.html

Slide 48

Slide 48 text

Regularization 47  MLE often causes over-fitting  When the training data is linearly separable → ∞ as � =1 → 0  Subject to be affected by noises in the training data  We use regularization (MAP estimation)  We introduce a penalty term when becomes too large  The loss function with an L2 regularization term: = − � =1 + 2  is the hyper parameter to control the trade-off between over/under fitting

Slide 49

Slide 49 text

Summary 48  We used sigmoid as an activation function  The model is also known as logistic regression  We defined instance-wise likelihood to assess how well the current model reproduce a prediction of a training instance  Training a model: Minimizing the loss function by changing weights  Loss function: − ∑=1 log � + (1 − ) log(1 − � )  Minimizing the loss function is equivalent to maximizing the products of instance-wise log-likelihoods of all instances  We showed an algorithm for minimizing the loss function by using Stochastic Gradient Descent (SGD)  The same as perceptron except for using the error ( − � ) for weighting the amount of an update

Slide 50

Slide 50 text

Training Multi-Layer Neural Networks with Back Propagation 49

Slide 51

Slide 51 text

Generic notation for multi-layer NNs 50 Σ (1) Σ (1) Σ (1) Σ (2) Σ (2) Σ (3) First layer: ℝ2 → ℝ3 (1) = (1) 1 (1) = (1)(0) (1) ∈ ℝ3×2, 1 , 1 ∈ ℝ3 Second layer: ℝ3 → ℝ2 (2) = (2) 2 (2) = (2)(1) (2) ∈ ℝ2×3, 2 , 2 ∈ ℝ2 Final layer: ℝ2 → ℝ (3) = (3) 3 (3) = (3)(2) (3) ∈ ℝ1×2, 3 , 3 ∈ ℝ 1 = ℎ1 (0) 2 = ℎ2 (0) ℎ1 (1) ℎ2 (1) ℎ3 (1) 1 (1) 2 (1) 3 (1) 1 (2) 2 (2) ℎ1 (2) ℎ2 (2) ℎ1 (3) = � 1 (3)  The –th layer ( ∈ 1, … , ) consists of:  Input: (−1) ∈ ℝ−1 ((0) = )  Output: () ∈ ℝ (() = � )  Weight: () ∈ ℝ×−1  Activation function: ()  Activation: () ∈ ℝ () = Please accept the notational conflict between an instance-wise loss and a layer number () = ()(()(−1)) : weight from the -th neuron to the -th neuron of the -th layer

Slide 52

Slide 52 text

How to train weights in multi-layer NNs 51  We have no explicit supervision signals for the internal (hidden) inputs/outputs (1), … , (−1)  Having said that, SGD only needs the value of gradient () for every weight () in MLPs  Can we compute the value of () for every weight ()?  Yes! Backpropagation can do that!!

Slide 53

Slide 53 text

Backpropagation 52  Commonly used in deep neural networks  Formulas for backpropagation look complicated  However:  We can understand backpropagation easily if we know the concept of computation graph  Most deep learning frameworks implement backpropagation by using automatic differentiation  Let’s see computation graph and automatic differentiation first

Slide 54

Slide 54 text

Computation graph: , , = + 53 Example from: http://cs231n.github.io/optimization-2/ + × = −2 = 5 = −4 = 3 = −12 ( = + ) ( = 𝛼𝛼) The value of a variable (above an arrow) Forward pass

Slide 55

Slide 55 text

Automatic Differentiation (AD): , , = + 54 Example from: http://cs231n.github.io/optimization-2/ The value of a variable (above an arrow) Forward pass + × = −2 = 5 = −4 = 3 = −12 ( = + ) ( = 𝛼𝛼) 1 3 −4 −4 −4 × 1 = = 3 × 1 = = −4 × −4 × −4 Compare with: = = −4 = = −4 = + = 3 Backward pass (Reverse mode AD) The gradient of the output with respect to the variable (below an arrow)

Slide 56

Slide 56 text

Automatic Differentiation (Baydin+ 2018) 55  AD computes derivations by using the chain rule  Function values computed in the forward pass  Derivations computed with respect to:  Every variable (in reverse-mode accumulation)  A specific variable (in forward-mode accumulation)  Do not confuse with these:  Numerical differentiation: e.g., () = + −()  Symbolic differentiation: e.g., Mathematica, sympy Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153):1-43.

Slide 57

Slide 57 text

Rules for reverse-mode Automatic Differentiation 56 + = + × = 𝑥𝑥 ⋅ ⋅ () = () ⋅ 1 + 2 1 2 Add Multiply Function application Branch

Slide 58

Slide 58 text

Exercise: AD on computation graph 57  Write a computation graph for , = − log ⋅ = − log 1 1 + −⋅  Consider = 1,1,1 ⊺ and = 1,1, −1.5 ⊺  Compute the value of  Compute gradients

Slide 59

Slide 59 text

Computing using AD 58 1 1 × + + × −1 exp +1 1/ −1.5 0.6225 1.6065 0.6065 −0.5 0.5 -1.5 log 2 2 × 3 3 × 1 1 1 1 1 1 1 2 = 𝛼𝛼 = 𝛿𝛿 = 𝜃𝜃 = + = + ⁄ = ⁄ = ⁄ = ⁄ = ⁄ = ⁄ = ⁄ = 1 ⁄ = 1 ⁄ = 1 ⁄ = 1 = − ⁄ = −1 = ⁄ = = + 1 ⁄ = 1 = 1/ ⁄ = −(1/)2 = log ⁄ = 1/ −0.4740 = − ⁄ = −1 −1 −1.6065 0.6224 0.6224 0.3775 −0.3775 −0.3775 0.5663 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 0.4740 × −1 = − 1 = − 1 0.6225 = −1.6065 × −1.6065 = − 1 1.6065 2 × −1.6065 = 0.6224 1 ⟵ 1 + 1 = 1 + 0.3775 × −1 1

Slide 60

Slide 60 text

Computing gradients with autograd 59 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 61

Slide 61 text

Computing gradients with pytorch 60 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 62

Slide 62 text

Training SLP using SGD with pytorch 61 = 0 0 1 0 1 1 1 0 1 1 1 1 , = 1 1 1 0 , = 0 0 0 x.mm(w): matrix-vector multiplication (): (4 × 1) sigmoid(): element-wise sigmoid function: (4 × 1) https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 63

Slide 63 text

Training MLP using SGD with pytorch 62 Added weights for the second layer Changed for two-layer perceptron Updates for the new parameters https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 64

Slide 64 text

Training SLP with high-level NN modules 63 The definition of the shape of the network and the loss function (bias=True for including weights for bias terms) We can implement this part in a generic manner, i.e., independently of the model We no longer append 1 (bias) to every instance because torch.n.Linear automatically includes a bias weight https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 65

Slide 65 text

Training MLP with high-level NN modules 64 The essence of the change from SLP to MLP We don’t have to modify this part to implement MLP (the number of iterations was changed from 100 to 1000 because we have more parameters to train) https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 66

Slide 66 text

SLP with high-level NN modules and optimizers 65 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 67

Slide 67 text

MLP with high-level NN modules and optimizers 66 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 68

Slide 68 text

SLP with a customizable NN class 67 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 69

Slide 69 text

MLP with a customizable NN class 68 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Slide 70

Slide 70 text

Manual derivation: Gradients for the final layer 69  The same as single-layer NNs ,1 () = − � ℎ −1  Here, we omit an index for instance for simplicity  We replaced with to avoid the notation conflict Σ � = ℎ() = () = ()(−1) ℎ1 (−1) ℎ2 (−1) ℎ() = � 1 ()

Slide 71

Slide 71 text

Manual derivation: Gradients for the internal layers (1/2) 70 1 () 2 () (. ) (. ) × 11 (+1) × 21 (+1) × 12 (+1) × 22 (+1) Σ Σ 1 (+1) 2 (+1) 1 +1 = 1 +1 1 2 +1 = 2 +1 1 1 +1 1 +1 2 +1 2 +1 11 (+1) 21 (+1) 12 (+1) 22 (+1) 22 (+1)2 +1 12 (+1)1 +1 21 (+1)2 +1 11 (+1)1 +1 ′ 1 ′ 2 1 2 1 = ′ 1 11 +1 1 +1 + 21 +1 2 +1 2 = ′ 2 11 +1 1 +1 + 21 +1 2 +1 1 (+1) = 11 (+1) 1 + 12 +1 2 2 (+1) = 21 (+1) 1 + 22 +1 2 Deriving the recursive formula of

Slide 72

Slide 72 text

Manual derivation: Gradients for the internal layers (2/2) 71  General form of the recursive formula of , 𝑗𝑗 = 𝑗𝑗 () = ′ 𝑗𝑗 � (+1) (+1)  Gradient for the internal layer, = ⋅ = ℎ𝑗𝑗 −1 Σ () Σ () () = () () = ()(−1) ℎ1 (−1) ℎ2 (−1) ℎ3 (−1) 1 () 2 () ℎ1 () ℎ2 () ()

Slide 73

Slide 73 text

Summary 72  We can use SGD only if we can compute gradients of all parameters  Even if we have no explicit supervision signals for internal layers  Automatic Differentiation (AD) can compute gradients systematically  AD computes derivations on computation graph by using the chain rule  AD realizes backpropagation without manual derivation of gradients  AD is employed in most deep learning frameworks  We only need implement an algorithm for a forward pass, i.e., how a model computes an output given an input  We can concentrate on designing a structure of neural network  This boosted the speed of research and development  Manual derivation of gradients is tedious and error-prone

Slide 74

Slide 74 text

An Intuitive Explanation of Universal Approximation Theorem for Multi-Layer NN 73

Slide 75

Slide 75 text

Universal approximation theorem (Cybenko, 1989) 74  Let denote the -dimensional unit cube 0,1 and ( ) denote the space of continuous functions on  Given any > 0 and any function ∈ , there exist an integer , real constants , ∈ ℝ, and real vectors ∈ ℝ that define a function , = � =1 ( ⋅ + ) , such that the function approximates the function : − <  This still holds when replacing with any compact subset of ℝ and (.) with some activation functions George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303-314.

Slide 76

Slide 76 text

What does the theorem state? 75 Neural networks with a single hidden layer can approximate any smooth functions closely The Universal Approximation Theorem for neural networks. https://www.youtube.com/watch?v=Ijqkc7OLenI (6:24)

Slide 77

Slide 77 text

Essence: Smooth function approximated by spikes 76 0.4 −0.4 0.3 −0.3 0.4 0.3 1 1 1 1 These shapes can be realized by choosing appropriate values , for ( + )

Slide 78

Slide 78 text

Summary of this lecture  Single-layer neural networks can realize logical AND, OR, NOT, but cannot XOR  Multi-layer neural networks can realize any logical functions including XOR  We can train single/multi-layer NNs by using gradient-based methods  By implementing graph structures of NNs in a programming language  With automatic differentiation in deep learning frameworks  Neural networks with a single hidden layer can approximate any smooth functions 77

Slide 79

Slide 79 text

References 78  Michael Nielsen. 2017. Neural networks and deep learning. http://neuralnetworksanddeeplearning.com/（日本語訳: https://nnadl-ja.github.io/nnadl_site_ja/）  Raul Rojas. 1996. Neural Networks - A Systematic Introduction. Springer-Verlag. (Available at https://page.mi.fu- berlin.de/rojas/neural/)  斎藤康毅. 2016. ゼロから作るDeep Learning. O'Reilly Japan.  Learning PyTorch with Examples. https://pytorch.org/tutorials/beginner/pytorch_with_examples.html