Naoaki Okazaki
July 28, 2020
11k

# Feedforward Neural Network (I): Binary Classification

Binary classification, Threshold Logic Units (TLUs), Single-layer Perceptron (SLP), Perceptron algorithm, sigmoid function, Stochastic Gradient Descent (SGD), Multi-layer Neural Networks, Backpropagation, Computation Graph, Automatic Differentiation, Universal Approximation Theorem

July 28, 2020

## Transcript

1. ### Feedforward Neural Network (I): Binary Classification Naoaki Okazaki School of

Computing, Tokyo Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/
2. ### Highlights of this lecture  Single-layer neural networks can realize

logical AND, OR, NOT, but cannot XOR  Multi-layer neural networks can realize any logical functions including XOR  We can train single/multi-layer NNs by using gradient-based methods  By implementing graph structures of NNs in a programming language  With automatic differentiation in deep learning frameworks  Neural networks with a single hidden layer can approximate any smooth functions 1

4. ### 3 Recap: Logical connectives（論理演算） https://vanya.jp.net/dc/ AND: ∧ OR: ∨ NOT:

¬ NAND (NOT of AND) ¬( ∧ ) NOR (NOT of OR) ¬( ∨ ) XOR (exclusive OR) ⊕
5. ### Logical circuits used in daily life 4 [1] https://response.jp/article/img/2017/10/02/300517/1228814.html [2]

https://toshiba.semicon-storage.com/info/docget.jsp?did=67518&prodName=TC74HC00AP [3] http://download.intel.com/pressroom/kits/corei7/images/Core_i7_300.jpg Activated by the OR of pressed states of all buttons in the Shinkansen Signs in Shinkansen [1] Intel® Core™ i7 Processor [3] Logic IC (TC74HC00AP) [2]
6. ### Recap: Functional complete set of AND, OR, NOT 5 Any

truth tables (Boolean functions with inputs : 0,1 ↦ {0,1}) can be expressed by combinations of logical connectives AND, OR, and NOT Out 0 0 0 0 1 1 1 0 1 1 1 0 ¬ ∧ ∧ ¬ (¬ ∧ ) ∨ ( ∧ ¬) Step 1 Step 2 Each logical formula yields 1 only when the corresponding input is given The overall output is 1 when any of logical formulas is satisfied Rough explanation: Disjunctive Normal Form (DNF) (積和標準形) We can convert a truth table into a logical formula in a systematic way with {AND, OR, NOT}: 1. For each row in the table where the output is true, take the AND of all the inputs:  When an input (column) of the row is true, use the input variable as it is  Otherwise (an input of the row is false), prepend a negation to the input variable 2. Take the OR of all the formulas obtained by the step 1
7. ### Can we build a logical connective only from input/output pairs?

6  In ML term, train a unknown mapping from supervision data  No knowledge about the internal mechanism associating inputs and outputs is required  We start with an example where all inputs/outputs can be described, but this assumption is impossible an impractical in the real world  Imagine: inputs are natural language questions and outputs are answers  We expect a learned mapping can predict outputs for unseen inputs ? , = 0 0 = 0 = 1 = 1 = 1 , = 0 1 , = 1 0 , = 1 1 = ? (, ) Input Output
8. ### Realize OR as a mathematical function 7  Find a

function that satisfies (, , ∈ {0,1}):  We can manually craft a function like this: , = 0 0 = 0 = 1 = 1 = 1 , = 0 1 , = 1 0 , = 1 1 = (, ) , = + () = � 1 (if > 0) 0 (otherwise) (step function)
9. ### Realize OR with a single-layer neural network 8  Finding

a function from scratch is hard in general  We assume a model with parameters  Assume a single-layer neural network (single-layer NN):  Train a model: find the parameters that can reproduce the input/output of the supervision data (OR) = ( + + ) Output: ∈ {0,1} Input: , ∈ {0,1} Parameters: , , ∈ ℝ

11. ### Parameters realizing logical OR: = ∨ 10 0 0 0

0 1 1 1 0 1 1 1 1 Σ −0.5 = 1 = 1 = 1 = 0 (, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 0.5 = 0 − 0.5 = 0 (, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 0.5 = 1 − 0.5 = 1 (, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 0.5 = 1 − 0.5 = 1 (, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 0.5 = 2 − 0.5 = 1 = ( + − 0.5) For example: = = 1, = −0.5
12. ### Parameters realizing logical AND: = ⋀ 11 0 0 0

0 1 0 1 0 0 1 1 1 = 1 = 0 For example: = = 1, = −1.5 Σ −1.5 = 1 = 1 (, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 1.5 = 0 − 1.5 = 0 (, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 1.5 = 1 − 1.5 = 0 (, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 1.5 = 1 − 1.5 = 0 (, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 1.5 = 2 − 1.5 = 1 = ( + − 1.5)
13. ### Parameters realizing logical NOT: = ¬ 12 0 1 1

0 = 1 = 0 For example: = −1, = 0, = 0.5 (We ignore as a logical NOT has one input) Σ 0.5 = −1 = 0: = −1 × 0 + 0.5 = 0.5 = 1 = 1: = −1 × 1 + 0.5 = −1 + 0.5 = 0 = (− + 0.5)
14. ### Parameters realizing logical NAND: = ¬(⋀) 13 0 0 1

0 1 1 1 0 1 1 1 0 = 0 = 1 For example: = = −1, = 1.5 Σ 1.5 = −1 = −1 (, ) = 0 0 ⊺: = −1 − 1 0 0 ⊺ + 1.5 = 0 + 1.5 = 1 (, ) = 0 1 ⊺: = −1 − 1 0 1 ⊺ + 1.5 = −1 + 1.5 = 1 (, ) = 1 0 ⊺: = −1 − 1 1 0 ⊺ + 1.5 = −1 + 1.5 = 1 (, ) = 1 1 ⊺: = −1 − 1 1 1 ⊺ + 1.5 = −2 + 1.5 = 0 = (− − + 1.5)
15. ### Can we find parameters that realize logical XOR? 14 0

0 0 0 1 1 1 0 1 1 1 0 Can we find parameter values , , such that they reproduce the logical XOR?
16. ### Single-layer NNs cannot realize XOR (Minsky and Papert, 1969) 15

 The decision rule for outputting = 1: + + > 0 ⟺ > − −  This draws a line with the slope − and y-intercept −  However, it is impossible to draw a single line that separates true/false outputs of the XOR logic  We say that XOR inputs are not linearly separable (linearly inseparable) TRUE FALSE Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.
17. ### How can we realize logical XOR? 16  Combine logical

connectives: = ( ∨ ) ∧ ¬ ∧  Alternatively, draw multiple lines (instead of a single line) 1. Draw a line for OR 2. Draw a line for NAND (NOT of AND) 3. Take the AND of these areas ∨ ¬( ∧ ) 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 AND = OR AND NOT AND NAND
18. ### XOR realized as a combination of single-layer neural networks 17

Σ ℎ1 1 = 1 Σ −1.5 1 = 1 2 = 1 Σ ℎ2 1.5 2 = −1 2 = −1 1 = 1 −0.5 ℎ2 ℎ1 ℎ2 ℎ1 https://chokkan.github.io/deeplearning/demo-mlp.html  XOR is AND of OR and NAND = ∨ ∧ ¬ ∧ = ℎ1 ∧ ℎ2 = ℎ1 + ℎ2 − 1.5 where: ℎ1 = ∨ = + − 0.5 ℎ2 = ¬ ∧ = − − + 1.5
19. ### Multi-layer neural networks 18  Combining (stacking) single-layer NNs, multi-layer

neural networks (multi-layer NNs) can realize any truth tables  Applicable to linearly inseparable input/output  Has the expressive power equivalent to any logic circuits such as Arithmetic Logic Unit (ALU) and memory  Note: Step function is the key to the non-linearlity  If we did not use a step function in the first layer… = 1 ℎ1 + 2 ℎ2 + = 1 11 1 + 12 2 + 1 + 2 21 1 + 22 2 + 2 + = 1 11 + 2 21 1 + 1 12 + 2 22 2 + 1 1 + 2 2 + Reduced to a single-layer NN
20. ### Generic form: Feed Forward Neural Network (FFNN) 19 1 Σ

2 Σ ℎ1 ℎ2 Σ ℎ3 Σ Σ 1 2 Σ First layer 1: ℝ2 → ℝ3 = 1 = 1(ℎ + ℎ) Second layer 2: ℝ3 → ℝ2 = 2 = 2(𝑧 + 𝑧) Final layer 3: ℝ2 → ℝ1 = 3 = 3(𝑦𝑦 + ) ℎ ∈ ℝ3×2, ℎ ∈ ℝ3 ℎ ∈ ℝ2×3, 𝑧 ∈ ℝ2 𝑦𝑦 ∈ ℝ1×2, 𝑦𝑦 ∈ ℝ Three-layer neural network: = 3 2 1  and are called hidden units, states, or layers  The depth of the neural network is three  1, 2, 3 are called activation functions
21. ### Summary  The logical units explained here are called Threshold

Logic Units (TLU), the first artificial neuron (McCulloch and Pitts, 1943)  Single-layer NN can provide a functionally complete set {AND, OR, NOT}  Single-layer NNs cannot model linearly inseparable data (e.g., XOR)  Multi-layer NNs (stacking single-layer NNs) can be seen as logical compounds  Multi-layer NNs can realize any binary functions: 0,1 ↦ {0,1}  Multi-layer NNs can model linearly inseparable data  We will see multi-layer NNs approximately express any smooth functions  We showed the generic form of feed-forward neural networks 20 W. McCulloch and W. Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115-133.

23. ### How to determine parameters of single-layer NNs 22  We

saw single-layer NNs realize logical connectives  By crafting parameters (weights and biases) carefully to realize desired connectives  However, crafting parameters is difficult  We are sometimes unsure of the internal mechanism associating input and output variables  We want to find parameters automatically from data  We are interested in determining parameters only from supervision data, pairs of inputs and outputs
24. ### Supervised learning (training) 23  Supervision data (input: ∈ ℝ,

output: ∈ {0,1})  = { 1 , 1 , … , , } ( instances)  Find parameters such that they can reproduce training instances as correctly as possible  We assume generalization  If the parameters reproduce training instances well, we expect that they will work for unseen instances
25. ### Supervised learning for single-layer NNs (with new notations) 24 

For simplicity, we include a bias term ∈ ℝ in ∈ ℝ  (new) = ⨁ 1 = 1 , 2 , … , , 1 ⊺, (new) = ⨁ = 1 , 2 , … , d , ⊺  (new) ⋅ new = 1 1 + 2 2 + ⋯ + + (←original form)  We introduce a new notation to distinguish a computed output � from the gold output in the supervision data  = { 1 , 1 , … , , } ( instances)  We distinguish two kinds of outputs hereafter  � : the output computed (predicted) by the model for the input  : the true (gold) output for the input in the supervision data  Training: find such that, ∀ ∈ {1, … , }: � = ( ⋅ ) = ( is the step function)
26. ### Perceptron algorithm (Rosenblatt, 1958) 25 1. = 0 2. =

1 (for simplicity) 3. Repeat: 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ⋅ 6. if � ≠ then: 7. if = 1 then: 8. ⟵ + 9. else: 10. ⟵ − 11. Until no instance updates Frank Rosenblatt. 1958. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6):386-408.
27. ### Exercise: train a single-layer NN to realize OR 26 

Convert the truth table into training data  Initialize the weight vector = 0  Apply the perceptron algorithm (previous page) to find  Fix = 1 in this exercise 1 2 0 0 0 0 1 1 1 0 1 1 1 1 = 0 0 1 ⊺, 0 , 0 1 1 ⊺, 1 , 1 0 1 ⊺, 1 , 1 1 1 ⊺, 1
28. ### Updating weights for OR 27  = 0 0 1

⊺, 0 , 0 1 1 ⊺, 1 , 1 0 1 ⊺, 1 , 1 1 1 ⊺, 1  Initialization: = 0 0 0 ⊺  Iteration #1: choose (4 , 4 ) = 1 1 1 ⊺, 1  Classification: � = ⋅ 4 = 0 = 0 ≠ 4  Update: ← + 4 = 1 1 1 ⊺  Iteration #2: choose (1 , 1 ) = 0 0 1 ⊺, 0  Classification: � = ⋅ 1 = 1 = 1 ≠ 1  Update: ← − 1 = 1 1 0 ⊺  Terminate (the weight classifies all instances correctly)  = 0 0 1 ⊺: = 1 1 0 0 0 1 ⊺ = 0  = 0 1 1 ⊺: = 1 1 0 0 1 1 ⊺ = 1  = 1 0 1 ⊺: = 1 1 0 1 0 1 ⊺ = 1  = 1 1 1 ⊺: = 1 1 0 1 1 1 ⊺ = 1 We chose the instances in the order that minimizes the required number of updates

30. ### Perceptron algorithm implemented in numpy (matrix version) 29 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb =

0 0 1 0 1 1 1 0 1 1 1 1 , = 1 2 3 � = = 3 2 + 3 1 + 3 1 + 2 + 3 Applying step function for each element − � = 1 − � 1 2 − � 2 3 − � 3 4 − � 4 − � ⋅ = 0 0 1 − � 1 0 2 − � 2 2 − � 2 3 − � 3 0 3 − � 3 4 − � 4 4 − � 4 4 − � 4
31. ### Why Perceptron algorithm works 30  Suppose the parameter misclassifies

( , )  If = 1:  Update the weight vector ⟵ +  If we classify again with the updated weights :  ′ ⋅ = + ⋅ = ⋅ + ⋅ ≥ ⋅  The dot product was increased (more likely to be classified as 1)  Otherwise (if = 0):  Update the weight vector ′ ⟵ −  If we classify again with the updated weights :  ′ ⋅ = − ⋅ = ⋅ − ⋅ ≤ ⋅  The dot product was decreased (more likely to be classified as 0)  The algorithm updates the parameter to the direction where it will classify ( , ) more correctly
32. ### Summary 31  The perceptron algorithm:  Can find parameters

of single-layer NNs for linearly-separable data  Cannot terminate with linearly-inseparable data  Single-layer NNs cannot classify linearly inseparable data  We must force to terminate the algorithm with incomplete parameters  Extending the algorithm to multi-layer is non trivial  We have no training data for hidden states  The famous argument of Minsky and Papert (1969)  In the next section, we consider the gradient-based method, an alternative but standard strategy for training NNs  Important concepts: sigmoid function and backpropagation Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.

34. ### Activation function: from step to sigmoid 33  Yields binary

outputs  Indifferentiable at zero  With zero gradients  Yields continuous scores  Differentiable at all points  With mostly non-zero gradients  Useful for gradient descent Sigmoid function: ℝ → (0,1) () = 1 1 + − Step function: ℝ → {0,1} () = � 1 (if > 0) 0 (otherwise)
35. ### General form with sigmoid function 34  Single layer NN

with sigmoid function � = ⋅ = 1 1 + −⋅ Given an input ∈ ℝ, it computes an output � ∈ (0,1) by using the parameter ∈ ℝ  This is also known as logistic regression  We can interpret � as the conditional probability 1 where an input is classified to 1 (positive category)  Rule to classify an input to 1: � > 0.5 ⟺ 1 1 + −⋅ > 1 2 ⟺ ⋅ > 0 The classification rule is the same as the one when we use the step function as an activation function
36. ### Example: logical AND 35  The same parameter in the

previous example � = = , = 1 + 2 − 1.5  The outputs are acceptable, but  1 ∧ 2 = 1|1 = 1, 2 = 1 is not so high (62.2%)  Room for improving so that it yields → 1 (100%) for positives (true) and � → 0 (0%) for negatives (false) 1 2 = 1 ∧ 2 = () 0 0 0 -1.5 0.182 0 1 0 -0.5 0.378 1 0 0 -0.5 0.378 1 1 1 0.5 0.622
37. ### Instance-wise likelihood 36  We introduce instance-wise likelihood , to

measure how well the parameters reproduce ( , ) = � (if = 1) 1 − � (otherwise)  Likelihood is a probability representing the ‘fitness’ of the parameters to the training data  We want to increase the likelihood by changing 1 2 = 1 ∧ 2 � = () 0 0 0 -1.5 0.182 1 − � = 0.818 0 1 0 -0.5 0.378 1 − � = 0.622 1 0 0 -0.5 0.378 1 − � = 0.622 1 1 1 0.5 0.622 � = 0.622 1 1 1 1 Parameters of AND: � = , = 1 + 2 − 1.5
38. ### Likelihood on the training data 37  We assume that

all instances in the training data are i.i.d. (independent and identically distributed)  We define likelihood as a joint probability on data, = � =1  When the training data = { 1 , 1 , … , , } is fixed, the likelihood is a function of the parameters  Let us maximize by changing  This is called Maximum Likelihood Estimation (MLE)  The maximizer ∗ reproduces the training data well
39. ### Training as a minimization problem 38  Products of (0,1)

values often cause underflow  Instead, use log-likelihood, the logarithm of the likelihood, = log = log � =1 = � =1 log  In mathematical optimization, we usually consider a minimization problem instead of maximization  We define an objective function () by using the negative of the log-likelihood = − = − � =1 log  is called a loss function or error function
40. ### Training as a minimization problem 39  Given the training

data = { 1 , 1 , … , , }, find ∗ as the minimization problem, ∗ = argmin = argmin � =1 − , = log = log � (if = 1) log 1 − � (otherwise) = log � + (1 − ) log(1 − � ) ∗
41. ### Stochastic Gradient Descent (SGD) 40  The objective function is

the sum of losses of instances, = � =1 −  We can use Stochastic Gradient Descent (SGD) and its variants (e.g., Adam) for minimizing  SGD Algorithm ( is the number of updates) 1. Initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ # Learning rate at 4. ( , ) ⟵ an instance chosen from at random 5. ⟵ − − = +
42. ### Exercise: compute the gradient 41 Prove: = � � =

− � by computing the gradients � , � , Here:  = log � + (1 − ) log(1 − � ) ,  � = = 1 1+− ,  = ⋅
43. ### Answer: the gradient 42  = log � + (1

− ) log(1 − � ) , � = � + 1− 1− � ⋅ −1 = 1− � − � (1−) � 1− � = − � � (1− � ) ,  � = = 1 1+− , � = −1 ⋅ 1 1+− 2 ⋅ − ⋅ −1 = 1 1+− ⋅ − 1+− = � 1 − � ,  = ⋅ = Therefore, = � � = − � � (1− � ) ⋅ � 1 − � ⋅ = − �
44. ### SGD elaborated for training single-layer NNs 43 1. Initialize with

random values 2. for ⟵ 1 to : 3. ⟵ 1/ 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ( ⋅ ) 6. ⟵ + = + − � # If = � , no need for updating # If = 1 and � < 1, add scaled by 1 − � to # If = 0 and 0 < � , subtract scaled by � to The algorithm is the same as perceptron except for using the error − � for weighting the amount of an update
45. ### SGD implemented in numpy 44 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb = 0 0 1

0 1 1 1 0 1 1 1 1 , = 1 2 3 � = = 3 2 + 3 1 + 3 1 + 2 + 3 Applying sigmoid function for each element
46. ### Note: Why is SGD called ‘stochastic’? 45  The objective

function is the sum of losses of instances, = � =1 −  Gradient descent ⟵ − = − − � =1  Update after computing loss values and gradients for all training instances  Stochastic gradient descent: use random samples from the data ⟵ − ~ − − , = − −  Approximate the gradients: from all instances → from a randomly-selected instance  Update after computing the loss value and gradients for each training instances  Faster to reach to the minimizer ∗ of the objective function
47. ### Note: What is learning rate? 46  A learning rate

determines a step size moving towards the steepest direction  A large step size may reach the minimum faster, but jump over the minimum  A small step size may take too long to converge and stuck in a local minimum  We should decay learning rates for a strongly convex function such that: � =1 ∞ = ∞, � =1 ∞ 2 < ∞,  Various scheduling strategies for learning rate: = 0 /, = 0 / , Adap, RMSProp  Strategies used in practice Stepwise Decay Schedule Polynomial Schedule Warming Up https://beta.mxnet.io/guide/modules/lr_scheduler.html
48. ### Regularization 47  MLE often causes over-fitting  When the

training data is linearly separable → ∞ as � =1 → 0  Subject to be affected by noises in the training data  We use regularization (MAP estimation)  We introduce a penalty term when becomes too large  The loss function with an L2 regularization term: = − � =1 + 2  is the hyper parameter to control the trade-off between over/under fitting
49. ### Summary 48  We used sigmoid as an activation function

 The model is also known as logistic regression  We defined instance-wise likelihood to assess how well the current model reproduce a prediction of a training instance  Training a model: Minimizing the loss function by changing weights  Loss function: − ∑=1 log � + (1 − ) log(1 − � )  Minimizing the loss function is equivalent to maximizing the products of instance-wise log-likelihoods of all instances  We showed an algorithm for minimizing the loss function by using Stochastic Gradient Descent (SGD)  The same as perceptron except for using the error ( − � ) for weighting the amount of an update

51. ### Generic notation for multi-layer NNs 50 Σ (1) Σ (1)

Σ (1) Σ (2) Σ (2) Σ (3) First layer: ℝ2 → ℝ3 (1) = (1) 1 (1) = (1)(0) (1) ∈ ℝ3×2, 1 , 1 ∈ ℝ3 Second layer: ℝ3 → ℝ2 (2) = (2) 2 (2) = (2)(1) (2) ∈ ℝ2×3, 2 , 2 ∈ ℝ2 Final layer: ℝ2 → ℝ (3) = (3) 3 (3) = (3)(2) (3) ∈ ℝ1×2, 3 , 3 ∈ ℝ 1 = ℎ1 (0) 2 = ℎ2 (0) ℎ1 (1) ℎ2 (1) ℎ3 (1) 1 (1) 2 (1) 3 (1) 1 (2) 2 (2) ℎ1 (2) ℎ2 (2) ℎ1 (3) = � 1 (3)  The –th layer ( ∈ 1, … , ) consists of:  Input: (−1) ∈ ℝ−1 ((0) = )  Output: () ∈ ℝ (() = � )  Weight: () ∈ ℝ×−1  Activation function: ()  Activation: () ∈ ℝ () = Please accept the notational conflict between an instance-wise loss and a layer number () = ()(()(−1)) : weight from the -th neuron to the -th neuron of the -th layer
52. ### How to train weights in multi-layer NNs 51  We

have no explicit supervision signals for the internal (hidden) inputs/outputs (1), … , (−1)  Having said that, SGD only needs the value of gradient () for every weight () in MLPs  Can we compute the value of () for every weight ()?  Yes! Backpropagation can do that!!
53. ### Backpropagation 52  Commonly used in deep neural networks 

Formulas for backpropagation look complicated  However:  We can understand backpropagation easily if we know the concept of computation graph  Most deep learning frameworks implement backpropagation by using automatic differentiation  Let’s see computation graph and automatic differentiation first
54. ### Computation graph: , , = + 53 Example from: http://cs231n.github.io/optimization-2/

+ × = −2 = 5 = −4 = 3 = −12 ( = + ) ( = 𝛼𝛼) The value of a variable (above an arrow) Forward pass
55. ### Automatic Differentiation (AD): , , = + 54 Example from:

http://cs231n.github.io/optimization-2/ The value of a variable (above an arrow) Forward pass + × = −2 = 5 = −4 = 3 = −12 ( = + ) ( = 𝛼𝛼) 1 3 −4 −4 −4 × 1 = = 3 × 1 = = −4 × −4 × −4 Compare with: = = −4 = = −4 = + = 3 Backward pass (Reverse mode AD) The gradient of the output with respect to the variable (below an arrow)
56. ### Automatic Differentiation (Baydin+ 2018) 55  AD computes derivations by

using the chain rule  Function values computed in the forward pass  Derivations computed with respect to:  Every variable (in reverse-mode accumulation)  A specific variable (in forward-mode accumulation)  Do not confuse with these:  Numerical differentiation: e.g., () = + −()  Symbolic differentiation: e.g., Mathematica, sympy Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153):1-43.
57. ### Rules for reverse-mode Automatic Differentiation 56 + = + ×

= 𝑥𝑥 ⋅ ⋅ () = () ⋅ 1 + 2 1 2 Add Multiply Function application Branch
58. ### Exercise: AD on computation graph 57  Write a computation

graph for , = − log ⋅ = − log 1 1 + −⋅  Consider = 1,1,1 ⊺ and = 1,1, −1.5 ⊺  Compute the value of  Compute gradients
59. ### Computing using AD 58 1 1 × + + ×

−1 exp +1 1/ −1.5 0.6225 1.6065 0.6065 −0.5 0.5 -1.5 log 2 2 × 3 3 × 1 1 1 1 1 1 1 2 = 𝛼𝛼 = 𝛿𝛿 = 𝜃𝜃 = + = + ⁄ = ⁄ = ⁄ = ⁄ = ⁄ = ⁄ = ⁄ = 1 ⁄ = 1 ⁄ = 1 ⁄ = 1 = − ⁄ = −1 = ⁄ = = + 1 ⁄ = 1 = 1/ ⁄ = −(1/)2 = log ⁄ = 1/ −0.4740 = − ⁄ = −1 −1 −1.6065 0.6224 0.6224 0.3775 −0.3775 −0.3775 0.5663 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 0.4740 × −1 = − 1 = − 1 0.6225 = −1.6065 × −1.6065 = − 1 1.6065 2 × −1.6065 = 0.6224 1 ⟵ 1 + 1 = 1 + 0.3775 × −1 1

62. ### Training SLP using SGD with pytorch 61 = 0 0

1 0 1 1 1 0 1 1 1 1 , = 1 1 1 0 , = 0 0 0 x.mm(w): matrix-vector multiplication (): (4 × 1) sigmoid(): element-wise sigmoid function: (4 × 1) https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb
63. ### Training MLP using SGD with pytorch 62 Added weights for

the second layer Changed for two-layer perceptron Updates for the new parameters https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb
64. ### Training SLP with high-level NN modules 63 The definition of

the shape of the network and the loss function (bias=True for including weights for bias terms) We can implement this part in a generic manner, i.e., independently of the model We no longer append 1 (bias) to every instance because torch.n.Linear automatically includes a bias weight https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb
65. ### Training MLP with high-level NN modules 64 The essence of

the change from SLP to MLP We don’t have to modify this part to implement MLP (the number of iterations was changed from 100 to 1000 because we have more parameters to train) https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

70. ### Manual derivation: Gradients for the final layer 69  The

same as single-layer NNs ,1 () = − � ℎ −1  Here, we omit an index for instance for simplicity  We replaced with to avoid the notation conflict Σ � = ℎ() = () = ()(−1) ℎ1 (−1) ℎ2 (−1) ℎ() = � 1 ()
71. ### Manual derivation: Gradients for the internal layers (1/2) 70 1

() 2 () (. ) (. ) × 11 (+1) × 21 (+1) × 12 (+1) × 22 (+1) Σ Σ 1 (+1) 2 (+1) 1 +1 = 1 +1 1 2 +1 = 2 +1 1 1 +1 1 +1 2 +1 2 +1 11 (+1) 21 (+1) 12 (+1) 22 (+1) 22 (+1)2 +1 12 (+1)1 +1 21 (+1)2 +1 11 (+1)1 +1 ′ 1 ′ 2 1 2 1 = ′ 1 11 +1 1 +1 + 21 +1 2 +1 2 = ′ 2 11 +1 1 +1 + 21 +1 2 +1 1 (+1) = 11 (+1) 1 + 12 +1 2 2 (+1) = 21 (+1) 1 + 22 +1 2 Deriving the recursive formula of
72. ### Manual derivation: Gradients for the internal layers (2/2) 71 

General form of the recursive formula of , 𝑗𝑗 = 𝑗𝑗 () = ′ 𝑗𝑗 � (+1) (+1)  Gradient for the internal layer, = ⋅ = ℎ𝑗𝑗 −1 Σ () Σ () () = () () = ()(−1) ℎ1 (−1) ℎ2 (−1) ℎ3 (−1) 1 () 2 () ℎ1 () ℎ2 () ()
73. ### Summary 72  We can use SGD only if we

can compute gradients of all parameters  Even if we have no explicit supervision signals for internal layers  Automatic Differentiation (AD) can compute gradients systematically  AD computes derivations on computation graph by using the chain rule  AD realizes backpropagation without manual derivation of gradients  AD is employed in most deep learning frameworks  We only need implement an algorithm for a forward pass, i.e., how a model computes an output given an input  We can concentrate on designing a structure of neural network  This boosted the speed of research and development  Manual derivation of gradients is tedious and error-prone

73
75. ### Universal approximation theorem (Cybenko, 1989) 74  Let denote the

-dimensional unit cube 0,1 and ( ) denote the space of continuous functions on  Given any > 0 and any function ∈ , there exist an integer , real constants , ∈ ℝ, and real vectors ∈ ℝ that define a function , = � =1 ( ⋅ + ) , such that the function approximates the function : − <  This still holds when replacing with any compact subset of ℝ and (.) with some activation functions George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303-314.
76. ### What does the theorem state? 75 Neural networks with a

single hidden layer can approximate any smooth functions closely The Universal Approximation Theorem for neural networks. https://www.youtube.com/watch?v=Ijqkc7OLenI (6:24)
77. ### Essence: Smooth function approximated by spikes 76 0.4 −0.4 0.3

−0.3 0.4 0.3 1 1 1 1 These shapes can be realized by choosing appropriate values , for ( + )
78. ### Summary of this lecture  Single-layer neural networks can realize

logical AND, OR, NOT, but cannot XOR  Multi-layer neural networks can realize any logical functions including XOR  We can train single/multi-layer NNs by using gradient-based methods  By implementing graph structures of NNs in a programming language  With automatic differentiation in deep learning frameworks  Neural networks with a single hidden layer can approximate any smooth functions 77
79. ### References 78  Michael Nielsen. 2017. Neural networks and deep

learning. http://neuralnetworksanddeeplearning.com/（日本語訳: https://nnadl-ja.github.io/nnadl_site_ja/）  Raul Rojas. 1996. Neural Networks - A Systematic Introduction. Springer-Verlag. (Available at https://page.mi.fu- berlin.de/rojas/neural/)  斎藤 康毅. 2016. ゼロから作るDeep Learning. O'Reilly Japan.  Learning PyTorch with Examples. https://pytorch.org/tutorials/beginner/pytorch_with_examples.html