Feedforward Neural Network (I): Binary Classification

Feedforward Neural Network (I): Binary Classification Naoaki Okazaki School of
Computing, Tokyo Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/

Highlights of this lecture  Single-layer neural networks can realize
logical AND, OR, NOT, but cannot XOR  Multi-layer neural networks can realize any logical functions including XOR  We can train single/multi-layer NNs by using gradient-based methods  By implementing graph structures of NNs in a programming language  With automatic differentiation in deep learning frameworks  Neural networks with a single hidden layer can approximate any smooth functions 1

Threshold Logic Unit (TLU) 2

3 Recap: Logical connectives（論理演算） https://vanya.jp.net/dc/ AND: ∧ OR: ∨ NOT:
¬ NAND (NOT of AND) ¬( ∧ ) NOR (NOT of OR) ¬( ∨ ) XOR (exclusive OR) ⊕

Logical circuits used in daily life 4 [1] https://response.jp/article/img/2017/10/02/300517/1228814.html [2]
https://toshiba.semicon-storage.com/info/docget.jsp?did=67518&prodName=TC74HC00AP [3] http://download.intel.com/pressroom/kits/corei7/images/Core_i7_300.jpg Activated by the OR of pressed states of all buttons in the Shinkansen Signs in Shinkansen [1] Intel® Core™ i7 Processor [3] Logic IC (TC74HC00AP) [2]

Recap: Functional complete set of AND, OR, NOT 5 Any
truth tables (Boolean functions with inputs : 0,1 ↦ {0,1}) can be expressed by combinations of logical connectives AND, OR, and NOT Out 0 0 0 0 1 1 1 0 1 1 1 0 ¬ ∧ ∧ ¬ (¬ ∧ ) ∨ ( ∧ ¬) Step 1 Step 2 Each logical formula yields 1 only when the corresponding input is given The overall output is 1 when any of logical formulas is satisfied Rough explanation: Disjunctive Normal Form (DNF) (積和標準形) We can convert a truth table into a logical formula in a systematic way with {AND, OR, NOT}: 1. For each row in the table where the output is true, take the AND of all the inputs:  When an input (column) of the row is true, use the input variable as it is  Otherwise (an input of the row is false), prepend a negation to the input variable 2. Take the OR of all the formulas obtained by the step 1

Can we build a logical connective only from input/output pairs?
6  In ML term, train a unknown mapping from supervision data  No knowledge about the internal mechanism associating inputs and outputs is required  We start with an example where all inputs/outputs can be described, but this assumption is impossible an impractical in the real world  Imagine: inputs are natural language questions and outputs are answers  We expect a learned mapping can predict outputs for unseen inputs ? , = 0 0 = 0 = 1 = 1 = 1 , = 0 1 , = 1 0 , = 1 1 = ? (, ) Input Output

Realize OR as a mathematical function 7  Find a
function that satisfies (, , ∈ {0,1}):  We can manually craft a function like this: , = 0 0 = 0 = 1 = 1 = 1 , = 0 1 , = 1 0 , = 1 1 = (, ) , = + () = � 1 (if > 0) 0 (otherwise) (step function)

Realize OR with a single-layer neural network 8  Finding
a function from scratch is hard in general  We assume a model with parameters  Assume a single-layer neural network (single-layer NN):  Train a model: find the parameters that can reproduce the input/output of the supervision data (OR) = ( + + ) Output: ∈ {0,1} Input: , ∈ {0,1} Parameters: , , ∈ ℝ

Interactive visualization of single-layer neural networks 9 https://chokkan.github.io/deeplearning/demo-slp.html

Parameters realizing logical OR: = ∨ 10 0 0 0
0 1 1 1 0 1 1 1 1 Σ −0.5 = 1 = 1 = 1 = 0 (, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 0.5 = 0 − 0.5 = 0 (, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 0.5 = 1 − 0.5 = 1 (, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 0.5 = 1 − 0.5 = 1 (, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 0.5 = 2 − 0.5 = 1 = ( + − 0.5) For example: = = 1, = −0.5

Parameters realizing logical AND: = ⋀ 11 0 0 0
0 1 0 1 0 0 1 1 1 = 1 = 0 For example: = = 1, = −1.5 Σ −1.5 = 1 = 1 (, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 1.5 = 0 − 1.5 = 0 (, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 1.5 = 1 − 1.5 = 0 (, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 1.5 = 1 − 1.5 = 0 (, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 1.5 = 2 − 1.5 = 1 = ( + − 1.5)

Parameters realizing logical NOT: = ¬ 12 0 1 1
0 = 1 = 0 For example: = −1, = 0, = 0.5 (We ignore as a logical NOT has one input) Σ 0.5 = −1 = 0: = −1 × 0 + 0.5 = 0.5 = 1 = 1: = −1 × 1 + 0.5 = −1 + 0.5 = 0 = (− + 0.5)

Parameters realizing logical NAND: = ¬(⋀) 13 0 0 1
0 1 1 1 0 1 1 1 0 = 0 = 1 For example: = = −1, = 1.5 Σ 1.5 = −1 = −1 (, ) = 0 0 ⊺: = −1 − 1 0 0 ⊺ + 1.5 = 0 + 1.5 = 1 (, ) = 0 1 ⊺: = −1 − 1 0 1 ⊺ + 1.5 = −1 + 1.5 = 1 (, ) = 1 0 ⊺: = −1 − 1 1 0 ⊺ + 1.5 = −1 + 1.5 = 1 (, ) = 1 1 ⊺: = −1 − 1 1 1 ⊺ + 1.5 = −2 + 1.5 = 0 = (− − + 1.5)

Can we find parameters that realize logical XOR? 14 0
0 0 0 1 1 1 0 1 1 1 0 Can we find parameter values , , such that they reproduce the logical XOR?

Single-layer NNs cannot realize XOR (Minsky and Papert, 1969) 15
 The decision rule for outputting = 1: + + > 0 ⟺ > − −  This draws a line with the slope − and y-intercept −  However, it is impossible to draw a single line that separates true/false outputs of the XOR logic  We say that XOR inputs are not linearly separable (linearly inseparable) TRUE FALSE Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.

How can we realize logical XOR? 16  Combine logical
connectives: = ( ∨ ) ∧ ¬ ∧  Alternatively, draw multiple lines (instead of a single line) 1. Draw a line for OR 2. Draw a line for NAND (NOT of AND) 3. Take the AND of these areas ∨ ¬( ∧ ) 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 AND = OR AND NOT AND NAND

XOR realized as a combination of single-layer neural networks 17
Σ ℎ1 1 = 1 Σ −1.5 1 = 1 2 = 1 Σ ℎ2 1.5 2 = −1 2 = −1 1 = 1 −0.5 ℎ2 ℎ1 ℎ2 ℎ1 https://chokkan.github.io/deeplearning/demo-mlp.html  XOR is AND of OR and NAND = ∨ ∧ ¬ ∧ = ℎ1 ∧ ℎ2 = ℎ1 + ℎ2 − 1.5 where: ℎ1 = ∨ = + − 0.5 ℎ2 = ¬ ∧ = − − + 1.5

Multi-layer neural networks 18  Combining (stacking) single-layer NNs, multi-layer
neural networks (multi-layer NNs) can realize any truth tables  Applicable to linearly inseparable input/output  Has the expressive power equivalent to any logic circuits such as Arithmetic Logic Unit (ALU) and memory  Note: Step function is the key to the non-linearlity  If we did not use a step function in the first layer… = 1 ℎ1 + 2 ℎ2 + = 1 11 1 + 12 2 + 1 + 2 21 1 + 22 2 + 2 + = 1 11 + 2 21 1 + 1 12 + 2 22 2 + 1 1 + 2 2 + Reduced to a single-layer NN

Generic form: Feed Forward Neural Network (FFNN) 19 1 Σ
2 Σ ℎ1 ℎ2 Σ ℎ3 Σ Σ 1 2 Σ First layer 1: ℝ2 → ℝ3 = 1 = 1(ℎ + ℎ) Second layer 2: ℝ3 → ℝ2 = 2 = 2(𝑧 + 𝑧) Final layer 3: ℝ2 → ℝ1 = 3 = 3(𝑦𝑦 + ) ℎ ∈ ℝ3×2, ℎ ∈ ℝ3 ℎ ∈ ℝ2×3, 𝑧 ∈ ℝ2 𝑦𝑦 ∈ ℝ1×2, 𝑦𝑦 ∈ ℝ Three-layer neural network: = 3 2 1  and are called hidden units, states, or layers  The depth of the neural network is three  1, 2, 3 are called activation functions

Summary  The logical units explained here are called Threshold
Logic Units (TLU), the first artificial neuron (McCulloch and Pitts, 1943)  Single-layer NN can provide a functionally complete set {AND, OR, NOT}  Single-layer NNs cannot model linearly inseparable data (e.g., XOR)  Multi-layer NNs (stacking single-layer NNs) can be seen as logical compounds  Multi-layer NNs can realize any binary functions: 0,1 ↦ {0,1}  Multi-layer NNs can model linearly inseparable data  We will see multi-layer NNs approximately express any smooth functions  We showed the generic form of feed-forward neural networks 20 W. McCulloch and W. Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115-133.

Training Single-Layer Neural Networks 21

How to determine parameters of single-layer NNs 22  We
saw single-layer NNs realize logical connectives  By crafting parameters (weights and biases) carefully to realize desired connectives  However, crafting parameters is difficult  We are sometimes unsure of the internal mechanism associating input and output variables  We want to find parameters automatically from data  We are interested in determining parameters only from supervision data, pairs of inputs and outputs

Supervised learning (training) 23  Supervision data (input: ∈ ℝ,
output: ∈ {0,1})  = { 1 , 1 , … , , } ( instances)  Find parameters such that they can reproduce training instances as correctly as possible  We assume generalization  If the parameters reproduce training instances well, we expect that they will work for unseen instances

Supervised learning for single-layer NNs (with new notations) 24 
For simplicity, we include a bias term ∈ ℝ in ∈ ℝ  (new) = ⨁ 1 = 1 , 2 , … , , 1 ⊺, (new) = ⨁ = 1 , 2 , … , d , ⊺  (new) ⋅ new = 1 1 + 2 2 + ⋯ + + (←original form)  We introduce a new notation to distinguish a computed output � from the gold output in the supervision data  = { 1 , 1 , … , , } ( instances)  We distinguish two kinds of outputs hereafter  � : the output computed (predicted) by the model for the input  : the true (gold) output for the input in the supervision data  Training: find such that, ∀ ∈ {1, … , }: � = ( ⋅ ) = ( is the step function)

Perceptron algorithm (Rosenblatt, 1958) 25 1. = 0 2. =
1 (for simplicity) 3. Repeat: 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ⋅ 6. if � ≠ then: 7. if = 1 then: 8. ⟵ + 9. else: 10. ⟵ − 11. Until no instance updates Frank Rosenblatt. 1958. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6):386-408.

Exercise: train a single-layer NN to realize OR 26 
Convert the truth table into training data  Initialize the weight vector = 0  Apply the perceptron algorithm (previous page) to find  Fix = 1 in this exercise 1 2 0 0 0 0 1 1 1 0 1 1 1 1 = 0 0 1 ⊺, 0 , 0 1 1 ⊺, 1 , 1 0 1 ⊺, 1 , 1 1 1 ⊺, 1

Updating weights for OR 27  = 0 0 1
⊺, 0 , 0 1 1 ⊺, 1 , 1 0 1 ⊺, 1 , 1 1 1 ⊺, 1  Initialization: = 0 0 0 ⊺  Iteration #1: choose (4 , 4 ) = 1 1 1 ⊺, 1  Classification: � = ⋅ 4 = 0 = 0 ≠ 4  Update: ← + 4 = 1 1 1 ⊺  Iteration #2: choose (1 , 1 ) = 0 0 1 ⊺, 0  Classification: � = ⋅ 1 = 1 = 1 ≠ 1  Update: ← − 1 = 1 1 0 ⊺  Terminate (the weight classifies all instances correctly)  = 0 0 1 ⊺: = 1 1 0 0 0 1 ⊺ = 0  = 0 1 1 ⊺: = 1 1 0 0 1 1 ⊺ = 1  = 1 0 1 ⊺: = 1 1 0 1 0 1 ⊺ = 1  = 1 1 1 ⊺: = 1 1 0 1 1 1 ⊺ = 1 We chose the instances in the order that minimizes the required number of updates

Perceptron algorithm implemented in numpy 28 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Perceptron algorithm implemented in numpy (matrix version) 29 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb =
0 0 1 0 1 1 1 0 1 1 1 1 , = 1 2 3 � = = 3 2 + 3 1 + 3 1 + 2 + 3 Applying step function for each element − � = 1 − � 1 2 − � 2 3 − � 3 4 − � 4 − � ⋅ = 0 0 1 − � 1 0 2 − � 2 2 − � 2 3 − � 3 0 3 − � 3 4 − � 4 4 − � 4 4 − � 4

Why Perceptron algorithm works 30  Suppose the parameter misclassifies
( , )  If = 1:  Update the weight vector ⟵ +  If we classify again with the updated weights :  ′ ⋅ = + ⋅ = ⋅ + ⋅ ≥ ⋅  The dot product was increased (more likely to be classified as 1)  Otherwise (if = 0):  Update the weight vector ′ ⟵ −  If we classify again with the updated weights :  ′ ⋅ = − ⋅ = ⋅ − ⋅ ≤ ⋅  The dot product was decreased (more likely to be classified as 0)  The algorithm updates the parameter to the direction where it will classify ( , ) more correctly

Summary 31  The perceptron algorithm:  Can find parameters
of single-layer NNs for linearly-separable data  Cannot terminate with linearly-inseparable data  Single-layer NNs cannot classify linearly inseparable data  We must force to terminate the algorithm with incomplete parameters  Extending the algorithm to multi-layer is non trivial  We have no training data for hidden states  The famous argument of Minsky and Papert (1969)  In the next section, we consider the gradient-based method, an alternative but standard strategy for training NNs  Important concepts: sigmoid function and backpropagation Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.

Single-Layer NN with Sigmoid Function 32

Activation function: from step to sigmoid 33  Yields binary
outputs  Indifferentiable at zero  With zero gradients  Yields continuous scores  Differentiable at all points  With mostly non-zero gradients  Useful for gradient descent Sigmoid function: ℝ → (0,1) () = 1 1 + − Step function: ℝ → {0,1} () = � 1 (if > 0) 0 (otherwise)

General form with sigmoid function 34  Single layer NN
with sigmoid function � = ⋅ = 1 1 + −⋅ Given an input ∈ ℝ, it computes an output � ∈ (0,1) by using the parameter ∈ ℝ  This is also known as logistic regression  We can interpret � as the conditional probability 1 where an input is classified to 1 (positive category)  Rule to classify an input to 1: � > 0.5 ⟺ 1 1 + −⋅ > 1 2 ⟺ ⋅ > 0 The classification rule is the same as the one when we use the step function as an activation function

Example: logical AND 35  The same parameter in the
previous example � = = , = 1 + 2 − 1.5  The outputs are acceptable, but  1 ∧ 2 = 1|1 = 1, 2 = 1 is not so high (62.2%)  Room for improving so that it yields → 1 (100%) for positives (true) and � → 0 (0%) for negatives (false) 1 2 = 1 ∧ 2 = () 0 0 0 -1.5 0.182 0 1 0 -0.5 0.378 1 0 0 -0.5 0.378 1 1 1 0.5 0.622

Instance-wise likelihood 36  We introduce instance-wise likelihood , to
measure how well the parameters reproduce ( , ) = � (if = 1) 1 − � (otherwise)  Likelihood is a probability representing the ‘fitness’ of the parameters to the training data  We want to increase the likelihood by changing 1 2 = 1 ∧ 2 � = () 0 0 0 -1.5 0.182 1 − � = 0.818 0 1 0 -0.5 0.378 1 − � = 0.622 1 0 0 -0.5 0.378 1 − � = 0.622 1 1 1 0.5 0.622 � = 0.622 1 1 1 1 Parameters of AND: � = , = 1 + 2 − 1.5

Likelihood on the training data 37  We assume that
all instances in the training data are i.i.d. (independent and identically distributed)  We define likelihood as a joint probability on data, = � =1  When the training data = { 1 , 1 , … , , } is fixed, the likelihood is a function of the parameters  Let us maximize by changing  This is called Maximum Likelihood Estimation (MLE)  The maximizer ∗ reproduces the training data well

Training as a minimization problem 38  Products of (0,1)
values often cause underflow  Instead, use log-likelihood, the logarithm of the likelihood, = log = log � =1 = � =1 log  In mathematical optimization, we usually consider a minimization problem instead of maximization  We define an objective function () by using the negative of the log-likelihood = − = − � =1 log  is called a loss function or error function

Training as a minimization problem 39  Given the training
data = { 1 , 1 , … , , }, find ∗ as the minimization problem, ∗ = argmin = argmin � =1 − , = log = log � (if = 1) log 1 − � (otherwise) = log � + (1 − ) log(1 − � ) ∗

Stochastic Gradient Descent (SGD) 40  The objective function is
the sum of losses of instances, = � =1 −  We can use Stochastic Gradient Descent (SGD) and its variants (e.g., Adam) for minimizing  SGD Algorithm ( is the number of updates) 1. Initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ # Learning rate at 4. ( , ) ⟵ an instance chosen from at random 5. ⟵ − − = +

Exercise: compute the gradient 41 Prove: = � � =
− � by computing the gradients � , � , Here:  = log � + (1 − ) log(1 − � ) ,  � = = 1 1+− ,  = ⋅

Answer: the gradient 42  = log � + (1
− ) log(1 − � ) , � = � + 1− 1− � ⋅ −1 = 1− � − � (1−) � 1− � = − � � (1− � ) ,  � = = 1 1+− , � = −1 ⋅ 1 1+− 2 ⋅ − ⋅ −1 = 1 1+− ⋅ − 1+− = � 1 − � ,  = ⋅ = Therefore, = � � = − � � (1− � ) ⋅ � 1 − � ⋅ = − �

SGD elaborated for training single-layer NNs 43 1. Initialize with
random values 2. for ⟵ 1 to : 3. ⟵ 1/ 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ( ⋅ ) 6. ⟵ + = + − � # If = � , no need for updating # If = 1 and � < 1, add scaled by 1 − � to # If = 0 and 0 < � , subtract scaled by � to The algorithm is the same as perceptron except for using the error − � for weighting the amount of an update

SGD implemented in numpy 44 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb = 0 0 1
0 1 1 1 0 1 1 1 1 , = 1 2 3 � = = 3 2 + 3 1 + 3 1 + 2 + 3 Applying sigmoid function for each element

Note: Why is SGD called ‘stochastic’? 45  The objective
function is the sum of losses of instances, = � =1 −  Gradient descent ⟵ − = − − � =1  Update after computing loss values and gradients for all training instances  Stochastic gradient descent: use random samples from the data ⟵ − ~ − − , = − −  Approximate the gradients: from all instances → from a randomly-selected instance  Update after computing the loss value and gradients for each training instances  Faster to reach to the minimizer ∗ of the objective function

Note: What is learning rate? 46  A learning rate
determines a step size moving towards the steepest direction  A large step size may reach the minimum faster, but jump over the minimum  A small step size may take too long to converge and stuck in a local minimum  We should decay learning rates for a strongly convex function such that: � =1 ∞ = ∞, � =1 ∞ 2 < ∞,  Various scheduling strategies for learning rate: = 0 /, = 0 / , Adap, RMSProp  Strategies used in practice Stepwise Decay Schedule Polynomial Schedule Warming Up https://beta.mxnet.io/guide/modules/lr_scheduler.html

Regularization 47  MLE often causes over-fitting  When the
training data is linearly separable → ∞ as � =1 → 0  Subject to be affected by noises in the training data  We use regularization (MAP estimation)  We introduce a penalty term when becomes too large  The loss function with an L2 regularization term: = − � =1 + 2  is the hyper parameter to control the trade-off between over/under fitting

Summary 48  We used sigmoid as an activation function
 The model is also known as logistic regression  We defined instance-wise likelihood to assess how well the current model reproduce a prediction of a training instance  Training a model: Minimizing the loss function by changing weights  Loss function: − ∑=1 log � + (1 − ) log(1 − � )  Minimizing the loss function is equivalent to maximizing the products of instance-wise log-likelihoods of all instances  We showed an algorithm for minimizing the loss function by using Stochastic Gradient Descent (SGD)  The same as perceptron except for using the error ( − � ) for weighting the amount of an update

Training Multi-Layer Neural Networks with Back Propagation 49

Generic notation for multi-layer NNs 50 Σ (1) Σ (1)
Σ (1) Σ (2) Σ (2) Σ (3) First layer: ℝ2 → ℝ3 (1) = (1) 1 (1) = (1)(0) (1) ∈ ℝ3×2, 1 , 1 ∈ ℝ3 Second layer: ℝ3 → ℝ2 (2) = (2) 2 (2) = (2)(1) (2) ∈ ℝ2×3, 2 , 2 ∈ ℝ2 Final layer: ℝ2 → ℝ (3) = (3) 3 (3) = (3)(2) (3) ∈ ℝ1×2, 3 , 3 ∈ ℝ 1 = ℎ1 (0) 2 = ℎ2 (0) ℎ1 (1) ℎ2 (1) ℎ3 (1) 1 (1) 2 (1) 3 (1) 1 (2) 2 (2) ℎ1 (2) ℎ2 (2) ℎ1 (3) = � 1 (3)  The –th layer ( ∈ 1, … , ) consists of:  Input: (−1) ∈ ℝ−1 ((0) = )  Output: () ∈ ℝ (() = � )  Weight: () ∈ ℝ×−1  Activation function: ()  Activation: () ∈ ℝ () = Please accept the notational conflict between an instance-wise loss and a layer number () = ()(()(−1)) : weight from the -th neuron to the -th neuron of the -th layer

How to train weights in multi-layer NNs 51  We
have no explicit supervision signals for the internal (hidden) inputs/outputs (1), … , (−1)  Having said that, SGD only needs the value of gradient () for every weight () in MLPs  Can we compute the value of () for every weight ()?  Yes! Backpropagation can do that!!

Backpropagation 52  Commonly used in deep neural networks 
Formulas for backpropagation look complicated  However:  We can understand backpropagation easily if we know the concept of computation graph  Most deep learning frameworks implement backpropagation by using automatic differentiation  Let’s see computation graph and automatic differentiation first

Computation graph: , , = + 53 Example from: http://cs231n.github.io/optimization-2/
+ × = −2 = 5 = −4 = 3 = −12 ( = + ) ( = 𝛼𝛼) The value of a variable (above an arrow) Forward pass

Automatic Differentiation (AD): , , = + 54 Example from:
http://cs231n.github.io/optimization-2/ The value of a variable (above an arrow) Forward pass + × = −2 = 5 = −4 = 3 = −12 ( = + ) ( = 𝛼𝛼) 1 3 −4 −4 −4 × 1 = = 3 × 1 = = −4 × −4 × −4 Compare with: = = −4 = = −4 = + = 3 Backward pass (Reverse mode AD) The gradient of the output with respect to the variable (below an arrow)

Automatic Differentiation (Baydin+ 2018) 55  AD computes derivations by
using the chain rule  Function values computed in the forward pass  Derivations computed with respect to:  Every variable (in reverse-mode accumulation)  A specific variable (in forward-mode accumulation)  Do not confuse with these:  Numerical differentiation: e.g., () = + −()  Symbolic differentiation: e.g., Mathematica, sympy Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153):1-43.

Rules for reverse-mode Automatic Differentiation 56 + = + ×
= 𝑥𝑥 ⋅ ⋅ () = () ⋅ 1 + 2 1 2 Add Multiply Function application Branch

Exercise: AD on computation graph 57  Write a computation
graph for , = − log ⋅ = − log 1 1 + −⋅  Consider = 1,1,1 ⊺ and = 1,1, −1.5 ⊺  Compute the value of  Compute gradients

Computing using AD 58 1 1 × + + ×
−1 exp +1 1/ −1.5 0.6225 1.6065 0.6065 −0.5 0.5 -1.5 log 2 2 × 3 3 × 1 1 1 1 1 1 1 2 = 𝛼𝛼 = 𝛿𝛿 = 𝜃𝜃 = + = + ⁄ = ⁄ = ⁄ = ⁄ = ⁄ = ⁄ = ⁄ = 1 ⁄ = 1 ⁄ = 1 ⁄ = 1 = − ⁄ = −1 = ⁄ = = + 1 ⁄ = 1 = 1/ ⁄ = −(1/)2 = log ⁄ = 1/ −0.4740 = − ⁄ = −1 −1 −1.6065 0.6224 0.6224 0.3775 −0.3775 −0.3775 0.5663 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 −0.3775 0.4740 × −1 = − 1 = − 1 0.6225 = −1.6065 × −1.6065 = − 1 1.6065 2 × −1.6065 = 0.6224 1 ⟵ 1 + 1 = 1 + 0.3775 × −1 1

Computing gradients with autograd 59 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Computing gradients with pytorch 60 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Training SLP using SGD with pytorch 61 = 0 0
1 0 1 1 1 0 1 1 1 1 , = 1 1 1 0 , = 0 0 0 x.mm(w): matrix-vector multiplication (): (4 × 1) sigmoid(): element-wise sigmoid function: (4 × 1) https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Training MLP using SGD with pytorch 62 Added weights for
the second layer Changed for two-layer perceptron Updates for the new parameters https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Training SLP with high-level NN modules 63 The definition of
the shape of the network and the loss function (bias=True for including weights for bias terms) We can implement this part in a generic manner, i.e., independently of the model We no longer append 1 (bias) to every instance because torch.n.Linear automatically includes a bias weight https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Training MLP with high-level NN modules 64 The essence of
the change from SLP to MLP We don’t have to modify this part to implement MLP (the number of iterations was changed from 100 to 1000 because we have more parameters to train) https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

SLP with high-level NN modules and optimizers 65 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

MLP with high-level NN modules and optimizers 66 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

SLP with a customizable NN class 67 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

MLP with a customizable NN class 68 https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

Manual derivation: Gradients for the final layer 69  The
same as single-layer NNs ,1 () = − � ℎ −1  Here, we omit an index for instance for simplicity  We replaced with to avoid the notation conflict Σ � = ℎ() = () = ()(−1) ℎ1 (−1) ℎ2 (−1) ℎ() = � 1 ()

Manual derivation: Gradients for the internal layers (1/2) 70 1
() 2 () (. ) (. ) × 11 (+1) × 21 (+1) × 12 (+1) × 22 (+1) Σ Σ 1 (+1) 2 (+1) 1 +1 = 1 +1 1 2 +1 = 2 +1 1 1 +1 1 +1 2 +1 2 +1 11 (+1) 21 (+1) 12 (+1) 22 (+1) 22 (+1)2 +1 12 (+1)1 +1 21 (+1)2 +1 11 (+1)1 +1 ′ 1 ′ 2 1 2 1 = ′ 1 11 +1 1 +1 + 21 +1 2 +1 2 = ′ 2 11 +1 1 +1 + 21 +1 2 +1 1 (+1) = 11 (+1) 1 + 12 +1 2 2 (+1) = 21 (+1) 1 + 22 +1 2 Deriving the recursive formula of

Manual derivation: Gradients for the internal layers (2/2) 71 
General form of the recursive formula of , 𝑗𝑗 = 𝑗𝑗 () = ′ 𝑗𝑗 � (+1) (+1)  Gradient for the internal layer, = ⋅ = ℎ𝑗𝑗 −1 Σ () Σ () () = () () = ()(−1) ℎ1 (−1) ℎ2 (−1) ℎ3 (−1) 1 () 2 () ℎ1 () ℎ2 () ()

Summary 72  We can use SGD only if we
can compute gradients of all parameters  Even if we have no explicit supervision signals for internal layers  Automatic Differentiation (AD) can compute gradients systematically  AD computes derivations on computation graph by using the chain rule  AD realizes backpropagation without manual derivation of gradients  AD is employed in most deep learning frameworks  We only need implement an algorithm for a forward pass, i.e., how a model computes an output given an input  We can concentrate on designing a structure of neural network  This boosted the speed of research and development  Manual derivation of gradients is tedious and error-prone

An Intuitive Explanation of Universal Approximation Theorem for Multi-Layer NN
73

Universal approximation theorem (Cybenko, 1989) 74  Let denote the
-dimensional unit cube 0,1 and ( ) denote the space of continuous functions on  Given any > 0 and any function ∈ , there exist an integer , real constants , ∈ ℝ, and real vectors ∈ ℝ that define a function , = � =1 ( ⋅ + ) , such that the function approximates the function : − <  This still holds when replacing with any compact subset of ℝ and (.) with some activation functions George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303-314.

What does the theorem state? 75 Neural networks with a
single hidden layer can approximate any smooth functions closely The Universal Approximation Theorem for neural networks. https://www.youtube.com/watch?v=Ijqkc7OLenI (6:24)

Essence: Smooth function approximated by spikes 76 0.4 −0.4 0.3
−0.3 0.4 0.3 1 1 1 1 These shapes can be realized by choosing appropriate values , for ( + )

Summary of this lecture  Single-layer neural networks can realize
logical AND, OR, NOT, but cannot XOR  Multi-layer neural networks can realize any logical functions including XOR  We can train single/multi-layer NNs by using gradient-based methods  By implementing graph structures of NNs in a programming language  With automatic differentiation in deep learning frameworks  Neural networks with a single hidden layer can approximate any smooth functions 77

References 78  Michael Nielsen. 2017. Neural networks and deep
learning. http://neuralnetworksanddeeplearning.com/（日本語訳: https://nnadl-ja.github.io/nnadl_site_ja/）  Raul Rojas. 1996. Neural Networks - A Systematic Introduction. Springer-Verlag. (Available at https://page.mi.fu- berlin.de/rojas/neural/)  斎藤康毅. 2016. ゼロから作るDeep Learning. O'Reilly Japan.  Learning PyTorch with Examples. https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

Feedforward Neural Network (I): Binary Classifi...

Feedforward Neural Network (I): Binary Classification

More Decks by Naoaki Okazaki

Other Decks in Research

Featured

Transcript