Naoaki Okazaki
PRO
July 28, 2020
10k

# Feedforward Neural Network (I): Binary Classification

Binary classification, Threshold Logic Units (TLUs), Single-layer Perceptron (SLP), Perceptron algorithm, sigmoid function, Stochastic Gradient Descent (SGD), Multi-layer Neural Networks, Backpropagation, Computation Graph, Automatic Differentiation, Universal Approximation Theorem

July 28, 2020

## Transcript

1. Feedforward Neural Network (I):
Binary Classification
Naoaki Okazaki
School of Computing,
Tokyo Institute of Technology
[email protected]
PowerPoint template designed by https://ppt.design4u.jp/template/

2. Highlights of this lecture
 Single-layer neural networks can realize logical AND, OR, NOT, but cannot XOR
 Multi-layer neural networks can realize any logical functions including XOR
 We can train single/multi-layer NNs by using gradient-based methods
 By implementing graph structures of NNs in a programming language
 With automatic differentiation in deep learning frameworks
 Neural networks with a single hidden layer can approximate any smooth
functions
1

3. Threshold Logic Unit (TLU)
2

4. 3
Recap: Logical connectives（論理演算）
https://vanya.jp.net/dc/
AND: ∧
OR: ∨
NOT: ¬
NAND (NOT of AND)
¬( ∧ )
NOR (NOT of OR)
¬( ∨ )
XOR (exclusive OR)

5. Logical circuits used in daily life
4
[1] https://response.jp/article/img/2017/10/02/300517/1228814.html
[2] https://toshiba.semicon-storage.com/info/docget.jsp?did=67518&prodName=TC74HC00AP
Activated by the OR of pressed states of
all buttons in the Shinkansen
Signs in Shinkansen [1]
Intel® Core™ i7 Processor [3]
Logic IC (TC74HC00AP) [2]

6. Recap: Functional complete set of AND, OR, NOT
5
Any truth tables (Boolean functions with inputs : 0,1 ↦ {0,1}) can be
expressed by combinations of logical connectives AND, OR, and NOT
Out
0 0 0
0 1 1
1 0 1
1 1 0
¬ ∧
∧ ¬
(¬ ∧ ) ∨ ( ∧ ¬)
Step 1 Step 2
Each logical formula yields 1
only when the corresponding
input is given
The overall output is 1
when any of logical
formulas is satisfied
Rough explanation: Disjunctive Normal Form (DNF) (積和標準形)
We can convert a truth table into a logical formula in a systematic way with {AND, OR, NOT}:
1. For each row in the table where the output is true, take the AND of all the inputs:
 When an input (column) of the row is true, use the input variable as it is
 Otherwise (an input of the row is false), prepend a negation to the input variable
2. Take the OR of all the formulas obtained by the step 1

7. Can we build a logical connective only from input/output pairs?
6
 In ML term, train a unknown mapping from supervision data
 No knowledge about the internal mechanism associating inputs and
outputs is required
 We start with an example where all inputs/outputs can be described,
but this assumption is impossible an impractical in the real world
 Imagine: inputs are natural language questions and outputs are answers
 We expect a learned mapping can predict outputs for unseen inputs
?
, = 0 0 = 0
= 1
= 1
= 1
, = 0 1
, = 1 0
, = 1 1 = ? (, )
Input Output

8. Realize OR as a mathematical function
7
 Find a function that satisfies (, , ∈ {0,1}):
 We can manually craft a function like this:

, = 0 0 = 0
= 1
= 1
= 1
, = 0 1
, = 1 0
, = 1 1
= (, )
, = +
() = �
1 (if > 0)
0 (otherwise)
(step function)

9. Realize OR with a single-layer neural network
8
 Finding a function from scratch is hard in general
 We assume a model with parameters
 Assume a single-layer neural network (single-layer NN):
 Train a model: find the parameters that can reproduce the
input/output of the supervision data (OR)
= (
+
+ )
Output: ∈ {0,1} Input: , ∈ {0,1}
Parameters:
,
, ∈ ℝ

10. Interactive visualization of single-layer neural networks
9
https://chokkan.github.io/deeplearning/demo-slp.html

11. Parameters realizing logical OR: = ∨
10

0 0 0
0 1 1
1 0 1
1 1 1

Σ
−0.5

= 1

= 1

= 1
= 0
(, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 0.5 = 0 − 0.5 = 0
(, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 0.5 = 1 − 0.5 = 1
(, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 0.5 = 1 − 0.5 = 1
(, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 0.5 = 2 − 0.5 = 1
= ( + − 0.5)
For example:
=
= 1, = −0.5

12. Parameters realizing logical AND: = ⋀
11

0 0 0
0 1 0
1 0 0
1 1 1
= 1
= 0
For example:
=
= 1, = −1.5

Σ
−1.5

= 1

= 1
(, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 1.5 = 0 − 1.5 = 0
(, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 1.5 = 1 − 1.5 = 0
(, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 1.5 = 1 − 1.5 = 0
(, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 1.5 = 2 − 1.5 = 1
= ( + − 1.5)

13. Parameters realizing logical NOT: = ¬
12

0 1
1 0
= 1
= 0
For example:
= −1,
= 0, = 0.5
(We ignore as a logical NOT has one input)
Σ
0.5

= −1

= 0: = −1 × 0 + 0.5 = 0.5 = 1
= 1: = −1 × 1 + 0.5 = −1 + 0.5 = 0
= (− + 0.5)

14. Parameters realizing logical NAND: = ¬(⋀)
13

0 0 1
0 1 1
1 0 1
1 1 0
= 0
= 1
For example:
=
= −1, = 1.5

Σ
1.5

= −1

= −1
(, ) = 0 0 ⊺: = −1 − 1 0 0 ⊺ + 1.5 = 0 + 1.5 = 1
(, ) = 0 1 ⊺: = −1 − 1 0 1 ⊺ + 1.5 = −1 + 1.5 = 1
(, ) = 1 0 ⊺: = −1 − 1 1 0 ⊺ + 1.5 = −1 + 1.5 = 1
(, ) = 1 1 ⊺: = −1 − 1 1 1 ⊺ + 1.5 = −2 + 1.5 = 0
= (− − + 1.5)

15. Can we find parameters that realize logical XOR?
14

0 0 0
0 1 1
1 0 1
1 1 0
Can we find parameter values
,
, such that they
reproduce the logical XOR?

16. Single-layer NNs cannot realize XOR (Minsky and Papert, 1969)
15
 The decision rule for outputting = 1:

+
+ > 0 ⟺ > −

 This draws a line with the slope −

and y-intercept −

 However, it is impossible to draw a
single line that separates true/false
outputs of the XOR logic
 We say that XOR inputs are not linearly
separable (linearly inseparable)

TRUE
FALSE
Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.

17. How can we realize logical XOR?
16
 Combine logical connectives: = ( ∨ ) ∧ ¬ ∧
 Alternatively, draw multiple lines (instead of a single line)
1. Draw a line for OR
2. Draw a line for NAND (NOT of AND)
3. Take the AND of these areas
∨ ¬( ∧ )
0 0 0 1 0
0 1 1 1 1
1 0 1 1 1
1 1 1 0 0

AND =
OR
AND
NOT
AND
NAND

18. XOR realized as a combination of single-layer neural networks
17
Σ ℎ1
1
= 1
Σ
−1.5
1
= 1
2
= 1
Σ ℎ2
1.5
2
= −1
2
= −1
1
= 1
−0.5
ℎ2

ℎ1
ℎ2
ℎ1

https://chokkan.github.io/deeplearning/demo-mlp.html
 XOR is AND of OR and NAND
= ∨ ∧ ¬ ∧ = ℎ1
∧ ℎ2
= ℎ1
+ ℎ2
− 1.5
where:
ℎ1
= ∨ = + − 0.5
ℎ2
= ¬ ∧ = − − + 1.5

19. Multi-layer neural networks
18
 Combining (stacking) single-layer NNs, multi-layer neural networks
(multi-layer NNs) can realize any truth tables
 Applicable to linearly inseparable input/output
 Has the expressive power equivalent to any logic circuits such as
Arithmetic Logic Unit (ALU) and memory
 Note: Step function is the key to the non-linearlity
 If we did not use a step function in the first layer…
= 1
ℎ1
+ 2
ℎ2
+
= 1
11
1
+ 12
2
+ 1
+ 2
21
1
+ 22
2
+ 2
+
= 1
11
+ 2
21
1
+ 1
12
+ 2
22
2
+ 1
1
+ 2
2
+
Reduced to a single-layer NN

20. Generic form: Feed Forward Neural Network (FFNN)
19
1
Σ
2
Σ
ℎ1
ℎ2
Σ ℎ3
Σ
Σ
1
2
Σ
First layer 1: ℝ2 → ℝ3
= 1 = 1(ℎ + ℎ)
Second layer 2: ℝ3 → ℝ2
= 2 = 2(𝑧 + 𝑧)
Final layer 3: ℝ2 → ℝ1
= 3 = 3(𝑦𝑦 + )
ℎ ∈ ℝ3×2, ℎ ∈ ℝ3 ℎ ∈ ℝ2×3, 𝑧 ∈ ℝ2 𝑦𝑦 ∈ ℝ1×2, 𝑦𝑦 ∈ ℝ
Three-layer neural network: = 3 2 1
 and are called hidden units, states, or layers
 The depth of the neural network is three
 1, 2, 3 are called activation functions

21. Summary
 The logical units explained here are called Threshold Logic Units (TLU), the first
artificial neuron (McCulloch and Pitts, 1943)
 Single-layer NN can provide a functionally complete set {AND, OR, NOT}
 Single-layer NNs cannot model linearly inseparable data (e.g., XOR)
 Multi-layer NNs (stacking single-layer NNs) can be seen as logical compounds
 Multi-layer NNs can realize any binary functions: 0,1 ↦ {0,1}
 Multi-layer NNs can model linearly inseparable data
 We will see multi-layer NNs approximately express any smooth functions
 We showed the generic form of feed-forward neural networks
20
W. McCulloch and W. Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115-133.

22. Training Single-Layer Neural Networks
21

23. How to determine parameters of single-layer NNs
22
 We saw single-layer NNs realize logical connectives
 By crafting parameters (weights and biases) carefully to realize
desired connectives
 However, crafting parameters is difficult
 We are sometimes unsure of the internal mechanism associating
input and output variables
 We want to find parameters automatically from data
 We are interested in determining parameters only from supervision
data, pairs of inputs and outputs

24. Supervised learning (training)
23
 Supervision data (input: ∈ ℝ, output: ∈ {0,1})
 = { 1
, 1
, … ,
,
} ( instances)
 Find parameters such that they can reproduce training
instances as correctly as possible
 We assume generalization
 If the parameters reproduce training instances well, we expect that
they will work for unseen instances

25. Supervised learning for single-layer NNs (with new notations)
24
 For simplicity, we include a bias term ∈ ℝ in ∈ ℝ
 (new) = ⨁ 1 = 1
, 2
, … ,
, 1 ⊺, (new) = ⨁ = 1
, 2
, … , d
, ⊺
 (new) ⋅ new = 1
1
+ 2
2
+ ⋯ +

+ (←original form)
 We introduce a new notation to distinguish a computed output �
from
the gold output in the supervision data
 = { 1
, 1
, … ,
,
} ( instances)
 We distinguish two kinds of outputs hereafter
 �
: the output computed (predicted) by the model for the input
 : the true (gold) output for the input in the supervision data
 Training: find such that,
∀ ∈ {1, … , }: �

= ( ⋅
) =
( is the step function)

26. Perceptron algorithm (Rosenblatt, 1958)
25
1. = 0
2. = 1 (for simplicity)
3. Repeat:
4. (
,
) ⟵ an instance chosen from at random
5. �
⟵ ⋅
6. if �

then:
7. if
= 1 then:
8. ⟵ +
9. else:
10. ⟵ −
Frank Rosenblatt. 1958. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review,
65(6):386-408.

27. Exercise: train a single-layer NN to realize OR
26
 Convert the truth table into training data
 Initialize the weight vector = 0
 Apply the perceptron algorithm (previous page) to find
 Fix = 1 in this exercise
1
2

0 0 0
0 1 1
1 0 1
1 1 1
=
0 0 1 ⊺, 0 ,
0 1 1 ⊺, 1 ,
1 0 1 ⊺, 1 ,
1 1 1 ⊺, 1

28. Updating weights for OR
27
 = 0 0 1 ⊺, 0 , 0 1 1 ⊺, 1 , 1 0 1 ⊺, 1 , 1 1 1 ⊺, 1
 Initialization: = 0 0 0 ⊺
 Iteration #1: choose (4
, 4
) = 1 1 1 ⊺, 1
 Classification: �
= ⋅ 4
= 0 = 0 ≠ 4
 Update: ← + 4
= 1 1 1 ⊺
 Iteration #2: choose (1
, 1
) = 0 0 1 ⊺, 0
 Classification: �
= ⋅ 1
= 1 = 1 ≠ 1
 Update: ← − 1
= 1 1 0 ⊺
 Terminate (the weight classifies all instances correctly)
 = 0 0 1 ⊺: = 1 1 0 0 0 1 ⊺ = 0
 = 0 1 1 ⊺: = 1 1 0 0 1 1 ⊺ = 1
 = 1 0 1 ⊺: = 1 1 0 1 0 1 ⊺ = 1
 = 1 1 1 ⊺: = 1 1 0 1 1 1 ⊺ = 1
We chose the
instances in
the order that
minimizes the
required
number of

29. Perceptron algorithm implemented in numpy
28
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

30. Perceptron algorithm implemented in numpy (matrix version)
29
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb
=
0 0 1
0 1 1
1 0 1
1 1 1
, =
1
2
3

= =
3
2
+ 3
1
+ 3
1
+ 2
+ 3
Applying step function for each element
− �
=
1
− �
1
2
− �
2
3
− �
3
4
− �
4
− �
⋅ =
0 0 1
− �
1
0 2
− �
2
2
− �
2
3
− �
3
0 3
− �
3
4
− �
4
4
− �
4
4
− �
4

31. Why Perceptron algorithm works
30
 Suppose the parameter misclassifies (
,
)
 If
= 1:
 Update the weight vector ⟵ +
 If we classify
again with the updated weights :
 ′ ⋅
= +

= ⋅
+

≥ ⋅
 The dot product was increased (more likely to be classified as 1)
 Otherwise (if
= 0):
 Update the weight vector ′ ⟵ −
 If we classify
again with the updated weights :
 ′ ⋅
= −

= ⋅

≤ ⋅
 The dot product was decreased (more likely to be classified as 0)
 The algorithm updates the parameter to the direction
where it will classify (
,
) more correctly

32. Summary
31
 The perceptron algorithm:
 Can find parameters of single-layer NNs for linearly-separable data
 Cannot terminate with linearly-inseparable data
 Single-layer NNs cannot classify linearly inseparable data
 We must force to terminate the algorithm with incomplete parameters
 Extending the algorithm to multi-layer is non trivial
 We have no training data for hidden states
 The famous argument of Minsky and Papert (1969)
 In the next section, we consider the gradient-based method, an
alternative but standard strategy for training NNs
 Important concepts: sigmoid function and backpropagation
Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.

33. Single-Layer NN with Sigmoid Function
32

34. Activation function: from step to sigmoid
33
 Yields binary outputs
 Indifferentiable at zero
 Yields continuous scores
 Differentiable at all points
Sigmoid function: ℝ → (0,1)
() =
1
1 + −
Step function: ℝ → {0,1}
() = �
1 (if > 0)
0 (otherwise)

35. General form with sigmoid function
34
 Single layer NN with sigmoid function

= ⋅ =
1
1 + −⋅
Given an input ∈ ℝ, it computes an output �
∈ (0,1) by using the
parameter ∈ ℝ
 This is also known as logistic regression
 We can interpret �
as the conditional probability 1 where an
input is classified to 1 (positive category)
 Rule to classify an input to 1:

> 0.5 ⟺
1
1 + −⋅
>
1
2
⟺ ⋅ > 0
The classification rule is the same as the one when we use the
step function as an activation function

36. Example: logical AND
35
 The same parameter in the previous example

= = , = 1 + 2 − 1.5
 The outputs are acceptable, but
 1
∧ 2
= 1|1
= 1, 2
= 1 is not so high (62.2%)
 Room for improving so that it yields → 1 (100%) for
positives (true) and �
→ 0 (0%) for negatives (false)
1
2
= 1
∧ 2
= ()
0 0 0 -1.5 0.182
0 1 0 -0.5 0.378
1 0 0 -0.5 0.378
1 1 1 0.5 0.622

37. Instance-wise likelihood
36
 We introduce instance-wise likelihood
, to measure how
well the parameters reproduce (
,
)

=

(if
= 1)
1 − �

(otherwise)
 Likelihood is a probability representing the ‘fitness’ of the
parameters to the training data
 We want to increase the likelihood by changing
1
2
= 1
∧ 2

= ()
0 0 0 -1.5 0.182 1 − �
= 0.818
0 1 0 -0.5 0.378 1 − �
= 0.622
1 0 0 -0.5 0.378 1 − �
= 0.622
1 1 1 0.5 0.622 �
= 0.622
1
1
1
1
Parameters of AND: �
= , = 1 + 2 − 1.5

38. Likelihood on the training data
37
 We assume that all instances in the training data are i.i.d.
(independent and identically distributed)
 We define likelihood as a joint probability on data,

= �
=1

 When the training data = { 1
, 1
, … ,
,
} is fixed, the
likelihood is a function of the parameters
 Let us maximize
by changing
 This is called Maximum Likelihood Estimation (MLE)
 The maximizer ∗ reproduces the training data well

39. Training as a minimization problem
38
 Products of (0,1) values often cause underflow
 Instead, use log-likelihood, the logarithm of the likelihood,

= log
= log �
=1

= �
=1

log
 In mathematical optimization, we usually consider a
 We define an objective function
() by using the
negative of the log-likelihood

= −
= − �
=1

log

is called a loss function or error function

40. Training as a minimization problem
39
 Given the training data = { 1
, 1
, … ,
,
}, find ∗
as the minimization problem,
∗ = argmin

= argmin

=1

,

= log
=
log �

(if
= 1)
log 1 − �

(otherwise)
=
log �

+ (1 −
) log(1 − �

)

40
 The objective function is the sum of losses of instances,

= �
=1

 We can use Stochastic Gradient Descent (SGD) and its

 SGD Algorithm ( is the number of updates)
1. Initialize with random values
2. for ⟵ 1 to :
3.
⟵ 1/ # Learning rate at
4. (
,
) ⟵ an instance chosen from at random
5. ⟵ −

= +

41
Prove:

=

=
− �

, �

,

Here:

=
log �

+ (1 −
) log(1 − �

) ,
 �

=
= 1
1+−
,

= ⋅

42

=
log �

+ (1 −
) log(1 − �

) ,

=

+ 1−
1− �

⋅ −1 = 1− �
− �
(1−)

1− �

= − �

(1− �
)
,
 �

=
= 1
1+−
,

= −1 ⋅ 1
1+− 2
⋅ − ⋅ −1 = 1
1+−
⋅ −
1+−
= �

1 − �

,

= ⋅

=
Therefore,

=

= − �

(1− �
)
⋅ �

1 − �

=
− �

44. SGD elaborated for training single-layer NNs
43
1. Initialize with random values
2. for ⟵ 1 to :
3.
⟵ 1/
4. (
,
) ⟵ an instance chosen from at random
5. �

⟵ ( ⋅
)
6. ⟵ +

= +

− �

# If
= �

, no need for updating
# If
= 1 and �

scaled by 1 − �

to
# If
= 0 and 0 < �

, subtract
scaled by �

to
The algorithm is the same as perceptron except for using
the error
− �

for weighting the amount of an update

45. SGD implemented in numpy
44
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb
=
0 0 1
0 1 1
1 0 1
1 1 1
, =
1
2
3

= =
3
2
+ 3
1
+ 3
1
+ 2
+ 3
Applying sigmoid function for each element

46. Note: Why is SGD called ‘stochastic’?
45
 The objective function is the sum of losses of instances,

= �
=1

⟵ −

= −
− �
=1

 Update after computing loss values and gradients for all training instances
 Stochastic gradient descent: use random samples from the data
⟵ −

~ −

,

= −

 Approximate the gradients: from all instances → from a randomly-selected instance
 Update after computing the loss value and gradients for each training instances
 Faster to reach to the minimizer ∗ of the objective function

47. Note: What is learning rate?
46
 A learning rate determines a step size moving towards the steepest direction
 A large step size may reach the minimum faster, but jump over the minimum
 A small step size may take too long to converge and stuck in a local minimum
 We should decay learning rates for a strongly convex function such that:

=1

= ∞, �
=1

2 < ∞,
 Various scheduling strategies for learning rate:
= 0
/,
= 0
 Strategies used in practice
Stepwise Decay Schedule Polynomial Schedule Warming Up
https://beta.mxnet.io/guide/modules/lr_scheduler.html

48. Regularization
47
 MLE often causes over-fitting
 When the training data is linearly separable
→ ∞ as �
=1

→ 0
 Subject to be affected by noises in the training data
 We use regularization (MAP estimation)
 We introduce a penalty term when becomes too large
 The loss function with an L2 regularization term:
= − �
=1

+ 2
 is the hyper parameter to control the trade-off between
over/under fitting

49. Summary
48
 We used sigmoid as an activation function
 The model is also known as logistic regression
 We defined instance-wise likelihood to assess how well the current
model reproduce a prediction of a training instance
 Training a model: Minimizing the loss function by changing weights
 Loss function: − ∑=1

log �

+ (1 −
) log(1 − �

)
 Minimizing the loss function is equivalent to maximizing the products of
instance-wise log-likelihoods of all instances
 We showed an algorithm for minimizing the loss function by using
 The same as perceptron except for using the error (
− �

) for weighting
the amount of an update

50. Training Multi-Layer Neural Networks
with Back Propagation
49

51. Generic notation for multi-layer NNs
50
Σ (1)
Σ (1)
Σ (1)
Σ (2)
Σ (2)
Σ (3)
First layer: ℝ2 → ℝ3
(1) = (1) 1
(1) = (1)(0)
(1) ∈ ℝ3×2, 1 , 1 ∈ ℝ3
Second layer: ℝ3 → ℝ2
(2) = (2) 2
(2) = (2)(1)
(2) ∈ ℝ2×3, 2 , 2 ∈ ℝ2
Final layer: ℝ2 → ℝ
(3) = (3) 3
(3) = (3)(2)
(3) ∈ ℝ1×2, 3 , 3 ∈ ℝ
1
= ℎ1
(0)
2
= ℎ2
(0)
ℎ1
(1)
ℎ2
(1)
ℎ3
(1)
1
(1)
2
(1)
3
(1)
1
(2)
2
(2)
ℎ1
(2)
ℎ2
(2)
ℎ1
(3) = �

1
(3)
 The –th layer ( ∈ 1, … , ) consists of:
 Input: (−1) ∈ ℝ−1 ((0) = )
 Output: () ∈ ℝ (() = �
)
 Weight: () ∈ ℝ×−1
 Activation function: ()
 Activation: () ∈ ℝ
() =

conflict between an instance-wise
loss
and a layer number
() = ()(()(−1))

: weight from the -th neuron
to the -th neuron of the -th layer

52. How to train weights in multi-layer NNs
51
 We have no explicit supervision signals for the internal
(hidden) inputs/outputs (1), … , (−1)
 Having said that, SGD only needs the value of gradient

()
for every weight

() in MLPs
 Can we compute the value of

()
for every weight

()?
 Yes! Backpropagation can do that!!

53. Backpropagation
52
 Commonly used in deep neural networks
 Formulas for backpropagation look complicated
 However:
 We can understand backpropagation easily if we know
the concept of computation graph
 Most deep learning frameworks implement
backpropagation by using automatic differentiation
 Let’s see computation graph and automatic
differentiation first

54. Computation graph: , , = +
53
Example from: http://cs231n.github.io/optimization-2/

+
×

= −2
= 5
= −4
= 3
= −12
( = + )
( = 𝛼𝛼)
The value of a variable (above an arrow)
Forward pass

55. Automatic Differentiation (AD): , , = +
54
Example from: http://cs231n.github.io/optimization-2/
The value of a variable (above an arrow)
Forward pass

+
×

= −2
= 5
= −4
= 3
= −12
( = + )
( = 𝛼𝛼)
1
3
−4
−4
−4

× 1 = = 3

× 1 = = −4

× −4

× −4
Compare with:

= = −4

= = −4

= + = 3
Backward pass
The gradient of the output with respect to the variable (below an arrow)

56. Automatic Differentiation (Baydin+ 2018)
55
 AD computes derivations by using the chain rule
 Function values computed in the forward pass
 Derivations computed with respect to:
 Every variable (in reverse-mode accumulation)
 A specific variable (in forward-mode accumulation)
 Do not confuse with these:
 Numerical differentiation: e.g., ()

= + −()

 Symbolic differentiation: e.g., Mathematica, sympy
Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a
survey. Journal of Machine Learning Research, 18(153):1-43.

57. Rules for reverse-mode Automatic Differentiation
56
+

= +

×

= 𝑥𝑥

()
= ()

1
+ 2
1
2

Multiply
Function application
Branch

58. Exercise: AD on computation graph
57
 Write a computation graph for
,

= − log ⋅ = − log
1
1 + −⋅
 Consider = 1,1,1 ⊺ and = 1,1, −1.5 ⊺
 Compute the value of

59. Computing

58
1
1
×
+
+ × −1 exp +1 1/

−1.5
0.6225
1.6065
0.6065
−0.5
0.5
-1.5
log

2
2
×

3
3
×

1
1
1
1
1
1
1
2
= 𝛼𝛼
= 𝛿𝛿
= 𝜃𝜃
= +
= +

= ⁄
=

= ⁄
=

= ⁄
=

= 1 ⁄
= 1

= 1 ⁄
= 1
= − ⁄
= −1
= ⁄
=
= + 1 ⁄
= 1
= 1/ ⁄
= −(1/)2
= log ⁄
= 1/
−0.4740
= − ⁄
= −1
−1
−1.6065
0.6224
0.6224
0.3775
−0.3775

−0.3775
0.5663
−0.3775
−0.3775
−0.3775
−0.3775
−0.3775
−0.3775
−0.3775
−0.3775
0.4740

× −1
= − 1

= − 1
0.6225
= −1.6065

× −1.6065
= − 1
1.6065
2
× −1.6065
= 0.6224
1
⟵ 1
+

1
= 1
+ 0.3775
× −1
1

59
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

60
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

62. Training SLP using SGD with pytorch
61
=
0 0 1
0 1 1
1 0 1
1 1 1
, =
1
1
1
0
, =
0
0
0
x.mm(w): matrix-vector multiplication (): (4 × 1)
sigmoid(): element-wise sigmoid function: (4 × 1)
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

63. Training MLP using SGD with pytorch
62
Added weights for the second layer
Changed for two-layer perceptron
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

64. Training SLP with high-level NN modules
63
The definition of the shape
of the network and the loss
function (bias=True for
including weights for bias
terms)
We can implement this part
in a generic manner, i.e.,
independently of the model
We no longer append 1 (bias) to every
instance because torch.n.Linear
automatically includes a bias weight
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

65. Training MLP with high-level NN modules
64
The essence of the
change from SLP to MLP
We don’t have to modify
this part to implement MLP
(the number of iterations was
changed from 100 to 1000 because
we have more parameters to train)
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

66. SLP with high-level NN modules and optimizers
65
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

67. MLP with high-level NN modules and optimizers
66
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

68. SLP with a customizable NN class
67
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

69. MLP with a customizable NN class
68
https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

70. Manual derivation: Gradients for the final layer
69
 The same as single-layer NNs

,1
()
= − �

−1
 Here, we omit an index for instance for simplicity
 We replaced
with to avoid the notation conflict
Σ

= ℎ() =
() = ()(−1)
ℎ1
(−1)
ℎ2
(−1)
ℎ() = �

1
()

71. Manual derivation: Gradients for the internal layers (1/2)
70
1
()
2
()
(. )
(. )
× 11
(+1)
× 21
(+1)
× 12
(+1)
× 22
(+1)
Σ
Σ
1
(+1)
2
(+1)

1
+1
= 1
+1
1

2
+1
= 2
+1
1
1
+1
1
+1
2
+1
2
+1
11
(+1)
21
(+1)
12
(+1)
22
(+1)
22
(+1)2
+1
12
(+1)1
+1
21
(+1)2
+1
11
(+1)1
+1
′ 1

′ 2

1

2

1
= ′ 1
11
+1 1
+1 + 21
+1 2
+1
2
= ′ 2
11
+1 1
+1 + 21
+1 2
+1
1
(+1) = 11
(+1) 1
+ 12
+1 2

2
(+1) = 21
(+1) 1
+ 22
+1 2

Deriving the recursive formula of

72. Manual derivation: Gradients for the internal layers (2/2)
71
 General form of the recursive formula of ,

𝑗𝑗

=
𝑗𝑗
() = ′ 𝑗𝑗

(+1)

(+1)
 Gradient for the internal layer,

=

=
ℎ𝑗𝑗
−1
Σ ()
Σ ()
() = ()
() = ()(−1)
ℎ1
(−1)
ℎ2
(−1)
ℎ3
(−1)
1
()
2
()
ℎ1
()
ℎ2
()

()

73. Summary
72
 We can use SGD only if we can compute gradients of all parameters
 Even if we have no explicit supervision signals for internal layers
 AD computes derivations on computation graph by using the chain rule
 AD is employed in most deep learning frameworks
 We only need implement an algorithm for a forward pass, i.e., how a
model computes an output given an input
 We can concentrate on designing a structure of neural network
 This boosted the speed of research and development
 Manual derivation of gradients is tedious and error-prone

74. An Intuitive Explanation of Universal
Approximation Theorem for Multi-Layer NN
73

75. Universal approximation theorem (Cybenko, 1989)
74
 Let
denote the -dimensional unit cube 0,1 and
(
) denote the space of continuous functions on
 Given any > 0 and any function ∈
, there exist
an integer , real constants
,
∈ ℝ, and real vectors

∈ ℝ that define a function ,
= �
=1

(
⋅ +
) ,
such that the function approximates the function :
− <
 This still holds when replacing
with any compact
subset of ℝ and (.) with some activation functions
George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303-314.

76. What does the theorem state?
75
Neural networks with a single hidden layer can approximate
any smooth functions closely
The Universal Approximation Theorem for neural networks. https://www.youtube.com/watch?v=Ijqkc7OLenI (6:24)

77. Essence: Smooth function approximated by spikes
76
0.4
−0.4 0.3
−0.3
0.4
0.3
1 1 1 1
These shapes can be realized by choosing appropriate values
,
for (
+
)

78. Summary of this lecture
 Single-layer neural networks can realize logical AND, OR, NOT, but cannot XOR
 Multi-layer neural networks can realize any logical functions including XOR
 We can train single/multi-layer NNs by using gradient-based methods
 By implementing graph structures of NNs in a programming language
 With automatic differentiation in deep learning frameworks
 Neural networks with a single hidden layer can approximate any smooth
functions
77

79. References
78
 Michael Nielsen. 2017. Neural networks and deep learning.
http://neuralnetworksanddeeplearning.com/（日本語訳: