Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Feedforward Neural Network (I): Binary Classification

Feedforward Neural Network (I): Binary Classification

Binary classification, Threshold Logic Units (TLUs), Single-layer Perceptron (SLP), Perceptron algorithm, sigmoid function, Stochastic Gradient Descent (SGD), Multi-layer Neural Networks, Backpropagation, Computation Graph, Automatic Differentiation, Universal Approximation Theorem

Naoaki Okazaki
PRO

July 28, 2020
Tweet

More Decks by Naoaki Okazaki

Other Decks in Research

Transcript

  1. Feedforward Neural Network (I):
    Binary Classification
    Naoaki Okazaki
    School of Computing,
    Tokyo Institute of Technology
    [email protected]
    PowerPoint template designed by https://ppt.design4u.jp/template/

    View Slide

  2. Highlights of this lecture
     Single-layer neural networks can realize logical AND, OR, NOT, but cannot XOR
     Multi-layer neural networks can realize any logical functions including XOR
     We can train single/multi-layer NNs by using gradient-based methods
     By implementing graph structures of NNs in a programming language
     With automatic differentiation in deep learning frameworks
     Neural networks with a single hidden layer can approximate any smooth
    functions
    1

    View Slide

  3. Threshold Logic Unit (TLU)
    2

    View Slide

  4. 3
    Recap: Logical connectives(論理演算)
    https://vanya.jp.net/dc/
    AND: ∧
    OR: ∨
    NOT: ¬
    NAND (NOT of AND)
    ¬( ∧ )
    NOR (NOT of OR)
    ¬( ∨ )
    XOR (exclusive OR)

    View Slide

  5. Logical circuits used in daily life
    4
    [1] https://response.jp/article/img/2017/10/02/300517/1228814.html
    [2] https://toshiba.semicon-storage.com/info/docget.jsp?did=67518&prodName=TC74HC00AP
    [3] http://download.intel.com/pressroom/kits/corei7/images/Core_i7_300.jpg
    Activated by the OR of pressed states of
    all buttons in the Shinkansen
    Signs in Shinkansen [1]
    Intel® Core™ i7 Processor [3]
    Logic IC (TC74HC00AP) [2]

    View Slide

  6. Recap: Functional complete set of AND, OR, NOT
    5
    Any truth tables (Boolean functions with inputs : 0,1 ↦ {0,1}) can be
    expressed by combinations of logical connectives AND, OR, and NOT
    Out
    0 0 0
    0 1 1
    1 0 1
    1 1 0
    ¬ ∧
    ∧ ¬
    (¬ ∧ ) ∨ ( ∧ ¬)
    Step 1 Step 2
    Each logical formula yields 1
    only when the corresponding
    input is given
    The overall output is 1
    when any of logical
    formulas is satisfied
    Rough explanation: Disjunctive Normal Form (DNF) (積和標準形)
    We can convert a truth table into a logical formula in a systematic way with {AND, OR, NOT}:
    1. For each row in the table where the output is true, take the AND of all the inputs:
     When an input (column) of the row is true, use the input variable as it is
     Otherwise (an input of the row is false), prepend a negation to the input variable
    2. Take the OR of all the formulas obtained by the step 1

    View Slide

  7. Can we build a logical connective only from input/output pairs?
    6
     In ML term, train a unknown mapping from supervision data
     No knowledge about the internal mechanism associating inputs and
    outputs is required
     We start with an example where all inputs/outputs can be described,
    but this assumption is impossible an impractical in the real world
     Imagine: inputs are natural language questions and outputs are answers
     We expect a learned mapping can predict outputs for unseen inputs
    ?
    , = 0 0 = 0
    = 1
    = 1
    = 1
    , = 0 1
    , = 1 0
    , = 1 1 = ? (, )
    Input Output

    View Slide

  8. Realize OR as a mathematical function
    7
     Find a function that satisfies (, , ∈ {0,1}):
     We can manually craft a function like this:

    , = 0 0 = 0
    = 1
    = 1
    = 1
    , = 0 1
    , = 1 0
    , = 1 1
    = (, )
    , = +
    () = �
    1 (if > 0)
    0 (otherwise)
    (step function)

    View Slide

  9. Realize OR with a single-layer neural network
    8
     Finding a function from scratch is hard in general
     We assume a model with parameters
     Assume a single-layer neural network (single-layer NN):
     Train a model: find the parameters that can reproduce the
    input/output of the supervision data (OR)
    = (
    +
    + )
    Output: ∈ {0,1} Input: , ∈ {0,1}
    Parameters:
    ,
    , ∈ ℝ

    View Slide

  10. Interactive visualization of single-layer neural networks
    9
    https://chokkan.github.io/deeplearning/demo-slp.html

    View Slide

  11. Parameters realizing logical OR: = ∨
    10

    0 0 0
    0 1 1
    1 0 1
    1 1 1

    Σ
    −0.5

    = 1

    = 1



    = 1
    = 0
    (, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 0.5 = 0 − 0.5 = 0
    (, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 0.5 = 1 − 0.5 = 1
    (, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 0.5 = 1 − 0.5 = 1
    (, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 0.5 = 2 − 0.5 = 1
    = ( + − 0.5)
    For example:
    =
    = 1, = −0.5

    View Slide

  12. Parameters realizing logical AND: = ⋀
    11

    0 0 0
    0 1 0
    1 0 0
    1 1 1
    = 1
    = 0
    For example:
    =
    = 1, = −1.5




    Σ
    −1.5

    = 1

    = 1
    (, ) = 0 0 ⊺: = 1 1 0 0 ⊺ − 1.5 = 0 − 1.5 = 0
    (, ) = 0 1 ⊺: = 1 1 0 1 ⊺ − 1.5 = 1 − 1.5 = 0
    (, ) = 1 0 ⊺: = 1 1 1 0 ⊺ − 1.5 = 1 − 1.5 = 0
    (, ) = 1 1 ⊺: = 1 1 1 1 ⊺ − 1.5 = 2 − 1.5 = 1
    = ( + − 1.5)

    View Slide

  13. Parameters realizing logical NOT: = ¬
    12

    0 1
    1 0
    = 1
    = 0
    For example:
    = −1,
    = 0, = 0.5
    (We ignore as a logical NOT has one input)
    Σ
    0.5

    = −1

    = 0: = −1 × 0 + 0.5 = 0.5 = 1
    = 1: = −1 × 1 + 0.5 = −1 + 0.5 = 0
    = (− + 0.5)

    View Slide

  14. Parameters realizing logical NAND: = ¬(⋀)
    13

    0 0 1
    0 1 1
    1 0 1
    1 1 0
    = 0
    = 1
    For example:
    =
    = −1, = 1.5

    Σ
    1.5

    = −1

    = −1
    (, ) = 0 0 ⊺: = −1 − 1 0 0 ⊺ + 1.5 = 0 + 1.5 = 1
    (, ) = 0 1 ⊺: = −1 − 1 0 1 ⊺ + 1.5 = −1 + 1.5 = 1
    (, ) = 1 0 ⊺: = −1 − 1 1 0 ⊺ + 1.5 = −1 + 1.5 = 1
    (, ) = 1 1 ⊺: = −1 − 1 1 1 ⊺ + 1.5 = −2 + 1.5 = 0
    = (− − + 1.5)



    View Slide

  15. Can we find parameters that realize logical XOR?
    14




    0 0 0
    0 1 1
    1 0 1
    1 1 0
    Can we find parameter values
    ,
    , such that they
    reproduce the logical XOR?

    View Slide

  16. Single-layer NNs cannot realize XOR (Minsky and Papert, 1969)
    15
     The decision rule for outputting = 1:

    +
    + > 0 ⟺ > −





     This draws a line with the slope −

    and y-intercept −

     However, it is impossible to draw a
    single line that separates true/false
    outputs of the XOR logic
     We say that XOR inputs are not linearly
    separable (linearly inseparable)

    TRUE
    FALSE
    Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.

    View Slide

  17. How can we realize logical XOR?
    16
     Combine logical connectives: = ( ∨ ) ∧ ¬ ∧
     Alternatively, draw multiple lines (instead of a single line)
    1. Draw a line for OR
    2. Draw a line for NAND (NOT of AND)
    3. Take the AND of these areas
    ∨ ¬( ∧ )
    0 0 0 1 0
    0 1 1 1 1
    1 0 1 1 1
    1 1 1 0 0



    AND =
    OR
    AND
    NOT
    AND
    NAND

    View Slide

  18. XOR realized as a combination of single-layer neural networks
    17
    Σ ℎ1
    1
    = 1
    Σ
    −1.5
    1
    = 1
    2
    = 1
    Σ ℎ2
    1.5
    2
    = −1
    2
    = −1
    1
    = 1
    −0.5
    ℎ2



    ℎ1
    ℎ2
    ℎ1


    https://chokkan.github.io/deeplearning/demo-mlp.html
     XOR is AND of OR and NAND
    = ∨ ∧ ¬ ∧ = ℎ1
    ∧ ℎ2
    = ℎ1
    + ℎ2
    − 1.5
    where:
    ℎ1
    = ∨ = + − 0.5
    ℎ2
    = ¬ ∧ = − − + 1.5

    View Slide

  19. Multi-layer neural networks
    18
     Combining (stacking) single-layer NNs, multi-layer neural networks
    (multi-layer NNs) can realize any truth tables
     Applicable to linearly inseparable input/output
     Has the expressive power equivalent to any logic circuits such as
    Arithmetic Logic Unit (ALU) and memory
     Note: Step function is the key to the non-linearlity
     If we did not use a step function in the first layer…
    = 1
    ℎ1
    + 2
    ℎ2
    +
    = 1
    11
    1
    + 12
    2
    + 1
    + 2
    21
    1
    + 22
    2
    + 2
    +
    = 1
    11
    + 2
    21
    1
    + 1
    12
    + 2
    22
    2
    + 1
    1
    + 2
    2
    +
    Reduced to a single-layer NN

    View Slide

  20. Generic form: Feed Forward Neural Network (FFNN)
    19
    1
    Σ
    2
    Σ
    ℎ1
    ℎ2
    Σ ℎ3
    Σ
    Σ
    1
    2
    Σ
    First layer 1: ℝ2 → ℝ3
    = 1 = 1(ℎ + ℎ)
    Second layer 2: ℝ3 → ℝ2
    = 2 = 2(𝑧 + 𝑧)
    Final layer 3: ℝ2 → ℝ1
    = 3 = 3(𝑦𝑦 + )
    ℎ ∈ ℝ3×2, ℎ ∈ ℝ3 ℎ ∈ ℝ2×3, 𝑧 ∈ ℝ2 𝑦𝑦 ∈ ℝ1×2, 𝑦𝑦 ∈ ℝ
    Three-layer neural network: = 3 2 1
     and are called hidden units, states, or layers
     The depth of the neural network is three
     1, 2, 3 are called activation functions

    View Slide

  21. Summary
     The logical units explained here are called Threshold Logic Units (TLU), the first
    artificial neuron (McCulloch and Pitts, 1943)
     Single-layer NN can provide a functionally complete set {AND, OR, NOT}
     Single-layer NNs cannot model linearly inseparable data (e.g., XOR)
     Multi-layer NNs (stacking single-layer NNs) can be seen as logical compounds
     Multi-layer NNs can realize any binary functions: 0,1 ↦ {0,1}
     Multi-layer NNs can model linearly inseparable data
     We will see multi-layer NNs approximately express any smooth functions
     We showed the generic form of feed-forward neural networks
    20
    W. McCulloch and W. Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115-133.

    View Slide

  22. Training Single-Layer Neural Networks
    21

    View Slide

  23. How to determine parameters of single-layer NNs
    22
     We saw single-layer NNs realize logical connectives
     By crafting parameters (weights and biases) carefully to realize
    desired connectives
     However, crafting parameters is difficult
     We are sometimes unsure of the internal mechanism associating
    input and output variables
     We want to find parameters automatically from data
     We are interested in determining parameters only from supervision
    data, pairs of inputs and outputs

    View Slide

  24. Supervised learning (training)
    23
     Supervision data (input: ∈ ℝ, output: ∈ {0,1})
     = { 1
    , 1
    , … ,
    ,
    } ( instances)
     Find parameters such that they can reproduce training
    instances as correctly as possible
     We assume generalization
     If the parameters reproduce training instances well, we expect that
    they will work for unseen instances

    View Slide

  25. Supervised learning for single-layer NNs (with new notations)
    24
     For simplicity, we include a bias term ∈ ℝ in ∈ ℝ
     (new) = ⨁ 1 = 1
    , 2
    , … ,
    , 1 ⊺, (new) = ⨁ = 1
    , 2
    , … , d
    , ⊺
     (new) ⋅ new = 1
    1
    + 2
    2
    + ⋯ +

    + (←original form)
     We introduce a new notation to distinguish a computed output �
    from
    the gold output in the supervision data
     = { 1
    , 1
    , … ,
    ,
    } ( instances)
     We distinguish two kinds of outputs hereafter
     �
    : the output computed (predicted) by the model for the input
     : the true (gold) output for the input in the supervision data
     Training: find such that,
    ∀ ∈ {1, … , }: �

    = ( ⋅
    ) =
    ( is the step function)

    View Slide

  26. Perceptron algorithm (Rosenblatt, 1958)
    25
    1. = 0
    2. = 1 (for simplicity)
    3. Repeat:
    4. (
    ,
    ) ⟵ an instance chosen from at random
    5. �
    ⟵ ⋅
    6. if �

    then:
    7. if
    = 1 then:
    8. ⟵ +
    9. else:
    10. ⟵ −
    11. Until no instance updates
    Frank Rosenblatt. 1958. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review,
    65(6):386-408.

    View Slide

  27. Exercise: train a single-layer NN to realize OR
    26
     Convert the truth table into training data
     Initialize the weight vector = 0
     Apply the perceptron algorithm (previous page) to find
     Fix = 1 in this exercise
    1
    2

    0 0 0
    0 1 1
    1 0 1
    1 1 1
    =
    0 0 1 ⊺, 0 ,
    0 1 1 ⊺, 1 ,
    1 0 1 ⊺, 1 ,
    1 1 1 ⊺, 1


    View Slide

  28. Updating weights for OR
    27
     = 0 0 1 ⊺, 0 , 0 1 1 ⊺, 1 , 1 0 1 ⊺, 1 , 1 1 1 ⊺, 1
     Initialization: = 0 0 0 ⊺
     Iteration #1: choose (4
    , 4
    ) = 1 1 1 ⊺, 1
     Classification: �
    = ⋅ 4
    = 0 = 0 ≠ 4
     Update: ← + 4
    = 1 1 1 ⊺
     Iteration #2: choose (1
    , 1
    ) = 0 0 1 ⊺, 0
     Classification: �
    = ⋅ 1
    = 1 = 1 ≠ 1
     Update: ← − 1
    = 1 1 0 ⊺
     Terminate (the weight classifies all instances correctly)
     = 0 0 1 ⊺: = 1 1 0 0 0 1 ⊺ = 0
     = 0 1 1 ⊺: = 1 1 0 0 1 1 ⊺ = 1
     = 1 0 1 ⊺: = 1 1 0 1 0 1 ⊺ = 1
     = 1 1 1 ⊺: = 1 1 0 1 1 1 ⊺ = 1
    We chose the
    instances in
    the order that
    minimizes the
    required
    number of
    updates

    View Slide

  29. Perceptron algorithm implemented in numpy
    28
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  30. Perceptron algorithm implemented in numpy (matrix version)
    29
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb
    =
    0 0 1
    0 1 1
    1 0 1
    1 1 1
    , =
    1
    2
    3

    = =
    3
    2
    + 3
    1
    + 3
    1
    + 2
    + 3
    Applying step function for each element
    − �
    =
    1
    − �
    1
    2
    − �
    2
    3
    − �
    3
    4
    − �
    4
    − �
    ⋅ =
    0 0 1
    − �
    1
    0 2
    − �
    2
    2
    − �
    2
    3
    − �
    3
    0 3
    − �
    3
    4
    − �
    4
    4
    − �
    4
    4
    − �
    4

    View Slide

  31. Why Perceptron algorithm works
    30
     Suppose the parameter misclassifies (
    ,
    )
     If
    = 1:
     Update the weight vector ⟵ +
     If we classify
    again with the updated weights :
     ′ ⋅
    = +

    = ⋅
    +

    ≥ ⋅
     The dot product was increased (more likely to be classified as 1)
     Otherwise (if
    = 0):
     Update the weight vector ′ ⟵ −
     If we classify
    again with the updated weights :
     ′ ⋅
    = −

    = ⋅


    ≤ ⋅
     The dot product was decreased (more likely to be classified as 0)
     The algorithm updates the parameter to the direction
    where it will classify (
    ,
    ) more correctly

    View Slide

  32. Summary
    31
     The perceptron algorithm:
     Can find parameters of single-layer NNs for linearly-separable data
     Cannot terminate with linearly-inseparable data
     Single-layer NNs cannot classify linearly inseparable data
     We must force to terminate the algorithm with incomplete parameters
     Extending the algorithm to multi-layer is non trivial
     We have no training data for hidden states
     The famous argument of Minsky and Papert (1969)
     In the next section, we consider the gradient-based method, an
    alternative but standard strategy for training NNs
     Important concepts: sigmoid function and backpropagation
    Marvin Minsky and Seymour Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press, Cambridge MA.

    View Slide

  33. Single-Layer NN with Sigmoid Function
    32

    View Slide

  34. Activation function: from step to sigmoid
    33
     Yields binary outputs
     Indifferentiable at zero
     With zero gradients
     Yields continuous scores
     Differentiable at all points
     With mostly non-zero gradients
     Useful for gradient descent
    Sigmoid function: ℝ → (0,1)
    () =
    1
    1 + −
    Step function: ℝ → {0,1}
    () = �
    1 (if > 0)
    0 (otherwise)

    View Slide

  35. General form with sigmoid function
    34
     Single layer NN with sigmoid function

    = ⋅ =
    1
    1 + −⋅
    Given an input ∈ ℝ, it computes an output �
    ∈ (0,1) by using the
    parameter ∈ ℝ
     This is also known as logistic regression
     We can interpret �
    as the conditional probability 1 where an
    input is classified to 1 (positive category)
     Rule to classify an input to 1:

    > 0.5 ⟺
    1
    1 + −⋅
    >
    1
    2
    ⟺ ⋅ > 0
    The classification rule is the same as the one when we use the
    step function as an activation function

    View Slide

  36. Example: logical AND
    35
     The same parameter in the previous example

    = = , = 1 + 2 − 1.5
     The outputs are acceptable, but
     1
    ∧ 2
    = 1|1
    = 1, 2
    = 1 is not so high (62.2%)
     Room for improving so that it yields → 1 (100%) for
    positives (true) and �
    → 0 (0%) for negatives (false)
    1
    2
    = 1
    ∧ 2
    = ()
    0 0 0 -1.5 0.182
    0 1 0 -0.5 0.378
    1 0 0 -0.5 0.378
    1 1 1 0.5 0.622

    View Slide

  37. Instance-wise likelihood
    36
     We introduce instance-wise likelihood
    , to measure how
    well the parameters reproduce (
    ,
    )

    =


    (if
    = 1)
    1 − �

    (otherwise)
     Likelihood is a probability representing the ‘fitness’ of the
    parameters to the training data
     We want to increase the likelihood by changing
    1
    2
    = 1
    ∧ 2

    = ()
    0 0 0 -1.5 0.182 1 − �
    = 0.818
    0 1 0 -0.5 0.378 1 − �
    = 0.622
    1 0 0 -0.5 0.378 1 − �
    = 0.622
    1 1 1 0.5 0.622 �
    = 0.622
    1
    1
    1
    1
    Parameters of AND: �
    = , = 1 + 2 − 1.5

    View Slide

  38. Likelihood on the training data
    37
     We assume that all instances in the training data are i.i.d.
    (independent and identically distributed)
     We define likelihood as a joint probability on data,

    = �
    =1


     When the training data = { 1
    , 1
    , … ,
    ,
    } is fixed, the
    likelihood is a function of the parameters
     Let us maximize
    by changing
     This is called Maximum Likelihood Estimation (MLE)
     The maximizer ∗ reproduces the training data well

    View Slide

  39. Training as a minimization problem
    38
     Products of (0,1) values often cause underflow
     Instead, use log-likelihood, the logarithm of the likelihood,

    = log
    = log �
    =1


    = �
    =1

    log
     In mathematical optimization, we usually consider a
    minimization problem instead of maximization
     We define an objective function
    () by using the
    negative of the log-likelihood

    = −
    = − �
    =1

    log

    is called a loss function or error function

    View Slide

  40. Training as a minimization problem
    39
     Given the training data = { 1
    , 1
    , … ,
    ,
    }, find ∗
    as the minimization problem,
    ∗ = argmin


    = argmin


    =1


    ,

    = log
    =
    log �

    (if
    = 1)
    log 1 − �

    (otherwise)
    =
    log �

    + (1 −
    ) log(1 − �

    )




    View Slide

  41. Stochastic Gradient Descent (SGD)
    40
     The objective function is the sum of losses of instances,

    = �
    =1


     We can use Stochastic Gradient Descent (SGD) and its
    variants (e.g., Adam) for minimizing

     SGD Algorithm ( is the number of updates)
    1. Initialize with random values
    2. for ⟵ 1 to :
    3.
    ⟵ 1/ # Learning rate at
    4. (
    ,
    ) ⟵ an instance chosen from at random
    5. ⟵ −


    = +


    View Slide

  42. Exercise: compute the gradient
    41
    Prove:


    =








    =
    − �


    by computing the gradients


    , �


    ,

    Here:

    =
    log �

    + (1 −
    ) log(1 − �

    ) ,
     �

    =
    = 1
    1+−
    ,

    = ⋅

    View Slide

  43. Answer: the gradient
    42

    =
    log �

    + (1 −
    ) log(1 − �

    ) ,



    =


    + 1−
    1− �

    ⋅ −1 = 1− �
    − �
    (1−)

    1− �

    = − �


    (1− �
    )
    ,
     �

    =
    = 1
    1+−
    ,



    = −1 ⋅ 1
    1+− 2
    ⋅ − ⋅ −1 = 1
    1+−
    ⋅ −
    1+−
    = �

    1 − �

    ,

    = ⋅


    =
    Therefore,


    =







    = − �


    (1− �
    )
    ⋅ �

    1 − �


    =
    − �


    View Slide

  44. SGD elaborated for training single-layer NNs
    43
    1. Initialize with random values
    2. for ⟵ 1 to :
    3.
    ⟵ 1/
    4. (
    ,
    ) ⟵ an instance chosen from at random
    5. �

    ⟵ ( ⋅
    )
    6. ⟵ +


    = +

    − �


    # If
    = �

    , no need for updating
    # If
    = 1 and �

    < 1, add
    scaled by 1 − �

    to
    # If
    = 0 and 0 < �

    , subtract
    scaled by �

    to
    The algorithm is the same as perceptron except for using
    the error
    − �

    for weighting the amount of an update

    View Slide

  45. SGD implemented in numpy
    44
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb
    =
    0 0 1
    0 1 1
    1 0 1
    1 1 1
    , =
    1
    2
    3

    = =
    3
    2
    + 3
    1
    + 3
    1
    + 2
    + 3
    Applying sigmoid function for each element

    View Slide

  46. Note: Why is SGD called ‘stochastic’?
    45
     The objective function is the sum of losses of instances,

    = �
    =1


     Gradient descent
    ⟵ −



    = −
    − �
    =1



     Update after computing loss values and gradients for all training instances
     Stochastic gradient descent: use random samples from the data
    ⟵ −



    ~ −

    ,


    = −



     Approximate the gradients: from all instances → from a randomly-selected instance
     Update after computing the loss value and gradients for each training instances
     Faster to reach to the minimizer ∗ of the objective function

    View Slide

  47. Note: What is learning rate?
    46
     A learning rate determines a step size moving towards the steepest direction
     A large step size may reach the minimum faster, but jump over the minimum
     A small step size may take too long to converge and stuck in a local minimum
     We should decay learning rates for a strongly convex function such that:

    =1


    = ∞, �
    =1


    2 < ∞,
     Various scheduling strategies for learning rate:
    = 0
    /,
    = 0
    / , Adap, RMSProp
     Strategies used in practice
    Stepwise Decay Schedule Polynomial Schedule Warming Up
    https://beta.mxnet.io/guide/modules/lr_scheduler.html

    View Slide

  48. Regularization
    47
     MLE often causes over-fitting
     When the training data is linearly separable
    → ∞ as �
    =1


    → 0
     Subject to be affected by noises in the training data
     We use regularization (MAP estimation)
     We introduce a penalty term when becomes too large
     The loss function with an L2 regularization term:
    = − �
    =1


    + 2
     is the hyper parameter to control the trade-off between
    over/under fitting

    View Slide

  49. Summary
    48
     We used sigmoid as an activation function
     The model is also known as logistic regression
     We defined instance-wise likelihood to assess how well the current
    model reproduce a prediction of a training instance
     Training a model: Minimizing the loss function by changing weights
     Loss function: − ∑=1

    log �

    + (1 −
    ) log(1 − �

    )
     Minimizing the loss function is equivalent to maximizing the products of
    instance-wise log-likelihoods of all instances
     We showed an algorithm for minimizing the loss function by using
    Stochastic Gradient Descent (SGD)
     The same as perceptron except for using the error (
    − �

    ) for weighting
    the amount of an update

    View Slide

  50. Training Multi-Layer Neural Networks
    with Back Propagation
    49

    View Slide

  51. Generic notation for multi-layer NNs
    50
    Σ (1)
    Σ (1)
    Σ (1)
    Σ (2)
    Σ (2)
    Σ (3)
    First layer: ℝ2 → ℝ3
    (1) = (1) 1
    (1) = (1)(0)
    (1) ∈ ℝ3×2, 1 , 1 ∈ ℝ3
    Second layer: ℝ3 → ℝ2
    (2) = (2) 2
    (2) = (2)(1)
    (2) ∈ ℝ2×3, 2 , 2 ∈ ℝ2
    Final layer: ℝ2 → ℝ
    (3) = (3) 3
    (3) = (3)(2)
    (3) ∈ ℝ1×2, 3 , 3 ∈ ℝ
    1
    = ℎ1
    (0)
    2
    = ℎ2
    (0)
    ℎ1
    (1)
    ℎ2
    (1)
    ℎ3
    (1)
    1
    (1)
    2
    (1)
    3
    (1)
    1
    (2)
    2
    (2)
    ℎ1
    (2)
    ℎ2
    (2)
    ℎ1
    (3) = �

    1
    (3)
     The –th layer ( ∈ 1, … , ) consists of:
     Input: (−1) ∈ ℝ−1 ((0) = )
     Output: () ∈ ℝ (() = �
    )
     Weight: () ∈ ℝ×−1
     Activation function: ()
     Activation: () ∈ ℝ
    () =

    Please accept the notational
    conflict between an instance-wise
    loss
    and a layer number
    () = ()(()(−1))

    : weight from the -th neuron
    to the -th neuron of the -th layer

    View Slide

  52. How to train weights in multi-layer NNs
    51
     We have no explicit supervision signals for the internal
    (hidden) inputs/outputs (1), … , (−1)
     Having said that, SGD only needs the value of gradient



    ()
    for every weight

    () in MLPs
     Can we compute the value of


    ()
    for every weight

    ()?
     Yes! Backpropagation can do that!!

    View Slide

  53. Backpropagation
    52
     Commonly used in deep neural networks
     Formulas for backpropagation look complicated
     However:
     We can understand backpropagation easily if we know
    the concept of computation graph
     Most deep learning frameworks implement
    backpropagation by using automatic differentiation
     Let’s see computation graph and automatic
    differentiation first

    View Slide

  54. Computation graph: , , = +
    53
    Example from: http://cs231n.github.io/optimization-2/



    +
    ×


    = −2
    = 5
    = −4
    = 3
    = −12
    ( = + )
    ( = 𝛼𝛼)
    The value of a variable (above an arrow)
    Forward pass

    View Slide

  55. Automatic Differentiation (AD): , , = +
    54
    Example from: http://cs231n.github.io/optimization-2/
    The value of a variable (above an arrow)
    Forward pass



    +
    ×


    = −2
    = 5
    = −4
    = 3
    = −12
    ( = + )
    ( = 𝛼𝛼)
    1
    3
    −4
    −4
    −4


    × 1 = = 3


    × 1 = = −4


    × −4


    × −4
    Compare with:


    = = −4


    = = −4


    = + = 3
    Backward pass
    (Reverse mode AD)
    The gradient of the output with respect to the variable (below an arrow)

    View Slide

  56. Automatic Differentiation (Baydin+ 2018)
    55
     AD computes derivations by using the chain rule
     Function values computed in the forward pass
     Derivations computed with respect to:
     Every variable (in reverse-mode accumulation)
     A specific variable (in forward-mode accumulation)
     Do not confuse with these:
     Numerical differentiation: e.g., ()

    = + −()

     Symbolic differentiation: e.g., Mathematica, sympy
    Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a
    survey. Journal of Machine Learning Research, 18(153):1-43.

    View Slide

  57. Rules for reverse-mode Automatic Differentiation
    56
    +


    = +



    ×


    = 𝑥𝑥



    ()
    = ()






    1
    + 2
    1
    2


    Add
    Multiply
    Function application
    Branch

    View Slide

  58. Exercise: AD on computation graph
    57
     Write a computation graph for
    ,

    = − log ⋅ = − log
    1
    1 + −⋅
     Consider = 1,1,1 ⊺ and = 1,1, −1.5 ⊺
     Compute the value of

     Compute gradients

    View Slide

  59. Computing

    using AD
    58
    1
    1
    ×
    +
    + × −1 exp +1 1/






    −1.5
    0.6225
    1.6065
    0.6065
    −0.5
    0.5
    -1.5
    log



    2
    2
    ×


    3
    3
    ×


    1
    1
    1
    1
    1
    1
    1
    2
    = 𝛼𝛼
    = 𝛿𝛿
    = 𝜃𝜃
    = +
    = +

    = ⁄
    =

    = ⁄
    =

    = ⁄
    =

    = 1 ⁄
    = 1

    = 1 ⁄
    = 1
    = − ⁄
    = −1
    = ⁄
    =
    = + 1 ⁄
    = 1
    = 1/ ⁄
    = −(1/)2
    = log ⁄
    = 1/
    −0.4740
    = − ⁄
    = −1
    −1
    −1.6065
    0.6224
    0.6224
    0.3775
    −0.3775
































    −0.3775
    0.5663
    −0.3775
    −0.3775
    −0.3775
    −0.3775
    −0.3775
    −0.3775
    −0.3775
    −0.3775
    0.4740


    × −1
    = − 1

    = − 1
    0.6225
    = −1.6065


    × −1.6065
    = − 1
    1.6065
    2
    × −1.6065
    = 0.6224
    1
    ⟵ 1
    +


    1
    = 1
    + 0.3775
    × −1
    1

    View Slide

  60. Computing gradients with autograd
    59
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  61. Computing gradients with pytorch
    60
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  62. Training SLP using SGD with pytorch
    61
    =
    0 0 1
    0 1 1
    1 0 1
    1 1 1
    , =
    1
    1
    1
    0
    , =
    0
    0
    0
    x.mm(w): matrix-vector multiplication (): (4 × 1)
    sigmoid(): element-wise sigmoid function: (4 × 1)
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  63. Training MLP using SGD with pytorch
    62
    Added weights for the second layer
    Changed for two-layer perceptron
    Updates for the new parameters
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  64. Training SLP with high-level NN modules
    63
    The definition of the shape
    of the network and the loss
    function (bias=True for
    including weights for bias
    terms)
    We can implement this part
    in a generic manner, i.e.,
    independently of the model
    We no longer append 1 (bias) to every
    instance because torch.n.Linear
    automatically includes a bias weight
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  65. Training MLP with high-level NN modules
    64
    The essence of the
    change from SLP to MLP
    We don’t have to modify
    this part to implement MLP
    (the number of iterations was
    changed from 100 to 1000 because
    we have more parameters to train)
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  66. SLP with high-level NN modules and optimizers
    65
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  67. MLP with high-level NN modules and optimizers
    66
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  68. SLP with a customizable NN class
    67
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  69. MLP with a customizable NN class
    68
    https://github.com/chokkan/deeplearning/blob/master/notebook/binary.ipynb

    View Slide

  70. Manual derivation: Gradients for the final layer
    69
     The same as single-layer NNs


    ,1
    ()
    = − �

    −1
     Here, we omit an index for instance for simplicity
     We replaced
    with to avoid the notation conflict
    Σ

    = ℎ() =
    () = ()(−1)
    ℎ1
    (−1)
    ℎ2
    (−1)
    ℎ() = �

    1
    ()

    View Slide

  71. Manual derivation: Gradients for the internal layers (1/2)
    70
    1
    ()
    2
    ()
    (. )
    (. )
    × 11
    (+1)
    × 21
    (+1)
    × 12
    (+1)
    × 22
    (+1)
    Σ
    Σ
    1
    (+1)
    2
    (+1)


    1
    +1
    = 1
    +1
    1


    2
    +1
    = 2
    +1
    1
    1
    +1
    1
    +1
    2
    +1
    2
    +1
    11
    (+1)
    21
    (+1)
    12
    (+1)
    22
    (+1)
    22
    (+1)2
    +1
    12
    (+1)1
    +1
    21
    (+1)2
    +1
    11
    (+1)1
    +1
    ′ 1

    ′ 2

    1

    2

    1
    = ′ 1
    11
    +1 1
    +1 + 21
    +1 2
    +1
    2
    = ′ 2
    11
    +1 1
    +1 + 21
    +1 2
    +1
    1
    (+1) = 11
    (+1) 1
    + 12
    +1 2

    2
    (+1) = 21
    (+1) 1
    + 22
    +1 2

    Deriving the recursive formula of

    View Slide

  72. Manual derivation: Gradients for the internal layers (2/2)
    71
     General form of the recursive formula of ,


    𝑗𝑗

    =
    𝑗𝑗
    () = ′ 𝑗𝑗




    (+1)

    (+1)
     Gradient for the internal layer,




    =










    =
    ℎ𝑗𝑗
    −1
    Σ ()
    Σ ()
    () = ()
    () = ()(−1)
    ℎ1
    (−1)
    ℎ2
    (−1)
    ℎ3
    (−1)
    1
    ()
    2
    ()
    ℎ1
    ()
    ℎ2
    ()


    ()

    View Slide

  73. Summary
    72
     We can use SGD only if we can compute gradients of all parameters
     Even if we have no explicit supervision signals for internal layers
     Automatic Differentiation (AD) can compute gradients systematically
     AD computes derivations on computation graph by using the chain rule
     AD realizes backpropagation without manual derivation of gradients
     AD is employed in most deep learning frameworks
     We only need implement an algorithm for a forward pass, i.e., how a
    model computes an output given an input
     We can concentrate on designing a structure of neural network
     This boosted the speed of research and development
     Manual derivation of gradients is tedious and error-prone

    View Slide

  74. An Intuitive Explanation of Universal
    Approximation Theorem for Multi-Layer NN
    73

    View Slide

  75. Universal approximation theorem (Cybenko, 1989)
    74
     Let
    denote the -dimensional unit cube 0,1 and
    (
    ) denote the space of continuous functions on
     Given any > 0 and any function ∈
    , there exist
    an integer , real constants
    ,
    ∈ ℝ, and real vectors

    ∈ ℝ that define a function ,
    = �
    =1


    (
    ⋅ +
    ) ,
    such that the function approximates the function :
    − <
     This still holds when replacing
    with any compact
    subset of ℝ and (.) with some activation functions
    George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303-314.

    View Slide

  76. What does the theorem state?
    75
    Neural networks with a single hidden layer can approximate
    any smooth functions closely
    The Universal Approximation Theorem for neural networks. https://www.youtube.com/watch?v=Ijqkc7OLenI (6:24)

    View Slide

  77. Essence: Smooth function approximated by spikes
    76
    0.4
    −0.4 0.3
    −0.3
    0.4
    0.3
    1 1 1 1
    These shapes can be realized by choosing appropriate values
    ,
    for (
    +
    )

    View Slide

  78. Summary of this lecture
     Single-layer neural networks can realize logical AND, OR, NOT, but cannot XOR
     Multi-layer neural networks can realize any logical functions including XOR
     We can train single/multi-layer NNs by using gradient-based methods
     By implementing graph structures of NNs in a programming language
     With automatic differentiation in deep learning frameworks
     Neural networks with a single hidden layer can approximate any smooth
    functions
    77

    View Slide

  79. References
    78
     Michael Nielsen. 2017. Neural networks and deep learning.
    http://neuralnetworksanddeeplearning.com/(日本語訳:
    https://nnadl-ja.github.io/nnadl_site_ja/)
     Raul Rojas. 1996. Neural Networks - A Systematic Introduction.
    Springer-Verlag. (Available at https://page.mi.fu-
    berlin.de/rojas/neural/)
     斎藤 康毅. 2016. ゼロから作るDeep Learning. O'Reilly Japan.
     Learning PyTorch with Examples.
    https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

    View Slide