Before we begin: The mathematical building blocks of neural networks

Slide 1

Slide 1 text

%JCRVGT $GHQTGYGDGIKP 6JGOCVJGOCVKECNDWKNFKPI $NQEMUQHPGWTCNPGVYQTMU Tomoki Tanimura, Keio University

Slide 2

Slide 2 text

6JKU%JCRVGTEQXGTU • A first example of a neural network • Tensor and tensor operations • How neural networks learn via backpropagation and gradient descent 2 Our goal in this chapter will be to build your intuition about these notions without getting overly technical.

Slide 3

Slide 3 text

6JKU%JCRVGTEQXGTU • A first example of a neural network • Tensor and tensor operations • How neural networks learn via backpropagation and gradient descent 3 Our goal in this chapter will be to build your intuition about these notions without getting overly technical.

Slide 4

Slide 4 text

/0+56&CVCUGV • MNIST (Mixed National Institute of Standards and Technology) • Classify grayscale images of handwritten digitis (28 x 28 pixels) into 10 categories ( 0 through 9 ) • 60000 training images + 10000 test images 4 0 2 4 3 28 28 data label

Slide 5

Slide 5 text

.QCF/0+56&CVCUGV 5 0 : : : : 2 : : : : <= Import keras (python library) • The model trained using training dataset and then evaluated by test dataset

Slide 6

Slide 6 text

• Training set • Test set 6JG&GVCKNQH&CVCUGV 6 … 60000 28 28 0 2 3 …

Slide 7

Slide 7 text

/QTG&GVCKNQH+OCIG 7

Slide 8

Slide 8 text

%CNEWNCVGCPKOCIG 8

Slide 9

Slide 9 text

6JG(NQYQH/CEJKPG.GCTPKPI • Training • Test Model Label = 5 1. Input 2. Calcurate 3. Output 4. Feedback & Revision 6 … Model Label = 5 Input Calcurate Output 5 Correct! …

Slide 10

Slide 10 text

'ZCORNGQH/QFGN • Define the Neural Network Model • Model is consisted of many layers which calculate the data • layer extract the representations 10 ← Import models and layers from keras (python library) ← Define like a model ↑ add the calculation methods (

Slide 11

Slide 11 text

%QPHKIWTCVKQPHQTVTCKPKPIVJGOQFGN • A loss function • An index of how much the model can't predict correctl • The model trained for minimizing the loss function • Optimizer • An algorithm that optimize the model • How to feedback to the model • Metrics • Accuracy 11

Slide 12

Slide 12 text

6JGVTCKPKPIHNQY Model Label = 8 1. Input 2. Calcurate 3. Output 4. Feedback & Revision 6

Slide 13

Slide 13 text

6JGVTCKPKPIHNQY Label = 8 Loss function 6 0.01 0.02 0.01 0.03 0.02 0.03 0.5 0.02 0.3 0.08 Feedback by Optimizer 0.01*3+2 0.01 0.2 0.2*0.01+0.2 0.01

Slide 14

Slide 14 text

-GTCUVTCKPKPIUGVWR • Network Compile • Data Preprocessing 14

Slide 15

Slide 15 text

6TCKPKPICPF6GUV • Training • Test • Overfitting • Train Accuracy >> Test Accuracy 15

Slide 16

Slide 16 text

6JKU%JCRVGTEQXGTU • A first example of a neural network • Tensor and tensor operations • How neural networks learn via backpropagation and gradient descent 16 Our goal in this chapter will be to build your intuition about these notions without getting overly technical.

Slide 17

Slide 17 text

&CVCTGRTGUGPVCVKQPUHQT00 • All current machine-learning systems use tensors as their basic data structure • Tensor is a container of numbers • Tensor is a generalization of matrices to an arbitrary numbers of dimensions. • In tensor, dimension is often called axis 17 0D tensors 1D tensors 2D tensors 3D tensors 4D tensors …

Slide 18

Slide 18 text

.GVņUVT[PWOR[FCVCQRGTCVKQP • Go to https://paiza.io/en • How to use python • Python is the numerical calculation library • Numpy is the tensor operation library of Python • print( … ) : write … in the standard output • How to use Numpy • First, “import numpy as np” • x = np.array([…]) : create the x which is the … array 18

Slide 19

Slide 19 text

5ECNCT &CPF8GEVQT & • Scalar: 0D tensor • Vector: 1D tensor 19

Slide 20

Slide 20 text

.GVņUVT[PWOR[FCVCQRGTCVKQP • Matrix: 2D tensor • High dimensional Tensor: 3D tensor 20

Slide 21

Slide 21 text

-G[#VVTKDWVGU 21 Number of Axes (Rank) >>> x.shape (3, 5) >>> x.dtype int64 Shape dtype

Slide 22

Slide 22 text

.QQMDCEMCV/0+56&CVCUGV 22 uint8 is the number between 0 and 255 60000 28 28

Slide 23

Slide 23 text

• Go to https://paiza.io/en • Create a temporary train_images of MNIST 6GPUQT1RGTCVKQPWUKPI0WOR[ 23

Slide 24

Slide 24 text

5NKEKPI • “:” is the slice for extracting the specified data 24 60000 28 28 10~100

Slide 25

Slide 25 text

5NKEKPI 25 60000 28 28 14~28 14~28 60000 28 28 7~14 7~14

Slide 26

Slide 26 text

$CVEJ5CORNKPI • The set of the data inputted to the model and calculated together 26 Model Labels = [8, 3, 6, 4] 1. Input 2. Calcurate 3. Output 4. Feedback & Revision [6, 2, 3, 4] Batch size (4)

Slide 27

Slide 27 text

• Create the batch sample using the slicing operation • All batch samples is the same size %TGCVGVJGDCVEJ 27

Slide 28

Slide 28 text

4GCNYQTNFGZCORNGUQHVGPUQTFCVC • Vector data • 2D tensor of shape (sample, features) • Timeseries data or Sequence data • 3D tensor of shape (samples, timesteps, features) • Images • 4D tensor of shape (samples, height, width, channels) • Videos • 5D tensor of shape (samples, frames, h, w, channels) 28

Slide 29

Slide 29 text

8GEVQT&CVC • Table data • Samples x Features • Ex: titanic dataset 29

Slide 30

Slide 30 text

6KOGUGTKGUFCVCQT5GSWGPEGFCVC • The dataset of stock prices 30 0:00 0:01 0:02 0:03 0:04 0:05 0:06 … 23:59 Max 2 9 7 56 8 6 8 … 4 Min 0 1 5 98 6 4 3 … 9 Now 6 7 9 6 4 67 98 … 1 0:00 0:01 0:02 0:03 0:04 0:05 0:06 … 23:59 Max 2 9 7 56 8 6 8 … 4 Min 0 1 5 98 6 4 3 … 9 Now 6 7 9 6 4 67 98 … 1 0:00 0:01 0:02 0:03 0:04 0:05 0:06 … 23:59 Max 2 9 7 56 8 6 8 … 4 Min 0 1 5 98 6 4 3 … 9 Now 6 7 9 6 4 67 98 … 1 0:00 0:01 0:02 0:03 0:04 0:05 0:06 … 23:59 Max 2 9 7 56 8 6 8 … 4 Min 0 1 5 98 6 4 3 … 9 Now 6 7 9 6 4 67 98 … 1 0:00 0:01 0:02 0:03 0:04 0:05 0:06 … 23:59 Max 2 9 7 56 8 6 8 … 4 Min 0 1 5 98 6 4 3 … 9 Now 6 7 9 6 4 67 98 … 1 … Time Date Features

Slide 31

Slide 31 text

+OCIGU • Samples x Height x Width x Channles 31

Slide 32

Slide 32 text

8KFGQU • Samples x Frames x Height x Width x Channles 32 Frames Frames … Samples

Slide 33

Slide 33 text

6JKU%JCRVGTEQXGTU • A first example of a neural network • Tensor and Tensor operations • How neural networks learn via backpropagation and gradient descent 33 Our goal in this chapter will be to build your intuition about these notions without getting overly technical.

Slide 34

Slide 34 text

6GPUQT1RGTCVKQPU Label = 8 Loss function 6 0.01 0.02 0.01 0.03 0.02 0.03 0.5 0.02 0.3 0.08 Feedback by Optimizer 0.01*3+2 0.01 0.2 0.2*0.01+0.2 0.01

Slide 35

Slide 35 text

NC[GTCPFVGPUQTQRGTCVKQP • Layer specify the calculation method 35

Slide 36

Slide 36 text

6JGFGVCKNQH&GPUGNC[GT • Dense layer is defined the following calculation 36 … Output is the larger of 0 and (input) * W + b 0.01 0.01*3+2 0.01 0.2 0.2*0.01+0.2 0.01 …

Slide 37

Slide 37 text

'NGOGPVYKUGQRGTCVKQPU • The ReLU operation and addition is the element-wise operation • Implementation based on the native Python • Implementation based on the Numpy • These operations is implemented Fortran or C by BLAS in Numpy, which is much faster than the native python 37

Slide 38

Slide 38 text

$TQCFECUV1RGTCVKQP • Make it possible to add the two tensors whose shapes differ • The smaller tensor is repeated alongside these new axes to match the full shape of the larger tensor 38

Slide 39

Slide 39 text

&QV1RGTCVKQP VGPUQTRTQFWEV • Most common and useful tensor operation 39 Vector dot product Matrix dot product Numpy Operation Mathematical Operation

Slide 40

Slide 40 text

/CVTKZFQVRTQFWEVDQZFKCITCO 40

Slide 41

Slide 41 text

6GPUQTTGUJCRKPI • Convert the shape of the tensor • Often use reshaping for “transposition” 41

Slide 42

Slide 42 text

6TCPURQUG • Exchange the two axes 42

Slide 43

Slide 43 text

.GVņUTGUJCRGCPFVTCPURQUG • Go to paiza! 43 Reshape Transpose

Slide 44

Slide 44 text

)GQOGVTKE+PVGTRTGVCVKQPQH6GPUQT1RGTCVKQP 44

Slide 45

Slide 45 text

)GQOGVTKE+PVGTRTGVCVKQPQH&GGR.GCTPKPI • Neural Network is the chains of simple tensor operations which are just geometric transformation of input data • What a neural network (or any other machine-learning model) is meant to do is figure out a transformation of the paper ball that would uncrumple it, so as to make the two classes cleanly separable again. • Uncrumpling paper balls is what machine learning is about: finding neat representations for complex, highly folded data manifolds. 45

Slide 46

Slide 46 text

6JKU%JCRVGTEQXGTU • A first example of a neural network • Tensor and Tensor operations • How neural networks learn via backpropagation and gradient descent 46 Our goal in this chapter will be to build your intuition about these notions without getting overly technical.

Slide 47

Slide 47 text

1RVKOK\CVKQP Label = 8 Loss function 6 0.01 0.02 0.01 0.03 0.02 0.03 0.5 0.02 0.3 0.08 Feedback by Optimizer 0.01*3+2 0.01 0.2 0.2*0.01+0.2 0.01

Slide 48

Slide 48 text

7RFCVGVJGYGKIJVU • Weights matrices are filled with small random values • There’s no reason to expect that relu(dot(W, input) + b) when w and b are random, will yield any useful representations • Weights are adjusted gradually to predict accurately from random value 48 weights (trainable parameters) This process is called “training” or “training loop”

Slide 49

Slide 49 text

6JGHNQYQHVTCKPKPI • Draw a batch of training samples x and corresponding targets y • Run the network on x to obtain predictions y_pred • Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y • Update all weights of the network in a way that slightly reduces the loss on this batch 49

Slide 50

Slide 50 text

1XGTXKGYQHVJGQRVKOK\CVKQP Label = 8 Loss function 6 0.01 0.02 0.01 0.03 0.02 0.03 0.5 0.02 0.3 0.08 Feedback by Optimizer 0.01*3+2 0.01 0.2 0.2*0.01+0.2 0.01

Slide 51

Slide 51 text

1XGTXKGYQHVJGQRVKOK\CVKQP Label = 8 6 Loss function: |6 - 8| = 2 Measure of mismatch W: 1 => 2 b: 2 => 3 1. Predict (Calculation) 2. Calculate the loss 3. Update the weights

Slide 52

Slide 52 text

• Update the weights using the “gradient” of the loss with regard to the network’s coefficients *QYWRFCVGVJGYGKIJVU 52 Loss function (differentiable)

Slide 53

Slide 53 text

/CVJGOCVKECN'ZRNCPCVKQP 53

Slide 54

Slide 54 text

/CVJGOCVKECN'ZRNCPCVKQP 54

Slide 55

Slide 55 text

%QORNGZKV[.QUUHWPEVKQP 55

Slide 56

Slide 56 text

.GCTPKPITCVG • How much the weight should be updated 56

Slide 57

Slide 57 text

1RVKOK\GT • These methods or the plans to update the weights • SGD, AdaGrad, RMSProp, … 57

Slide 58

Slide 58 text

)NQDCNCPF.QECN/KPKOWO • The purpose of the training is “Loss = Global minimum” • But, in the case of using Simple optimizer, the model’s weights is converged at “Local minimum” 58

Slide 59

Slide 59 text

/QOGPVWO • Invention of optimization to avoid converging the “Local minimum” • When updating the weights, the optimizer with momentum takes “the update log” into account 59

Slide 60

Slide 60 text

.QQMDCEMCVQWTHKTUVGZCORNG • Data 60 … 60000 28 0 2 3 …

Slide 61

Slide 61 text

.QQMDCEMCVQWTHKTUVGZCORNG • Model (Network) 61 0.01 0.01*3+2 0.01 0.2 0.2*0.01+0.2 0.01

Slide 62

Slide 62 text

.QQMDCEMCVQWTHKTUVGZCORNG • Compile (How to optimize the model) 62 Loss function (differentiable)

Slide 63

Slide 63 text

.QQMDCEMCVQWTHKTUVGZCORNG • Training loop (the configurations for optimizing the model) • fit • the method of starting to iterate on the training data in mini-batches of 128 samples, 5 times over • epoch • Each iteration over all training data • After these 5pochs, • The network will have performed 2,345 gradient updates (469 per epoch) • The loss of the network will be sufficiently low that the network will be capable of classifying handwritten digits with high accuracy 63