Soft Introduction to MSE based Linear Regression

Slide 1

Slide 1 text

Let’s solve a simple problem with Machine Learning!

Slide 2

Slide 2 text

House prices in Cracow

Slide 3

Slide 3 text

Randomly picked 38 house prices 22 164 22 380 35 175 35 320 36 199 36 198 37 320 38 210 38 199 39 295 41 315 41 299 45 315 48 359 49 285 50 279 50 409 50 286 51 330 52 380 54 375 54 396 54 430 54 299 56 387 56 465 59 285 60 295 62 485 62 314 67 500 69 390 74 459 75 498 75 460 78 550 Area [m^2] Area [m^2] Price [k zł] Price [k zł]

Slide 4

Slide 4 text

Visualise it

Slide 5

Slide 5 text

Problem

Slide 6

Slide 6 text

I have 83 meters square ﬂat to sell.

Slide 7

Slide 7 text

I have 83 meters square ﬂat to sell. What price should I set to match the market and don’t lose money? ?

Slide 8

Slide 8 text

We want our machine to be intelligent enough to solve it. But ﬁrst engineer has to create that “intelligence” by implementing learning algorithm in code. But engineer has to know what code to write. He/she needs some instruction. Luckily math can be turned into code easily!

Slide 9

Slide 9 text

Linear Regression To solve the problem we will use Algorithm that will try to approximate all points with single line

Slide 10

Slide 10 text

Simple problem 0 0 1 1 2 2 3 3 x y

Slide 11

Slide 11 text

Simple problem 0 0 1 1 2 2 3 3 x y We want machine to ﬁnd equation of linear function h(x) that will be as close as possible to ﬁt all the points - ideally goes through all of them.

Slide 12

Slide 12 text

Simple problem 0 0 1 1 2 2 3 3 x y Linear function: y = a*x + b Machine will need to ﬁnd right values for parameters a and b so the line goes ideally through all the points.

Slide 13

Slide 13 text

Simple problem 0 0 1 1 2 2 3 3 x y Humans don’t need Linear Regression to solve this problem. It’s obvious that function which goes through all the points is: h(x) = x y = a * x + b where a = 1, b = 0

Slide 14

Slide 14 text

Slide 15

Slide 15 text

But can you ﬁnd a and b so for function h(x) ﬁts those points as close as possible?

Slide 16

Slide 16 text

How to compare parameters? 0 0 1 1 2 2 3 3 x y Which parameters will create h(x) closest to all points? a = 0.5, b = 0 h(x) = 0.5x a = 0, b = 1.5 h(x) = 1.5

Slide 17

Slide 17 text

Let’s draw red and green functions 0 0 0 1.5 1 1 0.5 1.5 2 2 1 1.5 3 3 1.5 1.5 x h(x) = 0.5x h(x) = 1.5 y

Slide 18

Slide 18 text

Let’s draw red and green functions 0 0 0 1.5 1 1 0.5 1.5 2 2 1 1.5 3 3 1.5 1.5 x h(x) = 0.5x h(x) = 1.5 y

Slide 19

Slide 19 text

Let’s draw red and green functions 0 0 0 1.5 1 1 0.5 1.5 2 2 1 1.5 3 3 1.5 1.5 x h(x) = 0.5x h(x) = 1.5 y

Slide 20

Slide 20 text

Let’s calculate distance on y axis between corresponding points. r0 r1 r2 r3 g0 g1 g2 g3 Distance Red = r0 + r1 + r2 + r3 Distance Green = g0 + g1 + g2 + g3

Slide 21

Slide 21 text

r0 r1 r2 r3 Distance Red 0 0 0 1 1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = 0 - 0 = 0

Slide 22

Slide 22 text

r0 r1 r2 r3 Distance Red 0 0 0 1 1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = 0 - 0 = 0 r1 = 0.5 - 1 = -0.5

Slide 23

Slide 23 text

r0 r1 r2 r3 Distance Red 0 0 0 1 1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = 0 - 0 = 0 r1 = 0.5 - 1 = -0.5 r2 = 1 - 2 = -1

Slide 24

Slide 24 text

r0 r1 r2 r3 Distance Red 0 0 0 1 1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = 0 - 0 = 0 r1 = 0.5 - 1 = -0.5 r2 = 1 - 2 = -1 r3 = 1.5 - 3 = -1.5

Slide 25

Slide 25 text

r0 r1 r2 r3 Distance Red 0 0 0 1 1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = |0 - 0| = 0 r1 = |0.5 - 1| = 0.5 r2 = |1 - 2| = 1 r3 = |1.5 - 3| = 1.5 We care about distance - distance value shouldn’t be negative. We could take absolute value of each number. (In Machine Learning world this is called L1 Distance)

Slide 26

Slide 26 text

r0 r1 r2 r3 Distance Red 0 0 0 1 1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = (0 - 0)^2 = 0 r1 = (0.5 - 1)^2 = 0.25 r2 = (1 - 2)^2 = 1 r3 = (1.5 - 3)^2 = 2.25 But additional thing we could do is to penalise distances which are too big more. To achieve that we simply need to square each value. (In Machine Learning world it’s called L2 Distance)

Slide 27

Slide 27 text

r0 r1 r2 r3 Distance Red 0 0 0 1 1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = (0 - 0)^2 = 0 r1 = (0.5 - 1)^2 = 0.25 r2 = (1 - 2)^2 = 1 r3 = (1.5 - 3)^2 = 2.25 Squared-Distance-Red = 0 + 0.25 + 1 + 2.25 = 3.5

Slide 28

Slide 28 text

r0 r1 r2 r3 Distance Red 0 0 0 1 1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = (0 - 0)^2 = 0 r1 = (0.5 - 1)^2 = 0.25 r2 = (1 - 2)^2 = 1 r3 = (1.5 - 3)^2 = 2.25 Squared-Distance-Red = 0 + 0.25 + 1 + 2.25 = 3.5 Average Distance = (0 + 0.25 + 1 + 2.25) / 4 = 0.875

Slide 29

Slide 29 text

Distance Green 0 0 1.5 1 1 1.5 2 2 1.5 3 3 1.5 x h(x) = 1.5 y g0 = (1.5 - 0)^2 = 2.25 g1 = (1.5 - 1)^2 = 0.25 g2 = (1.5 - 2)^2 = 0.25 g3 = (1.5 - 3)^2 = 2.25 Squared-Distance-Green = 2.25 + 0.25 + 0.25 + 2.25 = 5 Average Distance = (2.25 + 0.25 + 0.25 + 2.25) / 4 = 1.25 g0 g1 g2 g3

Slide 30

Slide 30 text

We can compare now! Average Distance Green = 1.25 Average Distance Red = 0.875 (Average Distance Red) < (Average Distance Green ) Function parameters a=0.5, b=0 are better than a=0, b=1.5…

Slide 31

Slide 31 text

We can compare now! Function parameters a=0.5, b=0 are better than a=0, b=1.5… Because value of MEAN SQUARED-ERRORS for function created with red parameters is smaller than for function created with green parameters. Average Distance Green = 1.25 Average Distance Red = 0.875 (Average Distance Red) < (Average Distance Green )

Slide 32

Slide 32 text

Distance Green g0 g1 g2 g3 g3 = (1.5 - 3) ^ 2 = 2.25

Slide 33

Slide 33 text

Distance Green g0 g1 g2 g3 g3 = (1.5 - 3) ^ 2 = 2.25 Distance = Error

Slide 34

Slide 34 text

Distance Green g0 g1 g2 g3 g3 = (1.5 - 3) ^ 2 = 2.25 Distance = Error Squared Error

Slide 35

Slide 35 text

Distance Green g0 g1 g2 g3 g3 = (1.5 - 3) ^ 2 = 2.25 Distance = Error Squared Error g0 g1 g2 g3 Squared-Distance-Green = 2.25 + 0.25 + 0.25 + 2.25 = 5 Squared Errors

Slide 36

Slide 36 text

Distance Green g0 g1 g2 g3 g3 = (1.5 - 3) ^ 2 = 2.25 Distance = Error Squared Error g0 g1 g2 g3 Mean Squared Errors Average Distance = (2.25 + 0.25 + 0.25 + 2.25) / 4 = 1.25 Number of points m = 4

Slide 37

Slide 37 text

We can compare now! Function parameters a=0.5, b=0 are better than a=0, b=1.5… Because value of for function created with red parameters is smaller than for function created with green parameters. MSE Green = 1.25 MSE Red = 0.875 (MSE Red) < (MSE Green )

Slide 38

Slide 38 text

Let’s see how parameter a of function h(x) -> y = a * x aﬀects value of MSE (Mean Squared Error). We ignore b parameter for simpliﬁcation.

Slide 39

Slide 39 text

0 0 1 1 2 2 3 3 x y

Slide 40

Slide 40 text

0 0 1 1 2 2 3 3 x y Let’s pick large a = 10 and calculate MSE for given x, y:

Slide 41

Slide 41 text

0 0 1 1 2 2 3 3 x y Let’s pick large a = 10 and calculate MSE for given x, y: h(x) = a * x m = 4 (because 4y values)

Slide 42

Slide 42 text

0 0 1 1 2 2 3 3 x y Let’s pick large a = 10 and calculate MSE for given x, y: h(x) = a * x m = 4 (because 4y values)

Slide 43

Slide 43 text

0 0 1 1 2 2 3 3 x y h(x) = a * x m = 4 (because 4y values) Let’s pick large a = 10 and calculate MSE for given x, y: 0 = 10 * 0, x = 0 10 = 10 * 1, x = 1 20 = 10 * 2, x = 2 30 = 10 * 3 x = 3 expected result y = 0 expected result y = 1 expected result y = 2 expected result y = 3

Slide 44

Slide 44 text

0 0 1 1 2 2 3 3 x y Let’s pick large a = 10 and calculate MSE for given x, y: h(x) = a * x 0 = 10 * 0, m = 4 (because 4y values) x = 0 10 = 10 * 1, x = 1 20 = 10 * 2, x = 2 30 = 10 * 3 x = 3 expected result y = 0 expected result y = 1 expected result y = 2 expected result y = 3 MSE = ((0 - 0)^2 + (10 - 1)^2 + (20 - 2)^2 + (30 - 3)^2) / 4 MSE = 141.75

Slide 45

Slide 45 text

0 0 1 1 2 2 3 3 x y 10 y = 10 * x 141.75 a h(x) MSE

Slide 46

Slide 46 text

0 0 1 1 2 2 3 3 x y 10 y = 10 * x 141.75 5 y = 5 * x 28.0 a h(x) MSE

Slide 47

Slide 47 text

0 0 1 1 2 2 3 3 x y 10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 a h(x) MSE

Slide 48

Slide 48 text

0 0 1 1 2 2 3 3 x y 10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 a h(x) MSE

Slide 49

Slide 49 text

0 0 1 1 2 2 3 3 x y 10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 a h(x) MSE

Slide 50

Slide 50 text

0 0 1 1 2 2 3 3 x y 10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 -10 y = -10 * x 211.75 a h(x) MSE

Slide 51

Slide 51 text

0 0 1 1 2 2 3 3 x y 10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 -10 y = -10 * x 211.75 1 y = x 0 a h(x) MSE

Slide 52

Slide 52 text

0 0 1 1 2 2 3 3 x y 10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 -10 y = -10 * x 211.75 1 y = x 0 a h(x) MSE GLOBAL MINIMUM - LOWEST POSSIBLE VALUE OF MSE

Slide 53

Slide 53 text

0 0 1 1 2 2 3 3 x y 10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 -10 y = -10 * x 211.75 1 y = x 0 a h(x) MSE y = 1 * x

Slide 54

Slide 54 text

The smaller MSE (Mean Squared Errors) value - the better function ﬁts the points. We want MSE to be AS SMALL AS POSSIBLE - ideally 0!

Slide 55

Slide 55 text

It is possible to search for best h(x) parameters by randomly testing many diﬀerent parameters and picking the one for which MSE is the lowest.

Slide 56

Slide 56 text

It is possible to search for best h(x) parameters by randomly testing many diﬀerent parameters and picking the one for which MSE is the lowest. BUT IT'S INEFFICIENT!

Slide 57

Slide 57 text

It is possible to search for best h(x) parameters by randomly testing many diﬀerent parameters and picking the one for which MSE is the lowest. BUT IT'S INEFFICIENT! There is a mathematical method that allows you to change h(x) parameters in many iterations. Which each iteration MSE value is getting closer to it’s local minimum.

Slide 58

Slide 58 text

Gradient Descent Method for adjusting parameters in order to minimise MSE

Slide 59

Slide 59 text

Gradient Descent - recipe It can be mathematically proven that it works. For now trust it the same way you trust that recipe in cookbook won’t hurt your stomach :)

Slide 60

Slide 60 text

Gradient Descent - recipe Step 1: Decide on what function h(x) you want to ﬁnd h(x) = a * x + b parameters to ﬁnd: a, b

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Gradient Descent - recipe Step 1: Decide on what function h(x) you want to find h(x) = a * x + b parameters to find: a, b Step 2: Decide on function that measures efficiency of h(x) Mean Squared Errors the lower value the better a, b parameters Step 3: Take derivatives of MSE in respect to a, b dMSE(h(x), y) da = dMSE(h(x), y) db =

Slide 63

Slide 63 text

Slide 64

Slide 64 text

Slide 65

Slide 65 text

Gradient Descent - intuition After you have initialised your a parameter it usually generate big error at start. init

Slide 66

Slide 66 text

Gradient Descent - intuition After performing single iteration of Gradient Descent and updating a - then MSE value should decrease with each iteration. next step

Slide 67

Slide 67 text

Gradient Descent - intuition After performing single iteration of Gradient Descent and updating a - then MSE value should decrease with each iteration. next step

Slide 68

Slide 68 text

Gradient Descent - intuition After performing single iteration of Gradient Descent and updating a - then MSE value should decrease with each iteration. next step

Slide 69

Slide 69 text

Gradient Descent - intuition After performing single iteration of Gradient Descent and updating a - then MSE value should decrease with each iteration. step 3 next step

Slide 70

Slide 70 text

Gradient Descent - intuition After performing single iteration of Gradient Descent and updating a - then MSE value should decrease with each iteration. step 3 next step

Slide 71

Slide 71 text

Gradient Descent - intuition Size of step depends of learning_rate parameter. step 3 next step size of step

Slide 72

Slide 72 text

Gradient Descent - intuition If learning_rate is too large you can overshoot GLOBAL MINIMUM. next step size of step

Slide 73

Slide 73 text

Gradient Descent - intuition And then error can start to increase. next step next step

Slide 74

Slide 74 text

Gradient Descent - intuition That’s why you should keep small value of learning_rate. So you don't overshoot GLOBAL MINIMUM and approach it slowly! step 3

Slide 75

Slide 75 text

After implementing this in code we can solve our problem (visualisation part - running coded theory in Python and showing how graph changes)