Soft Introduction to MSE based Linear Regression

Let’s solve a simple problem with Machine Learning!

House prices in Cracow

Randomly picked 38 house prices 22 164 22 380 35
175 35 320 36 199 36 198 37 320 38 210 38 199 39 295 41 315 41 299 45 315 48 359 49 285 50 279 50 409 50 286 51 330 52 380 54 375 54 396 54 430 54 299 56 387 56 465 59 285 60 295 62 485 62 314 67 500 69 390 74 459 75 498 75 460 78 550 Area [m^2] Area [m^2] Price [k zł] Price [k zł]

Visualise it

Problem

I have 83 meters square ﬂat to sell.

I have 83 meters square ﬂat to sell. What price
should I set to match the market and don’t lose money? ?

We want our machine to be intelligent enough to solve
it. But ﬁrst engineer has to create that “intelligence” by implementing learning algorithm in code. But engineer has to know what code to write. He/she needs some instruction. Luckily math can be turned into code easily!

Linear Regression To solve the problem we will use Algorithm
that will try to approximate all points with single line

Simple problem 0 0 1 1 2 2 3 3
x y

x y We want machine to ﬁnd equation of linear function h(x) that will be as close as possible to ﬁt all the points - ideally goes through all of them.

x y Linear function: y = a*x + b Machine will need to ﬁnd right values for parameters a and b so the line goes ideally through all the points.

x y Humans don’t need Linear Regression to solve this problem. It’s obvious that function which goes through all the points is: h(x) = x y = a * x + b where a = 1, b = 0

x y Humans don’t need Linear Regression to solve this problem. It’s obvious that function which goes through all the points is: h(x) = x y = a * x + b where a = 1, b = 0 h(0) = 0, h(1) = 1, h(2) = 2, h(3) = 3

But can you ﬁnd a and b so for function
h(x) ﬁts those points as close as possible?

How to compare parameters? 0 0 1 1 2 2
3 3 x y Which parameters will create h(x) closest to all points? a = 0.5, b = 0 h(x) = 0.5x a = 0, b = 1.5 h(x) = 1.5

Let’s draw red and green functions 0 0 0 1.5
1 1 0.5 1.5 2 2 1 1.5 3 3 1.5 1.5 x h(x) = 0.5x h(x) = 1.5 y

Let’s calculate distance on y axis between corresponding points. r0
r1 r2 r3 g0 g1 g2 g3 Distance Red = r0 + r1 + r2 + r3 Distance Green = g0 + g1 + g2 + g3

r0 r1 r2 r3 Distance Red 0 0 0 1
1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = 0 - 0 = 0

1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = 0 - 0 = 0 r1 = 0.5 - 1 = -0.5

1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = 0 - 0 = 0 r1 = 0.5 - 1 = -0.5 r2 = 1 - 2 = -1

1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = 0 - 0 = 0 r1 = 0.5 - 1 = -0.5 r2 = 1 - 2 = -1 r3 = 1.5 - 3 = -1.5

1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = |0 - 0| = 0 r1 = |0.5 - 1| = 0.5 r2 = |1 - 2| = 1 r3 = |1.5 - 3| = 1.5 We care about distance - distance value shouldn’t be negative. We could take absolute value of each number. (In Machine Learning world this is called L1 Distance)

1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = (0 - 0)^2 = 0 r1 = (0.5 - 1)^2 = 0.25 r2 = (1 - 2)^2 = 1 r3 = (1.5 - 3)^2 = 2.25 But additional thing we could do is to penalise distances which are too big more. To achieve that we simply need to square each value. (In Machine Learning world it’s called L2 Distance)

1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = (0 - 0)^2 = 0 r1 = (0.5 - 1)^2 = 0.25 r2 = (1 - 2)^2 = 1 r3 = (1.5 - 3)^2 = 2.25 Squared-Distance-Red = 0 + 0.25 + 1 + 2.25 = 3.5

1 0.5 2 2 1 3 3 1.5 x h(x) = 0.5x y r0 = (0 - 0)^2 = 0 r1 = (0.5 - 1)^2 = 0.25 r2 = (1 - 2)^2 = 1 r3 = (1.5 - 3)^2 = 2.25 Squared-Distance-Red = 0 + 0.25 + 1 + 2.25 = 3.5 Average Distance = (0 + 0.25 + 1 + 2.25) / 4 = 0.875

Distance Green 0 0 1.5 1 1 1.5 2 2
1.5 3 3 1.5 x h(x) = 1.5 y g0 = (1.5 - 0)^2 = 2.25 g1 = (1.5 - 1)^2 = 0.25 g2 = (1.5 - 2)^2 = 0.25 g3 = (1.5 - 3)^2 = 2.25 Squared-Distance-Green = 2.25 + 0.25 + 0.25 + 2.25 = 5 Average Distance = (2.25 + 0.25 + 0.25 + 2.25) / 4 = 1.25 g0 g1 g2 g3

We can compare now! Average Distance Green = 1.25 Average
Distance Red = 0.875 (Average Distance Red) < (Average Distance Green ) Function parameters a=0.5, b=0 are better than a=0, b=1.5…

We can compare now! Function parameters a=0.5, b=0 are better
than a=0, b=1.5… Because value of MEAN SQUARED-ERRORS for function created with red parameters is smaller than for function created with green parameters. Average Distance Green = 1.25 Average Distance Red = 0.875 (Average Distance Red) < (Average Distance Green )

Distance Green g0 g1 g2 g3 g3 = (1.5 -
3) ^ 2 = 2.25

3) ^ 2 = 2.25 Distance = Error

3) ^ 2 = 2.25 Distance = Error Squared Error

3) ^ 2 = 2.25 Distance = Error Squared Error g0 g1 g2 g3 Squared-Distance-Green = 2.25 + 0.25 + 0.25 + 2.25 = 5 Squared Errors

3) ^ 2 = 2.25 Distance = Error Squared Error g0 g1 g2 g3 Mean Squared Errors Average Distance = (2.25 + 0.25 + 0.25 + 2.25) / 4 = 1.25 Number of points m = 4

We can compare now! Function parameters a=0.5, b=0 are better
than a=0, b=1.5… Because value of for function created with red parameters is smaller than for function created with green parameters. MSE Green = 1.25 MSE Red = 0.875 (MSE Red) < (MSE Green )

Let’s see how parameter a of function h(x) -> y
= a * x aﬀects value of MSE (Mean Squared Error). We ignore b parameter for simpliﬁcation.

0 0 1 1 2 2 3 3 x y

0 0 1 1 2 2 3 3 x y
Let’s pick large a = 10 and calculate MSE for given x, y:

0 0 1 1 2 2 3 3 x y
Let’s pick large a = 10 and calculate MSE for given x, y: h(x) = a * x m = 4 (because 4y values)

0 0 1 1 2 2 3 3 x y
h(x) = a * x m = 4 (because 4y values) Let’s pick large a = 10 and calculate MSE for given x, y: 0 = 10 * 0, x = 0 10 = 10 * 1, x = 1 20 = 10 * 2, x = 2 30 = 10 * 3 x = 3 expected result y = 0 expected result y = 1 expected result y = 2 expected result y = 3

0 0 1 1 2 2 3 3 x y
Let’s pick large a = 10 and calculate MSE for given x, y: h(x) = a * x 0 = 10 * 0, m = 4 (because 4y values) x = 0 10 = 10 * 1, x = 1 20 = 10 * 2, x = 2 30 = 10 * 3 x = 3 expected result y = 0 expected result y = 1 expected result y = 2 expected result y = 3 MSE = ((0 - 0)^2 + (10 - 1)^2 + (20 - 2)^2 + (30 - 3)^2) / 4 MSE = 141.75

0 0 1 1 2 2 3 3 x y
10 y = 10 * x 141.75 a h(x) MSE

0 0 1 1 2 2 3 3 x y
10 y = 10 * x 141.75 5 y = 5 * x 28.0 a h(x) MSE

0 0 1 1 2 2 3 3 x y
10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 a h(x) MSE

0 0 1 1 2 2 3 3 x y
10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 a h(x) MSE

0 0 1 1 2 2 3 3 x y
10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 a h(x) MSE

0 0 1 1 2 2 3 3 x y
10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 -10 y = -10 * x 211.75 a h(x) MSE

0 0 1 1 2 2 3 3 x y
10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 -10 y = -10 * x 211.75 1 y = x 0 a h(x) MSE

0 0 1 1 2 2 3 3 x y
10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 -10 y = -10 * x 211.75 1 y = x 0 a h(x) MSE GLOBAL MINIMUM - LOWEST POSSIBLE VALUE OF MSE

0 0 1 1 2 2 3 3 x y
10 y = 10 * x 141.75 5 y = 5 * x 28.0 2 y = 2 * x 1.75 -2 y = -2 * x 15.75 -5 y = 10 * x 63.0 -10 y = -10 * x 211.75 1 y = x 0 a h(x) MSE y = 1 * x

The smaller MSE (Mean Squared Errors) value - the better
function ﬁts the points. We want MSE to be AS SMALL AS POSSIBLE - ideally 0!

It is possible to search for best h(x) parameters by
randomly testing many diﬀerent parameters and picking the one for which MSE is the lowest.

randomly testing many diﬀerent parameters and picking the one for which MSE is the lowest. BUT IT'S INEFFICIENT!

randomly testing many diﬀerent parameters and picking the one for which MSE is the lowest. BUT IT'S INEFFICIENT! There is a mathematical method that allows you to change h(x) parameters in many iterations. Which each iteration MSE value is getting closer to it’s local minimum.

Gradient Descent Method for adjusting parameters in order to minimise
MSE

Gradient Descent - recipe It can be mathematically proven that
it works. For now trust it the same way you trust that recipe in cookbook won’t hurt your stomach :)

Gradient Descent - recipe Step 1: Decide on what function
h(x) you want to ﬁnd h(x) = a * x + b parameters to ﬁnd: a, b

h(x) you want to find h(x) = a * x + b parameters to find: a, b Step 2: Decide on function that measures efficiency of h(x) Mean Squared Errors the lower value the better a, b parameters

h(x) you want to find h(x) = a * x + b parameters to find: a, b Step 2: Decide on function that measures efficiency of h(x) Mean Squared Errors the lower value the better a, b parameters Step 3: Take derivatives of MSE in respect to a, b dMSE(h(x), y) da = dMSE(h(x), y) db =

h(x) you want to find h(x) = a * x + b parameters to find: a, b Step 2: Decide on function that measures efficiency of h(x) Mean Squared Errors the lower value the better a, b parameters Step 3: Take derivatives of MSE in respect to a, b dMSE(h(x), y) da = dMSE(h(x), y) db = Step 4: Update a, b a = a - lr * b = b - lr * dMSE(h(x), y) da dMSE(h(x), y) db lr - small value between 0.001 - 0.1

h(x) you want to find h(x) = a * x + b parameters to find: a, b Step 2: Decide on function that measures efficiency of h(x) Mean Squared Errors the lower value the better a, b parameters Step 3: Take derivatives of MSE in respect to a, b dMSE(h(x), y) da = dMSE(h(x), y) db = Step 4: Update a, b a = a - lr * b = b - lr * dMSE(h(x), y) da dMSE(h(x), y) db lr - small value between 0.001 - 0.1 REPEAT UNTIL MSE IS SMALL

Gradient Descent - intuition After you have initialised your a
parameter it usually generate big error at start. init

Gradient Descent - intuition After performing single iteration of Gradient
Descent and updating a - then MSE value should decrease with each iteration. next step

Gradient Descent - intuition After performing single iteration of Gradient
Descent and updating a - then MSE value should decrease with each iteration. step 3 next step

Gradient Descent - intuition Size of step depends of learning_rate
parameter. step 3 next step size of step

Gradient Descent - intuition If learning_rate is too large you
can overshoot GLOBAL MINIMUM. next step size of step

Gradient Descent - intuition And then error can start to
increase. next step next step

Gradient Descent - intuition That’s why you should keep small
value of learning_rate. So you don't overshoot GLOBAL MINIMUM and approach it slowly! step 3

After implementing this in code we can solve our problem
(visualisation part - running coded theory in Python and showing how graph changes)

Soft Introduction to MSE based Linear Regression

Soft Introduction to MSE based Linear Regression

More Decks by F1sherKK

Other Decks in Technology

Featured

Transcript