Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning: The Bare Math Behind Libraries

Machine Learning: The Bare Math Behind Libraries

Machine learning is one of the hottest buzzwords in technology today as well as one of the most innovative fields in computer science – yet people use libraries as black boxes without basic knowledge of the field. In this session, we will strip them to bare math, so next time you use a machine learning library, you'll have a deeper understanding of what lies underneath.

Medwith

June 21, 2019
Tweet

More Decks by Medwith

Other Decks in Programming

Transcript

  1. Machine Learning „Field of study that gives computers the ability

    to learn without being explicitly programmed.” Arthur Samuel
  2. Machine Learning „I consider every method that needs training as

    intelligent or machine learning method.” Our Lecturer
  3. Supervised learning • Build your model that performs particular task:

    – Prepare data set consisting of examples & expected outputs
  4. Supervised learning • Build your model that performs particular task:

    – Prepare data set consisting of examples & expected outputs – Present examples to your model
  5. Supervised learning • Build your model that performs particular task:

    – Prepare data set consisting of examples & expected outputs – Present examples to your model – Check how it responds (model’s output values)
  6. Supervised learning • Build your model that performs particular task:

    – Prepare data set consisting of examples & expected outputs – Present examples to your model – Check how it responds (model’s output values) – Adjust model’s params by comparing output values with expected output values
  7. Neural Networks • Inspired by biological brain mechanisms • Many

    applications: – Computer vision – Speech recognition – Compression
  8. Artifcial Neuron • Inputs (x 1, … , x n

    ) are features of single example • Multiply each input by weight, sum it and put the sum as an argument of activation function w 1 w 2 w n w 0 Σ x 1 x 2 x n ... s y
  9. Activation function • Sigmoid – Maps sum of neurons signals

    to value from 0 to 1 – Continous, nonlinear – If input is positive it gives values > 0.5 f (x)= 1 1+e(−βx)
  10. Linear Regression • Method for modelling relationship between variables •

    Simplest form: how x relates to y • Examples: – House size vs house price – Voltage vs electric current
  11. Costume price vs number of issues • For given amount

    of money predict in how many comic book issues you’ll appear. Costume price(x) Number of issues (y) 240 6370 480 8697 ... ... 26 2200
  12. Linear regression • Let's have a function: f (x ,Θ)=Θ1

    x+Θ0 f (x,Θ)−number of comicbookissues x−costume price Θ−parameters
  13. Objective function Q(Θ)= 1 2 N ∑ j=0 N (f

    ( xj ,Θ)−y j )2 Q(Θ)−objectivefunction N−numberof datasamples j−indexof particulardatasample
  14. Objective function Q(Θ)= 1 2 N ∑ j=0 N (f

    ( xj ,Θ)−y j )2 Q(Θ)−objectivefunction N−numberof datasamples j−indexof particulardatasample
  15. Objective function Q(Θ)= 1 2 N ∑ j=0 N (f

    ( xj ,Θ)−y j )2 Q(Θ)−objectivefunction N−numberof datasamples j−indexof particulardatasample
  16. Objective function Q(Θ)= 1 2 N ∑ j=0 N (f

    ( xj ,Θ)−y j )2 Q(Θ)−objectivefunction N−numberof datasamples j−indexof particulardatasample
  17. Gradient descent • Find the minimum of the objective function

    • Iteratively update function parameters: Θ0 (t +1)=Θ0 (t)−α 1 N ∑ j=0 N (f ( xj ,Θ)− y j ) Θ1 (t +1)=Θ1 (t )−α 1 N ∑ j=0 N (f (xj ,Θ)− yj ) x t−number of iteration α−learning step
  18. Gradient descent • Find the minimum of the objective function

    • Iteratively update function parameters: Θ0 (t +1)=Θ0 (t)−α 1 N ∑ j=0 N (f ( xj ,Θ)− y j ) Θ1 (t +1)=Θ1 (t )−α 1 N ∑ j=0 N (f (xj ,Θ)− yj ) x t−number of iteration α−learning step
  19. Gradient descent • Find the minimum of the objective function

    • Iteratively update function parameters: Θ0 (t +1)=Θ0 (t)−α 1 N ∑ j=0 N (f ( xj ,Θ)− y j ) Θ1 (t +1)=Θ1 (t )−α 1 N ∑ j=0 N (f (xj ,Θ)− yj ) x t−number of iteration α−learning step
  20. NN – compute error +1 +1 x 1 x 2

    y ( y−expected output)2
  21. NN – backpropagation step • Use gradient descent and computed

    error • Update every weight of every neuron from hidden and output layer
  22. NN – backpropagation step +1 +1 x 1 x 2

    y ( y−expected output)2
  23. Real life problem • You said it can solve non

    linear problems, let's generate superhero logo using it.
  24. Why would we let them? • Less complex mathematical apparatus

    than in supervised learning. • It is similar to discovering world on your own.
  25. Why would we let them? Used mostly for sorting and

    grouping when: • Sorting key can’t be easily fgured out. • Data is very complex and fnding the key is not trivial.
  26. Hebbian learning • Works similar to the nature • Great

    for beginners and biological simulations :) • Simple Hebbian learning algorithm Δ w ij =η⋅x ij ⋅y i Δ w ij −change of j weight of ineuron η−learningcoefficient x ij − jinput of ineuron y i −output of i neuron
  27. Hebbian learning • Works similar to the nature • Great

    for beginners and biological simulations :) • Generalised Hebbian learning algorithm Δ w ij =F(x ij , y j ) Δ w ij −change of j weight of ineuron η−learningcoefficient x ij − jinput of ineuron y i −output of i neuron
  28. Hebb’s neuron model w 1 w 2 w n w

    0 Δ w ij =F(x ij , y j ) Σ x 1 x 2 x n ... 1 s y
  29. Hebb’s neuron model 0.230 0.010 0.900 0.110 Δ w ij

    =F(x ij , y j ) Σ 0.200 0.300 0.100 1
  30. Hebb’s neuron model 0.230 0.010 0.900 0.110 Δ w ij

    =F(x ij , y j ) Σ 1 0.046 0.003 0.090 0.110 0.200 0.300 0.100
  31. Hebb’s neuron model 0.230 0.010 0.900 0.110 Δ w ij

    =F(x ij , y j ) Σ 1 0.046 0.003 0.090 0.110 0.249 0.200 0.300 0.100
  32. Hebb’s neuron model 0.230 0.010 0.900 0.110 Δ w ij

    =F(x ij , y j ) Σ 1 0.046 0.003 0.090 0.110 0.249 0.562 0.562 0.200 0.300 0.100
  33. Hebb’s neuron model 0.230 0.010 0.900 0.110 Δ w ij

    =F(x ij , y j ) Σ 1 0.562 0.200 0.300 0.100
  34. Hebb’s neuron model 0.230 0.010 0.900 0.110 Δ w ij

    =F(x ij , y j ) Σ 1 0.562 +0.011 +0.016 +0.005 +0.056 0.300 0.100 0.200
  35. Marvel database to the rescue Intelligence Strength Speed Durability Energy

    projection Fighting skills Iron Man 6 6 5 6 6 4 Spiderman 4 4 3 3 1 4 Black Panter 5 3 2 3 3 5 Wolverine 2 4 2 4 1 7 Thor 2 7 7 6 6 4 Dr Strange 4 2 7 2 6 6 Hulk 2 7 3 7 5 4 Cpt. America 3 3 2 3 1 6 Mr Fantastic 6 2 2 5 1 3 Human Torch 2 2 5 2 5 3 Invisible Woman 3 2 3 6 5 3 The Thing 3 6 2 6 1 5 Luke Cage 3 4 2 5 1 4 She Hulk 3 7 3 6 1 4 Ms Marvel 2 6 2 6 1 4 Daredevil 3 3 2 2 4 5
  36. Hebbian learning weaknesses • Unstable. • Prone to rise the

    weights ad infnitum. • Some groups can trigger no response. • Some groups may trigger response from too many neurons.
  37. Learning with concurrency • You try to generalize input vector

    in weights vector. • Instead of checking the reaction to input - you check distance between both vectors. • Ideally – each neuron specializes in one class generalization. • Two main strategies: – Winner Takes All (WTA) – Winner Takes Most (WTM)
  38. Example 1.0 2.0 3.0 Idea behind Distance 2.0 0.0 -1.0

    Neuron weights 3.0 2.0 2.0 d i =w i −x i Euclidiandistance √∑ i=1 n d i 2 =√5
  39. Distance 2.0 0.0 -1.0 Idea behind Neuron weights 3.0 2.0

    2.0 Learning coefficient η=0.100 Learning Step 0.2 0.0 -0.1 Δ w i =η⋅d i Example 1.0 2.0 3.0 d i =w i −x i Learning coefficient η=0.100
  40. Idea behind Distan ce 2.0 0.0 -1.0 Neuron weights 2.8

    = 3.0 – 0.2 2.0 = 2.0 – 0.0 2.1 = 2.0 -(-0.1) Exam ple 1.0 2.0 3.0 d i =w i −x i Learning coefficient η=0.100 Learning Step 0.2 0.0 -0.1 w' i =w i −Δw i Δ w i =η⋅d i Example 1.0 2.0 3.0 Learning Step 0.2 0.0 -0.1 Δ w i =η⋅d i Distance 2.0 0.0 -1.0 d i =w i −x i
  41. Learning with concurrency • Gives more diverse groups. • Less

    prone to clustering (than Hebb’s). • Searches wider spectrum of answers. • First step to more complex networks.
  42. Learning with concurency - weaknesses • WTA – works best

    if teaching examples are evenly distributed in solution space. • WTM – works best if weights’ vectors are evenly distributed in solution space. • Still can stick to local optimum.
  43. Kohonen’s self-organizing map • The most popular self-organizing network with

    concurrency algorithm. • It teaches groups of neurons with WTM alghoritm • Special features: – Neurons are organised in a grid – Nevertheless – they are treated as a single layer
  44. Kohonen’s self-organizing map w ij (s+1)=w ij (s)+Θ(k best ,i

    , s)⋅η(s)⋅(I j (s)−w ij (s)) s−epochnumber k best −best neuron w ij (s)− j weight of ineuron Θ(k best ,i,s)−neighbourhood function η(s)−learning coefficient for sepoch I j (s)− jchunk of example for s epoch
  45. SOM model By Mcld - Own work, CC BY-SA 3.0,

    https://commons.wikimedia.org/w/index.php?curid=10373592
  46. Common weaknesses of artifcial neuron systems • We are still

    dependant on randomized weights. • All algorithms can stick to local optimum.
  47. Bibliography • Presentation + code: https://bitbucket.org/medwith/public/downloads/ml-math-ndcoslo19.zip • https://www.coursera.org/learn/machine-learning • https://www.coursera.org/specializations/deep-learning

    • Math for Machine Learning - Amazon Training and Certifcation • Linear and Logistic Regression - Amazon Training and Certifcation • Grus J., Data Science from Scratch: First Principles with Python • Patterson J., Gibson A., Deep Learning: A Practitioner's Approach • Trask A., Grokking Deep Learning • Stroud K. A., Booth D. J, Engineering Mathematics • https://github.com/massie/octave-nn- neural network Octave implementation • https://www.desmos.com/calculator/dnzfajfpym - Nanananana … Batman equation ;) • https://xkcd.com/605/ - extrapolating ;) • http://dilbert.com/strip/2013-02-02 - Dilbert & Machine Learning