Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Class 4: Cost of Empirical Risk Minimization

40e37c08199ed4d3866ce6e1ff0be06d?s=47 David Evans
January 24, 2019

Class 4: Cost of Empirical Risk Minimization

https://uvammm.github.io/class4

Markets, Mechanisms, and Machines
University of Virginia
cs4501/econ4559 Spring 2019
David Evans and Denis Nekipelov
https://uvammm.github.io/

40e37c08199ed4d3866ce6e1ff0be06d?s=128

David Evans

January 24, 2019
Tweet

Transcript

  1. MARKETS, MECHANISMS, MACHINES University of Virginia, Spring 2019 Class 4:

    Cost of Empirical Risk Minimization 24 January 2019 cs4501/econ4559 Spring 2019 David Evans and Denis Nekipelov https://uvammm.github.io
  2. Plan Recap: Risk Minimization Empirical Risk Minimization How hard is

    it to solve ERM? “Hard Problems” “Solving” Intractable Problems Special Cases 1
  3. Recap: Risk Minimization Nature provides inputs ! with labels "

    provided by supervisor, drawn randomly from distribution #(%, ') Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Risk: - ℎ = / 0 ', ℎ 1 = 2 0(', ℎ 1 ) 3#(1, ') ℎ∗ = argmin :∈ℋ -(ℎ) 2 Vapnik’s notation: choose ; ∈ Λ that minimizes -(;) = ∫ 0(', >(1, ;) 3#(1, ')
  4. Empirical Risk Minimization Nature provides inputs ! with labels "

    provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 3
  5. Empirical Risk Minimization Given a set of possible functions, ℋ,

    choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) 4 Choosing the set ℋ:
  6. 5 Neural Information Processing Systems 1991

  7. Empirical Risk Minimization Nature provides inputs ! with labels "

    provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 6 How expensive is it to find ℎ∗?
  8. Special Case: Ordinary Least Squares Nature provides inputs ! with

    labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 7 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}
  9. Special Case: Ordinary Least Squares Nature provides inputs ! with

    labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 8 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}
  10. Model Evaluation: Simple Linear Regression 9 ! " = $%

    + $' (
  11. Example ERM Set of Functions (ℋ) Loss (") Name ℎ(%)

    = (T% ℎ % – + , Ordinary Least Squares 10 ℋ = ℎ % ∈ ℝ/ ℎ % = (0%, ( ∈ ℝ/}
  12. Restricting through Regularization 11 Given a set of possible functions,

    ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) ℎ∗ = argmin =∈ℋ %&'( (ℎ) Instead of explicitly defining ℋ, use a regularizer added to loss to constrain solution space: ℎ∗ = argmin =∈ℋ %&'( ℎ + λ Reg (ℎ)
  13. Popular Regularizers 12 ℋ = ℎ $ ∈ ℝ' ℎ

    $ = ()$, ( ∈ ℝ'} ℓ- regularizer: Reg ℎ = ( - = ∑ 23- ' |(2 | ℎ∗ = argmin ;∈ℋ <=>? ℎ + λ Reg(ℎ) ℓD regularizer: Reg ℎ = ( D D = ∑ 23- ' (2 D = (()
  14. Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (")

    Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression 13 ( , , = ((.
  15. Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (")

    Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression ℎ % = (T% + / often ℓ1, ℓ, log(1 + 678 9 :) Logistic Regression … … 14 Science (and Art) of ERM is picking ℋ, Reg, and "
  16. How to choose? 15 Science (and Art) of ERM is

    picking ℋ, Reg, and &
  17. Cost of Computing

  18. Class Background cs111x: 34 (out of 38 total) cs2102: 28

    cs2150: 27 cs3102: 11 cs4102: 17 Some Machine Learning Course: 6 17
  19. Cost Questions Cost of Training 18 Cost of Prediction ℎ∗

    = argmin *∈ℋ -./0 ℎ + λ Reg (ℎ) 7 8 = ℎ∗(9)
  20. Cost of Prediction 19 ! " = $% + $'

    ( How expensive is it to compute prediction ! "?
  21. How to measure cos

  22. 21 https://aws.amazon.com/ec2/spot/pricing/

  23. 22 https://aws.amazon.com/ec2/spot/pricing/

  24. 23 https://aws.amazon.com/ec2/spot/pricing/

  25. 24 https://aws.amazon.com/ec2/spot/pricing/

  26. 25

  27. None
  28. None
  29. None
  30. 29 Bandwidth is expensive!

  31. Empirical Risk Minimization Nature provides inputs ! with labels "

    provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 30 How expensive is it to find ℎ∗?
  32. Alan Turing, 1912-1954 How should we model a computer?

  33. Cray-1 (1976) Pebble (2014) Apple II (1977) Palm Pre (2009)

    MacBook Air (2008) Surface (2016)
  34. Colossus (1944) Apollo Guidance Computer (1969) Honeywell Kitchen Computer (1969)

    ($10,600 “complete with two-week programming course”)
  35. What Computers was Turing modelling?

  36. None
  37. Modeling “Scratch Paper” (and Input and Output)

  38. Two-Dimensional Paper = One-Dimensional Tape

  39. Modeling Pencil and Paper # C S S A 7

    2 3 How long should the tape be? ... ...
  40. Modeling Processing

  41. Modeling Processing

  42. Modeling Processing Look at the current state of the computation

    Follow simple rules about what to do next Scratch paper to keep track
  43. Modeling Processing (Brains) Follow simple rules Remember what you are

    doing “For the present I shall only say that the justification lies in the fact that the human memory is necessarily limited.” Alan Turing 42
  44. Modelling Processing 43 ! = ($, & ⊆ $ ×

    Γ → $, +, ∈ $) $ is a finite set, Γ is finite set of symbols that can be written in memory # C S S A 7 2 3 ... ...
  45. Turing’s Model 44 !" = (%, a finite set of

    (head) states ! ⊆ % × Γ → % × Γ × 789, transition function => ∈ %, start state =@AABCD ⊆ % accepting states ) % is a finite set, Γ is finite set of symbols that can be written in memory 789 = {Left, Right, Halt}
  46. 45

  47. TM Computation Definition. A language (in most Computer Science uses)

    is a (possibly infinite) set of finite strings. Σ = alphabet, a finite set of symbols # ⊆ Σ∗
  48. Cost of Prediction 47 ! " = $% + $'

    ( Can we define the prediction problem as a language?
  49. Addition Language 48 !"## = { &' &( … &*

    + ,' ,( … ,* = -' -( … -* -*.( | &0 , ,0 , -0 ∈ 0, 1 , - = & + , }
  50. Cost of Prediction 49 ! " = $% + $'

    ( Can we define the prediction problem as a language?
  51. Cost of Multiplication 50

  52. Most recent improvement: 2007

  53. Most recent improvement: 2007

  54. Talking about Cost Cost of an algorithm: (nearly) always can

    get a tight bound Naive multiplication algorithm for two N-digit integers has running time cost in Θ(#2).
  55. Talking about Cost Cost of solving a problem: cost of

    the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof:
  56. Talking about Cost Cost of solving a problem: cost of

    the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2).
  57. Talking about Cost Cost of solving a problem: cost of

    the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in Ω(#). Proof: changing the value of any digit can change the output, so at least need to look at all input digits (in worst case).
  58. Talking about Cost Cost of solving a problem: cost of

    the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2). Fürer 2007
  59. Charge Concrete Costs: matter in practice size of dataset cost

    of computing resources – where, when, etc. Asymptotic Costs: important for understanding based on abstract models of computing predicting costs as data scales Project 2: will be posted by tomorrow 58