David Evans
January 24, 2019
610

Class 4: Cost of Empirical Risk Minimization

https://uvammm.github.io/class4

Markets, Mechanisms, and Machines
University of Virginia
cs4501/econ4559 Spring 2019
David Evans and Denis Nekipelov
https://uvammm.github.io/

January 24, 2019

Transcript

1. MARKETS, MECHANISMS, MACHINES University of Virginia, Spring 2019 Class 4:

Cost of Empirical Risk Minimization 24 January 2019 cs4501/econ4559 Spring 2019 David Evans and Denis Nekipelov https://uvammm.github.io
2. Plan Recap: Risk Minimization Empirical Risk Minimization How hard is

it to solve ERM? “Hard Problems” “Solving” Intractable Problems Special Cases 1
3. Recap: Risk Minimization Nature provides inputs ! with labels "

provided by supervisor, drawn randomly from distribution #(%, ') Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Risk: - ℎ = / 0 ', ℎ 1 = 2 0(', ℎ 1 ) 3#(1, ') ℎ∗ = argmin :∈ℋ -(ℎ) 2 Vapnik’s notation: choose ; ∈ Λ that minimizes -(;) = ∫ 0(', >(1, ;) 3#(1, ')
4. Empirical Risk Minimization Nature provides inputs ! with labels "

provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 3
5. Empirical Risk Minimization Given a set of possible functions, ℋ,

choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) 4 Choosing the set ℋ:

7. Empirical Risk Minimization Nature provides inputs ! with labels "

provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 6 How expensive is it to find ℎ∗?
8. Special Case: Ordinary Least Squares Nature provides inputs ! with

labels " provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 7 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}
9. Special Case: Ordinary Least Squares Nature provides inputs ! with

labels " provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 8 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}

+ \$' (
11. Example ERM Set of Functions (ℋ) Loss (") Name ℎ(%)

= (T% ℎ % – + , Ordinary Least Squares 10 ℋ = ℎ % ∈ ℝ/ ℎ % = (0%, ( ∈ ℝ/}
12. Restricting through Regularization 11 Given a set of possible functions,

ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) ℎ∗ = argmin =∈ℋ %&'( (ℎ) Instead of explicitly defining ℋ, use a regularizer added to loss to constrain solution space: ℎ∗ = argmin =∈ℋ %&'( ℎ + λ Reg (ℎ)
13. Popular Regularizers 12 ℋ = ℎ \$ ∈ ℝ' ℎ

\$ = ()\$, ( ∈ ℝ'} ℓ- regularizer: Reg ℎ = ( - = ∑ 23- ' |(2 | ℎ∗ = argmin ;∈ℋ <=>? ℎ + λ Reg(ℎ) ℓD regularizer: Reg ℎ = ( D D = ∑ 23- ' (2 D = (()
14. Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (")

Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression 13 ( , , = ((.
15. Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (")

Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression ℎ % = (T% + / often ℓ1, ℓ, log(1 + 678 9 :) Logistic Regression … … 14 Science (and Art) of ERM is picking ℋ, Reg, and "
16. How to choose? 15 Science (and Art) of ERM is

picking ℋ, Reg, and &

18. Class Background cs111x: 34 (out of 38 total) cs2102: 28

cs2150: 27 cs3102: 11 cs4102: 17 Some Machine Learning Course: 6 17
19. Cost Questions Cost of Training 18 Cost of Prediction ℎ∗

= argmin *∈ℋ -./0 ℎ + λ Reg (ℎ) 7 8 = ℎ∗(9)
20. Cost of Prediction 19 ! " = \$% + \$'

( How expensive is it to compute prediction ! "?

27. None
28. None
29. None

31. Empirical Risk Minimization Nature provides inputs ! with labels "

provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 30 How expensive is it to find ℎ∗?

33. Cray-1 (1976) Pebble (2014) Apple II (1977) Palm Pre (2009)

MacBook Air (2008) Surface (2016)
34. Colossus (1944) Apollo Guidance Computer (1969) Honeywell Kitchen Computer (1969)

(\$10,600 “complete with two-week programming course”)

36. None

39. Modeling Pencil and Paper # C S S A 7

2 3 How long should the tape be? ... ...

42. Modeling Processing Look at the current state of the computation

Follow simple rules about what to do next Scratch paper to keep track
43. Modeling Processing (Brains) Follow simple rules Remember what you are

doing “For the present I shall only say that the justification lies in the fact that the human memory is necessarily limited.” Alan Turing 42
44. Modelling Processing 43 ! = (\$, & ⊆ \$ ×

Γ → \$, +, ∈ \$) \$ is a finite set, Γ is finite set of symbols that can be written in memory # C S S A 7 2 3 ... ...
45. Turing’s Model 44 !" = (%, a finite set of

(head) states ! ⊆ % × Γ → % × Γ × 789, transition function => ∈ %, start state =@AABCD ⊆ % accepting states ) % is a finite set, Γ is finite set of symbols that can be written in memory 789 = {Left, Right, Halt}

47. TM Computation Definition. A language (in most Computer Science uses)

is a (possibly infinite) set of finite strings. Σ = alphabet, a finite set of symbols # ⊆ Σ∗
48. Cost of Prediction 47 ! " = \$% + \$'

( Can we define the prediction problem as a language?
49. Addition Language 48 !"## = { &' &( … &*

+ ,' ,( … ,* = -' -( … -* -*.( | &0 , ,0 , -0 ∈ 0, 1 , - = & + , }
50. Cost of Prediction 49 ! " = \$% + \$'

( Can we define the prediction problem as a language?

54. Talking about Cost Cost of an algorithm: (nearly) always can

get a tight bound Naive multiplication algorithm for two N-digit integers has running time cost in Θ(#2).
55. Talking about Cost Cost of solving a problem: cost of

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof:
56. Talking about Cost Cost of solving a problem: cost of

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2).
57. Talking about Cost Cost of solving a problem: cost of

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in Ω(#). Proof: changing the value of any digit can change the output, so at least need to look at all input digits (in worst case).
58. Talking about Cost Cost of solving a problem: cost of

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2). Fürer 2007
59. Charge Concrete Costs: matter in practice size of dataset cost

of computing resources – where, when, etc. Asymptotic Costs: important for understanding based on abstract models of computing predicting costs as data scales Project 2: will be posted by tomorrow 58