610

# Class 4: Cost of Empirical Risk Minimization

https://uvammm.github.io/class4

Markets, Mechanisms, and Machines
University of Virginia
cs4501/econ4559 Spring 2019
David Evans and Denis Nekipelov
https://uvammm.github.io/

January 24, 2019

## Transcript

1. ### MARKETS, MECHANISMS, MACHINES University of Virginia, Spring 2019 Class 4:

Cost of Empirical Risk Minimization 24 January 2019 cs4501/econ4559 Spring 2019 David Evans and Denis Nekipelov https://uvammm.github.io
2. ### Plan Recap: Risk Minimization Empirical Risk Minimization How hard is

it to solve ERM? “Hard Problems” “Solving” Intractable Problems Special Cases 1
3. ### Recap: Risk Minimization Nature provides inputs ! with labels "

provided by supervisor, drawn randomly from distribution #(%, ') Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Risk: - ℎ = / 0 ', ℎ 1 = 2 0(', ℎ 1 ) 3#(1, ') ℎ∗ = argmin :∈ℋ -(ℎ) 2 Vapnik’s notation: choose ; ∈ Λ that minimizes -(;) = ∫ 0(', >(1, ;) 3#(1, ')
4. ### Empirical Risk Minimization Nature provides inputs ! with labels "

provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 3
5. ### Empirical Risk Minimization Given a set of possible functions, ℋ,

choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) 4 Choosing the set ℋ:

7. ### Empirical Risk Minimization Nature provides inputs ! with labels "

provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 6 How expensive is it to find ℎ∗?
8. ### Special Case: Ordinary Least Squares Nature provides inputs ! with

labels " provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 7 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}
9. ### Special Case: Ordinary Least Squares Nature provides inputs ! with

labels " provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 8 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}

+ \$' (
11. ### Example ERM Set of Functions (ℋ) Loss (") Name ℎ(%)

= (T% ℎ % – + , Ordinary Least Squares 10 ℋ = ℎ % ∈ ℝ/ ℎ % = (0%, ( ∈ ℝ/}
12. ### Restricting through Regularization 11 Given a set of possible functions,

ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) ℎ∗ = argmin =∈ℋ %&'( (ℎ) Instead of explicitly defining ℋ, use a regularizer added to loss to constrain solution space: ℎ∗ = argmin =∈ℋ %&'( ℎ + λ Reg (ℎ)
13. ### Popular Regularizers 12 ℋ = ℎ \$ ∈ ℝ' ℎ

\$ = ()\$, ( ∈ ℝ'} ℓ- regularizer: Reg ℎ = ( - = ∑ 23- ' |(2 | ℎ∗ = argmin ;∈ℋ <=>? ℎ + λ Reg(ℎ) ℓD regularizer: Reg ℎ = ( D D = ∑ 23- ' (2 D = (()
14. ### Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (")

Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression 13 ( , , = ((.
15. ### Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (")

Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression ℎ % = (T% + / often ℓ1, ℓ, log(1 + 678 9 :) Logistic Regression … … 14 Science (and Art) of ERM is picking ℋ, Reg, and "
16. ### How to choose? 15 Science (and Art) of ERM is

picking ℋ, Reg, and &

18. ### Class Background cs111x: 34 (out of 38 total) cs2102: 28

cs2150: 27 cs3102: 11 cs4102: 17 Some Machine Learning Course: 6 17
19. ### Cost Questions Cost of Training 18 Cost of Prediction ℎ∗

= argmin *∈ℋ -./0 ℎ + λ Reg (ℎ) 7 8 = ℎ∗(9)
20. ### Cost of Prediction 19 ! " = \$% + \$'

( How expensive is it to compute prediction ! "?

27. None
28. None
29. None

31. ### Empirical Risk Minimization Nature provides inputs ! with labels "

provided by supervisor, drawn randomly from distribution # \$, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 30 How expensive is it to find ℎ∗?

33. ### Cray-1 (1976) Pebble (2014) Apple II (1977) Palm Pre (2009)

MacBook Air (2008) Surface (2016)
34. ### Colossus (1944) Apollo Guidance Computer (1969) Honeywell Kitchen Computer (1969)

(\$10,600 “complete with two-week programming course”)

36. None

39. ### Modeling Pencil and Paper # C S S A 7

2 3 How long should the tape be? ... ...

42. ### Modeling Processing Look at the current state of the computation

Follow simple rules about what to do next Scratch paper to keep track
43. ### Modeling Processing (Brains) Follow simple rules Remember what you are

doing “For the present I shall only say that the justification lies in the fact that the human memory is necessarily limited.” Alan Turing 42
44. ### Modelling Processing 43 ! = (\$, & ⊆ \$ ×

Γ → \$, +, ∈ \$) \$ is a finite set, Γ is finite set of symbols that can be written in memory # C S S A 7 2 3 ... ...
45. ### Turing’s Model 44 !" = (%, a finite set of

(head) states ! ⊆ % × Γ → % × Γ × 789, transition function => ∈ %, start state =@AABCD ⊆ % accepting states ) % is a finite set, Γ is finite set of symbols that can be written in memory 789 = {Left, Right, Halt}

47. ### TM Computation Definition. A language (in most Computer Science uses)

is a (possibly infinite) set of finite strings. Σ = alphabet, a finite set of symbols # ⊆ Σ∗
48. ### Cost of Prediction 47 ! " = \$% + \$'

( Can we define the prediction problem as a language?
49. ### Addition Language 48 !"## = { &' &( … &*

+ ,' ,( … ,* = -' -( … -* -*.( | &0 , ,0 , -0 ∈ 0, 1 , - = & + , }
50. ### Cost of Prediction 49 ! " = \$% + \$'

( Can we define the prediction problem as a language?

54. ### Talking about Cost Cost of an algorithm: (nearly) always can

get a tight bound Naive multiplication algorithm for two N-digit integers has running time cost in Θ(#2).
55. ### Talking about Cost Cost of solving a problem: cost of

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof:
56. ### Talking about Cost Cost of solving a problem: cost of

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2).
57. ### Talking about Cost Cost of solving a problem: cost of

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in Ω(#). Proof: changing the value of any digit can change the output, so at least need to look at all input digits (in worst case).
58. ### Talking about Cost Cost of solving a problem: cost of

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2). Fürer 2007
59. ### Charge Concrete Costs: matter in practice size of dataset cost

of computing resources – where, when, etc. Asymptotic Costs: important for understanding based on abstract models of computing predicting costs as data scales Project 2: will be posted by tomorrow 58