MARKETS, MECHANISMS, MACHINES University of Virginia, Spring 2019 Class 4: Cost of Empirical Risk Minimization 24 January 2019 cs4501/econ4559 Spring 2019 David Evans and Denis Nekipelov

Plan Recap: Risk Minimization Empirical Risk Minimization How hard is it to solve ERM? “Hard Problems” “Solving” Intractable Problems Special Cases 1

Recap: Risk Minimization Nature provides inputs ! with labels " provided by supervisor, drawn randomly from distribution #(%, ') Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Risk: - ℎ = / 0 ', ℎ 1 = 2 0(', ℎ 1 ) 3#(1, ') ℎ∗ = argmin :∈ℋ -(ℎ) 2 Vapnik’s notation: choose ; ∈ Λ that minimizes -(;) = ∫ 0(', >(1, ;) 3#(1, ')

Empirical Risk Minimization Nature provides inputs ! with labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 3

Empirical Risk Minimization Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) 4 Choosing the set ℋ:

5 Neural Information Processing Systems 1991

Special Case: Ordinary Least Squares Nature provides inputs ! with labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 7 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}

Special Case: Ordinary Least Squares Nature provides inputs ! with labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 8 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}

Model Evaluation: Simple Linear Regression 9 ! " = $% + $' (

Example ERM Set of Functions (ℋ) Loss (") Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares 10 ℋ = ℎ % ∈ ℝ/ ℎ % = (0%, ( ∈ ℝ/}

Restricting through Regularization 11 Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) ℎ∗ = argmin =∈ℋ %&'( (ℎ) Instead of explicitly defining ℋ, use a regularizer added to loss to constrain solution space: ℎ∗ = argmin =∈ℋ %&'( ℎ + λ Reg (ℎ)

Popular Regularizers 12 ℋ = ℎ $ ∈ ℝ' ℎ $ = ()$, ( ∈ ℝ'} ℓ- regularizer: Reg ℎ = ( - = ∑ 23- ' |(2 | ℎ∗ = argmin ;∈ℋ <=>? ℎ + λ Reg(ℎ) ℓD regularizer: Reg ℎ = ( D D = ∑ 23- ' (2 D = (()

Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (") Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression 13 ( , , = ((.

Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (") Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression ℎ % = (T% + / often ℓ1, ℓ, log(1 + 678 9 :) Logistic Regression … … 14 Science (and Art) of ERM is picking ℋ, Reg, and "

How to choose? 15 Science (and Art) of ERM is picking ℋ, Reg, and &

Cost of Computing

Class Background cs111x: 34 (out of 38 total) cs2102: 28 cs2150: 27 cs3102: 11 cs4102: 17 Some Machine Learning Course: 6 17

Cost Questions Cost of Training 18 Cost of Prediction ℎ∗ = argmin *∈ℋ -./0 ℎ + λ Reg (ℎ) 7 8 = ℎ∗(9)

Cost of Prediction 19 ! " = $% + $' ( How expensive is it to compute prediction ! "?

How to measure cos

29 Bandwidth is expensive!

Alan Turing, 1912-1954 How should we model a computer?

Cray-1 (1976) Pebble (2014) Apple II (1977) Palm Pre (2009) MacBook Air (2008) Surface (2016)

Colossus (1944) Apollo Guidance Computer (1969) Honeywell Kitchen Computer (1969) ($10,600 “complete with two-week programming course”)

What Computers was Turing modelling?

Modeling “Scratch Paper” (and Input and Output)

Two-Dimensional Paper = One-Dimensional Tape

Modeling Pencil and Paper # C S S A 7 2 3 How long should the tape be? ... ...

Modeling Processing

Modeling Processing

Modeling Processing Look at the current state of the computation Follow simple rules about what to do next Scratch paper to keep track

Modeling Processing (Brains) Follow simple rules Remember what you are doing “For the present I shall only say that the justification lies in the fact that the human memory is necessarily limited.” Alan Turing 42

Modelling Processing 43 ! = ($, & ⊆ $ × Γ → $, +, ∈ $) $ is a finite set, Γ is finite set of symbols that can be written in memory # C S S A 7 2 3 ... ...

Turing’s Model 44 !" = (%, a finite set of (head) states ! ⊆ % × Γ → % × Γ × 789, transition function => ∈ %, start state =@AABCD ⊆ % accepting states ) % is a finite set, Γ is finite set of symbols that can be written in memory 789 = {Left, Right, Halt}

TM Computation Definition. A language (in most Computer Science uses) is a (possibly infinite) set of finite strings. Σ = alphabet, a finite set of symbols # ⊆ Σ∗

Cost of Prediction 47 ! " = $% + $' ( Can we define the prediction problem as a language?

Addition Language 48 !"## = { &' &( … &* + ,' ,( … ,* = -' -( … -* -*.( | &0 , ,0 , -0 ∈ 0, 1 , - = & + , }

Cost of Prediction 49 ! " = $% + $' ( Can we define the prediction problem as a language?

Cost of Multiplication 50

Most recent improvement: 2007

Most recent improvement: 2007

Talking about Cost Cost of an algorithm: (nearly) always can get a tight bound Naive multiplication algorithm for two N-digit integers has running time cost in Θ(#2).

Talking about Cost Cost of solving a problem: cost of the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof:

Talking about Cost Cost of solving a problem: cost of the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2).

Talking about Cost Cost of solving a problem: cost of the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in Ω(#). Proof: changing the value of any digit can change the output, so at least need to look at all input digits (in worst case).

Talking about Cost Cost of solving a problem: cost of the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2). Fürer 2007

Charge Concrete Costs: matter in practice size of dataset cost of computing resources – where, when, etc. Asymptotic Costs: important for understanding based on abstract models of computing predicting costs as data scales Project 2: will be posted by tomorrow 58