Class 4: Cost of Empirical Risk Minimization

MARKETS, MECHANISMS, MACHINES University of Virginia, Spring 2019 Class 4:
Cost of Empirical Risk Minimization 24 January 2019 cs4501/econ4559 Spring 2019 David Evans and Denis Nekipelov https://uvammm.github.io

Plan Recap: Risk Minimization Empirical Risk Minimization How hard is
it to solve ERM? “Hard Problems” “Solving” Intractable Problems Special Cases 1

Recap: Risk Minimization Nature provides inputs ! with labels "
provided by supervisor, drawn randomly from distribution #(%, ') Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Risk: - ℎ = / 0 ', ℎ 1 = 2 0(', ℎ 1 ) 3#(1, ') ℎ∗ = argmin :∈ℋ -(ℎ) 2 Vapnik’s notation: choose ; ∈ Λ that minimizes -(;) = ∫ 0(', >(1, ;) 3#(1, ')

Empirical Risk Minimization Nature provides inputs ! with labels "
provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 3

Empirical Risk Minimization Given a set of possible functions, ℋ,
choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) 4 Choosing the set ℋ:

5 Neural Information Processing Systems 1991

provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 6 How expensive is it to find ℎ∗?

Special Case: Ordinary Least Squares Nature provides inputs ! with
labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 7 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}

Special Case: Ordinary Least Squares Nature provides inputs ! with
labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 8 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}

Model Evaluation: Simple Linear Regression 9 ! " = $%
+ $' (

Example ERM Set of Functions (ℋ) Loss (") Name ℎ(%)
= (T% ℎ % – + , Ordinary Least Squares 10 ℋ = ℎ % ∈ ℝ/ ℎ % = (0%, ( ∈ ℝ/}

Restricting through Regularization 11 Given a set of possible functions,
ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) ℎ∗ = argmin =∈ℋ %&'( (ℎ) Instead of explicitly defining ℋ, use a regularizer added to loss to constrain solution space: ℎ∗ = argmin =∈ℋ %&'( ℎ + λ Reg (ℎ)

Popular Regularizers 12 ℋ = ℎ $ ∈ ℝ' ℎ
$ = ()$, ( ∈ ℝ'} ℓ- regularizer: Reg ℎ = ( - = ∑ 23- ' |(2 | ℎ∗ = argmin ;∈ℋ <=>? ℎ + λ Reg(ℎ) ℓD regularizer: Reg ℎ = ( D D = ∑ 23- ' (2 D = (()

Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (")
Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression 13 ( , , = ((.

Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (")
Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression ℎ % = (T% + / often ℓ1, ℓ, log(1 + 678 9 :) Logistic Regression … … 14 Science (and Art) of ERM is picking ℋ, Reg, and "

How to choose? 15 Science (and Art) of ERM is
picking ℋ, Reg, and &

Cost of Computing

Class Background cs111x: 34 (out of 38 total) cs2102: 28
cs2150: 27 cs3102: 11 cs4102: 17 Some Machine Learning Course: 6 17

Cost Questions Cost of Training 18 Cost of Prediction ℎ∗
= argmin *∈ℋ -./0 ℎ + λ Reg (ℎ) 7 8 = ℎ∗(9)

Cost of Prediction 19 ! " = $% + $'
( How expensive is it to compute prediction ! "?

How to measure cos

21 https://aws.amazon.com/ec2/spot/pricing/

29 Bandwidth is expensive!

provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 30 How expensive is it to find ℎ∗?

Alan Turing, 1912-1954 How should we model a computer?

Cray-1 (1976) Pebble (2014) Apple II (1977) Palm Pre (2009)
MacBook Air (2008) Surface (2016)

Colossus (1944) Apollo Guidance Computer (1969) Honeywell Kitchen Computer (1969)
($10,600 “complete with two-week programming course”)

What Computers was Turing modelling?

Modeling “Scratch Paper” (and Input and Output)

Two-Dimensional Paper = One-Dimensional Tape

Modeling Pencil and Paper # C S S A 7
2 3 How long should the tape be? ... ...

Modeling Processing

Modeling Processing Look at the current state of the computation
Follow simple rules about what to do next Scratch paper to keep track

Modeling Processing (Brains) Follow simple rules Remember what you are
doing “For the present I shall only say that the justification lies in the fact that the human memory is necessarily limited.” Alan Turing 42

Modelling Processing 43 ! = ($, & ⊆ $ ×
Γ → $, +, ∈ $) $ is a finite set, Γ is finite set of symbols that can be written in memory # C S S A 7 2 3 ... ...

Turing’s Model 44 !" = (%, a finite set of
(head) states ! ⊆ % × Γ → % × Γ × 789, transition function => ∈ %, start state =@AABCD ⊆ % accepting states ) % is a finite set, Γ is finite set of symbols that can be written in memory 789 = {Left, Right, Halt}

TM Computation Definition. A language (in most Computer Science uses)
is a (possibly infinite) set of finite strings. Σ = alphabet, a finite set of symbols # ⊆ Σ∗

( Can we define the prediction problem as a language?

Addition Language 48 !"## = { &' &( … &*
+ ,' ,( … ,* = -' -( … -* -*.( | &0 , ,0 , -0 ∈ 0, 1 , - = & + , }

( Can we define the prediction problem as a language?

Cost of Multiplication 50

Most recent improvement: 2007

Talking about Cost Cost of an algorithm: (nearly) always can
get a tight bound Naive multiplication algorithm for two N-digit integers has running time cost in Θ(#2).

Talking about Cost Cost of solving a problem: cost of
the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof:

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2).

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in Ω(#). Proof: changing the value of any digit can change the output, so at least need to look at all input digits (in worst case).

the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2). Fürer 2007

Charge Concrete Costs: matter in practice size of dataset cost
of computing resources – where, when, etc. Asymptotic Costs: important for understanding based on abstract models of computing predicting costs as data scales Project 2: will be posted by tomorrow 58

Class 4: Cost of Empirical Risk Minimization

Class 4: Cost of Empirical Risk Minimization

More Decks by David Evans

Other Decks in Education

Featured

Transcript