Class 4: Cost of Empirical Risk Minimization

Slide 1

Slide 1 text

MARKETS, MECHANISMS, MACHINES University of Virginia, Spring 2019 Class 4: Cost of Empirical Risk Minimization 24 January 2019 cs4501/econ4559 Spring 2019 David Evans and Denis Nekipelov https://uvammm.github.io

Slide 2

Slide 2 text

Plan Recap: Risk Minimization Empirical Risk Minimization How hard is it to solve ERM? “Hard Problems” “Solving” Intractable Problems Special Cases 1

Slide 3

Slide 3 text

Recap: Risk Minimization Nature provides inputs ! with labels " provided by supervisor, drawn randomly from distribution #(%, ') Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Risk: - ℎ = / 0 ', ℎ 1 = 2 0(', ℎ 1 ) 3#(1, ') ℎ∗ = argmin :∈ℋ -(ℎ) 2 Vapnik’s notation: choose ; ∈ Λ that minimizes -(;) = ∫ 0(', >(1, ;) 3#(1, ')

Slide 4

Slide 4 text

Empirical Risk Minimization Nature provides inputs ! with labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&; , ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 3

Slide 5

Slide 5 text

Empirical Risk Minimization Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) 4 Choosing the set ℋ:

Slide 6

Slide 6 text

5 Neural Information Processing Systems 1991

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Special Case: Ordinary Least Squares Nature provides inputs ! with labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 7 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}

Slide 9

Slide 9 text

Special Case: Ordinary Least Squares Nature provides inputs ! with labels " provided by supervisor, drawn randomly from distribution # $, & : Training data = () , &) , (* , &* , … , ((- , &- ) Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: 3456 ℎ = 1 9 : ;<) - =(&;, ℎ (; ) ℎ∗ = argmin D∈ℋ 3456 (ℎ) 8 Squared Loss: = &, E = E − & * Set of functions: ℋ = G( + I G ∈ ℝ-, I ∈ ℝ}

Slide 10

Slide 10 text

Model Evaluation: Simple Linear Regression 9 ! " = $% + $' (

Slide 11

Slide 11 text

Example ERM Set of Functions (ℋ) Loss (") Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares 10 ℋ = ℎ % ∈ ℝ/ ℎ % = (0%, ( ∈ ℝ/}

Slide 12

Slide 12 text

Restricting through Regularization 11 Given a set of possible functions, ℋ, choose the hypothesis function ℎ∗ ∈ ℋ that minimizes Empirical Risk: %&'( ℎ = 1 + , -./ 0 1(3- , ℎ 5- ) ℎ∗ = argmin =∈ℋ %&'( (ℎ) Instead of explicitly defining ℋ, use a regularizer added to loss to constrain solution space: ℎ∗ = argmin =∈ℋ %&'( ℎ + λ Reg (ℎ)

Slide 13

Slide 13 text

Popular Regularizers 12 ℋ = ℎ $ ∈ ℝ' ℎ $ = ()$, ( ∈ ℝ'} ℓ- regularizer: Reg ℎ = ( - = ∑ 23- ' |(2 | ℎ∗ = argmin ;∈ℋ <=>? ℎ + λ Reg(ℎ) ℓD regularizer: Reg ℎ = ( D D = ∑ 23- ' (2 D = (()

Slide 14

Slide 14 text

Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (") Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression 13 ( , , = ((.

Slide 15

Slide 15 text

Some Specific ERMs Set of Functions (ℋ) Regularizer Loss (") Name ℎ(%) = (T% ℎ % – + , Ordinary Least Squares ℎ % = (T% - ( , , ℎ % – + , Ridge Regression ℎ % = (T% + / often ℓ1, ℓ, log(1 + 678 9 :) Logistic Regression … … 14 Science (and Art) of ERM is picking ℋ, Reg, and "

Slide 16

Slide 16 text

How to choose? 15 Science (and Art) of ERM is picking ℋ, Reg, and &

Slide 17

Slide 17 text

Cost of Computing

Slide 18

Slide 18 text

Class Background cs111x: 34 (out of 38 total) cs2102: 28 cs2150: 27 cs3102: 11 cs4102: 17 Some Machine Learning Course: 6 17

Slide 19

Slide 19 text

Cost Questions Cost of Training 18 Cost of Prediction ℎ∗ = argmin *∈ℋ -./0 ℎ + λ Reg (ℎ) 7 8 = ℎ∗(9)

Slide 20

Slide 20 text

Cost of Prediction 19 ! " = $% + $' ( How expensive is it to compute prediction ! "?

Slide 21

Slide 21 text

How to measure cos

Slide 22

Slide 22 text

21 https://aws.amazon.com/ec2/spot/pricing/

Slide 23

Slide 23 text

22 https://aws.amazon.com/ec2/spot/pricing/

Slide 24

Slide 24 text

23 https://aws.amazon.com/ec2/spot/pricing/

Slide 25

Slide 25 text

24 https://aws.amazon.com/ec2/spot/pricing/

Slide 26

Slide 26 text

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

29 Bandwidth is expensive!

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Alan Turing, 1912-1954 How should we model a computer?

Slide 33

Slide 33 text

Cray-1 (1976) Pebble (2014) Apple II (1977) Palm Pre (2009) MacBook Air (2008) Surface (2016)

Slide 34

Slide 34 text

Colossus (1944) Apollo Guidance Computer (1969) Honeywell Kitchen Computer (1969) ($10,600 “complete with two-week programming course”)

Slide 35

Slide 35 text

What Computers was Turing modelling?

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Modeling “Scratch Paper” (and Input and Output)

Slide 38

Slide 38 text

Two-Dimensional Paper = One-Dimensional Tape

Slide 39

Slide 39 text

Modeling Pencil and Paper # C S S A 7 2 3 How long should the tape be? ... ...

Slide 40

Slide 40 text

Modeling Processing

Slide 41

Slide 41 text

Modeling Processing

Slide 42

Slide 42 text

Modeling Processing Look at the current state of the computation Follow simple rules about what to do next Scratch paper to keep track

Slide 43

Slide 43 text

Modeling Processing (Brains) Follow simple rules Remember what you are doing “For the present I shall only say that the justification lies in the fact that the human memory is necessarily limited.” Alan Turing 42

Slide 44

Slide 44 text

Modelling Processing 43 ! = ($, & ⊆ $ × Γ → $, +, ∈ $) $ is a finite set, Γ is finite set of symbols that can be written in memory # C S S A 7 2 3 ... ...

Slide 45

Slide 45 text

Turing’s Model 44 !" = (%, a finite set of (head) states ! ⊆ % × Γ → % × Γ × 789, transition function => ∈ %, start state =@AABCD ⊆ % accepting states ) % is a finite set, Γ is finite set of symbols that can be written in memory 789 = {Left, Right, Halt}

Slide 46

Slide 46 text

Slide 47

Slide 47 text

TM Computation Definition. A language (in most Computer Science uses) is a (possibly infinite) set of finite strings. Σ = alphabet, a finite set of symbols # ⊆ Σ∗

Slide 48

Slide 48 text

Cost of Prediction 47 ! " = $% + $' ( Can we define the prediction problem as a language?

Slide 49

Slide 49 text

Addition Language 48 !"## = { &' &( … &* + ,' ,( … ,* = -' -( … -* -*.( | &0 , ,0 , -0 ∈ 0, 1 , - = & + , }

Slide 50

Slide 50 text

Cost of Prediction 49 ! " = $% + $' ( Can we define the prediction problem as a language?

Slide 51

Slide 51 text

Cost of Multiplication 50

Slide 52

Slide 52 text

Most recent improvement: 2007

Slide 53

Slide 53 text

Most recent improvement: 2007

Slide 54

Slide 54 text

Talking about Cost Cost of an algorithm: (nearly) always can get a tight bound Naive multiplication algorithm for two N-digit integers has running time cost in Θ(#2).

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Talking about Cost Cost of solving a problem: cost of the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in !(#2). Proof: naive multiplication solves it and has running time in Θ(#2).

Slide 57

Slide 57 text

Talking about Cost Cost of solving a problem: cost of the least expensive algorithm very rarely can get a tight bound Multiplication problem for two N-digit integers has running time cost in Ω(#). Proof: changing the value of any digit can change the output, so at least need to look at all input digits (in worst case).

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Charge Concrete Costs: matter in practice size of dataset cost of computing resources – where, when, etc. Asymptotic Costs: important for understanding based on abstract models of computing predicting costs as data scales Project 2: will be posted by tomorrow 58