Slide 1

Slide 1 text

The Elements of Statical Learning §3.4 Regularization
 (shrinkage methods) Lecturer: Hayato Maki Augmented Human Communication Laboratory
 Nara Institute of Science and Technology
 2017/05/31

Slide 2

Slide 2 text

Agenda How to restrict a model? • Select features • Regularization 

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

3JEHF3FHSFTTJPO  w1FOBMJ[JOHCZUIFTVNPGTRVBSFTPGQBSBNFUFST &SSPSCFUXFFOBMBCFMBOEBOFTUJNBUJPO 4VNPGTRVBSFEQBSBNFUFST 3FHVMBSJ[BUJPOQBSBNFUFS w&RVJWBMFOUGPSN TVCKFDUUP

Slide 5

Slide 5 text

Effect to Correlated Data  • When X1 and X2 are correlated … DBODFMPVUFBDIPUIFS 1BSBNFUFSWBMVF 1BSBNFUFSWBMVF can be very big! and Regularization

Slide 6

Slide 6 text

Don’t Regularize Intercept  • Notice the intercept parameter is NOT involved in regularization Add c to all data + c

Slide 7

Slide 7 text

 • Solution of Ridge regression will depend on input scaling
 → Standardize inputs before solving. • If inputs are centered (subtract means), intercepts can be estimated separately and other parameters can be using Ridge regression without interception. Scaling and Intercept Estimation

Slide 8

Slide 8 text

 Matrix Representation Loss function: Solution: regularization term Compare to Add positive constants to diagonal components It makes the product of X full rank!
 Recall the regularization effect to correlated data.

Slide 9

Slide 9 text

 Example: Prostate Cancer Data stronger regularization best performance in training data chosen by cross validation 'JH Effective degree of freedom : singular value of X coefficients

Slide 10

Slide 10 text

 Ridge Solution as Posterior Mode • Assume the probability density function as: • Then, log likelihood of posterior distribution of will be the same to (3.41) setting . • Ridge solution will be the mode of the posterior (in Gaussian case, mode is equal to mean)

Slide 11

Slide 11 text

 View from SVD • Singular value decomposition of data matrix X =singular value: • Least-square and ridge estimation are written as: Smaller singular value, stronger regularization. What does it mean?

Slide 12

Slide 12 text

 View from SVD • Sample covariance matrix: • The largest eigen value , the corresponding eigen vector is called the 1st eigen vector. • has the largest sample variance among the all linear combination of X. The variance is

Slide 13

Slide 13 text

 View from SVD • Greater singular value, less shrinkage will be applied. • Assumption:
 Response will tend to vary more in the directions of high variance of inputs. 'JH

Slide 14

Slide 14 text

Lasso  • Penalizing by the sum of absolute value of parameters subject to Compare to Ridge regression w&RVJWBMFOUGPSN

Slide 15

Slide 15 text

Property of Lasso  • Lasso doesn’t have closed form solution, leading to a quadratic programming. • Some of parameters will be zero. • Shrinkage parameter stronger regularization

Slide 16

Slide 16 text

$PNQBSJTPOPG4VCTFU4FMFDUJPO 
 3JEHF BOE-BTTP 1SPQPSUJPOBM
 TISJOLBHF 4VCUSBDU
 USVODBUJOHBU[FSP Drop all parameters smaller
 than Mth largest 'JH

Slide 17

Slide 17 text

$PNQBSJTPOPG4VCTFU4FMFDUJPO 
 3JEHF BOE-BTTP 'JH Lasso Ridge → likely to be zero because
 of corners

Slide 18

Slide 18 text

(FOFSBMJ[BUJPOPG3JEHFBOE-BTTP • q = 1 → Lasso (L1 norm), minimum value that restriction area will be convex. • q = 2 → Ridge (L2 norm) • q = 0 → subset selection (depend on only p) • q can be other values (like 3, 4, …) but it’s useless empirically…

Slide 19

Slide 19 text

#BZFTJBO7JFX • View as log prior distribution of • Each method can be seen as maximization of posterior using different prior distribution. • When q is between 1 and 2, it is called elastic net • Select features like Lasso and shrinkage correlated features like Ridge