Regularization_The Element of Statical Learning

The Elements of Statical Learning §3.4 Regularization  (shrinkage methods) Lecturer:
Hayato Maki Augmented Human Communication Laboratory  Nara Institute of Science and Technology  2017/05/31

Agenda How to restrict a model? • Select features •
Regularization

3JEHF3FHSFTTJPO w1FOBMJ[JOHCZUIFTVNPGTRVBSFTPGQBSBNFUFST &SSPSCFUXFFOBMBCFMBOEBOFTUJNBUJPO 4VNPGTRVBSFEQBSBNFUFST 3FHVMBSJ[BUJPOQBSBNFUFS w&RVJWBMFOUGPSN TVCKFDUUP

Effect to Correlated Data • When X1 and X2
are correlated … DBODFMPVUFBDIPUIFS 1BSBNFUFSWBMVF 1BSBNFUFSWBMVF can be very big! and Regularization

Don’t Regularize Intercept • Notice the intercept parameter is
NOT involved in regularization Add c to all data + c

• Solution of Ridge regression will depend on input
scaling  → Standardize inputs before solving. • If inputs are centered (subtract means), intercepts can be estimated separately and other parameters can be using Ridge regression without interception. Scaling and Intercept Estimation

Matrix Representation Loss function: Solution: regularization term Compare to
Add positive constants to diagonal components It makes the product of X full rank!  Recall the regularization effect to correlated data.

Example: Prostate Cancer Data stronger regularization best performance in
training data chosen by cross validation 'JH Effective degree of freedom : singular value of X coefﬁcients

Ridge Solution as Posterior Mode • Assume the probability
density function as: • Then, log likelihood of posterior distribution of will be the same to (3.41) setting . • Ridge solution will be the mode of the posterior (in Gaussian case, mode is equal to mean)

View from SVD • Singular value decomposition of data
matrix X =singular value: • Least-square and ridge estimation are written as: Smaller singular value, stronger regularization. What does it mean?

View from SVD • Sample covariance matrix: • The
largest eigen value , the corresponding eigen vector is called the 1st eigen vector. • has the largest sample variance among the all linear combination of X. The variance is

View from SVD • Greater singular value, less shrinkage
will be applied. • Assumption:  Response will tend to vary more in the directions of high variance of inputs. 'JH

Lasso • Penalizing by the sum of absolute value
of parameters subject to Compare to Ridge regression w&RVJWBMFOUGPSN

Property of Lasso • Lasso doesn’t have closed form
solution, leading to a quadratic programming. • Some of parameters will be zero. • Shrinkage parameter stronger regularization

$PNQBSJTPOPG4VCTFU4FMFDUJPO   3JEHF BOE-BTTP 1SPQPSUJPOBM  TISJOLBHF 4VCUSBDU  USVODBUJOHBU[FSP Drop all
parameters smaller  than Mth largest 'JH

$PNQBSJTPOPG4VCTFU4FMFDUJPO   3JEHF BOE-BTTP 'JH Lasso Ridge → likely to
be zero because  of corners

(FOFSBMJ[BUJPOPG3JEHFBOE-BTTP • q = 1 → Lasso (L1 norm), minimum
value that restriction area will be convex. • q = 2 → Ridge (L2 norm) • q = 0 → subset selection (depend on only p) • q can be other values (like 3, 4, …) but it’s useless empirically…

#BZFTJBO7JFX • View as log prior distribution of • Each
method can be seen as maximization of posterior using different prior distribution. • When q is between 1 and 2, it is called elastic net • Select features like Lasso and shrinkage correlated features like Ridge

Regularization_The Element of Statical Learning

Regularization_The Element of Statical Learning

Hayato Maki

More Decks by Hayato Maki

Other Decks in Technology

Featured

Transcript

The Elements of Statical Learning §3.4 Regularization  (shrinkage methods) Lecturer:

Agenda How to restrict a model? • Select features •

3JEHF3FHSFTTJPO w1FOBMJ[JOHCZUIFTVNPGTRVBSFTPGQBSBNFUFST &SSPSCFUXFFOBMBCFMBOEBOFTUJNBUJPO 4VNPGTRVBSFEQBSBNFUFST 3FHVMBSJ[BUJPOQBSBNFUFS w&RVJWBMFOUGPSN TVCKFDUUP

Effect to Correlated Data • When X1 and X2

Don’t Regularize Intercept • Notice the intercept parameter is

• Solution of Ridge regression will depend on input

Matrix Representation Loss function: Solution: regularization term Compare to

Example: Prostate Cancer Data stronger regularization best performance in

Ridge Solution as Posterior Mode • Assume the probability

View from SVD • Singular value decomposition of data

View from SVD • Sample covariance matrix: • The

View from SVD • Greater singular value, less shrinkage

Lasso • Penalizing by the sum of absolute value

Property of Lasso • Lasso doesn’t have closed form

$PNQBSJTPOPG4VCTFU4FMFDUJPO   3JEHF BOE-BTTP 1SPQPSUJPOBM  TISJOLBHF 4VCUSBDU  USVODBUJOHBU[FSP Drop all

$PNQBSJTPOPG4VCTFU4FMFDUJPO   3JEHF BOE-BTTP 'JH Lasso Ridge → likely to

(FOFSBMJ[BUJPOPG3JEHFBOE-BTTP • q = 1 → Lasso (L1 norm), minimum

#BZFTJBO7JFX • View as log prior distribution of • Each