Hayato Maki
May 31, 2017
110

# Regularization_The Element of Statical Learning

Section 3.4 of the element of the statical learning

May 31, 2017

## Transcript

1. ### The Elements of Statical Learning §3.4 Regularization  (shrinkage methods) Lecturer:

Hayato Maki Augmented Human Communication Laboratory  Nara Institute of Science and Technology  2017/05/31
2. ### Agenda How to restrict a model? • Select features •

Regularization 
3. None

5. ### Effect to Correlated Data  • When X1 and X2

are correlated … DBODFMPVUFBDIPUIFS 1BSBNFUFSWBMVF 1BSBNFUFSWBMVF can be very big! and Regularization
6. ### Don’t Regularize Intercept  • Notice the intercept parameter is

NOT involved in regularization Add c to all data + c
7. ###  • Solution of Ridge regression will depend on input

scaling  → Standardize inputs before solving. • If inputs are centered (subtract means), intercepts can be estimated separately and other parameters can be using Ridge regression without interception. Scaling and Intercept Estimation
8. ###  Matrix Representation Loss function: Solution: regularization term Compare to

Add positive constants to diagonal components It makes the product of X full rank!  Recall the regularization effect to correlated data.
9. ###  Example: Prostate Cancer Data stronger regularization best performance in

training data chosen by cross validation 'JH Effective degree of freedom : singular value of X coefﬁcients
10. ###  Ridge Solution as Posterior Mode • Assume the probability

density function as: • Then, log likelihood of posterior distribution of will be the same to (3.41) setting . • Ridge solution will be the mode of the posterior (in Gaussian case, mode is equal to mean)
11. ###  View from SVD • Singular value decomposition of data

matrix X =singular value: • Least-square and ridge estimation are written as: Smaller singular value, stronger regularization. What does it mean?
12. ###  View from SVD • Sample covariance matrix: • The

largest eigen value , the corresponding eigen vector is called the 1st eigen vector. • has the largest sample variance among the all linear combination of X. The variance is
13. ###  View from SVD • Greater singular value, less shrinkage

will be applied. • Assumption:  Response will tend to vary more in the directions of high variance of inputs. 'JH
14. ### Lasso  • Penalizing by the sum of absolute value

of parameters subject to Compare to Ridge regression w&RVJWBMFOUGPSN
15. ### Property of Lasso  • Lasso doesn’t have closed form

solution, leading to a quadratic programming. • Some of parameters will be zero. • Shrinkage parameter stronger regularization
16. ### \$PNQBSJTPOPG4VCTFU4FMFDUJPO   3JEHF BOE-BTTP 1SPQPSUJPOBM  TISJOLBHF 4VCUSBDU  USVODBUJOHBU[FSP Drop all

parameters smaller  than Mth largest 'JH
17. ### \$PNQBSJTPOPG4VCTFU4FMFDUJPO   3JEHF BOE-BTTP 'JH Lasso Ridge → likely to

be zero because  of corners
18. ### (FOFSBMJ[BUJPOPG3JEHFBOE-BTTP • q = 1 → Lasso (L1 norm), minimum

value that restriction area will be convex. • q = 2 → Ridge (L2 norm) • q = 0 → subset selection (depend on only p) • q can be other values (like 3, 4, …) but it’s useless empirically…
19. ### #BZFTJBO7JFX • View as log prior distribution of • Each

method can be seen as maximization of posterior using different prior distribution. • When q is between 1 and 2, it is called elastic net • Select features like Lasso and shrinkage correlated features like Ridge