Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Regularization_The Element of Statical Learning

Regularization_The Element of Statical Learning

Section 3.4 of the element of the statical learning

18d356eace1ae9fb3c1188871990a482?s=128

Hayato Maki

May 31, 2017
Tweet

Transcript

  1. The Elements of Statical Learning §3.4 Regularization
 (shrinkage methods) Lecturer:

    Hayato Maki Augmented Human Communication Laboratory
 Nara Institute of Science and Technology
 2017/05/31
  2. Agenda How to restrict a model? • Select features •

    Regularization 
  3. None
  4. 3JEHF3FHSFTTJPO  w1FOBMJ[JOHCZUIFTVNPGTRVBSFTPGQBSBNFUFST &SSPSCFUXFFOBMBCFMBOEBOFTUJNBUJPO 4VNPGTRVBSFEQBSBNFUFST 3FHVMBSJ[BUJPOQBSBNFUFS w&RVJWBMFOUGPSN TVCKFDUUP

  5. Effect to Correlated Data  • When X1 and X2

    are correlated … DBODFMPVUFBDIPUIFS 1BSBNFUFSWBMVF 1BSBNFUFSWBMVF can be very big! and Regularization
  6. Don’t Regularize Intercept  • Notice the intercept parameter is

    NOT involved in regularization Add c to all data + c
  7.  • Solution of Ridge regression will depend on input

    scaling
 → Standardize inputs before solving. • If inputs are centered (subtract means), intercepts can be estimated separately and other parameters can be using Ridge regression without interception. Scaling and Intercept Estimation
  8.  Matrix Representation Loss function: Solution: regularization term Compare to

    Add positive constants to diagonal components It makes the product of X full rank!
 Recall the regularization effect to correlated data.
  9.  Example: Prostate Cancer Data stronger regularization best performance in

    training data chosen by cross validation 'JH Effective degree of freedom : singular value of X coefficients
  10.  Ridge Solution as Posterior Mode • Assume the probability

    density function as: • Then, log likelihood of posterior distribution of will be the same to (3.41) setting . • Ridge solution will be the mode of the posterior (in Gaussian case, mode is equal to mean)
  11.  View from SVD • Singular value decomposition of data

    matrix X =singular value: • Least-square and ridge estimation are written as: Smaller singular value, stronger regularization. What does it mean?
  12.  View from SVD • Sample covariance matrix: • The

    largest eigen value , the corresponding eigen vector is called the 1st eigen vector. • has the largest sample variance among the all linear combination of X. The variance is
  13.  View from SVD • Greater singular value, less shrinkage

    will be applied. • Assumption:
 Response will tend to vary more in the directions of high variance of inputs. 'JH
  14. Lasso  • Penalizing by the sum of absolute value

    of parameters subject to Compare to Ridge regression w&RVJWBMFOUGPSN
  15. Property of Lasso  • Lasso doesn’t have closed form

    solution, leading to a quadratic programming. • Some of parameters will be zero. • Shrinkage parameter stronger regularization
  16. $PNQBSJTPOPG4VCTFU4FMFDUJPO 
 3JEHF BOE-BTTP 1SPQPSUJPOBM
 TISJOLBHF 4VCUSBDU
 USVODBUJOHBU[FSP Drop all

    parameters smaller
 than Mth largest 'JH
  17. $PNQBSJTPOPG4VCTFU4FMFDUJPO 
 3JEHF BOE-BTTP 'JH Lasso Ridge → likely to

    be zero because
 of corners
  18. (FOFSBMJ[BUJPOPG3JEHFBOE-BTTP • q = 1 → Lasso (L1 norm), minimum

    value that restriction area will be convex. • q = 2 → Ridge (L2 norm) • q = 0 → subset selection (depend on only p) • q can be other values (like 3, 4, …) but it’s useless empirically…
  19. #BZFTJBO7JFX • View as log prior distribution of • Each

    method can be seen as maximization of posterior using different prior distribution. • When q is between 1 and 2, it is called elastic net • Select features like Lasso and shrinkage correlated features like Ridge