Regularization_The Element of Statical Learning

Regularization_The Element of Statical Learning

Section 3.4 of the element of the statical learning


Hayato Maki

May 31, 2017


  1. 1.

    The Elements of Statical Learning §3.4 Regularization
 (shrinkage methods) Lecturer:

    Hayato Maki Augmented Human Communication Laboratory
 Nara Institute of Science and Technology
  2. 3.
  3. 5.

    Effect to Correlated Data  • When X1 and X2

    are correlated … DBODFMPVUFBDIPUIFS 1BSBNFUFSWBMVF 1BSBNFUFSWBMVF can be very big! and Regularization
  4. 6.

    Don’t Regularize Intercept  • Notice the intercept parameter is

    NOT involved in regularization Add c to all data + c
  5. 7.

     • Solution of Ridge regression will depend on input

 → Standardize inputs before solving. • If inputs are centered (subtract means), intercepts can be estimated separately and other parameters can be using Ridge regression without interception. Scaling and Intercept Estimation
  6. 8.

     Matrix Representation Loss function: Solution: regularization term Compare to

    Add positive constants to diagonal components It makes the product of X full rank!
 Recall the regularization effect to correlated data.
  7. 9.

     Example: Prostate Cancer Data stronger regularization best performance in

    training data chosen by cross validation 'JH Effective degree of freedom : singular value of X coefficients
  8. 10.

     Ridge Solution as Posterior Mode • Assume the probability

    density function as: • Then, log likelihood of posterior distribution of will be the same to (3.41) setting . • Ridge solution will be the mode of the posterior (in Gaussian case, mode is equal to mean)
  9. 11.

     View from SVD • Singular value decomposition of data

    matrix X =singular value: • Least-square and ridge estimation are written as: Smaller singular value, stronger regularization. What does it mean?
  10. 12.

     View from SVD • Sample covariance matrix: • The

    largest eigen value , the corresponding eigen vector is called the 1st eigen vector. • has the largest sample variance among the all linear combination of X. The variance is
  11. 13.

     View from SVD • Greater singular value, less shrinkage

    will be applied. • Assumption:
 Response will tend to vary more in the directions of high variance of inputs. 'JH
  12. 14.

    Lasso  • Penalizing by the sum of absolute value

    of parameters subject to Compare to Ridge regression w&RVJWBMFOUGPSN
  13. 15.

    Property of Lasso  • Lasso doesn’t have closed form

    solution, leading to a quadratic programming. • Some of parameters will be zero. • Shrinkage parameter stronger regularization
  14. 18.

    (FOFSBMJ[BUJPOPG3JEHFBOE-BTTP • q = 1 → Lasso (L1 norm), minimum

    value that restriction area will be convex. • q = 2 → Ridge (L2 norm) • q = 0 → subset selection (depend on only p) • q can be other values (like 3, 4, …) but it’s useless empirically…
  15. 19.

    #BZFTJBO7JFX • View as log prior distribution of • Each

    method can be seen as maximization of posterior using different prior distribution. • When q is between 1 and 2, it is called elastic net • Select features like Lasso and shrinkage correlated features like Ridge