Regularization_The Element of Statical Learning

Regularization_The Element of Statical Learning

Section 3.4 of the element of the statical learning

18d356eace1ae9fb3c1188871990a482?s=128

Hayato Maki

May 31, 2017
Tweet

Transcript

  1. 1.

    The Elements of Statical Learning §3.4 Regularization
 (shrinkage methods) Lecturer:

    Hayato Maki Augmented Human Communication Laboratory
 Nara Institute of Science and Technology
 2017/05/31
  2. 3.
  3. 5.

    Effect to Correlated Data  • When X1 and X2

    are correlated … DBODFMPVUFBDIPUIFS 1BSBNFUFSWBMVF 1BSBNFUFSWBMVF can be very big! and Regularization
  4. 6.

    Don’t Regularize Intercept  • Notice the intercept parameter is

    NOT involved in regularization Add c to all data + c
  5. 7.

     • Solution of Ridge regression will depend on input

    scaling
 → Standardize inputs before solving. • If inputs are centered (subtract means), intercepts can be estimated separately and other parameters can be using Ridge regression without interception. Scaling and Intercept Estimation
  6. 8.

     Matrix Representation Loss function: Solution: regularization term Compare to

    Add positive constants to diagonal components It makes the product of X full rank!
 Recall the regularization effect to correlated data.
  7. 9.

     Example: Prostate Cancer Data stronger regularization best performance in

    training data chosen by cross validation 'JH Effective degree of freedom : singular value of X coefficients
  8. 10.

     Ridge Solution as Posterior Mode • Assume the probability

    density function as: • Then, log likelihood of posterior distribution of will be the same to (3.41) setting . • Ridge solution will be the mode of the posterior (in Gaussian case, mode is equal to mean)
  9. 11.

     View from SVD • Singular value decomposition of data

    matrix X =singular value: • Least-square and ridge estimation are written as: Smaller singular value, stronger regularization. What does it mean?
  10. 12.

     View from SVD • Sample covariance matrix: • The

    largest eigen value , the corresponding eigen vector is called the 1st eigen vector. • has the largest sample variance among the all linear combination of X. The variance is
  11. 13.

     View from SVD • Greater singular value, less shrinkage

    will be applied. • Assumption:
 Response will tend to vary more in the directions of high variance of inputs. 'JH
  12. 14.

    Lasso  • Penalizing by the sum of absolute value

    of parameters subject to Compare to Ridge regression w&RVJWBMFOUGPSN
  13. 15.

    Property of Lasso  • Lasso doesn’t have closed form

    solution, leading to a quadratic programming. • Some of parameters will be zero. • Shrinkage parameter stronger regularization
  14. 18.

    (FOFSBMJ[BUJPOPG3JEHFBOE-BTTP • q = 1 → Lasso (L1 norm), minimum

    value that restriction area will be convex. • q = 2 → Ridge (L2 norm) • q = 0 → subset selection (depend on only p) • q can be other values (like 3, 4, …) but it’s useless empirically…
  15. 19.

    #BZFTJBO7JFX • View as log prior distribution of • Each

    method can be seen as maximization of posterior using different prior distribution. • When q is between 1 and 2, it is called elastic net • Select features like Lasso and shrinkage correlated features like Ridge