Upgrade to Pro — share decks privately, control downloads, hide ads and more …

(NIPS・ICDM2016論文輪読会)Learning Kernels with Random Features 2017/2/11

ecarlatt
February 11, 2017
1.9k

(NIPS・ICDM2016論文輪読会)Learning Kernels with Random Features 2017/2/11

ecarlatt

February 11, 2017
Tweet

Transcript

  1. Learning Kernels with Random Features Aman Sinha, John Duchi in

    NIPS2016 Shuhei Kano ౦ژେ ৘ใཧ޻ 2017/2/11 NIPSɾICDM 2016 ࿦จྠಡձ@RCO 1 / 24
  2. Overview Random Fourier Features(RFF) [Rahimi2007]: Fast Kernel Method • Approximate

    Gram matrix by Euclidean vector product. • Sampling from Fourier transformation of shift-invariant kernel (cf. Bochner’s theorem) K(x − y) = ϕ(x)⊤ϕ(y) ≈ z(x)⊤z(y) ϕ(x) : Rd → Rd′ , z(x) : Rd → RD, (d′ ≤ ∞, D = 102 ∼ 104) Contribution of [Sinha2016, in NIPS] • Extend the random feature approach to kernel learning • Show Consistency & Generalization Performance 2 / 24
  3. Kernel Method is too Heavy... example • Φij = K(xi,

    xj), Φ ∈ Rn×n • Conputation with Φ costs O(n2) ∼ O(n3) • e.g. kernel ridge ˆ θ = (Φ⊤Φ + λIn )−1Φ⊤y (1) For large scale problems • Subsampling: [Achlioptas2001] • Low-Rank approximation • Nyström method: [Williams2000; Drineas2005] • Stochastic approximation • Doubly Stochastic Gradients: [Dai2014] • Random Fourier Features: [Rahimi2007] 5 / 24
  4. RFF overview • K(x − x′) = ϕ⊤(x)ϕ(x′) ≈ z(x)⊤z(x′)

    Proposition (implication of Bochner’s) Some kernels are the Fourier transform of a probability measure: ∃p s.t. K(∆) = ∫ Rd exp(−iw⊤∆)dp(w) • wi iid ∼ p(w), bi iid ∼ Uni[0, 2π] ˜ z(x) = √ 2 D    cos(w⊤ 1 x + b1 ) . . . cos(w⊤ D x + bD )    (i = 1, . . . , D) Kernel K(∆) p(w) (∝) Gaussian e− ∥∆∥2 2 2 e−∥w∥2 2 /2 Laplacian e−∥∆∥1 ∏ d 1 π(1+w2 d ) Cauchy ∏ d 2 1+∆2 d e−∥w∥1 6 / 24
  5. History of RFF Proposed: [Rahimi2007] • Generalization: [Rahimi2008] • Acceleration

    in Gaussian: [Le2013] • Theoretical analysis: [Sutherland2015; Sriperumbudur2015] • Comparison to Nyström method: [Yang2012] in NIPS2016: • Kernel Leaning: [Sinha2016] • Effective sampling in Gaussian: [Yu2016] 7 / 24
  6. Supervised learning with kernel learning example: usual l2 -regulalization min

    f∈F n ∑ i=1 c(f(xi), yi) + λ 2 ∥f∥2 2 , (2) • optimal f(representer theorem): f(·) = ∑ i αi K(·, xi) Procedure 1 Solve Kernel Alighment [Cristianini2001] • What is "good" kernel?: Its Gram matrix is similar to correlation matrix of label. 2 Sampling from learned kernel, then solve empirical risk minimization 9 / 24
  7. Settings • (xi, yi) ∈ Rd × {−1, 1}: Data

    (i = 1, . . . , n) • ϕ : Rd × W → [−1, 1]: feature map • Q: probability measure(distribution) on W • We only consider the following kernel: positive-semidefinite, continuous, shift-invariant and properly-scaled KQ (x, x′) = ∫ ϕ(x, w)ϕ(x′, w)dQ(w) (3) 10 / 24
  8. Kernel Learning Overview of learning kernel • Choose the "best"

    KQ over all distribution in some large set P of possible distributions on random features. max Q∈P ∑ i,j KQ (xi, xj)yiyj (= ⟨KQ , yy⊤⟩F ) (4) • Alignment: like cosine similality of matrix (uses Frobenius norm) A(K1 , K2 ) = ⟨K1 , K2 ⟩F √ ⟨K1 , K1 ⟩F ⟨K2 , K2 ⟩F (5) 11 / 24
  9. Empirical alignment maximization • Focus on sets P defined measures

    on the space of probability distributions: for user-defined P0 and f-fivergence, P := {Q : Df (Q||P0 ) = ∫ f ( dQ dP0 ) dP0 ≤ ρ}, ρ > 0. (6) • For wi iid ∼ P0 , i ∈ [Nw ], define the discrete approximation of P: PNw := {q : Df (q||1/Nw ) ≤ ρ} (7) • empirical version of alignment maximization (4): max q∈PNw ∑ i,j yiyj Nw ∑ m=1 qm ϕ(xi, wm)ϕ(xj, wm) (8) RFF feature map • (8): find weights ˆ q that describe the underlying dataset well 12 / 24
  10. To solve (8) • Φ = [ϕ1 · · ·

    ϕn] ∈ RNw×nɼ ϕi = [ϕ(xi, w1) · · · ϕ(xi, wNw )]⊤ ∈ RNw Objective and Lagrangian dual function ∑ i,j yiyj Nw ∑ m=1 qm ϕ(xi, wm)ϕ(xj, wm) = q⊤((Φy) ⊙ (Φy)) (9) sup q∈∆ L(q, λ) = sup q∈∆ {q⊤((Φy) ⊙ (Φy)) − λ(Df (q||1/Nw ) − ρ)} (10) • minλ∈[0,∞] (10): convex minimzation ˠ use bisection method • So we alternately optimize λ and q 13 / 24
  11. Computation of each λ • We only consider χ2-divergence: Df

    (P||Q) = ∫ {( dP dQ ) 2 − 1 } dQ (11) • When χ2-divergence, Df (q||1/Nw ) ≈ ∑ Nw m=1 (Nw qm )2/Nw − 1 Given λ, solve max q∈∆ { q⊤((Φy) ⊙ (Φy)) − λ 1 Nw Nw ∑ m=1 (Nw qm )2 } (12) • qm = [vm /λNw + τ]+ , searching τ costs O(Nw ) time [Duchi2008] 14 / 24
  12. After getting ˆ q • Solve empirical risk minimization about

    loss function c(x, y) • Two approach standard RFF • naive: W1, . . . , WD iid ∼ ˆ q, ϕi D = [ϕ(xi, W1) · · · ϕ(xi, WD)]⊤ arg min θ { n ∑ i=1 c ( 1 √ D θ⊤ϕi D , yi ) + r(θ) } (13) Proposed • efficient for D ≥ nnz(ˆ q): For w1, . . . wNw ∈ P0 (used before), arg min θ { n ∑ i=1 c(θ⊤diag(ˆ q)1/2ϕi, yi) + r(θ) } (14) 15 / 24
  13. Guarantees • The details are not shown in presentation, see

    the paper. (Alignment) Consistency • When n → ∞ and Nw → ∞ respectively, is alignment provided by the estimated distribution ˆ q nearly optimal? ˠ Yes!! Generalization peformance • Show the risk bound of (14) estimator. • Tools: consistency result + [Cortes2010] 16 / 24
  14. Artificial Data: good with poor choice of P0 • xi

    ∼ N(0, I) ∈ Rd, yi = sign(∥xi∥2 − √ d) • P0 corresponds to Gaussian kenel (ill-suited for this task) ∗The figures are quoted from the author’s poster 18 / 24
  15. Benchmark Datasets: vs standard RFF ∗The figures and tables are

    quoted from the author’s poster 19 / 24
  16. Conclusion • Exploit computational advantages of RFF to develop a

    fast kernel-learning optimization procedure • Show that optimization procedure is consistent, and proposed estimator generalizes well • Emprirical results indicate we learn new structure, and we attain competitive results faster than other methods Thank you for linstening! 21 / 24
  17. Reference I • Lelated to random Fourier features: [Le2013] Quoc

    V Le et al. “Fastfood - Computing Hilbert Space Expansions in loglinear time”. In: Proceedings of the 30th International Conference on Machine Learning. 2013. [Rahimi2007] Ali Rahimi et al. “Random Features for Large-Scale Kernel Machines”. In: Advances in Neural Information Processing Systems 20. 2007. [Rahimi2008] Ali Rahimi et al. “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning”. In: Advances in Neural Information Processing Systems 21. 2008. [Sinha2016] Aman Sinha et al. “Learning Kernels with Random Features”. In: Advances in Neural Information Processing Systems 29. 2016. [Sriperumbudur2015] Bharath K. Sriperumbudur et al. “Optimal Rates for Random Fourier Features”. In: Advances in Neural Information Processing Systems 28. 2015. [Sutherland2015] Dougal J Sutherland et al. “On the Error of Random Fourier Features”. In: Uncertainty in Artificial Intelligence. 2015. 22 / 24
  18. Reference II [Yang2012] Tianbao Yang et al. “Nystrom Method vs

    Random Fourier Features: A Theoretical and Empirical Comparison”. In: Advances in Neural Information Processing Systems. 2012. [Yu2016] Felix X. Yu et al. “Orthogonal Random Features”. In: Advances In Neural Information Processing Systems 29. 2016. • Lelated to fast kernel machine: [Achlioptas2001] Dimitris Achlioptas et al. “Sampling Techniques for Kernel Methods”. In: Advances in Neural Information Processing Systems 14. 2001. [Dai2014] Bo Dai et al. “Scalable Kernel Methods via Doubly Stochastic Gradients”. In: Advances in Neural Information Processing Systems 27. Curran Associates, Inc., 2014, pp. 3041–3049. [Drineas2005] Petros Drineas et al. “On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning”. In: Journal of Machine Learning Research 6.12 (2005), pp. 2153–2175. [Williams2000] Christopher K. I. Williams et al. “Using the Nyström Method to Speed Up Kernel Machines”. In: Advances in Neural Information Processing Systems 13. 2000. 23 / 24
  19. Reference III • Other [Cortes2010] Corinna Cortes et al. “Generalization

    Bounds for Learning Kernels”. In: Proceedings of the 27th International Conference on Machine Learning. 2010. [Cristianini2001] Nello Cristianini et al. “On kernel-target alignment”. In: Advances in Neural Information Processing Systems 14. 2001. [Duchi2008] John Duchi et al. “Efficient projections onto the l1-ball for learning in high dimensions”. In: Proceedings of the 25th International Conference on Machine Learning. 2008. [Gönen2011] Mehmet Gönen et al. “Multiple Kernel Learning Algorithms”. In: Journal of Machine Learning Research 12 (2011), pp. 2211–2268. 24 / 24