(NIPS・ICDM2016論文輪読会)Learning Kernels with Random Features 2017/2/11

Slide 1

Slide 1 text

Learning Kernels with Random Features Aman Sinha, John Duchi in NIPS2016 Shuhei Kano ౦ژେ ৘ใཧ޻ 2017/2/11 NIPSɾICDM 2016 ࿦จྠಡձ@RCO 1 / 24

Slide 2

Slide 2 text

Overview Random Fourier Features(RFF) [Rahimi2007]: Fast Kernel Method • Approximate Gram matrix by Euclidean vector product. • Sampling from Fourier transformation of shift-invariant kernel (cf. Bochner’s theorem) K(x − y) = ϕ(x)⊤ϕ(y) ≈ z(x)⊤z(y) ϕ(x) : Rd → Rd′ , z(x) : Rd → RD, (d′ ≤ ∞, D = 102 ∼ 104) Contribution of [Sinha2016, in NIPS] • Extend the random feature approach to kernel learning • Show Consistency & Generalization Performance 2 / 24

Slide 3

Slide 3 text

1 Background 2 Proposed method 3 Expetiments 3 / 24

Slide 4

Slide 4 text

1 Background 2 Proposed method 3 Expetiments 4 / 24

Slide 5

Slide 5 text

Kernel Method is too Heavy... example • Φij = K(xi, xj), Φ ∈ Rn×n • Conputation with Φ costs O(n2) ∼ O(n3) • e.g. kernel ridge ˆ θ = (Φ⊤Φ + λIn )−1Φ⊤y (1) For large scale problems • Subsampling: [Achlioptas2001] • Low-Rank approximation • Nyström method: [Williams2000; Drineas2005] • Stochastic approximation • Doubly Stochastic Gradients: [Dai2014] • Random Fourier Features: [Rahimi2007] 5 / 24

Slide 6

Slide 6 text

RFF overview • K(x − x′) = ϕ⊤(x)ϕ(x′) ≈ z(x)⊤z(x′) Proposition (implication of Bochner’s) Some kernels are the Fourier transform of a probability measure: ∃p s.t. K(∆) = ∫ Rd exp(−iw⊤∆)dp(w) • wi iid ∼ p(w), bi iid ∼ Uni[0, 2π] ˜ z(x) = √ 2 D    cos(w⊤ 1 x + b1 ) . . . cos(w⊤ D x + bD )    (i = 1, . . . , D) Kernel K(∆) p(w) (∝) Gaussian e− ∥∆∥2 2 2 e−∥w∥2 2 /2 Laplacian e−∥∆∥1 ∏ d 1 π(1+w2 d ) Cauchy ∏ d 2 1+∆2 d e−∥w∥1 6 / 24

Slide 7

Slide 7 text

History of RFF Proposed: [Rahimi2007] • Generalization: [Rahimi2008] • Acceleration in Gaussian: [Le2013] • Theoretical analysis: [Sutherland2015; Sriperumbudur2015] • Comparison to Nyström method: [Yang2012] in NIPS2016: • Kernel Leaning: [Sinha2016] • Eﬀective sampling in Gaussian: [Yu2016] 7 / 24

Slide 8

Slide 8 text

1 Background 2 Proposed method 3 Expetiments 8 / 24

Slide 9

Slide 9 text

Supervised learning with kernel learning example: usual l2 -regulalization min f∈F n ∑ i=1 c(f(xi), yi) + λ 2 ∥f∥2 2 , (2) • optimal f(representer theorem): f(·) = ∑ i αi K(·, xi) Procedure 1 Solve Kernel Alighment [Cristianini2001] • What is "good" kernel?: Its Gram matrix is similar to correlation matrix of label. 2 Sampling from learned kernel, then solve empirical risk minimization 9 / 24

Slide 10

Slide 10 text

Settings • (xi, yi) ∈ Rd × {−1, 1}: Data (i = 1, . . . , n) • ϕ : Rd × W → [−1, 1]: feature map • Q: probability measure(distribution) on W • We only consider the following kernel: positive-semideﬁnite, continuous, shift-invariant and properly-scaled KQ (x, x′) = ∫ ϕ(x, w)ϕ(x′, w)dQ(w) (3) 10 / 24

Slide 11

Slide 11 text

Kernel Learning Overview of learning kernel • Choose the "best" KQ over all distribution in some large set P of possible distributions on random features. max Q∈P ∑ i,j KQ (xi, xj)yiyj (= ⟨KQ , yy⊤⟩F ) (4) • Alignment: like cosine similality of matrix (uses Frobenius norm) A(K1 , K2 ) = ⟨K1 , K2 ⟩F √ ⟨K1 , K1 ⟩F ⟨K2 , K2 ⟩F (5) 11 / 24

Slide 12

Slide 12 text

Empirical alignment maximization • Focus on sets P defined measures on the space of probability distributions: for user-defined P0 and f-fivergence, P := {Q : Df (Q||P0 ) = ∫ f ( dQ dP0 ) dP0 ≤ ρ}, ρ > 0. (6) • For wi iid ∼ P0 , i ∈ [Nw ], define the discrete approximation of P: PNw := {q : Df (q||1/Nw ) ≤ ρ} (7) • empirical version of alignment maximization (4): max q∈PNw ∑ i,j yiyj Nw ∑ m=1 qm ϕ(xi, wm)ϕ(xj, wm) (8) RFF feature map • (8): find weights ˆ q that describe the underlying dataset well 12 / 24

Slide 13

Slide 13 text

To solve (8) • Φ = [ϕ1 · · · ϕn] ∈ RNw×nɼ ϕi = [ϕ(xi, w1) · · · ϕ(xi, wNw )]⊤ ∈ RNw Objective and Lagrangian dual function ∑ i,j yiyj Nw ∑ m=1 qm ϕ(xi, wm)ϕ(xj, wm) = q⊤((Φy) ⊙ (Φy)) (9) sup q∈∆ L(q, λ) = sup q∈∆ {q⊤((Φy) ⊙ (Φy)) − λ(Df (q||1/Nw ) − ρ)} (10) • minλ∈[0,∞] (10): convex minimzation ˠ use bisection method • So we alternately optimize λ and q 13 / 24

Slide 14

Slide 14 text

Computation of each λ • We only consider χ2-divergence: Df (P||Q) = ∫ {( dP dQ ) 2 − 1 } dQ (11) • When χ2-divergence, Df (q||1/Nw ) ≈ ∑ Nw m=1 (Nw qm )2/Nw − 1 Given λ, solve max q∈∆ { q⊤((Φy) ⊙ (Φy)) − λ 1 Nw Nw ∑ m=1 (Nw qm )2 } (12) • qm = [vm /λNw + τ]+ , searching τ costs O(Nw ) time [Duchi2008] 14 / 24

Slide 15

Slide 15 text

After getting ˆ q • Solve empirical risk minimization about loss function c(x, y) • Two approach standard RFF • naive: W1, . . . , WD iid ∼ ˆ q, ϕi D = [ϕ(xi, W1) · · · ϕ(xi, WD)]⊤ arg min θ { n ∑ i=1 c ( 1 √ D θ⊤ϕi D , yi ) + r(θ) } (13) Proposed • eﬃcient for D ≥ nnz(ˆ q): For w1, . . . wNw ∈ P0 (used before), arg min θ { n ∑ i=1 c(θ⊤diag(ˆ q)1/2ϕi, yi) + r(θ) } (14) 15 / 24

Slide 16

Slide 16 text

Guarantees • The details are not shown in presentation, see the paper. (Alignment) Consistency • When n → ∞ and Nw → ∞ respectively, is alignment provided by the estimated distribution ˆ q nearly optimal? ˠ Yes!! Generalization peformance • Show the risk bound of (14) estimator. • Tools: consistency result + [Cortes2010] 16 / 24

Slide 17

Slide 17 text

1 Background 2 Proposed method 3 Expetiments 17 / 24

Slide 18

Slide 18 text

Artiﬁcial Data: good with poor choice of P0 • xi ∼ N(0, I) ∈ Rd, yi = sign(∥xi∥2 − √ d) • P0 corresponds to Gaussian kenel (ill-suited for this task) ∗The ﬁgures are quoted from the author’s poster 18 / 24

Slide 19

Slide 19 text

Benchmark Datasets: vs standard RFF ∗The ﬁgures and tables are quoted from the author’s poster 19 / 24

Slide 20

Slide 20 text

Benchmark Datasets: Scalability • vs joint optimizaiton[Gönen2011] (n=5000) ∗The tables are quoted from the author’s poster 20 / 24

Slide 21

Slide 21 text

Conclusion • Exploit computational advantages of RFF to develop a fast kernel-learning optimization procedure • Show that optimization procedure is consistent, and proposed estimator generalizes well • Emprirical results indicate we learn new structure, and we attain competitive results faster than other methods Thank you for linstening! 21 / 24

Slide 22

Slide 22 text

Reference I • Lelated to random Fourier features: [Le2013] Quoc V Le et al. “Fastfood - Computing Hilbert Space Expansions in loglinear time”. In: Proceedings of the 30th International Conference on Machine Learning. 2013. [Rahimi2007] Ali Rahimi et al. “Random Features for Large-Scale Kernel Machines”. In: Advances in Neural Information Processing Systems 20. 2007. [Rahimi2008] Ali Rahimi et al. “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning”. In: Advances in Neural Information Processing Systems 21. 2008. [Sinha2016] Aman Sinha et al. “Learning Kernels with Random Features”. In: Advances in Neural Information Processing Systems 29. 2016. [Sriperumbudur2015] Bharath K. Sriperumbudur et al. “Optimal Rates for Random Fourier Features”. In: Advances in Neural Information Processing Systems 28. 2015. [Sutherland2015] Dougal J Sutherland et al. “On the Error of Random Fourier Features”. In: Uncertainty in Artiﬁcial Intelligence. 2015. 22 / 24

Slide 23

Slide 23 text

Reference II [Yang2012] Tianbao Yang et al. “Nystrom Method vs Random Fourier Features: A Theoretical and Empirical Comparison”. In: Advances in Neural Information Processing Systems. 2012. [Yu2016] Felix X. Yu et al. “Orthogonal Random Features”. In: Advances In Neural Information Processing Systems 29. 2016. • Lelated to fast kernel machine: [Achlioptas2001] Dimitris Achlioptas et al. “Sampling Techniques for Kernel Methods”. In: Advances in Neural Information Processing Systems 14. 2001. [Dai2014] Bo Dai et al. “Scalable Kernel Methods via Doubly Stochastic Gradients”. In: Advances in Neural Information Processing Systems 27. Curran Associates, Inc., 2014, pp. 3041–3049. [Drineas2005] Petros Drineas et al. “On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning”. In: Journal of Machine Learning Research 6.12 (2005), pp. 2153–2175. [Williams2000] Christopher K. I. Williams et al. “Using the Nyström Method to Speed Up Kernel Machines”. In: Advances in Neural Information Processing Systems 13. 2000. 23 / 24

Slide 24

Slide 24 text

Reference III • Other [Cortes2010] Corinna Cortes et al. “Generalization Bounds for Learning Kernels”. In: Proceedings of the 27th International Conference on Machine Learning. 2010. [Cristianini2001] Nello Cristianini et al. “On kernel-target alignment”. In: Advances in Neural Information Processing Systems 14. 2001. [Duchi2008] John Duchi et al. “Eﬃcient projections onto the l1-ball for learning in high dimensions”. In: Proceedings of the 25th International Conference on Machine Learning. 2008. [Gönen2011] Mehmet Gönen et al. “Multiple Kernel Learning Algorithms”. In: Journal of Machine Learning Research 12 (2011), pp. 2211–2268. 24 / 24