ͱతม ͕ఆ·Εɺ Λղ͘͜ͱʹؼணɻa,b Λ , ͷΑ͏ʹબɺิͰࣔͨ͠ܗ͕ಘΒΕΔɻ A full rank Ͱ͋ΔͨΊɺઢܗํఔࣜΛղ͘͜ͱͰॏΈ w ͕ٻ·Δɻ [ূ໌ऴ] xn , the n ⇥ n matrix A = [max {xi bj , 0 } ]ij has full rank. Its smallest eigenvalue is mini xi bi . Proof. By its deﬁnition, the matrix A is lower triangular, that is, all entries with i < j vanish. A basic linear algebra fact states that a lower-triangular matrix has full rank if and only if all of the entries on the diagional are nonzero. Since, xi > bi , we have that max {xi bi , 0 } > 0 . Hence, A is invertible. The second claim follows directly from the fact that a lower-triangular matrix has all its eigenvalues on the main diagonal. This in turn follows from the ﬁrst fact, since A I can have lower rank only if equals one of the diagonal values. Proof of Theorem 1. For weight vectors w, b 2 Rn and a 2 Rd, consider the function c : Rn ! R, c ( x ) = X j =1 wj max {ha, xi bj , 0 } It is easy to see that c can be expressed by a depth 2 network with ReLU activations. Now, ﬁx a sample S = {z 1 , . . . , zn } of size n and a target vector y 2 Rn. To prove the theorem, we need to ﬁnd weights a, b, w so that yi = c ( zi) for all i 2 { 1 , . . . , n} First, choose a and b such that with xi = ha, zi i we have the interleaving property b 1 < x 1 < b 2 < · · · < bn < xn . This is possible since all zi ’s are distinct. Next, consider the set of n equations in the n unknowns w, yi = c ( zi) , i 2 { 1 , . . . , n} . n i j ij i i i Proof. By its deﬁnition, the matrix A is lower triangular, that is, all entries with i < j vanish. A basic linear algebra fact states that a lower-triangular matrix has full rank if and only if all of the entries on the diagional are nonzero. Since, xi > bi , we have that max {xi bi , 0 } > 0 . Hence, A is invertible. The second claim follows directly from the fact that a lower-triangular matrix has all its eigenvalues on the main diagonal. This in turn follows from the ﬁrst fact, since A I can have lower rank only if equals one of the diagonal values. Proof of Theorem 1. For weight vectors w, b 2 Rn and a 2 Rd, consider the function c : Rn ! R, c ( x ) = X j =1 wj max {ha, xi bj , 0 } It is easy to see that c can be expressed by a depth 2 network with ReLU activations. Now, ﬁx a sample S = {z 1 , . . . , zn } of size n and a target vector y 2 Rn. To prove the theorem, we need to ﬁnd weights a, b, w so that yi = c ( zi) for all i 2 { 1 , . . . , n} First, choose a and b such that with xi = ha, zi i we have the interleaving property b 1 < x 1 < b 2 < · · · < bn < xn . This is possible since all zi ’s are distinct. Next, consider the set of n equations in the n unknowns w, yi = c ( zi) , i 2 { 1 , . . . , n} . Proof. By its deﬁnition, the matrix A is lower tri basic linear algebra fact states that a lower-triang entries on the diagional are nonzero. Since, xi > is invertible. The second claim follows directly fr its eigenvalues on the main diagonal. This in turn lower rank only if equals one of the diagonal val Proof of Theorem 1. For weight vectors w, b 2 Rn c ( x ) = X j =1 wj ma It is easy to see that c can be expressed by a depth Now, ﬁx a sample S = {z 1 , . . . , zn } of size n and need to ﬁnd weights a, b, w so that yi = c ( zi) for a First, choose a and b such that with xi = ha, zi i w · · · < bn < xn . This is possible since all zi ’s are the n unknowns w, yi = c ( zi) , i entries on the diagional are nonzero. Since, xi > bi , we have that max {xi bi , 0 } > 0 . He is invertible. The second claim follows directly from the fact that a lower-triangular matrix its eigenvalues on the main diagonal. This in turn follows from the ﬁrst fact, since A I ca lower rank only if equals one of the diagonal values. Proof of Theorem 1. For weight vectors w, b 2 Rn and a 2 Rd, consider the function c : Rn c ( x ) = X j =1 wj max {ha, xi bj , 0 } It is easy to see that c can be expressed by a depth 2 network with ReLU activations. Now, ﬁx a sample S = {z 1 , . . . , zn } of size n and a target vector y 2 Rn. To prove the theore need to ﬁnd weights a, b, w so that yi = c ( zi) for all i 2 { 1 , . . . , n} First, choose a and b such that with xi = ha, zi i we have the interleaving property b 1 < x 1 < · · · < bn < xn . This is possible since all zi ’s are distinct. Next, consider the set of n equat the n unknowns w, yi = c ( zi) , i 2 { 1 , . . . , n} . i j ij i i i By its deﬁnition, the matrix A is lower triangular, that is, all entries with i < j vanish. A near algebra fact states that a lower-triangular matrix has full rank if and only if all of the on the diagional are nonzero. Since, xi > bi , we have that max {xi bi , 0 } > 0 . Hence, A tible. The second claim follows directly from the fact that a lower-triangular matrix has all nvalues on the main diagonal. This in turn follows from the ﬁrst fact, since A I can have ank only if equals one of the diagonal values. f Theorem 1. For weight vectors w, b 2 Rn and a 2 Rd, consider the function c : Rn ! R, c ( x ) = X j =1 wj max {ha, xi bj , 0 } sy to see that c can be expressed by a depth 2 network with ReLU activations. x a sample S = {z 1 , . . . , zn } of size n and a target vector y 2 Rn. To prove the theorem, we ﬁnd weights a, b, w so that yi = c ( zi) for all i 2 { 1 , . . . , n} hoose a and b such that with xi = ha, zi i we have the interleaving property b 1 < x 1 < b 2 < bn < xn . This is possible since all zi ’s are distinct. Next, consider the set of n equations in nknowns w, yi = c ( zi) , i 2 { 1 , . . . , n} . We have c ( zi) = Aw, where A = [max {xi bi , 0 } ]ij is the matrix we encoun We chose a and b so that the lemma applies and hence A has full rank. We can n system y = Aw to ﬁnd suitable weights w. While the construction in the previous proof has inevitably high width given tha possible to trade width for depth. The construction is as follows. With the nota and assuming w.l.o.g. that x 1 , . . . , xn 2 [0 , 1] , partition the interval [0 , 1] into 5/18
= 0 ϥϯμϜͷൺΛมߋ͢Δ͜ͱͰ൚ԽੑͷৼΔ͍Λ؍ଌͰ͖Δ CIFAR10 dataset Ͱͷ {InceptionV3, AlexNet, MLP 1×512} ͷ݁Ռ (a) learning curves (b) convergence slowdown (c) generalization error growth Figure 1: Fitting random labels and random pixels on CIFAR10. (a) shows the training loss of 6/18
tune ͢Δ͜ͱͰ DL ͷ൚ԽੑΛߴΊΔ ͔͠͠ਖ਼ଇԽ͚ͩͰ DL ͷ൚Խੑͷઆ໌ͱͯ͠ेͰͳ͍ Table 1: The training and test accuracy (in percentage) of various models on the CIFAR10 dataset. Performance with and without data augmentation and weight decay are compared. The results of ﬁtting random labels are also included. model # params random crop weight decay train accuracy test accuracy Inception 1,649,402 yes yes 100.0 89.05 yes no 100.0 89.31 no yes 100.0 86.03 no no 100.0 85.75 (ﬁtting random labels) no no 100.0 9.78 Inception w/o BatchNorm 1,649,402 no yes 100.0 83.00 no no 100.0 82.00 (ﬁtting random labels) no no 100.0 10.12 Alexnet 1,387,786 yes yes 99.90 81.22 yes no 99.82 79.66 no yes 100.0 77.36 no no 100.0 76.07 (ﬁtting random labels) no no 99.82 9.86 MLP 3x512 1,735,178 no yes 100.0 53.35 no no 100.0 52.39 (ﬁtting random labels) no no 100.0 10.48 MLP 1x512 1,209,866 no yes 99.80 50.39 no no 100.0 50.51 (ﬁtting random labels) no no 99.34 10.61 ҟͳΔ େҬղ ਖ਼ଇԽ͕ ͳͯ͘ Ұఆͷ ൚Խੑ CIFAR10 dataset 7/18
ɹϥϯμϜϥϕϧʹϑΟοτͰ͖Δঢ়گͰ ~ 1 ͱͳΓ༗༻Ͱͳ͍ ɾUniform stability: ɹΞϧΰϦζϜͷੑΛଌΔͷͰ൚Խੑͷద༻؆୯Ͱͳ͍ ɾLipschitz continuity and robustness: ɹΠϯϓοταΠζ n ʹର͢ΔࢦґଘੑͷͨΊऑ͗͢Δ ɾSharpeness: ɹ࠷ۙఏҊ͞ΕͨͷͰҰ෦ಛΛଊ͍͑ͯΔ͕ɺ·ͩෆे top-1 accuracy even with explicit regularizers turned on. Partially corrupted labels We further inspect the behavior of neural network training with a vary- ing level of label corruptions from 0 (no corruption) to 1 (complete random labels) on the CIFAR10 dataset. The networks ﬁt the corrupted training set perfectly for all the cases. Figure 1b shows the slowdown of the convergence time with increasing level of label noises. Figure 1c depicts the test errors after convergence. Since the training errors are always zero, the test errors are the same as generalization errors. As the noise level approaches 1, the generalization errors converge to 90% — the performance of random guessing on CIFAR10. 2.2 IMPLICATIONS In light of our randomization experiments, we discuss how our ﬁndings pose a challenge for several traditional approaches for reasoning about generalization. Rademacher complexity and VC-dimension. Rademacher complexity is commonly used and ﬂexible complexity measure of a hypothesis class. The empirical Rademacher complexity of a hypothesis class H on a dataset {x 1 , . . . , xn } is deﬁned as ˆ Rn( H ) = E " sup h2H 1 n n X i =1 i h ( xi) # (1) where 1 , . . . , n 2 {± 1 } are i.i.d. uniform random variables. This deﬁnition closely resembles our randomization test. Speciﬁcally, ˆ Rn( H ) measures ability of H to ﬁt random ± 1 binary label assignments. While we consider multiclass problems, it is straightforward to consider related binary classiﬁcation problems for which the same experimental observations hold. Since our randomization tests suggest that many neural networks ﬁt the training set with random labels perfectly, we expect that ˆ Rn( H ) ⇡ 1 for the corresponding model class H. This is, of course, a trivial upper bound on the Rademacher complexity that does not lead to useful generalization bounds in realistic settings. A similar reasoning applies to VC-dimension and its continuous analog fat-shattering dimension, unless we further restrict the network. While Bartlett (1998) proves a bound on the fat-shattering dimension in terms of ` 1 norm bounds on the weights of the network, this bound does not apply to the ReLU networks that we consider here. This result was generalized to other norms by Neyshabur et al. (2015), but even these do not seem to explain the generalization behavior that we observe. Uniform stability. Stepping away from complexity measures of the hypothesis class, we can in- stead consider properties of the algorithm used for training. This is commonly done with some notion of stability, such as uniform stability (Bousquet & Elisseeff, 2002). Uniform stability of an algorithm A measures how sensitive the algorithm is to the replacement of a single example. How- ever, it is solely a property of the algorithm, which does not take into account speciﬁcs of the data or the distribution of the labels. It is possible to deﬁne weaker notions of stability (Mukherjee et al., 2002; Poggio et al., 2004; Shalev-Shwartz et al., 2010). The weakest stability measure is directly equivalent to bounding generalization error and does take the data into account. However, it has been difﬁcult to utilize this weaker stability notion effectively. 3 THE ROLE OF REGULARIZATION the performance of random guessing on CIFAR10. 2.2 IMPLICATIONS In light of our randomization experiments, we discus traditional approaches for reasoning about generaliza Rademacher complexity and VC-dimension. Ra ﬂexible complexity measure of a hypothesis class. hypothesis class H on a dataset {x 1 , . . . , xn } is deﬁn ˆ Rn( H ) = E " sup h2H n where 1 , . . . , n 2 {± 1 } are i.i.d. uniform random our randomization test. Speciﬁcally, ˆ Rn( H ) measu assignments. While we consider multiclass problems classiﬁcation problems for which the same experimen tests suggest that many neural networks ﬁt the traini that ˆ Rn( H ) ⇡ 1 for the corresponding model class the Rademacher complexity that does not lead to us A similar reasoning applies to VC-dimension and i unless we further restrict the network. While Bartle dimension in terms of ` 1 norm bounds on the weigh the ReLU networks that we consider here. This resul et al. (2015), but even these do not seem to explain th Uniform stability. Stepping away from complexit stead consider properties of the algorithm used for notion of stability, such as uniform stability (Bousqu algorithm A measures how sensitive the algorithm is ever, it is solely a property of the algorithm, which d or the distribution of the labels. It is possible to deﬁn metric space ( X, M ) with a ﬁnite diameter diamM( X ) = supx,y 2 X M ( x, y ) and showed that capacity is proportional to ⇣ CM margin ⌘ n diamM( X ) . This capacity bound is weak as it has an exponen dependence on input size. Another related approach is through algorithmic robustness as suggested by Xu and Mannor [2 Given ✏ > 0 , the model f w found by a learning algorithm is K robust if X can be partitioned into disjoint sets, denoted as {C i }K i =1 , such that for any pair (x , y ) in the training set s ,3 x , z 2 C i ) |` (w , x) ` (w , z) | ✏ Xu and Mannor [28] showed the capacity of a model class whose models are K-robust scales as For the model class of functions with bounded Lipschitz Ck . k , K is proportional to Ck.k margin -cover number of the input domain X under norm k.k. However, the covering number of the input dom can be exponential in the input dimension and the capacity can still grow as ⇣ Ck.k margin ⌘ n 4. Returning to our original question, the C ` and C ` Lipschitz constants of the network can spectral norm. As an initial empirical investigation of the appropriateness of the different complexity measures, we compared the complexity (under each of the above measures) of models trained on true versus random labels. We would expect to see two phenomena: ﬁrst, the complexity of models trained on true labels should be substantially lower than those trained on random labels, corresponding to their better generalization ability. Second, when training on random labels, we expect capacity to increase almost linearly with the number of training examples, since every extra example requires new capacity in order to ﬁt it’s random label. However, when training on true labels we expect the model to capture the true functional dependence between input and output and thus ﬁtting more training examples should only require small increases in the capacity of the network. The results are reported in Figure 1. We indeed observe a gap between the complexity of models learned on real and random labels for all four norms, with the difference in increase in capacity between true and random labels being most pronounced for the `2 norm and `2 -path norm. In Section 3 we present further empirical investigations of the appropriateness of these complexity measures to explaining other phenomena. 2.3 Lipschitz Continuity and Robustness The measures/norms we discussed so far also control the Lipschitz constant of the network with respect to its input. Is the capacity control achieved through the bound on the Lipschitz constant? Is bounding the Lipschitz constant alone enough for generalization? To answer these questions, and in order to understand capacity control in terms of Lipschitz continuity more broadly, we review here the relevant guarantees. Given an input space X and metric M, a function f : X ! R on a metric space ( X, M ) is called a Lipschitz function if there exists a constant CM, such that |f ( x ) f ( y ) | CM M ( x, y ) . Luxburg and Bousquet [13] studied the capacity of functions with bounded Lipschitz constant on 5 metric space ( X, M ) with a ﬁnite diameter diamM( X ) = supx,y 2 X M ( x, y ) and showed that the capacity is proportional to ⇣ CM margin ⌘ n diamM( X ) . This capacity bound is weak as it has an exponential dependence on input size. Another related approach is through algorithmic robustness as suggested by Xu and Mannor [28]. Given ✏ > 0 , the model f w found by a learning algorithm is K robust if X can be partitioned into K disjoint sets, denoted as {C i }K i =1 , such that for any pair (x , y ) in the training set s ,3 x , z 2 C i ) |` (w , x) ` (w , z) | ✏ (3) Xu and Mannor [28] showed the capacity of a model class whose models are K-robust scales as K. For the model class of functions with bounded Lipschitz Ck . k , K is proportional to Ck.k margin -covering number of the input domain X under norm k.k. However, the covering number of the input domain can be exponential in the input dimension and the capacity can still grow as ⇣ Ck.k margin ⌘ n 4. Returning to our original question, the C `1 and C `2 Lipschitz constants of the network can be bounded by Q d i =1 kW i k 1 , 1 (hence `1 -path norm) and Q d i =1 kW i k 2 , respectively [28, 25]. This will result in a very large capacity bound that scales as ⇣Q d i=1 k Wi k 2 margin ⌘ n, which is exponential in both the input dimension and depth of the network. This shows that simply bounding the Lipschitz constant of the network is not enough to get a reasonable capacity control, and the capacity bounds of the previous Section are not merely a consequence of bounding the Lipschitz constant. 2.4 Sharpness The notion of sharpness as a generalization measure was recently suggested by Keskar et al. [11] and corresponds to robustness to adversarial perturbations on the parameter space: ⇣ ↵(w) = max|⌫i | ↵ (| wi |+ 1 ) b L ( f w +⌫) b L ( f w) 1 + b L ( f w) ' max |⌫i | ↵ (| wi |+ 1 ) b L ( f w +⌫) b L ( f w) , (4) where the training error b L ( f w) is generally very small in the case of neural networks in practice, so we can simply drop it from the denominator without a signiﬁcant change in the sharpness value. As we will explain below, sharpness deﬁned this way does not capture the generalization behavior. To see this, we ﬁrst examine whether sharpness can predict the generalization behavior for networks trained on true vs random labels. In the left plot of Figure 2, we plot the sharpness for networks trained on true vs random labels. While sharpness correctly predicts the generalization behavior for bigger networks, for networks of smaller size, those trained on random labels have less sharpness 8/18
Risk Minimization Λղ͘ d > n ͷͱ͖ͲΜͳϥϕϧͰϑΟοτ͢Δ͜ͱ͕Ͱ͖Δ X Λ i ൪ͷߦྻ͕ Ͱ͋Δ n×d ͷߦྻͱ͢Εɺ X ͕ rank n Ͱ͋Εແݶݸͷղ͕͋Δ͕ɺͦΕΒͲ͏ҧ͏ͷ͔ʁ ྫ͑ global minima ۙͷۂͰղͷ࣭ΛݟΔͱશͯಉ͡ʹͳΔ Hessian ͕ w ʹґΒͳ͘ͳΓɺશͯͷ global minima Ͱॖୀ deep neural nets remain mysterious for many reasons, we note in this section that it is not easy to understand the source of generalization for linear models either. Indeed, it is ppeal to the simple case of linear models to see if there are parallel insights that can help nderstand neural networks. e collect n distinct data points { ( xi , yi) } where xi are d-dimensional feature vectors and ls. Letting loss denote a nonnegative loss function with loss( y, y ) = 0 , consider the isk minimization (ERM) problem minw2Rd 1 n Pn i =1 loss( wT xi , yi) (2) hen we can ﬁt any labeling. But is it then possible to generalize with such a rich model o explicit regularization? ote the n ⇥ d data matrix whose i-th row is xT i . If X has rank n, then the system of Xw = y has an inﬁnite number of solutions regardless of the right hand side. We can ﬁnd inimum in the ERM problem (2) by simply solving this linear system. global minima generalize equally well? Is there a way to determine when one global will generalize whereas another will not? One popular way to understand quality of hough deep neural nets remain mysterious for many reasons, we note in this section that it is essarily easy to understand the source of generalization for linear models either. Indeed, ul to appeal to the simple case of linear models to see if there are parallel insights that can etter understand neural networks. pose we collect n distinct data points { ( xi , yi) } where xi are d-dimensional feature vectors re labels. Letting loss denote a nonnegative loss function with loss( y, y ) = 0 , consider irical risk minimization (ERM) problem minw2Rd 1 n Pn i =1 loss( wT xi , yi) n, then we can ﬁt any labeling. But is it then possible to generalize with such a rich m s and no explicit regularization? X denote the n ⇥ d data matrix whose i-th row is xT i . If X has rank n, then the system ations Xw = y has an inﬁnite number of solutions regardless of the right hand side. We can obal minimum in the ERM problem (2) by simply solving this linear system. do all global minima generalize equally well? Is there a way to determine when one gl imum will generalize whereas another will not? One popular way to understand qualit us better understand neural ne Suppose we collect n distinct yi are labels. Letting loss de empirical risk minimization (E If d n, then we can ﬁt any class and no explicit regulariz Let X denote the n ⇥ d data equations Xw = y has an inﬁ a global minimum in the ERM But do all global minima gen minimum will generalize wh d the source of generalization for linear models either. Indeed, it is case of linear models to see if there are parallel insights that can help tworks. data points { ( xi , yi) } where xi are d-dimensional feature vectors and enote a nonnegative loss function with loss( y, y ) = 0 , consider the ERM) problem minw2Rd 1 n Pn i =1 loss( wT xi , yi) (2) labeling. But is it then possible to generalize with such a rich model ation? matrix whose i-th row is xT i . If X has rank n, then the system of nite number of solutions regardless of the right hand side. We can ﬁnd M problem (2) by simply solving this linear system. neralize equally well? Is there a way to determine when one global ereas another will not? One popular way to understand quality of minima is the curvature of the loss function at the solution. But in the linear case, the curvature of all optimal solutions is the same (Choromanska et al., 2015). To see this, note that in the case when yi is a scalar, r2 1 n Pn i =1 loss( wT xi , yi) = 1 n XT diag( ) X, ✓ i := @2 loss( z,yi) @z2 z = yi , 8i ◆ A similar formula can be found when y is vector valued. In particular, the Hessian is not a function of the choice of w. Moreover, the Hessian is degenerate at all global optimal solutions. If curvature doesn’t distinguish global minima, what does? A promising direction is to consider the workhorse algorithm, stochastic gradient descent (SGD), and inspect which solution SGD converges to. Since the SGD update takes the form wt +1 = wt ⌘t et xit where ⌘t is the step size and et is the prediction error loss. If w 0 = 0 , we must have that the solution has the form w = Pn i =1 ↵i xi 9/18
͜ͷ α unique solution Λͪɺ͍͔ͭ͘ͷ࣮ݧͰྑ͍݁Ռ ɾgabor มޙͷ MNIST Ͱ test error ͕ 0.6% ɾrandom weight CNN Ͱॲཧͨ͠ޙͷ CIFAR10 Ͱ test error ͕ 17% ͜͜ͰٻΊͨղ w ͷ l2-norm ͕࠷খͱͳΔ͜ͱ͕ΒΕ͍ͯΔ ͔͜͠͠ͷ l2-norm ൚Խੑͷࢦඪͱͯ͠ेͰͳ͘ɺMNISTͰ wavelet มΛࢪ͢ͱ൚Խੑ্͕Δ͕ղͷ l2-norm େ͖͘ͳΔ z = yi similar formula can be found when y is vector valued. In particular, the Hessian is not a function the choice of w. Moreover, the Hessian is degenerate at all global optimal solutions. curvature doesn’t distinguish global minima, what does? A promising direction is to consider the rkhorse algorithm, stochastic gradient descent (SGD), and inspect which solution SGD converges Since the SGD update takes the form wt +1 = wt ⌘t et xit where ⌘t is the step size and et is prediction error loss. If w 0 = 0 , we must have that the solution has the form w = Pn i =1 ↵i xi some coefﬁcients ↵. Hence, if we run SGD we have that w = XT ↵ lies in the span of the a points. If we also perfectly interpolate the labels we have Xw = y. Enforcing both of these ntities, this reduces to the single equation XXT ↵ = y (3) ich has a unique solution. Note that this equation only depends on the dot-products between data points xi . We have thus derived the “kernel trick” (Sch¨ olkopf et al., 2001)—albeit in a undabout fashion. e can therefore perfectly ﬁt any set of labels by forming the Gram matrix (aka the kernel matrix) the data K = XXT and solving the linear system K↵ = y for ↵. This is an n ⇥ n linear system t can be solved on standard workstations whenever n is less than a hundred thousand, as is the e for small benchmarks like CIFAR10 and MNIST. ite surprisingly, ﬁtting the training labels exactly yields excellent performance for convex models. MNIST with no preprocessing, we are able to achieve a test error of 1.2% by simply solving (3). te that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonethe- s, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 6 GB of RAM with a conventional LAPACK call. By ﬁrst applying a Gabor wavelet transform to data and then solving (3), the error on MNIST drops to 0.6%. Surprisingly, adding regularization XT diag( ) X, i := i @z2 z = yi , 8i is vector valued. In particular, the Hessian is not a function an is degenerate at all global optimal solutions. minima, what does? A promising direction is to consider the t descent (SGD), and inspect which solution SGD converges m wt +1 = wt ⌘t et xit where ⌘t is the step size and et is e must have that the solution has the form w = Pn i =1 ↵i xi run SGD we have that w = XT ↵ lies in the span of the olate the labels we have Xw = y. Enforcing both of these ation XXT ↵ = y (3) t this equation only depends on the dot-products between ed the “kernel trick” (Sch¨ olkopf et al., 2001)—albeit in a labels by forming the Gram matrix (aka the kernel matrix) linear system K↵ = y for ↵. This is an n ⇥ n linear system ions whenever n is less than a hundred thousand, as is the 0 and MNIST. bels exactly yields excellent performance for convex models. re able to achieve a test error of 1.2% by simply solving (3). e kernel matrix requires 30GB to store in memory. Nonethe- 3 minutes in on a commodity workstation with 24 cores and APACK call. By ﬁrst applying a Gabor wavelet transform to empirical risk minimization (ERM m If d n, then we can ﬁt any labe class and no explicit regularization Let X denote the n ⇥ d data mat equations Xw = y has an inﬁnite n a global minimum in the ERM pro But do all global minima general minimum will generalize wherea vature doesn’t distinguish global minima, what does? A promising direction is to con orse algorithm, stochastic gradient descent (SGD), and inspect which solution SGD co nce the SGD update takes the form wt +1 = wt ⌘t et xit where ⌘t is the step size a ediction error loss. If w 0 = 0 , we must have that the solution has the form w = P me coefﬁcients ↵. Hence, if we run SGD we have that w = XT ↵ lies in the spa oints. If we also perfectly interpolate the labels we have Xw = y. Enforcing both ies, this reduces to the single equation XXT ↵ = y has a unique solution. Note that this equation only depends on the dot-products ta points xi . We have thus derived the “kernel trick” (Sch¨ olkopf et al., 2001)—al about fashion. n therefore perfectly ﬁt any set of labels by forming the Gram matrix (aka the kerne data K = XXT and solving the linear system K↵ = y for ↵. This is an n ⇥ n linea an be solved on standard workstations whenever n is less than a hundred thousand, or small benchmarks like CIFAR10 and MNIST. surprisingly, ﬁtting the training labels exactly yields excellent performance for convex NIST with no preprocessing, we are able to achieve a test error of 1.2% by simply sol hat this is not exactly simple as the kernel matrix requires 30GB to store in memory. N his system can be solved in under 3 minutes in on a commodity workstation with 24 c B of RAM with a conventional LAPACK call. By ﬁrst applying a Gabor wavelet tran 10/18
͜ͷղΛ༻͍ͯ Ͱ͋Δ͜ͱΛ༻͍Ε ͱͳΔͷͰ Ͱ͋Δ͜ͱ͕ࣔ͞ΕΔɻ͜ΕʹΑΓ [ূ໌ऴ] rical risk minimization (ERM) problem minw2Rd 1 n Pn i =1 loss( wT xi , yi) ( n, then we can ﬁt any labeling. But is it then possible to generalize with such a rich mod and no explicit regularization? X denote the n ⇥ d data matrix whose i-th row is xT i . If X has rank n, then the system tions Xw = y has an inﬁnite number of solutions regardless of the right hand side. We can ﬁ bal minimum in the ERM problem (2) by simply solving this linear system. do all global minima generalize equally well? Is there a way to determine when one glob mum will generalize whereas another will not? One popular way to understand quality 11/18
ͷൺֱ͕Ͱ͖ͳ͍ͷͰɺग़ྗͷεέʔϧΛߟྀ͍ͨ͠ ͜͜Ͱ margin Λߟ͑Δ: training ͷ 0.001 ~ 0.1 ͘Β͍ͷσʔλͷதͷ࠷খͷ γ Λ margin ͱ͢Δ ࢦඪ ɾ ɾ ɾ ɾ input ͷআ͍͍ͯΔ path norm node rescaling (in-out Ͱଧͪফ͠߹͏) ߟྀ approaches is meaningless, as they would all go toward inﬁnity. Instead, to meaningfully compare norms of the network, we should explicitly take into account the scaling of the outputs of the network. One way this can be done, when the training error is indeed zero, is to consider the “margin” of the predictions in addition to the norms of the parameters. We refer to the margin for a single data point x as the difference between the score of the correct label and the maximum score of other labels, i.e. f w(x)[ ytrue] max y 6= ytrue f w(x)[ y ] (2) In order to measure scale over an entire training set, one simple approach is to consider the “hard margin”, which is the minimum margin among all training points. However, this deﬁnition is very sensitive to extreme points as well as to the size of the training set. We consider instead a more robust notion that allows a small portion of data points to violate the margin. For a given training set and small value ✏ > 0 , we deﬁne the margin margin as the lowest value of such that d✏me data point have margin lower than where m is the size of the training set. We found empirically that the qualitative and relative nature of our empirical results is almost unaffected by reasonable choices of ✏ (e.g. between 0 . 001 and 0 . 1 ). The norm-based measures we investigate in this work and their corresponding capacity bounds are as follows 2: • `2 norm with capacity proportional to 1 2 margin Q d i =1 4 kW i k2 F [18]. 1Node-rescaling can be deﬁned as a sequence of reparametrizations, each of which corresponds to multiplying incoming weights and dividing outgoing weights of a hidden unit by a positive scalar ↵. The resulting network computes the same function as the network before the reparametrization. 2We have dropped the term that only depend on the norm of the input. The bounds based on `2 -path norm and spectral norm can be derived directly from the those based on `1 -path norm and `2 norm respectively. Without further conditions on weights, exponential dependence on depth is tight but the 4d dependence might be loose [18]. We will also discuss a rather loose bound on the capacity based on the spectral norm in Section 2.3. 4 size of traning set 10K 20K 30K 40K 50K 1020 1025 1030 true labels random labels size of traning set 10K 20K 30K 40K 50K 1025 1030 1035 size of traning set 10K 20K 30K 40K 50K 100 102 104 size of traning set 10K 20K 30K 40K 50K 105 1010 1015 `2 norm `1 -path norm `2 -path norm spectral norm Figure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10 dataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scal- ing issues (see Section 2), where for each complexity measure, we drop the terms that only depend on depth or number of hidden units; e.g. for `2 -path norm we plot 2 margin P j 2 Q d k=0 [ hk] Q d i =1 W2 i [ j i , j i 1] .We also set the margin over training set S to be 5th-percentile of the margins of the data points in S, i.e. Prc5 {fw( x i)[ y i] maxy 6= yi fw(x)[ y ] | ( x i , y i) 2 S}. In all experiments, the training error of the learned network is zero. The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels. Furthermore, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set. • `1 -path norm with capacity proportional to 1 2 margin ⇣P j 2 Q d k=0 [ hk] Q d i =1 2 W i[ j i , j i 1] ⌘2 [4, 18]. • `2 -path norm with capacity proportional to 1 2 margin P j 2 Q d k=0 [ hk] Q d i =1 4 h i W2 i [ j i , j i 1] . • spectral norm with capacity proportional to 1 2 margin Q d i =1 h i kW i k2 2 . where Q d k =0[ h k] is the Cartesian product over sets [ h k] . The above bounds indicate that capacity can be bounded in terms of either `2 -norm or `1 -path norm independent of number of parameters. The `2 -path norm dependence on the number of hidden units in each layer is unavoidable. However, it is not clear that the dependence on the number of parameters is needed for the bound based on the spectral norm. As an initial empirical investigation of the appropriateness of the different complexity measures, size of traning set 10K 20K 30K 40K 50K 1020 1025 1030 true labels random labels size of traning set 10K 20K 30K 40K 50K 1025 1030 1035 size of traning set 10K 20K 30K 40K 50K 100 102 104 size of traning set 10K 20K 30K 40K 50K 105 1010 1015 `2 norm `1 -path norm `2 -path norm spectral norm Figure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10 dataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scal- ing issues (see Section 2), where for each complexity measure, we drop the terms that only depend on depth or number of hidden units; e.g. for `2 -path norm we plot 2 margin P j 2 Q d k=0 [ hk] Q d i =1 W2 i [ j i , j i 1] .We also set the margin over training set S to be 5th-percentile of the margins of the data points in S, i.e. Prc5 {fw( x i)[ y i] maxy 6= yi fw(x)[ y ] | ( x i , y i) 2 S}. In all experiments, the training error of the learned network is zero. The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels. Furthermore, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set. • `1 -path norm with capacity proportional to 1 2 margin ⇣P j 2 Q d k=0 [ hk] Q d i =1 2 W i[ j i , j i 1] ⌘2 [4, 18]. • `2 -path norm with capacity proportional to 1 2 margin P j 2 Q d k=0 [ hk] Q d i =1 4 h i W2 i [ j i , j i 1] . • spectral norm with capacity proportional to 1 2 margin Q d i =1 h i kW i k2 2 . where Q d k =0[ h k] is the Cartesian product over sets [ h k] . The above bounds indicate that capacity can be bounded in terms of either `2 -norm or `1 -path norm independent of number of parameters. The `2 -path norm dependence on the number of hidden units in each layer is unavoidable. However, it is not clear that the dependence on the number of parameters is needed for the bound based on the spectral norm. As an initial empirical investigation of the appropriateness of the different complexity measures, we compared the complexity (under each of the above measures) of models trained on true versus size of traning set 10K 20K 30K 40K 50K 1020 1025 1030 true labels random labels size of traning set 10K 20K 30K 40K 50K 1025 1030 1035 size of traning set 10K 20K 30K 40K 50K 100 102 104 size of traning set 10K 20K 30K 40K 50K 105 1010 1015 `2 norm `1 -path norm `2 -path norm spectral norm Figure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10 dataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scal- ing issues (see Section 2), where for each complexity measure, we drop the terms that only depend on depth or number of hidden units; e.g. for `2 -path norm we plot 2 margin P j 2 Q d k=0 [ hk] Q d i =1 W2 i [ j i , j i 1] .We also set the margin over training set S to be 5th-percentile of the margins of the data points in S, i.e. Prc5 {fw( x i)[ y i] maxy 6= yi fw(x)[ y ] | ( x i , y i) 2 S}. In all experiments, the training error of the learned network is zero. The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels. Furthermore, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set. • `1 -path norm with capacity proportional to 1 2 margin ⇣P j 2 Q d k=0 [ hk] Q d i =1 2 W i[ j i , j i 1] ⌘2 [4, 18]. • `2 -path norm with capacity proportional to 1 2 margin P j 2 Q d k=0 [ hk] Q d i =1 4 h i W2 i [ j i , j i 1] . • spectral norm with capacity proportional to 1 2 margin Q d i =1 h i kW i k2 2 . where Q d k =0[ h k] is the Cartesian product over sets [ h k] . The above bounds indicate that capacity can be bounded in terms of either `2 -norm or `1 -path norm independent of number of parameters. The `2 -path norm dependence on the number of hidden units in each layer is unavoidable. However, it is not clear that the dependence on the number of parameters is needed for the bound based on the spectral norm. As an initial empirical investigation of the appropriateness of the different complexity measures, we compared the complexity (under each of the above measures) of models trained on true versus ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ i-1 i i+1 ΧϧςγΞϯੵͰ શͯͷ path Λߟྀ 12/18
Λղ͍ͨ݁ՌͰࢦඪͷྑ͞Λ֬ true ͱ random Ͱဃ͕େ͖͍΄Ͳྑ͘ɺl2 ͷࢦඪ͕ྑ͍ৼΔ͍ margin”, which is the minimum margin among all training points. However, this deﬁnition is very sensitive to extreme points as well as to the size of the training set. We consider instead a more robust notion that allows a small portion of data points to violate the margin. For a given training set and small value ✏ > 0 , we deﬁne the margin margin as the lowest value of such that d✏me data point have margin lower than where m is the size of the training set. We found empirically that the qualitative and relative nature of our empirical results is almost unaffected by reasonable choices of ✏ (e.g. between 0 . 001 and 0 . 1 ). The norm-based measures we investigate in this work and their corresponding capacity bounds are as follows 2: • `2 norm with capacity proportional to 1 2 margin Q d i =1 4 kW i k2 F [18]. 1Node-rescaling can be deﬁned as a sequence of reparametrizations, each of which corresponds to multiplying incoming weights and dividing outgoing weights of a hidden unit by a positive scalar ↵. The resulting network computes the same function as the network before the reparametrization. 2We have dropped the term that only depend on the norm of the input. The bounds based on `2 -path norm and spectral norm can be derived directly from the those based on `1 -path norm and `2 norm respectively. Without further conditions on weights, exponential dependence on depth is tight but the 4d dependence might be loose [18]. We will also discuss a rather loose bound on the capacity based on the spectral norm in Section 2.3. 4 Figure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10 dataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scal- ing issues (see Section 2), where for each complexity measure, we drop the terms that only depend on depth or number of hidden units; e.g. for `2 -path norm we plot 2 margin P j 2 Q d k=0 [ hk] Q d i =1 W2 i [ j i , j i 1] .We also set the margin over training set S to be 5th-percentile of the margins of the data points in S, i.e. Prc5 {fw( x i)[ y i] maxy 6= yi fw(x)[ y ] | ( x i , y i) 2 S}. In all experiments, the training error of the learned network is zero. The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels. Furthermore, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set. • `1 -path norm with capacity proportional to 1 2 margin ⇣P j 2 Q d k=0 [ hk] Q d i =1 2 W i[ j i , j i 1] ⌘2 [4, 18]. • `2 -path norm with capacity proportional to 1 2 margin P j 2 Q d k=0 [ hk] Q d i =1 4 h i W2 i [ j i , j i 1] . • spectral norm with capacity proportional to 1 2 margin Q d i =1 h i kW i k2 2 . where Q d k =0[ h k] is the Cartesian product over sets [ h k] . The above bounds indicate that capacity can be bounded in terms of either `2 -norm or `1 -path norm independent of number of parameters. The `2 -path norm dependence on the number of hidden units in each layer is unavoidable. However, it is not clear that the dependence on the number of parameters is needed for the bound based on the spectral norm. As an initial empirical investigation of the appropriateness of the different complexity measures, we compared the complexity (under each of the above measures) of models trained on true versus random labels. We would expect to see two phenomena: ﬁrst, the complexity of models trained on true labels should be substantially lower than those trained on random labels, corresponding to their better generalization ability. Second, when training on random labels, we expect capacity to increase almost linearly with the number of training examples, since every extra example requires new capacity in order to ﬁt it’s random label. However, when training on true labels we expect the model to capture the true functional dependence between input and output and thus ﬁtting more training examples should only require small increases in the capacity of the network. The results are reported in Figure 1. We indeed observe a gap between the complexity of models learned on real and random labels for all four norms, with the difference in increase in capacity between true and random labels being most pronounced for the `2 norm and `2 -path norm. Figure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10 dataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scal- ing issues (see Section 2), where for each complexity measure, we drop the terms that only depend on depth or number of hidden units; e.g. for `2 -path norm we plot 2 margin P j 2 Q d k=0 [ hk] Q d i =1 W2 i [ j i , j i 1] .We also set the margin over training set S to be 5th-percentile of the margins of the data points in S, i.e. Prc5 {fw( x i)[ y i] maxy 6= yi fw(x)[ y ] | ( x i , y i) 2 S}. In all experiments, the training error of the learned network is zero. The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels. Furthermore, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set. • `1 -path norm with capacity proportional to 1 2 margin ⇣P j 2 Q d k=0 [ hk] Q d i =1 2 W i[ j i , j i 1] ⌘2 [4, 18]. • `2 -path norm with capacity proportional to 1 2 margin P j 2 Q d k=0 [ hk] Q d i =1 4 h i W2 i [ j i , j i 1] . • spectral norm with capacity proportional to 1 2 margin Q d i =1 h i kW i k2 2 . where Q d k =0[ h k] is the Cartesian product over sets [ h k] . The above bounds indicate that capacity can be bounded in terms of either `2 -norm or `1 -path norm independent of number of parameters. The `2 -path norm dependence on the number of hidden units in each layer is unavoidable. However, it is not clear that the dependence on the number of parameters is needed for the bound based on the spectral norm. As an initial empirical investigation of the appropriateness of the different complexity measures, we compared the complexity (under each of the above measures) of models trained on true versus random labels. We would expect to see two phenomena: ﬁrst, the complexity of models trained on true labels should be substantially lower than those trained on random labels, corresponding to their better generalization ability. Second, when training on random labels, we expect capacity to increase almost linearly with the number of training examples, since every extra example requires new capacity in order to ﬁt it’s random label. However, when training on true labels we expect the model to capture the true functional dependence between input and output and thus ﬁtting more training examples should only require small increases in the capacity of the network. The results are reported in Figure 1. We indeed observe a gap between the complexity of models learned on real and random labels for all four norms, with the difference in increase in capacity between true and random labels being most pronounced for the `2 norm and `2 -path norm. In Section 3 we present further empirical investigations of the appropriateness of these complexity Figure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10 dataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scal- ing issues (see Section 2), where for each complexity measure, we drop the terms that only depend on depth or number of hidden units; e.g. for `2 -path norm we plot 2 margin P j 2 Q d k=0 [ hk] Q d i =1 W2 i [ j i , j i 1] .We also set the margin over training set S to be 5th-percentile of the margins of the data points in S, i.e. Prc5 {fw( x i)[ y i] maxy 6= yi fw(x)[ y ] | ( x i , y i) 2 S}. In all experiments, the training error of the learned network is zero. The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels. Furthermore, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set. • `1 -path norm with capacity proportional to 1 2 margin ⇣P j 2 Q d k=0 [ hk] Q d i =1 2 W i[ j i , j i 1] ⌘2 [4, 18]. • `2 -path norm with capacity proportional to 1 2 margin P j 2 Q d k=0 [ hk] Q d i =1 4 h i W2 i [ j i , j i 1] . • spectral norm with capacity proportional to 1 2 margin Q d i =1 h i kW i k2 2 . where Q d k =0[ h k] is the Cartesian product over sets [ h k] . The above bounds indicate that capacity can be bounded in terms of either `2 -norm or `1 -path norm independent of number of parameters. The `2 -path norm dependence on the number of hidden units in each layer is unavoidable. However, it is not clear that the dependence on the number of parameters is needed for the bound based on the spectral norm. As an initial empirical investigation of the appropriateness of the different complexity measures, we compared the complexity (under each of the above measures) of models trained on true versus random labels. We would expect to see two phenomena: ﬁrst, the complexity of models trained on true labels should be substantially lower than those trained on random labels, corresponding to their better generalization ability. Second, when training on random labels, we expect capacity to increase almost linearly with the number of training examples, since every extra example requires new capacity in order to ﬁt it’s random label. However, when training on true labels we expect the model to capture the true functional dependence between input and output and thus ﬁtting more training examples should only require small increases in the capacity of the network. The results are reported in Figure 1. We indeed observe a gap between the complexity of models learned on real and random labels for all four norms, with the difference in increase in capacity between true and random labels being most pronounced for the `2 norm and `2 -path norm. In Section 3 we present further empirical investigations of the appropriateness of these complexity measures to explaining other phenomena. size of traning set 10K 20K 30K 40K 50K 1020 1025 1030 true labels random labels size of traning set 10K 20K 30K 40K 50K 1025 1030 1035 size of traning set 10K 20K 30K 40K 50K 100 102 104 size of traning set 10K 20K 30K 40K 50K 105 1010 1015 `2 norm `1 -path norm `2 -path norm spectral norm 13/18
Ͱ͋Δ߹ͷܗ͕ಘΒΕΔ KL ͷ߲ʹΑΓ sharpness ͚ͩͰ͏·͍͔͘ͳ͍ঢ়گͰ༗༻ size of traning set 10K 20K 30K 40K 50K sharpness 0.4 0.6 0.8 1 1.2 true labels random labels KL #108 0 1 2 3 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 KL 2K 4K 6K expected sharpness 0 0.04 0.08 0.12 5K 10K 30K 50K KL #108 0 1 2 3 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 KL 2K 4K 6K expected sharpness 0 0.04 0.08 0.12 5K 10K 30K 50K Figure 2: Sharpness and PAC-Bayes measures on a VGG network trained on subsets of CIFAR10 dataset with true or random labels. In the left panel, we plot max sharpness, which we calculate as suggested by Keskar 4 true labels random labels where K = 2 (KL ( w +⌫k P )+ln 2m ) m 1 . When the training loss E ⌫[ b L ( f w +⌫)] is smaller than K, then the last term dominates. This is often the case for neural networks with small enough perturbation. One can also get the the following weaker bound: E⌫ [ L ( fw+⌫ )] E⌫ [ b L ( fw+⌫ )] + 4 s KL (w + ⌫kP ) + ln 2 m m (6) The above inequality clearly holds for K 1 and for K < 1 it can be derived from Equation (5) by upper bounding the loss in the second term by 1 . We can rewrite the above bound as follows: E⌫ [ L ( fw+⌫ )] b L ( fw) + E⌫ [ b L ( fw+⌫ )] b L ( fw) | {z } expected sharpness +4 s 1 m ✓ KL (w + ⌫kP ) + ln 2 m ◆ (7) As we can see, the PAC-Bayes bound depends on two quantities - i) the expected sharpness and ii) the Kullback Leibler (KL) divergence to the “prior” P. The bound is valid for any distribution measure P, any perturbation distribution ⌫ and any method of choosing w dependent on the training set. A simple way to instantiate the bound is to set P to be a zero mean, 2 variance Gaussian distribution. Choosing the perturbation ⌫ to also be a zero mean spherical Gaussian with variance 2 in every direction, yields the following guarantee (w.p. 1 over the training set): E ⌫⇠N (0 , )n [ L ( fw+⌫ )] b L ( fw) + E ⌫⇠N (0 , )n [ b L ( fw+⌫ )] b L ( fw) | {z } expected sharpness +4 v u u u t 1 m ✓ k w k2 2 2 2 | {z } KL + ln 2 m ◆ , (8) Another interesting approach is to set the variance of the perturbation to each parameter with respect to the magnitude of the parameter. For example if i = ↵ |w i | + , then the KL term in the above expression changes to P i w 2 i 2 2 i . The above generalization guarantees give a clear way to think about capacity control jointly in terms of both the expected sharpness and the norm, and as we discussed earlier indicates that sharpness by itself cannot control the capacity without considering the scaling. In the above generalization bound, norms and sharpness interact in a direct way depending on , as increasing the norm by decreasing causes decrease in sharpness and vice versa. It is therefore important to ﬁnd the right balance between the norm and sharpness by choosing appropriately in order to get a reasonable bound on the capacity. In our experiments we observe that looking at both these measures jointly indeed makes a better pre- dictor for the generalization error. As discussed earlier, Dziugaite and Roy [7] numerically optimize the overall PAC-Bayes generalization bound over a family of multivariate Gaussian distributions (different choices of perturbations and priors). Since the precise way the sharpness and KL-divergence E⌫ [ L ( fw+⌫ )] E⌫ [ b L ( fw+⌫ )] + 4 KL (w + ⌫kP ) + ln 2 m m (6) The above inequality clearly holds for K 1 and for K < 1 it can be derived from Equation (5) by upper bounding the loss in the second term by 1 . We can rewrite the above bound as follows: E⌫ [ L ( fw+⌫ )] b L ( fw) + E⌫ [ b L ( fw+⌫ )] b L ( fw) | {z } expected sharpness +4 s 1 m ✓ KL (w + ⌫kP ) + ln 2 m ◆ (7) As we can see, the PAC-Bayes bound depends on two quantities - i) the expected sharpness and ii) the Kullback Leibler (KL) divergence to the “prior” P. The bound is valid for any distribution measure P, any perturbation distribution ⌫ and any method of choosing w dependent on the training set. A simple way to instantiate the bound is to set P to be a zero mean, 2 variance Gaussian distribution. Choosing the perturbation ⌫ to also be a zero mean spherical Gaussian with variance 2 in every direction, yields the following guarantee (w.p. 1 over the training set): E ⌫⇠N (0 , )n [ L ( fw+⌫ )] b L ( fw) + E ⌫⇠N (0 , )n [ b L ( fw+⌫ )] b L ( fw) | {z } expected sharpness +4 v u u u t 1 m ✓ k w k2 2 2 2 | {z } KL + ln 2 m ◆ , (8) Another interesting approach is to set the variance of the perturbation to each parameter with respect to the magnitude of the parameter. For example if i = ↵ |w i | + , then the KL term in the above expression changes to P i w 2 i 2 2 i . The above generalization guarantees give a clear way to think about capacity control jointly in terms of both the expected sharpness and the norm, and as we discussed earlier indicates that sharpness by itself cannot control the capacity without considering the scaling. In the above generalization bound, norms and sharpness interact in a direct way depending on , as increasing the norm by decreasing causes decrease in sharpness and vice versa. It is therefore important to ﬁnd the right balance between the norm and sharpness by choosing appropriately in order to get a reasonable bound on the capacity. In our experiments we observe that looking at both these measures jointly indeed makes a better pre- dictor for the generalization error. As discussed earlier, Dziugaite and Roy [7] numerically optimize the overall PAC-Bayes generalization bound over a family of multivariate Gaussian distributions (different choices of perturbations and priors). Since the precise way the sharpness and KL-divergence 7 ͏·͍ͬͯ͘ΔྖҬ 14/18
3K 4K 5K error 0 0.1 0.2 0.3 0.4 training test #random labels 0 1K 2K 3K 4K 5K measure 0 0.2 0.4 0.6 0.8 1 `2 norm spectral norm path-`1 norm path-`2 norm sharpness KL #107 0 1 2 3 4 5 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 KL #107 0 1 2 3 4 5 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 0 1K 2K 3K 4K 5K Figure 3: Experiments on global minima with poor generalization. For each experiment, a VGG network is trained on union of a subset of CIFAR10 dataset with size 10000 containing samples with true labels and another subset of CIFAR10 datasets with varying size containing random labels. The learned networks are all global minima for the objective function on the subset with true labels. The left plot indicates the training and test errors based on the size of the set with random labels. The plot in the middle shows change in different measures based on the size of the set with random labels. The plot on the right indicates the relationship between expected sharpness and KL in PAC-bayes for each of the experiments. Measures are calculated as explained in Figures 1 and 2. 8 32 128 512 2K 8K error 0 0.02 0.04 0.06 0.08 training test 32 128 512 2K 8K measure 0 0.2 0.4 0.6 0.8 1 `2 norm spectral norm path-`1 norm path-`2 norm sharpness 6 0 1 2 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 KL #106 0 1 2 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 32 128 512 2048 #random labels 0 1K 2K 3K 4K 5K error 0 0.1 0.2 0.3 0.4 training test #random labels 0 1K 2K 3K 4K 5K measure 0 0.2 0.4 0.6 0.8 1 `2 norm spectral norm path-`1 norm path-`2 norm sharpness KL #107 0 1 2 3 4 5 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 KL #107 0 1 2 3 4 5 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 0 1K 2K 3K 4K 5K Figure 3: Experiments on global minima with poor generalization. For each experiment, a VGG network is trained on union of a subset of CIFAR10 dataset with size 10000 containing samples with true labels and another subset of CIFAR10 datasets with varying size containing random labels. The learned networks are all global minima for the objective function on the subset with true labels. The left plot indicates the training and test errors based on the size of the set with random labels. The plot in the middle shows change in different measures based on the size of the set with random labels. The plot on the right indicates the relationship between expected sharpness and KL in PAC-bayes for each of the experiments. Measures are calculated as explained in Figures 1 and 2. #hidden units 8 32 128 512 2K 8K error 0 0.02 0.04 0.06 0.08 training test #hidden units 32 128 512 2K 8K measure 0 0.2 0.4 0.6 0.8 1 `2 norm spectral norm path-`1 norm path-`2 norm sharpness KL #106 0 1 2 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 KL #106 0 1 2 expected sharpness 0 0.05 0.1 0.15 0.2 0.25 0.3 32 128 512 2048 ɾfully connected ͷ feed forward ͰӅΕΛม࣮͑ͯݧ ɹ→ path-l2 ͕ྑͦ͞͏͕ͩϓϩοτͯ͠Δͷ ɹ→ test error ͷ୯ௐݮগΛ͏·͘આ໌͖͠Δࢦඪͳ͍ size of traning set 10K 20K 30K 40K 50K 1020 1025 1030 true labels random labels size of traning set 10K 20K 30K 40K 50K 1025 1030 1035 size of traning set 10K 20K 30K 40K 50K 100 102 104 size of traning set 10K 20K 30K 40K 50K 105 1010 1015 `2 norm `1 -path norm `2 -path norm spectral norm Figure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10 dataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scal- ing issues (see Section 2), where for each complexity measure, we drop the terms that only depend on depth or number of hidden units; e.g. for `2 -path norm we plot 2 margin P j 2 Q d k=0 [ hk] Q d i =1 W2 i [ j i , j i 1] .We also set the margin over training set S to be 5th-percentile of the margins of the data points in S, i.e. Prc5 {fw( x i)[ y i] maxy 6= yi fw(x)[ y ] | ( x i , y i) 2 S}. In all experiments, the training error of the learned network is zero. The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels. Furthermore, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set. • `1 -path norm with capacity proportional to 1 2 margin ⇣P j 2 Q d k=0 [ hk] Q d i =1 2 W i[ j i , j i 1] ⌘2 [4, 18]. • `2 -path norm with capacity proportional to 1 2 P Q d Q d 4 h i W2 [ j i , j i 1] . 15/18
Understanding deep learning requires rethinking generalization: https://arxiv.org/abs/1611.03530 • A PAC-Bayesian Tutorial with A Dropout Bound: https://arxiv.org/abs/1307.2118 • Rademacher and Gaussian Complexities: Risk Bounds and Structural Results: http://www.jmlr.org/papers/volume3/bartlett02a/bartlett02a.pdf • Stability and Generalization: http://www.jmlr.org/papers/volume2/bousquet02a/bousquet02a.pdf • Nearly-tight VC-dimension bounds for piecewise linear neural networks https://arxiv.org/abs/1703.02930 • On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima https://arxiv.org/abs/1609.04836 [Slides & Web pages] • PAC-Bayes Analysis: Background and Applications: http://web.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes-JohnShawe.pdf 18/18