@ PFN (2014.10 — ) • Chainer Development • Application of ML/DL to biological/chemical data • Ph.D student @ Univ. of Tokyo (2018.4 — ) • Theoretical analysis of DL models • Statistical learning theory for ResNet-type CNNs • Expressive power of graph/invariant NNs ICML / ICLR 2019 2 https://sites.google.com/view/kentaoono/
ResNet-type CNNs Background: FNNs and CNNs are known to achieve minimax-optimal estimation error rates for several function classes (e.g., Hölder). Problem: Known optimal FNNs/CNNs had unrealistic constraints about sparsity and channel size. Contribution: We prove that ResNet-type CNNs can have same rates as FNNs without these constraints. Key technique: Transformation of block-sparse FNNs into ResNet-type CNNs. ICML/ICLR 2019 Reading@DeNA, Sibuya 4
+ ) #° : Unknown true function on [0, 1]/ , ): Gaussian noise We consider a non-parametric regression problem: Goal: Given 0 i.i.d. samples 12 , 32 , … , 15 , 35 ~ ℒ(&, !), we want to estimate the true function #°. Metric: Estimation error ℛ : # ∶= <& | : #(&) − #°(&)|? Note: Estimation error is also known as population risk.
error. (∵ Neither true function "° nor distribution of $ are known.) ICML/ICLR 2019 Reading@DeNA, Sibuya 7 ℛ & " ∶= )$ | & "($) − "°($)|. / ℛ & " ∶= 1 1 2 345 6 | & "(73 ) − "°(73 )|. We can theoretically validate that the minimizer is close to the true function (i.e., ℛ & " is small). Explained later. Empirical Error Estimation Error Instead, we minimize the empirical error. J Note: Empirical error is also known as empirical risk.
we search the estimator " #. ICML/ICLR 2019 Reading@DeNA, Sibuya 8 " # ∈ arginf ,∈ℱ - ℛ # In our case, (e.g., depth, width, kernel size etc.) ℱ = Set of funcMons realizable by FNNs/CNNs with some architecture. Note: Architecture of FNNs/CNNs can depend on the sample size / ERM estimator
side of the hypothesis class ℱ. • The ERM estimator is not the only estimator we can think of. • In general, we cannot compute the ERM estimator. Practically, we obtain an estimator via optimization (e.g., SGD). ICML/ICLR 2019 Reading@DeNA, Sibuya 9 ℱ "° True function $ " SGD estimator $ " ERM estimator
= #°(&) + ) #° : Unknown true function (e.g., Hölder, Barron, Besov class), ): Gaussian noise We consider a non-parametric regression problem: Given * i.i.d. samples, we - specify a hypothesis class ℱ by the architecture of CNNs, - pick an ERM estimator , # from ℱ, and - evaluate the estimation error ℛ , # .
all (measurable) estimators, "° : all $ variate %-Hölder functions. ICML/ICLR 2019 Reading@DeNA, Sibuya 11 inf ! ) sup )° -. ℛ ! " ≥ 12 34 3456 There is no estimator which is uniformly good for any Hölder function. → Can FNNs/CNNs achieve the theoretical limit? Ref. e.g., Tsybakov, 08 Theorem i.e., for any estimator ! ", there exists a true function "° such that -. ℛ ! " ≥ 12 78 789: (if inf and sup are attained).
Theorem (informal) When the true function !° is #-Hölder, by choosing the architecture of ReLU FNN (resp. CNN) appropriately, the ERM estimator $ ! can achieve FNN case: Yarotsky, 17; Schmidt-Hieber, 17 CNN case: Petersen & Voigtlaender, 18 %& ' ( $ ! ≤ *+, -. -./0 log + 4 J FNN and CNN are minimax optimal (up to log factors) for the #-Hölder case for sufficiently large +. Here, * is a universal constant. Note: When the true function is in some classes, we can show superiority of DNNs. That is, DNNs can achieve minimax optimality, while others methods cannot (e.g., shallow nets, kernel methods). (cf. [Imaizumi and Fukumizu, 19], [Suzuki, 19], [Hayakawa and Suzuki, 19]).
ℱ = CNN such that 100 % &'( % of weights are non−zero + some condi@ons (width, depth etc.) %(log &) %(&() 0 = 1 2341 % &( log & Non-zero weights Do we really search ERM estimators from sparse FNNs in practice? → NO!
No combinatorial optimization ! When the true function !° is #-Hölder, there exists a set of ResNet-type CNNs ℱ such that: Theorem (e.g., Hölder Case, informal) Optimality of ResNet-type CNNs • ℱ does NOT have sparse constraints • the estimator % ! of ℱ achieves the minimax-optimal estimation error rate (up to log factors). Note: It is not enough to naïvely transform FNNs into CNNs in order to obtain the desired estimation error because it leads to too large model complexity. To solve the problem, we need parameter rescaling techniques.
Hölder functions only? " Using the same strategy, we can prove a similar statement for the Barron class (or any other function class which can approximate by block-sparse FNNs). ! Any other improvements on CNN architectures? " We can remove unrealistic constraints on channels size, too. ! Depth of each residual block is ! = "(log #). This is unrealistically deep. " If we allow identity connections to have scaling schemes, ResNet-type CNNs which has residual blocks whose depth is "(1) are minimax optimal for the Hölder class. ICML / ICLR 2019 21 See the paper for details
can exceed that of FNNs? • For the Hölder case, the depth of residual blocks of optimal ResNet- type CNNs is too large (!(log &)). Can we make it !(1) ? • Analysis on more practical CNNs (pooling, stride > 1 etc.). ICML/ICLR 2019 Reading@DeNA, Sibuya 23
ResNet-type CNNs Background: FNNs and CNNs are known to achieve minimax-optimal estimation error rates for several function classes (e.g., Hölder). Problem: Known optimal FNNs/CNNs had unrealistic constraints about sparsity and channel size. Contribution: We prove that ResNet-type CNNs can have same rates as FNNs without these constraints. Key technique: Transformation of block-sparse FNNs into ResNet-type CNNs. ICML/ICLR 2019 Reading@DeNA, Sibuya 24
On the minimax optimality and superiority of deepneural network learning over sparse parameter spaces. arXiv preprint arXiv:1905.09195,2019. [Imaizumi and Fukumizu, 19] Masaaki Imaizumi and Kenji Fukumizu. Deep neural networks learn non-smooth functions effectively. Proceedings ofMachine Learning Research, volume 89, pages 869–878. PMLR, 2019 [Petersen and Voigtlaender, 19] Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018. [Schmidt-Hieber, 17] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. arXiv preprint arXiv:1708.06633, 2017. [Suzuki, 19] Taiji Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. InInternational Conference onLearning Representations, 2019. [Tsybakov, 08] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519, 9780387790510. [Yarotsky, 18] Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. arXiv preprint arXiv:1804.10306, 2018. ICML/ICLR 2019 Reading@DeNA, Sibuya 25