Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks (ICML 2019)

Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks (ICML
2019) Kenta Oono1,2 Taiji Suzuki1,3 ICML/ICLR 2019 Reading@DeNA, Sibuya 1 {kenta_oono, taiji}@mist.i.u-tokyo.ac.jp 1. The University of Tokyo 2. Preferred Networks, Inc. 3. RIKEN AIP 21st July, 2019 ICML/ICLR 2019 Reading@DeNA, Sibuya Paper: http://proceedings.mlr.press/v97/oono19a.html Poster, Video: https://icml.cc/Conferences/2019/ScheduleMultitrack?event=4374

Kenta Oono (@delta2323_) • MSc. in Mathematics (2011.3) • Engineer
@ PFN (2014.10 — ) • Chainer Development • Application of ML/DL to biological/chemical data • Ph.D student @ Univ. of Tokyo (2018.4 — ) • Theoretical analysis of DL models • Statistical learning theory for ResNet-type CNNs • Expressive power of graph/invariant NNs ICML / ICLR 2019 2 https://sites.google.com/view/kentaoono/

Key Takeaway A. Hidden sparse structure promotes good performance. ICML/ICLR
2019 Reading@DeNA, Sibuya 3 Q. Why ResNet-type CNNs work well?

Summary What is this paper about?: Statistical Learning Theory for
ResNet-type CNNs Background: FNNs and CNNs are known to achieve minimax-optimal estimation error rates for several function classes (e.g., Hölder). Problem: Known optimal FNNs/CNNs had unrealistic constraints about sparsity and channel size. Contribution: We prove that ResNet-type CNNs can have same rates as FNNs without these constraints. Key technique: Transformation of block-sparse FNNs into ResNet-type CNNs. ICML/ICLR 2019 Reading@DeNA, Sibuya 4

Agenda • Problem Setting • Prior Work • Contribution •
Main Idea • Conclusion ICML / ICLR 2019 5

Problem Setting ICML/ICLR 2019 Reading@DeNA, Sibuya 6 ! = #°(&)
+ ) #° : Unknown true function on [0, 1]/ , ): Gaussian noise We consider a non-parametric regression problem: Goal: Given 0 i.i.d. samples 12 , 32 , … , 15 , 35 ~ ℒ(&, !), we want to estimate the true function #°. Metric: Estimation error ℛ : # ∶= <& | : #(&) − #°(&)|? Note: Estimation error is also known as population risk.

Empirical Risk Minimization L We cannot directly minimize the estimation
error. (∵ Neither true function "° nor distribution of $ are known.) ICML/ICLR 2019 Reading@DeNA, Sibuya 7 ℛ & " ∶= )$ | & "($) − "°($)|. / ℛ & " ∶= 1 1 2 345 6 | & "(73 ) − "°(73 )|. We can theoretically validate that the minimizer is close to the true function (i.e., ℛ & " is small). Explained later. Empirical Error Estimation Error Instead, we minimize the empirical error. J Note: Empirical error is also known as empirical risk.

Hypothesis Class A set of candidate functions ℱ from which
we search the estimator " #. ICML/ICLR 2019 Reading@DeNA, Sibuya 8 " # ∈ arginf ,∈ℱ - ℛ # In our case, (e.g., depth, width, kernel size etc.) ℱ = Set of funcMons realizable by FNNs/CNNs with some architecture. Note: Architecture of FNNs/CNNs can depend on the sample size / ERM estimator

Note on Estimators • The true function can be out
side of the hypothesis class ℱ. • The ERM estimator is not the only estimator we can think of. • In general, we cannot compute the ERM estimator. Practically, we obtain an estimator via optimization (e.g., SGD). ICML/ICLR 2019 Reading@DeNA, Sibuya 9 ℱ "° True function $ " SGD estimator $ " ERM estimator

Summary of Problem Setting ICML/ICLR 2019 Reading@DeNA, Sibuya 10 !
= #°(&) + ) #° : Unknown true function (e.g., Hölder, Barron, Besov class), ): Gaussian noise We consider a non-parametric regression problem: Given * i.i.d. samples, we - specify a hypothesis class ℱ by the architecture of CNNs, - pick an ERM estimator , # from ℱ, and - evaluate the estimation error ℛ , # .

Minimax Lower Bound for Hölder Functions Here, ! " :
all (measurable) estimators, "° : all $ variate %-Hölder functions. ICML/ICLR 2019 Reading@DeNA, Sibuya 11 inf ! ) sup )° -. ℛ ! " ≥ 12 34 3456 There is no estimator which is uniformly good for any Hölder function. → Can FNNs/CNNs achieve the theoretical limit? Ref. e.g., Tsybakov, 08 Theorem i.e., for any estimator ! ", there exists a true function "° such that -. ℛ ! " ≥ 12 78 789: (if inf and sup are attained).

Minimax Optimality of Neural Networks ICML/ICLR 2019 Reading@DeNA, Sibuya 12
Theorem (informal) When the true function !° is #-Hölder, by choosing the architecture of ReLU FNN (resp. CNN) appropriately, the ERM estimator $ ! can achieve FNN case: Yarotsky, 17; Schmidt-Hieber, 17 CNN case: Petersen & Voigtlaender, 18 %& ' ( $ ! ≤ *+, -. -./0 log + 4 J FNN and CNN are minimax optimal (up to log factors) for the #-Hölder case for sufficiently large +. Here, * is a universal constant. Note: When the true function is in some classes, we can show superiority of DNNs. That is, DNNs can achieve minimax optimality, while others methods cannot (e.g., shallow nets, kernel methods). (cf. [Imaizumi and Fukumizu, 19], [Suzuki, 19], [Hayakawa and Suzuki, 19]).

Decomposition of Estimation Error ICML/ICLR 2019 Reading@DeNA, Sibuya 13 ℱ
"° $ " "∗ "° : True function $ " : ERM estimator Approx. Error Model Complexity Approx. Error Model Complexity Large ℱ ↓ ↑ Small ℱ ↑ ↓ Est. Error "∗ ∈ arginf -∈ℱ $ " − "° / : Best approximator

CNN type Parameter Size !ℱ Minimax Optimality Combinatorial Optimization General
# of all weights Sub-optimal L - Prior Work ICML / ICLR 2019 14 #$ ℛ & ' ≾ inf,∈ℱ ∥ ' − '° ∥1 2 + 4 5(!ℱ /8) Approximation Error Model Complexity * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] 8: Sample size ℱ: Set of functions realizable by CNNs '°: True function (e.g., Hölder, Barron, Besov etc.) 4 5(;): 5-notation ignoring logarithmic terms. !ℱ : # of “parameters” of ℱ Ref. e.g., Schmidt-Hieber, 17 Sparse* # of non-zero weights Optimal J Needed L

Optimal FNN is Sparse ICML / ICLR 2019 15
ℱ = CNN such that 100 % &'( % of weights are non−zero + some condi@ons (width, depth etc.) %(log &) %(&() 0 = 1 2341 % &( log & Non-zero weights Do we really search ERM estimators from sparse FNNs in practice? → NO!

Contribution ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints.
ICML/ICLR 2019 Reading@DeNA, Sibuya 16 Known optimal FNNs have block-sparse structures * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] Key Observation CNN type Parameter Size !ℱ Minimax Optimality Combinatorial Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J

Block-sparse FNNs 17 Barron [Klusowski & Barron, 18] Hölder [Yarotsky,
17; Schmidt-Hieber, 17] Besov [Suzuki, 19]. Known best approximating FNNs are block-sparse when the true function is b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCW1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCWM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> wM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> w1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FNN ∶= % &'( ) *& + FC& (.) − 1 General Sparse FNN Block-sparse FNN

Block-sparse FNN to ResNet-type CNN ICML/ICLR 2019 Reading@DeNA, Sibuya 18
b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCW1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCWM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> wM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> + + id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCid W,b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> Convw1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> ConvwM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> w1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> Transform FNN ∶= % &'( ) *& + FC& (.) − 1 CNN: = FC ∘ Conv) + id ∘ ⋯ ∘ (Conv( + id) ∘ ; ↑ Minimax Optimal ↑ Minimax Optimal, too ! Block-sparse FNN ResNet-type CNN

Block-sparse FNN to ResNet-type CNN For any block-sparse FNN with
! blocks, there exists a ResNet-type CNN with ! residual blocks which has "(!) more parameters and which is identical (as a function) to the FNN. ICML/ICLR 2019 Reading@DeNA, Sibuya 19 Theorem (informal) b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCW1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCWM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> w1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> wM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> + + id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCid W,b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> Convw1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> ConvwM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> We can translate ANY estimation error rates for block-sparse FNNs into that for CNNs.

ICML/ICLR 2019 Reading@DeNA, Sibuya 20 J Minimax optimal ! J
No combinatorial optimization ! When the true function !° is #-Hölder, there exists a set of ResNet-type CNNs ℱ such that: Theorem (e.g., Hölder Case, informal) Optimality of ResNet-type CNNs • ℱ does NOT have sparse constraints • the estimator % ! of ℱ achieves the minimax-optimal estimation error rate (up to log factors). Note: It is not enough to naïvely transform FNNs into CNNs in order to obtain the desired estimation error because it leads to too large model complexity. To solve the problem, we need parameter rescaling techniques.

Misc. Comments about Thm. ! Is you theory applicable to
Hölder functions only? " Using the same strategy, we can prove a similar statement for the Barron class (or any other function class which can approximate by block-sparse FNNs). ! Any other improvements on CNN architectures? " We can remove unrealistic constraints on channels size, too. ! Depth of each residual block is ! = "(log #). This is unrealistically deep. " If we allow identity connections to have scaling schemes, ResNet-type CNNs which has residual blocks whose depth is "(1) are minimax optimal for the Hölder class. ICML / ICLR 2019 21 See the paper for details

CNN type Parameter Size !ℱ Minimax Optimality Discrete Optimization General
# of all weights Sub-optimal L - Sparse # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J Conclusion ResNet-type CNNs can achieve the same rate as FNNs for several function classes without implausible constraints. Key technique is transformation of block-sparse FNNs into ResNet-type CNNs. ICML/ICLR 2019 Reading@DeNA, Sibuya 22 ↑ Minimax Optimal ↑ Minimax Optimal, too ! b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCW1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCWM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> w1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> wM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> + + id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCid W,b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> Convw1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> ConvwM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit>

Future Work • When the estimation error rates of CNNs
can exceed that of FNNs? • For the Hölder case, the depth of residual blocks of optimal ResNet- type CNNs is too large (!(log &)). Can we make it !(1) ? • Analysis on more practical CNNs (pooling, stride > 1 etc.). ICML/ICLR 2019 Reading@DeNA, Sibuya 23

Summary What is this paper about?: Statistical Learning Theory for
ResNet-type CNNs Background: FNNs and CNNs are known to achieve minimax-optimal estimation error rates for several function classes (e.g., Hölder). Problem: Known optimal FNNs/CNNs had unrealistic constraints about sparsity and channel size. Contribution: We prove that ResNet-type CNNs can have same rates as FNNs without these constraints. Key technique: Transformation of block-sparse FNNs into ResNet-type CNNs. ICML/ICLR 2019 Reading@DeNA, Sibuya 24

References [Hayakawa and Suzuki, 19] Satoshi Hayakawa and Taiji Suzuki.
On the minimax optimality and superiority of deepneural network learning over sparse parameter spaces. arXiv preprint arXiv:1905.09195,2019. [Imaizumi and Fukumizu, 19] Masaaki Imaizumi and Kenji Fukumizu. Deep neural networks learn non-smooth functions effectively. Proceedings ofMachine Learning Research, volume 89, pages 869–878. PMLR, 2019 [Petersen and Voigtlaender, 19] Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018. [Schmidt-Hieber, 17] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. arXiv preprint arXiv:1708.06633, 2017. [Suzuki, 19] Taiji Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. InInternational Conference onLearning Representations, 2019. [Tsybakov, 08] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519, 9780387790510. [Yarotsky, 18] Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. arXiv preprint arXiv:1804.10306, 2018. ICML/ICLR 2019 Reading@DeNA, Sibuya 25

Approximation and Non-parametric Estimation of ...

Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks (ICML 2019)

Kenta Oono

More Decks by Kenta Oono

Other Decks in Technology

Featured

Transcript

Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks (ICML

Kenta Oono (@delta2323_) • MSc. in Mathematics (2011.3) • Engineer

Key Takeaway A. Hidden sparse structure promotes good performance. ICML/ICLR

Summary What is this paper about?: Statistical Learning Theory for

Agenda • Problem Setting • Prior Work • Contribution •

Problem Setting ICML/ICLR 2019 Reading@DeNA, Sibuya 6 ! = #°(&)

Empirical Risk Minimization L We cannot directly minimize the estimation

Hypothesis Class A set of candidate functions ℱ from which

Note on Estimators • The true function can be out

Summary of Problem Setting ICML/ICLR 2019 Reading@DeNA, Sibuya 10 !

Minimax Lower Bound for Hölder Functions Here, ! " :

Minimax Optimality of Neural Networks ICML/ICLR 2019 Reading@DeNA, Sibuya 12

Decomposition of Estimation Error ICML/ICLR 2019 Reading@DeNA, Sibuya 13 ℱ

CNN type Parameter Size !ℱ Minimax Optimality Combinatorial Optimization General

Optimal FNN is Sparse ICML / ICLR 2019 15

Contribution ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints.

Block-sparse FNNs 17 Barron [Klusowski & Barron, 18] Hölder [Yarotsky,

Block-sparse FNN to ResNet-type CNN ICML/ICLR 2019 Reading@DeNA, Sibuya 18

Block-sparse FNN to ResNet-type CNN For any block-sparse FNN with

ICML/ICLR 2019 Reading@DeNA, Sibuya 20 J Minimax optimal ! J

Misc. Comments about Thm. ! Is you theory applicable to

CNN type Parameter Size !ℱ Minimax Optimality Discrete Optimization General

Future Work • When the estimation error rates of CNNs

Summary What is this paper about?: Statistical Learning Theory for

References [Hayakawa and Suzuki, 19] Satoshi Hayakawa and Taiji Suzuki.