Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks (ICML 2019)

Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks (ICML 2019)

6fbbe833ca3a8072fd940d0be853119f?s=128

Kenta Oono

July 21, 2019
Tweet

Transcript

  1. Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks (ICML

    2019) Kenta Oono1,2 Taiji Suzuki1,3 ICML/ICLR 2019 Reading@DeNA, Sibuya 1 {kenta_oono, taiji}@mist.i.u-tokyo.ac.jp 1. The University of Tokyo 2. Preferred Networks, Inc. 3. RIKEN AIP 21st July, 2019 ICML/ICLR 2019 Reading@DeNA, Sibuya Paper: http://proceedings.mlr.press/v97/oono19a.html Poster, Video: https://icml.cc/Conferences/2019/ScheduleMultitrack?event=4374
  2. Kenta Oono (@delta2323_) • MSc. in Mathematics (2011.3) • Engineer

    @ PFN (2014.10 — ) • Chainer Development • Application of ML/DL to biological/chemical data • Ph.D student @ Univ. of Tokyo (2018.4 — ) • Theoretical analysis of DL models • Statistical learning theory for ResNet-type CNNs • Expressive power of graph/invariant NNs ICML / ICLR 2019  2 https://sites.google.com/view/kentaoono/
  3. Key Takeaway A. Hidden sparse structure promotes good performance. ICML/ICLR

    2019 Reading@DeNA, Sibuya 3 Q. Why ResNet-type CNNs work well?
  4. Summary What is this paper about?: Statistical Learning Theory for

    ResNet-type CNNs Background: FNNs and CNNs are known to achieve minimax-optimal estimation error rates for several function classes (e.g., Hölder). Problem: Known optimal FNNs/CNNs had unrealistic constraints about sparsity and channel size. Contribution: We prove that ResNet-type CNNs can have same rates as FNNs without these constraints. Key technique: Transformation of block-sparse FNNs into ResNet-type CNNs. ICML/ICLR 2019 Reading@DeNA, Sibuya 4
  5. Agenda • Problem Setting • Prior Work • Contribution •

    Main Idea • Conclusion ICML / ICLR 2019  5
  6. Problem Setting ICML/ICLR 2019 Reading@DeNA, Sibuya 6 ! = #°(&)

    + ) #° : Unknown true function on [0, 1]/ , ): Gaussian noise We consider a non-parametric regression problem: Goal: Given 0 i.i.d. samples 12 , 32 , … , 15 , 35 ~ ℒ(&, !), we want to estimate the true function #°. Metric: Estimation error ℛ : # ∶= <& | : #(&) − #°(&)|? Note: Estimation error is also known as population risk.
  7. Empirical Risk Minimization L We cannot directly minimize the estimation

    error. (∵ Neither true function "° nor distribution of $ are known.) ICML/ICLR 2019 Reading@DeNA, Sibuya 7 ℛ & " ∶= )$ | & "($) − "°($)|. / ℛ & " ∶= 1 1 2 345 6 | & "(73 ) − "°(73 )|. We can theoretically validate that the minimizer is close to the true function (i.e., ℛ & " is small). Explained later. Empirical Error Estimation Error Instead, we minimize the empirical error. J Note: Empirical error is also known as empirical risk.
  8. Hypothesis Class A set of candidate functions ℱ from which

    we search the estimator " #. ICML/ICLR 2019 Reading@DeNA, Sibuya 8 " # ∈ arginf ,∈ℱ - ℛ # In our case, (e.g., depth, width, kernel size etc.) ℱ = Set of funcMons realizable by FNNs/CNNs with some architecture. Note: Architecture of FNNs/CNNs can depend on the sample size / ERM estimator
  9. Note on Estimators • The true function can be out

    side of the hypothesis class ℱ. • The ERM estimator is not the only estimator we can think of. • In general, we cannot compute the ERM estimator. Practically, we obtain an estimator via optimization (e.g., SGD). ICML/ICLR 2019 Reading@DeNA, Sibuya 9 ℱ "° True function $ " SGD estimator $ " ERM estimator
  10. Summary of Problem Setting ICML/ICLR 2019 Reading@DeNA, Sibuya 10 !

    = #°(&) + ) #° : Unknown true function (e.g., Hölder, Barron, Besov class), ): Gaussian noise We consider a non-parametric regression problem: Given * i.i.d. samples, we - specify a hypothesis class ℱ by the architecture of CNNs, - pick an ERM estimator , # from ℱ, and - evaluate the estimation error ℛ , # .
  11. Minimax Lower Bound for Hölder Functions Here, ! " :

    all (measurable) estimators, "° : all $ variate %-Hölder functions. ICML/ICLR 2019 Reading@DeNA, Sibuya 11 inf ! ) sup )° -. ℛ ! " ≥ 12 34 3456 There is no estimator which is uniformly good for any Hölder function. → Can FNNs/CNNs achieve the theoretical limit? Ref. e.g., Tsybakov, 08 Theorem i.e., for any estimator ! ", there exists a true function "° such that -. ℛ ! " ≥ 12 78 789: (if inf and sup are attained).
  12. Minimax Optimality of Neural Networks ICML/ICLR 2019 Reading@DeNA, Sibuya 12

    Theorem (informal) When the true function !° is #-Hölder, by choosing the architecture of ReLU FNN (resp. CNN) appropriately, the ERM estimator $ ! can achieve FNN case: Yarotsky, 17; Schmidt-Hieber, 17 CNN case: Petersen & Voigtlaender, 18 %& ' ( $ ! ≤ *+, -. -./0 log + 4 J FNN and CNN are minimax optimal (up to log factors) for the #-Hölder case for sufficiently large +. Here, * is a universal constant. Note: When the true function is in some classes, we can show superiority of DNNs. That is, DNNs can achieve minimax optimality, while others methods cannot (e.g., shallow nets, kernel methods). (cf. [Imaizumi and Fukumizu, 19], [Suzuki, 19], [Hayakawa and Suzuki, 19]).
  13. Decomposition of Estimation Error ICML/ICLR 2019 Reading@DeNA, Sibuya 13 ℱ

    "° $ " "∗ "° : True function $ " : ERM estimator Approx. Error Model Complexity Approx. Error Model Complexity Large ℱ ↓ ↑ Small ℱ ↑ ↓ Est. Error "∗ ∈ arginf -∈ℱ $ " − "° / : Best approximator
  14. CNN type Parameter Size !ℱ Minimax Optimality Combinatorial Optimization General

    # of all weights Sub-optimal L - Prior Work ICML / ICLR 2019  14 #$ ℛ & ' ≾ inf,∈ℱ ∥ ' − '° ∥1 2 + 4 5(!ℱ /8) Approximation Error Model Complexity * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] 8: Sample size ℱ: Set of functions realizable by CNNs '°: True function (e.g., Hölder, Barron, Besov etc.) 4 5(;): 5-notation ignoring logarithmic terms. !ℱ : # of “parameters” of ℱ Ref. e.g., Schmidt-Hieber, 17 Sparse* # of non-zero weights Optimal J Needed L
  15. Optimal FNN is Sparse ICML / ICLR 2019  15

    ℱ = CNN such that 100 % &'( % of weights are non−zero + some condi@ons (width, depth etc.) %(log &) %(&()  0 = 1 2341 % &( log & Non-zero weights Do we really search ERM estimators from sparse FNNs in practice? → NO!
  16. Contribution ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints.

    ICML/ICLR 2019 Reading@DeNA, Sibuya 16 Known optimal FNNs have block-sparse structures * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] Key Observation CNN type Parameter Size !ℱ Minimax Optimality Combinatorial Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J
  17. Block-sparse FNNs 17 Barron [Klusowski & Barron, 18] Hölder [Yarotsky,

    17; Schmidt-Hieber, 17] Besov [Suzuki, 19]. Known best approximating FNNs are block-sparse when the true function is b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCW1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCWM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> wM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> w1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FNN ∶= % &'( ) *& + FC& (.) − 1 General Sparse FNN Block-sparse FNN
  18. Block-sparse FNN to ResNet-type CNN ICML/ICLR 2019 Reading@DeNA, Sibuya 18

    b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCW1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCWM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> wM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> + + id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCid W,b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> Convw1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> ConvwM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> w1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> Transform FNN ∶= % &'( ) *& + FC& (.) − 1 CNN: = FC ∘ Conv) + id ∘ ⋯ ∘ (Conv( + id) ∘ ; ↑ Minimax Optimal ↑ Minimax Optimal, too ! Block-sparse FNN ResNet-type CNN
  19. Block-sparse FNN to ResNet-type CNN For any block-sparse FNN with

    ! blocks, there exists a ResNet-type CNN with ! residual blocks which has "(!) more parameters and which is identical (as a function) to the FNN. ICML/ICLR 2019 Reading@DeNA, Sibuya 19 Theorem (informal) b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCW1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCWM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> w1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> wM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> + + id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCid W,b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> Convw1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> ConvwM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> We can translate ANY estimation error rates for block-sparse FNNs into that for CNNs.
  20. ICML/ICLR 2019 Reading@DeNA, Sibuya 20 J Minimax optimal ! J

    No combinatorial optimization ! When the true function !° is #-Hölder, there exists a set of ResNet-type CNNs ℱ such that: Theorem (e.g., Hölder Case, informal) Optimality of ResNet-type CNNs • ℱ does NOT have sparse constraints • the estimator % ! of ℱ achieves the minimax-optimal estimation error rate (up to log factors). Note: It is not enough to naïvely transform FNNs into CNNs in order to obtain the desired estimation error because it leads to too large model complexity. To solve the problem, we need parameter rescaling techniques.
  21. Misc. Comments about Thm. ! Is you theory applicable to

    Hölder functions only? " Using the same strategy, we can prove a similar statement for the Barron class (or any other function class which can approximate by block-sparse FNNs). ! Any other improvements on CNN architectures? " We can remove unrealistic constraints on channels size, too. ! Depth of each residual block is ! = "(log #). This is unrealistically deep. " If we allow identity connections to have scaling schemes, ResNet-type CNNs which has residual blocks whose depth is "(1) are minimax optimal for the Hölder class. ICML / ICLR 2019  21 See the paper for details
  22. CNN type Parameter Size !ℱ Minimax Optimality Discrete Optimization General

    # of all weights Sub-optimal L - Sparse # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J Conclusion ResNet-type CNNs can achieve the same rate as FNNs for several function classes without implausible constraints. Key technique is transformation of block-sparse FNNs into ResNet-type CNNs. ICML/ICLR 2019 Reading@DeNA, Sibuya 22 ↑ Minimax Optimal ↑ Minimax Optimal, too ! b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCW1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCWM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> w1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> wM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> + + id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> id <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> FCid W,b <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> Convw1,b1 <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> ConvwM ,bM <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit>
  23. Future Work • When the estimation error rates of CNNs

    can exceed that of FNNs? • For the Hölder case, the depth of residual blocks of optimal ResNet- type CNNs is too large (!(log &)). Can we make it !(1) ? • Analysis on more practical CNNs (pooling, stride > 1 etc.). ICML/ICLR 2019 Reading@DeNA, Sibuya 23
  24. Summary What is this paper about?: Statistical Learning Theory for

    ResNet-type CNNs Background: FNNs and CNNs are known to achieve minimax-optimal estimation error rates for several function classes (e.g., Hölder). Problem: Known optimal FNNs/CNNs had unrealistic constraints about sparsity and channel size. Contribution: We prove that ResNet-type CNNs can have same rates as FNNs without these constraints. Key technique: Transformation of block-sparse FNNs into ResNet-type CNNs. ICML/ICLR 2019 Reading@DeNA, Sibuya 24
  25. References [Hayakawa and Suzuki, 19] Satoshi Hayakawa and Taiji Suzuki.

    On the minimax optimality and superiority of deepneural network learning over sparse parameter spaces. arXiv preprint arXiv:1905.09195,2019. [Imaizumi and Fukumizu, 19] Masaaki Imaizumi and Kenji Fukumizu. Deep neural networks learn non-smooth functions effectively. Proceedings ofMachine Learning Research, volume 89, pages 869–878. PMLR, 2019 [Petersen and Voigtlaender, 19] Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018. [Schmidt-Hieber, 17] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. arXiv preprint arXiv:1708.06633, 2017. [Suzuki, 19] Taiji Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. InInternational Conference onLearning Representations, 2019. [Tsybakov, 08] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519, 9780387790510. [Yarotsky, 18] Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. arXiv preprint arXiv:1804.10306, 2018. ICML/ICLR 2019 Reading@DeNA, Sibuya 25