20170405_mlkitchen2

How and why does deep learning work well? From some
theoretical point of view Yohei KIKUTA  [email protected] https://www.linkedin.com/in/yohei-kikuta-983b29117/ 20170405 ML Kitchen #2

Deep Learning !! ɾImage Recognition Source : https://vl-lab.eecs.umich.edu/ 2/34

Deep Learning !! ɾImage Recognition Source : https://vl-lab.eecs.umich.edu/ , http://www.image-net.org/challenges/LSVRC/
0 7.5 15 22.5 30 2010 2011 2012 2013 2014 2015 2016 2.99 3.57 7.41 11.2 15.3 25.8 28.2 Classification error [%] Deep Learning !! human ability 3/34

Deep Learning !! ɾNatural Language Processing Source : https://github.com/yoheikikuta/arxiv_summary_translation original
ja fr 4/34

Deep Learning !! ɾImage Generation Source : https://arxiv.org/abs/1612.03242 5/34

Deep Learning !! ɾStyle Transfer Source : https://arxiv.org/abs/1703.07511 6/34

Deep Learning !! ɾAnd so many others… ɹɾObject Detection ɹɾSuper
Resolution ɹɾQuestion Answering ɹɾDeep Reinforcement Learning ɹɾRecommendation ɹɾ… 7/34

Why good ? Common explanations : ɹɾLarge amounts of data
ɹɾHigh computational power ɹɾImprovements of optimization methods ɹɾProgresses in modeling ɹɹ(new architectures, ReLU, regularizations, …) ɹɾ… 8/34

Why good ? Common explanations : ɹɾLarge amounts of data
ɹɾHigh computational power ɹɾImprovements of optimization methods ɹɾProgresses in modeling ɹɹ(new architectures, ReLU, regularizations, …) ɹɾ… But these do NOT explain why deep learning (DL) itself is good… 9/34

Goals We would like to clarify how and why DL
works ɹɾExpressibility of Neural Network (NN)  ɹɾMeaning and its interpretation behind deep architecture Math & Phys help to answer these points ! This slide basically depends on https://arxiv.org/abs/1608.08225 The contents of this slide are very limited part of these topics 10/34

Outline 1. Introduction 2. Mathematical set up 3. Expressibility of
NN 4. Depth and Renormalization Group 5. Summary 6. References 11/34

Approximate prob. distributions Unsupervised learning Prediction Classification p(x,y) p(y|x) p(x|y)
FIG. 1: Neural networks can approximate probability distributions. Given many samples of random vectors y and x, both classiﬁcation and prediction involve viewing x as a stochastic function of y and attempting to estimate the probability distributions for y given x and x given y, respectively. In contrast, unsupervised learning attempts to approximate the joint probability distribution of y and x without making any assumptions about causality. In all three cases, the neural network searches for patterns in the data that can be used to better model the probability distribution. A word of cau- tion: we are following the machine learning convention where sivity, and we discuss the connections to renorm tion, compositionality and complexity. We sum our conclusions in Section IV and discuss a tec point about renormalization and deep learning i pendix B. II. EXPRESSIBILITY AND EFFICIENCY SHALLOW NEURAL NETWORKS Let us now explore what classes of probability di tions p are the focus of physics and machine learnin how accurately and e ciently neural networks c proximate them. Although our results will be full eral, it will help illustrate key points of we give the ematical notation from Figure 1 concrete interpret For a machine-learning example, we might interpr an element of some set of animals {cat, dog, rabb and x as the vector of pixels in an image depictin an animal, so that p(x|y) for y = cat gives the bility distribution of images of cats with di↵eren NN can approximate probability distributions. Example : x = images y = {dog, cat, gorilla, …} Source : https://arxiv.org/abs/1608.08225 : many simplifying features for approximation (symmetry, locality, …) : more complicated 12/34

Approximate prob. distributions Using Bayes’ theorem Introduce Then tends to
have properties making it simple to evaluate ! Hamiltonian or Surprisal 13/34

Approximate prob. distributions Vector expression : (exp acts elementwise) Approximate
by n-layers Feed Forward NN : Where Local function (does not mix ) Max-pooling Softmax for non-linear 14/34

Natures of Hamiltonian Arbitrary possible patterns for 1000×1000 gray scale
Images : Boolean functions of n variables : (in this case NN requires at least bits*) (# of atoms in the universe) (for n > 260) *https://page.mi.fu-berlin.de/rojas/neural/chapter/K6.pdf 15/34

Natures of Hamiltonian Arbitrary possible patterns for 1000×1000 gray scale
Images : Boolean functions of n variables : (in this case NN requires at least bits*) *https://page.mi.fu-berlin.de/rojas/neural/chapter/K6.pdf **Pattern Recognition and Machine Learning Exercise 1.16 However, phys and ML tend to favor polynomials. The degree of freedom (dof) is** n : # of components of x d : degree of polynomials (# of atoms in the universe) (for n > 260) 16/34

Multiplication gate by NN Thm Pf Consider and NN dims
{out,hidden,in} = {1,4,2}. can approximate a multiplication gate arbitrary well. and low-order. Let us therefore focus our initial investi- gation on Hamiltonians that can be expanded as a power series: Hy(x) = h+ X i hi xi+ X ij hij xi xj+ X ijk hijk xi xj xk+· · · . (9) If the vector x has n components (i = 1, ..., n), then there are (n + d)!/(n!d!) terms of degree up to d. 1. Continuous input variables If we can accurately approximate multiplication using a u v uv λ -λ λ -λ -λ λ -λ λ σ σ σ σ Continuous multiplication gate: μ μ -μ -μ λ-2 4σ”(0) μ FIG. 2: Multiplication can be e ple neural nets, becoming arbitra and ! 1 (right). Squares a (without loss of generality, ) And we can always make arbitrary small. source : https://arxiv.org/abs/1608.08225 18/34

Polynomial by NN For any given multivariate polynomial and tolerance
, there exists a NN of fixed finite size N (independent of ) that approximates the polynomial to accuracy better than . See http://www2.math.technion.ac.il/~pinkus/papers/acta.pdf Universal Approximation Theorem : This corollary : for any typically slightly larger than 4 Cor Pf 19/34

Redundancy of parameters Some properties of Hamiltonians in phys and
ML : ɹɾLow polynomial order ɹɹphys - d = 4 in the fundamental theory (renormalizability) ɹɹML - d = 2 for gaussian prob., d =1 for translation, rotation, convolution, … ɹɾLocality ɹɹphys - Only Nearest Neighbor interactions (in lattice formalism) ɹɹML - Markov network formalism, local receptive field in CNN ɹɾSymmetry ɹɹphys - Lorentz symmetry, gauge symmetry, … ɹɹML - Translation, rotation, … e.g.) (n, d = 2) polynomials → nearest neighbor → translation sym dof : 20/34

Mathematical preparations Sufficient statistics for given : Minimal sufficient statistics
: (another expression ) e.g.) minimal sufficient statistics for gaussian distrib. ɹɹ for : ɹɹ for : Consider causal hierarchies generating data : Using prob. distrib. and Markov matrix at ith level 22/34

Markovian hierarchical process Thm Pf Let be a minimal sufficient
statistic for . There exists some functions s.t. . using bayes’ thm & markovian &   def of minimal sufficient statistic Then is a sufficient statistic for . But is minimal sufficient statistic, so 23/34

Markovian hierarchical process Cor Pf Define . Then . By
mathematical induction . This implies the corollary. Roughly speaking, this corollary states that the structure of the inference problem reflects the structure of the generative process ! Note that though minimal sufficient statistics are often difficult to find, it is frequently possible to come up with nearly sufficient statistics. 24/34

y 0 =y y 1 y 2 y 3 >
y 3 =T 3 (x) > y 2 =T 2 (x) > y 0 =T 0 (x) y 4 M 4 M 1 M 3 M 2 > y 1 =T 1 (x) x=y 4 add foregrounds simulate sky map n, n , Q, T/S T Ω, Ω , Λ, τ, h b Pixel 1 Pixel 2 ΔT 6422347 6443428 -454.841 3141592 2718281 141.421 8454543 9345593 654.766 1004356 8345388 -305.567 ... ... ... TELESCOPE DATA CMB SKY MAP TRANSFORMED OBJECT RAY-TRACED OBJECT CATEGORY LABEL FINAL IMAGE SOLIDWORKS PARAMTERS POWER SPECTRUM COSMO- LOGICAL PARAMTERS cat or dog? take linear combinations, add noise ray trace select background select color, shape & posture f 3 f 2 f 1 f 0 generate fluctuations scale & translate MULTΙ- FREQUENCY MAPS param 1 param 2 param 3 6422347 6443428 -454.841 3141592 2718281 141.421 8454543 9345593 654.766 1004356 8345388 -305.567 ... ... ... y 0 =y y 1 y 2 y 3 y 4 M 4 M 1 M 3 M 2 FIG. 3: Causal hierarchy examples relevant to physics (left) and image classiﬁcation (right). As information ﬂows down the hierarchy y0 ! y1 ! ... ! yn = y , some of it is destroyed by random Markov processes. However, no further information is Phys ML source : https://arxiv.org/abs/1608.08225 25/34

Renormalization and distillation Consider the Hamiltonian for modeling natural images.
※ This Hamiltonian satisfies rotation & translation invariances. Coarse graining it by factor c (pixel values are averaged) … . . . . . . … If we take , we get (This is called renormalization group (RG) transformation ) After repeating it, only are relevant (others go to zero). 26/34

Renormalization and distillation ※ This is a supervised feature extraction*
(renormalization) = (a special case of feature extraction) *See appendix B in https://arxiv.org/abs/1608.08225 Deep architecture naturally realizes such RG transformation. 27/34

No-flattering theorem Thm Pf No NN can implement an n-input
multiplication gate using fewer than neurons in the hidden layer. See Appendix A in https://arxiv.org/abs/1608.08225 . This is a mathematical advantage of deep architecture. e.g.) Required neurons for 32 numbers multiplication case ɹɹdeep : ɹɹshallow : (realized by arranging 4n and n copies in   a binary tree with layers.) 28/34

Summary ɾExpressibility of NN ɹNN can approximate multiplication gate for
reasonable cost ɹHamiltonians in phys and ML tend to be polynomials   ɾMeaning and its interpretation behind deep architecture ɹMarkovian hierarchical picture matches with DL ɹNaturally induced renormalization acts as feature extractions ɹDeep architecture is suitable for realizing polynomials  Let’s study physics to understand DL better ! 30/34

Relation between phys and ML Physics Machine learning Hamiltonian Surprisal
ln p Simple H Cheap learning Quadratic H Gaussian p Locality Sparsity Translationally symmetric H Convnet Computing p from H Softmaxing Spin Bit Free energy di↵erence KL-divergence E↵ective theory Nearly lossless data distillation Irrelevant operator Noise Relevant operator Feature TABLE I: Physics-ML dictionary. Le too pli Th of Se gen xi source : https://arxiv.org/abs/1608.08225 31/34

Renormalization Group in physics Renormalization Group (RG) is very sophisticated,
beautiful and powerful method for analyzing physical phenomena (especially so called critical phenomena). RG is based on the invariance of Hamiltonian under some scale transformations. e.g.) 1 dim Ising model The Renormalization Group in Real Space his chapter, we develop the real-space RG for the one-dimensional ing model. The Ising model is especially useful for illustrating the ion of Kadanoff averaging and its effect on the partition function. This was solved exactly in Chapter 3, so comparisons with those results will the efficacy of the RG method to be assessed. In the next chapter, we ply the RG method to the two-dimensional (2D) Ising model. The One-Dimensional Ising Model ergy E of the 1D Ising model is H = −J i si si+1 , (7.1) > 0 is the coupling constant. The partition function, Z = {si=±1} e−H/kBT = {si=±1} i eKsisi+1 , (7.2) K = J/kB T, kB is Boltzmann’s constant, and T is the absolute tem- e, determines the thermodynamic properties of the Ising model in f the Helmholtz free energy F = −kB T ln Z . (7.3) ution of this model was derived in Sec. 3.2. Renormalization by Decimation are several ways of implementing the real-space RG because of the ty in choosing the block spins. This can rarely be performed exactly, tematic approximations are available to make the calculation man- . The 1D Ising model is an exception in that the RG calculations can ed out exactly. In this chapter, we develop the real-space RG for the one-dimensional (1D) Ising model. The Ising model is especially useful for illustrating the realization of Kadanoff averaging and its effect on the partition function. This model was solved exactly in Chapter 3, so comparisons with those results will enable the efficacy of the RG method to be assessed. In the next chapter, we will apply the RG method to the two-dimensional (2D) Ising model. 7.1 The One-Dimensional Ising Model The energy E of the 1D Ising model is H = −J i si si+1 , (7.1) where J > 0 is the coupling constant. The partition function, Z = {si=±1} e−H/kBT = {si=±1} i eKsisi+1 , (7.2) where K = J/kB T, kB is Boltzmann’s constant, and T is the absolute tem- perature, determines the thermodynamic properties of the Ising model in terms of the Helmholtz free energy F = −kB T ln Z . (7.3) The solution of this model was derived in Sec. 3.2. 7.2 Renormalization by Decimation There are several ways of implementing the real-space RG because of the flexibility in choosing the block spins. This can rarely be performed exactly, but systematic approximations are available to make the calculation man- ageable. The 1D Ising model is an exception in that the RG calculations can be carried out exactly. (1D) Ising model. The Ising model is especially useful for illustr realization of Kadanoff averaging and its effect on the partition func model was solved exactly in Chapter 3, so comparisons with those re enable the efficacy of the RG method to be assessed. In the next ch will apply the RG method to the two-dimensional (2D) Ising mode 7.1 The One-Dimensional Ising Model The energy E of the 1D Ising model is H = −J i si si+1 , where J > 0 is the coupling constant. The partition function, Z = {si=±1} e−H/kBT = {si=±1} i eKsisi+1 , where K = J/kB T, kB is Boltzmann’s constant, and T is the abso perature, determines the thermodynamic properties of the Ising terms of the Helmholtz free energy F = −kB T ln Z . The solution of this model was derived in Sec. 3.2. 7.2 Renormalization by Decimation There are several ways of implementing the real-space RG becau flexibility in choosing the block spins. This can rarely be performe but systematic approximations are available to make the calculat ageable. The 1D Ising model is an exception in that the RG calcula be carried out exactly. We will perform a decimation, whereby a partial evaluation of the function (7.2) is carried out by to summing over alternate spins on th fficacy of the RG method to be assessed. In the next chapter, we e RG method to the two-dimensional (2D) Ising model. he One-Dimensional Ising Model E of the 1D Ising model is H = −J i si si+1 , (7.1) is the coupling constant. The partition function, Z = {si=±1} e−H/kBT = {si=±1} i eKsisi+1 , (7.2) J/kB T, kB is Boltzmann’s constant, and T is the absolute tem- termines the thermodynamic properties of the Ising model in Helmholtz free energy F = −kB T ln Z . (7.3) of this model was derived in Sec. 3.2. enormalization by Decimation veral ways of implementing the real-space RG because of the choosing the block spins. This can rarely be performed exactly, tic approximations are available to make the calculation man- 1D Ising model is an exception in that the RG calculations can ut exactly. erform a decimation, whereby a partial evaluation of the partition ) is carried out by to summing over alternate spins on the lattice. where J > 0 is the coupling constant. The partition function, Z = {si=±1} e−H/kBT = {si=±1} i eKsisi+1 , (7.2) where K = J/kB T, kB is Boltzmann’s constant, and T is the absolute tem- perature, determines the thermodynamic properties of the Ising model in terms of the Helmholtz free energy F = −kB T ln Z . (7.3) The solution of this model was derived in Sec. 3.2. 7.2 Renormalization by Decimation There are several ways of implementing the real-space RG because of the flexibility in choosing the block spins. This can rarely be performed exactly, but systematic approximations are available to make the calculation man- ageable. The 1D Ising model is an exception in that the RG calculations can be carried out exactly. We will perform a decimation, whereby a partial evaluation of the partition function (7.2) is carried out by to summing over alternate spins on the lattice. The decimation proceeds by summing over the spins on, say, the odd sites. This is shown schematically in Fig. 7.1. The first step is to write the partition function in a form where the spins on the even and odd sites are separated: Z = {s2i=±1} 2i {s2i+1=±1} 2i+1 eK(s2is2i+1+s2i+1s2i+2 . (7.4) K K K K K K K K Figure 7.1: Decimation of the 1D Ising model. The spins on the sites marke by filled circles can be summed over independently, leaving a partition fun tion expressed only in terms of spins on sites marked by open circles. As indicated in Fig. 7.1, in a model that has interactions only between ad jacent spins,the spins on odd sites do not interact with one another. Thu the sums in the partition function over alternate sites may be evaluated in dependently of one another. The contributions from the odd sites can b determined by focussing on the contribution from a typical odd spin s2n+ Performing the partition sum over odd spins leaves only even spins: s2n+1=±1 eK(s2ns2n+1+s2n+1s2n+2) = eK(s2n+s2n+2) + e−K(s2n+s2n+2) = 2 cosh K(s2n + s2n+2 ) . (7.5 in terms of which the partition function becomes Z = {s2i=±1} 2i 2 cosh K(s2i + s2i+2 ) . (7.6 As noted above, this is an exact expression. 7.3 Recursion Relations The expression in (7.6) represents the partial evaluation of the partitio function obtained by summing over the spins on odd sites. This term ha two types of contributions: a spin-independent term from the removed (odd spins and a spin-dependent term from the remaining (even) spins. Notice tha the resulting interaction between spins does not appear to have the form the original Ising energy in (7.1). However, we now assert that the effect 2i 2i+2 for every configuration of neighboring spins s2i and s2i+1 . Ther configurations obtained from s2i = ±1 and s2i+2 = ±1, so we m s2i = +1 , s2i+2 = +1 , z(K)eK′ = 2 cosh 2K s2i = −1 , s2i+2 = −1 , z(K)eK′ = 2 cosh 2K s2i = +1 , s2i+2 = −1 , z(K)e−K′ = 2 , s2i = −1 , s2i+2 = +1 , z(K)e−K′ = 2 . Thus, there are only two distinct conditions to satisfy, one to aligned spins and one corresponding to opposed spins. T of pairs of configurations results from the up-down symmet model in the absence of a magnetic field. We can express the above in terms of K′ and z(K) separately by dividing and m two distinct equations, which yields, respectively, e2K′ = cosh 2K , z2(K) = 4 cosh 2K . The first of these equations can be solved for K′ in terms of K K′ = 1 2 ln(cosh 2K) . rmalization Group in Real Space 105 recursion relation for the coupling constant. This recursion relation ed entirely in terms of the original coupling constant, i.e. no new ns are generated by the decimation. This is a special property of ng model; in higher dimensions the RG generates additional inter- tween spins that are consistent with the symmetry of the original on (7.11) can be written in a more useful form by using the fact Helmholtz free energy F = −kB T ln Z is an extensive quantity, so is also extensive: ln Z = Nf(K), where N is the number of sites tem. Hence, Eq. (7.8) may be written as Z(K, N) = [z(K)]N/2Z(K′, N/2) , (7.13) This is the recursion relation for the coupling constant. This recursion relation is expressed entirely in terms of the original coupling constant, i.e. no new interactions are generated by the decimation. This is a special property of the 1D Ising model; in higher dimensions the RG generates additional interactions between spins that are consistent with the symmetry of the original model. Equation (7.11) can be written in a more useful form by using the fact that the Helmholtz free energy F = −kB T ln Z is an extensive quantity, so that ln Z is also extensive: ln Z = Nf(K), where N is the number of sites in the system. Hence, Eq. (7.8) may be written as Z(K, N) = [z(K)]N/2Z(K′, N/2) , (7.13) where, on account of Eq. (7.10), Z is the same function of K′ and 1 2 N on the right-hand side of this equation as it is of K and N on the left-hand side. Additionally, since the form of the energy is preserved after each decimation, the functional form of z(K) is preserved as well. Taking logarithms, we have ln Z(K, N) = Nf(K) = 1 2 N ln z(K) + 1 2 Nf(K′) , (7.14) which can be rearranged as f(K′) = 2f(K) − ln z(K) . (7.15) By invoking Eq. (7.11), we obtain f(K′) = 2f(K) − ln 2(cosh 2K)1/2 . (7.16) This is the second recursion relation of the 1D Ising model. Figure 7.2 summarizes the repeated application of the decimation proce- 106 The Renormalization Group in Real K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K Figure 7.2: Schematic illustration of the recursive decimation of the 1 model. At each step, every second spin spin, indicated by the filled are incorporated into the partition function, which results in an Ising with the same energy, but with a renormalized coupling constant. 7.4 Fixed Points The behavior of the coupling constant under repeated decimation examined by first writing the recursion relation in Eq. (7.10) as K′ = Thus, beginning with an initial value K0 of the coupling constant, deci yields K1 = R(K0 ), a second decimation produces K2 = R(K1 ), and This can be represented diagrammatically as in Fig. 7.3. Suppose we have some initial value of the coupling constant K0 , Hamiltonian Partition func Free energy ( ) Sum over odd spins, → We can get the parameter relation between the original and coarse grained system : 104 The Renormalization Group in Real Spa this assertion. For the moment, we write Z = {s2i=±1} 2i 2 cosh K(s2i + s2i+2 ) = {s2i=±1} 2i z(K)eK′s2is2i+2 (7 = [z(K)]N/2 {s2i=±1} 2i eK′s2is2i+2 , (7 in which z(K) is the spin-independent part of the partition function a K′ is the new (i.e. renormalized) coupling constant. Consistency betw Eqs. (7.2) and (7.7) requires that 2 cosh K(s2i + s2i+2 ) = z(K)eK′s2is2i+2 (7 for every configuration of neighboring spins s2i and s2i+1 . There are four s configurations obtained from s2i = ±1 and s2i+2 = ±1, so we must have t K′ If we denote , we can derive For fixed point ( ), we get 108 The Renormalization Group in Real Space and the recursion relation reduces to f(K∗) ≈ ln 2(cosh 2K∗)1/2 = ln 2 + 1 2 ln(cosh 2K∗) . (7.18) As K → 0, cosh K → 1, so f(K∗) → ln 2 , (7.19) and the free energy F = −NkB Tf(K∗) = −NkB T ln 2 , (7.20) (For 2 dim, we get non-trivial result : ) 33/34

ɾSome point to notice in variational RG : https://arxiv.org/abs/1609.03541 ɹThis
paper is comment on “Why does deep and cheap learning work so well?” ɾAn exact mapping btw variational RG and DL : https://arxiv.org/abs/1410.3831 ɹRelations between Renormalization Group and DL based on Restricted Boltzmann Machines ɾThe Loss Surface of Multilayer Networks : https://arxiv.org/abs/1412.0233 ɹSome analysis using the connection btw physics spin glass model and deep NN ɾRenormalization group in Newtonian physics : http://www.gakushuin.ac.jp/~881791/pdf/ParityRG.pdf ɹThough this is in Japanese, this is an amazing article about natures of Renormalization Group ɾReconstructing a NN from its output : http://www.uam.es/departamentos/ciencias/matematicas/ibero/10.3/MATEMATICAIBEROAMERICANA_1994_10_03_02.pdf ɹFor tanh activation function, we can determine the network architecture by observing the NN output ɾDeep vs. shallow : https://arxiv.org/abs/1608.03287 ɹFor some class deep architecture is superior to shallow one ɾNonlinear dynamics of deep linear NN : https://arxiv.org/abs/1312.6120 ɹNonlinear aspects for learning of deep linear NN ɾNonlinear dynamics of deep linear NN : https://arxiv.org/abs/1702.08580 ɹWithout nonlinearity (deep linear NN), the proof of “depth alone does not create bad local minima” 34/34

20170405_mlkitchen2

20170405_mlkitchen2

More Decks by yoppe

Other Decks in Science

Featured

Transcript