Slide 28
Slide 28 text
min
x1
,…,xn
F(x1
, …, xn
) := ∑
i
∥gx1
,…,xn
(zi
) − yi
∥2
Training
dxi
(t)
dt
= − ∇xi
F(xi
(t))
Training Dynamics of 2 layers MLP
f(α) = ∫ kdα ⊗ α + ∫ hdα
k(θ, θ′

) := ∑
i
⟨ϕθ
(zi
)ϕθ′

(zi
)⟩
h(θ) := − ∑
i
⟨yi
, ϕθ
(zi
)⟩
α = 1
n
∑
k
δxk
minα
f(α) := ∑
i
∥Gα
(zi
) − yi
∥2
∂αt
∂t
− div(∇W
f(αt
) αt
) = 0
Gα
(z) := ∫ ϕθ
(z)dα(θ)
2 layers perceptron:
ϕx
(z) = u σ(⟨z, v⟩)
Theorem: for perceptrons, if has
enough neurons, can only
converge to a global minimum.
αt=0
αt
« Global » convergence,
despite not being convex.
→
F
Lenaic
Chizat
Francis
Bach
σ (uk
)k
(vk
)k
z
g(x1
,…,xn
)
(z)
g(x1
,…,xn
)
(z) := 1
n
∑n
k=1
ϕxk
(z)
parameters ,
x = (u, v)