Convergent Plug & Play schemes for Inverse Problems in Imaging

Convergent Plug & Play schemes for Inverse Problems in Imaging
Samuel Hurault, Arthur Leclaire, Nicolas Papadakis

Image Enhancement and Restoration [Thi´ ebaut et al. ’16] [Gustafsson
et al. ’16] 1/51

Image Enhancement and Restoration [Thi´ ebaut et al. ’16] [Gustafsson
et al. ’16] [The Mandalorian, Disney ’20] 1/51

Outline 1. Introduction 2. From Proximal algorithms to Plug &
Play algorithms 3. Gradient Step Plug & Play 4. Proximal Gradient Step Plug & Play 5. Conclusion

1. Introduction

Image Inverse Problems Find x from observation y = Ax
+ ξ • x ∈ Rn input ”clean” image • y ∈ Rm degraded image • A ∈ Rm×n degradation matrix • ξ random noise, generally ξ ∼ N(0, σ2 Idm ) Denoising: = + 2/51

+ ξ • x ∈ Rn input ”clean” image • y ∈ Rm degraded image • A ∈ Rm×n degradation matrix • ξ random noise, generally ξ ∼ N(0, σ2 Idm ) Deblurring: = ∗ + 2/51

+ ξ • x ∈ Rn input ”clean” image • y ∈ Rm degraded image • A ∈ Rm×n degradation matrix • ξ random noise, generally ξ ∼ N(0, σ2 Idm ) Super-resolution: =         ∗         s + 2/51

+ ξ • x ∈ Rn input ”clean” image • y ∈ Rm degraded image • A ∈ Rm×n degradation matrix • ξ random noise, generally ξ ∼ N(0, σ2 Idm ) Inpainting: = ⊗ 2/51

+ ξ • x ∈ Rn input ”clean” image • y ∈ Rm degraded image • A ∈ Rm×n degradation matrix • ξ random noise, generally ξ ∼ N(0, σ2 Idm ) Compressed Sensing: e.g. Magnetic Resonance Imaging (MRI) 2/51

Image Inverse Problems Find x from observation y ∼ p(y|x)
• y ∈ Rm observation • x ∈ Rn unknown input • p(y|x) forward model Computed Tomography: 2/51

Image Inverse Problems with Plug & Play Find x from
observation y = A(x) + ξ Supervized learning Provides state-of-the art results from paired data (xi , yi ) Requires to train a diﬀerent model for each A 3/51

observation y = A(x) + ξ Supervized learning Provides state-of-the art results from paired data (xi , yi ) Requires to train a diﬀerent model for each A Plug & Play (PnP): • Learn an eﬃcient denoiser • Plug the denoiser in an iterative algorithm • Solve general inverse problems, involving any operator A 3/51

observation y = A(x) + ξ Supervized learning Provides state-of-the art results from paired data (xi , yi ) Requires to train a diﬀerent model for each A Plug & Play (PnP): • Learn an eﬃcient denoiser • Plug the denoiser in an iterative algorithm • Solve general inverse problems, involving any operator A State-of-the-art image restoration Convergence of the method 3/51

Maximum A-Posteriori Find x from observation y ∼ p(y|x) with
an a-priori p(x) on the solution arg max x∈Rn p(x|y) = arg max x∈Rn p(y|x)p(x) p(y) = arg min x∈Rn − log p(y|x) − log p(x) 4/51

Maximum A-Posteriori Find x from observation y ∼ p(y|x) with
an a-priori p(x) on the solution arg max x∈Rn p(x|y) = arg max x∈Rn p(y|x)p(x) p(y) = arg min x∈Rn − log p(y|x) − log p(x) Maximum A-Posteriori x∗ ∈ arg min x∈Rn f (x) + g(x) ⇐⇒ arg min x∈Rn data-ﬁdelity + regularization f (x) = − log p(y|x) g(x) ∝ − log p(x) 4/51

A variety of data-ﬁdelity terms f Observation y = Ax
+ ξ • Assuming Gaussian noise model ξ ∼ N(0, σ2 Id), f (x) = − log p(y|x) = 1 2σ2 ||Ax − y||2 → convex and smooth f , non-strongly convex in general 5/51

A variety of data-ﬁdelity terms f Observation y = Ax
+ ξ • Assuming Gaussian noise model ξ ∼ N(0, σ2 Id), f (x) = − log p(y|x) = 1 2σ2 ||Ax − y||2 → convex and smooth f , non-strongly convex in general • Less regular cases - Noiseless case: f (x) = ı{x | Ax=y} → non-smooth f - Laplace / Poisson noise model → non-smooth f - Phase retrieval → non-convex f • More complex non-linear modeling of real complex physical systems (e.g. X-ray computed tomography, electron-microscopy...) 5/51

A variety of explicit image priors Design an explicit regularization
on image features: • Total variation [Rudin et al. ‘92] • Fourier spectrum [Ruderman ‘94] • Wavelet sparsity [Mallat ’09] 6/51

on image features: • Total variation [Rudin et al. ‘92] • Fourier spectrum [Ruderman ‘94] • Wavelet sparsity [Mallat ’09] Learn an explicit prior on image patches: • Dictionary learning [Elad et al. ’06, Mairal et al. ’08] • Gaussian mixture models [Yu et al. ’10], [Zoran, Weiss ‘11] 6/51

on image features: • Total variation [Rudin et al. ‘92] • Fourier spectrum [Ruderman ‘94] • Wavelet sparsity [Mallat ’09] Learn an explicit prior on image patches: • Dictionary learning [Elad et al. ’06, Mairal et al. ’08] • Gaussian mixture models [Yu et al. ’10], [Zoran, Weiss ‘11] Learn an explicit deep prior on full images: (generative models) • Variational Auto-encoders [Kingma & Welling, ‘13] • Normalizing ﬂows [Dinh et al. ‘15] • Adversarial [Lunz et al. ‘18, Mukherjee et al. ‘21] • Score-based models [Song. et al ’21] 6/51

PnP motivations Find x∗ ∈ arg minx∈Rn Data-ﬁdelity(x) + Regularization(x)
• Decouple data-ﬁdelity and regularization in iterative algorithms [Combette & Pesquet ‘11, Zoran & Weiss ‘11] 7/51

• Decouple data-ﬁdelity and regularization in iterative algorithms [Combette & Pesquet ‘11, Zoran & Weiss ‘11] Regularization by Image Denoising (easier, well-understood task) [Venkatakrishnan et al. ‘13, Romano et al. ‘17] 7/51

• Decouple data-ﬁdelity and regularization in iterative algorithms [Combette & Pesquet ‘11, Zoran & Weiss ‘11] Regularization by Image Denoising (easier, well-understood task) [Venkatakrishnan et al. ‘13, Romano et al. ‘17] → State-of-the art denoisers Dσ without explicit prior Filtering methods [Dabov et al. ‘07, Lebrun et al. ‘13] Deep denoisers [Zhang et al. ‘16,‘17, ‘21, Song et al. ‘19] 7/51

• Decouple data-ﬁdelity and regularization in iterative algorithms [Combette & Pesquet ‘11, Zoran & Weiss ‘11] Regularization by Image Denoising (easier, well-understood task) [Venkatakrishnan et al. ‘13, Romano et al. ‘17] → State-of-the art denoisers Dσ without explicit prior Filtering methods [Dabov et al. ‘07, Lebrun et al. ‘13] Deep denoisers [Zhang et al. ‘16,‘17, ‘21, Song et al. ‘19] → Step towards the manifold of clean images: implicit prior 7/51

• Decouple data-ﬁdelity and regularization in iterative algorithms [Combette & Pesquet ‘11, Zoran & Weiss ‘11] Regularization by Image Denoising (easier, well-understood task) [Venkatakrishnan et al. ‘13, Romano et al. ‘17] → State-of-the art denoisers Dσ without explicit prior Filtering methods [Dabov et al. ‘07, Lebrun et al. ‘13] Deep denoisers [Zhang et al. ‘16,‘17, ‘21, Song et al. ‘19] → Step towards the manifold of clean images: implicit prior • PnP: From y degraded, iterate: 1. Perform a denoising step with Dσ 2. Enforce data-ﬁdelity 7/51

• Decouple data-ﬁdelity and regularization in iterative algorithms [Combette & Pesquet ‘11, Zoran & Weiss ‘11] Regularization by Image Denoising (easier, well-understood task) [Venkatakrishnan et al. ‘13, Romano et al. ‘17] → State-of-the art denoisers Dσ without explicit prior Filtering methods [Dabov et al. ‘07, Lebrun et al. ‘13] Deep denoisers [Zhang et al. ‘16,‘17, ‘21, Song et al. ‘19] → Step towards the manifold of clean images: implicit prior • PnP: From y degraded, iterate: 1. Perform a denoising step with Dσ 2. Enforce data-ﬁdelity How many interations? Convergence when plugging Dσ in an optimization algorithm? 7/51

2. From Proximal algorithms to Plug & Play algorithms

Image Inverse Problems Maximum A-Posteriori (MAP) x∗ MAP (y) =
arg min x∈Rn λ 2 ||Ax − y||2 − log p(x) Explicit prior p: tractable minimization objective x∗ ∈ arg min x∈Rn f (x) + g(x) 8/51

arg min x∈Rn λ 2 ||Ax − y||2 − log p(x) Explicit prior p: tractable minimization objective x∗ ∈ arg min x∈Rn f (x) + g(x) data-ﬁdelity 8/51

arg min x∈Rn λ 2 ||Ax − y||2 − log p(x) Explicit prior p: tractable minimization objective x∗ ∈ arg min x∈Rn f (x) + g(x) data-ﬁdelity prior p(x) 8/51

arg min x∈Rn λ 2 ||Ax − y||2 − log p(x) Explicit prior p: tractable minimization objective x∗ ∈ arg min x∈Rn f (x) + g(x) Example: Inpainting 8/51

arg min x∈Rn λ 2 ||Ax − y||2 − log p(x) Explicit prior p: tractable minimization objective x∗ ∈ arg min x∈Rn f (x) + g(x) Example: Inpainting Log-concave prior p Convex problem f + g 8/51

arg min x∈Rn λ 2 ||Ax − y||2 − log p(x) Explicit prior p: tractable minimization objective x∗ ∈ arg min x∈Rn f (x) + g(x) Example: Inpainting Log-concave prior p Non log-concave prior Convex problem f + g Non-convex problem f + g 8/51

Proximal optimization algorithms Minimization problem for a proper function h
: Rn → R x∗ ∈ arg min x∈Rn h(x) 9/51

: Rn → R x∗ ∈ arg min x∈Rn h(x) Explicit Gradient Descent operator for diﬀerentiable functions xk+1 = (Id −τ∇h)(xk ) 9/51

: Rn → R x∗ ∈ arg min x∈Rn h(x) Explicit Gradient Descent operator for diﬀerentiable functions xk+1 = (Id −τ∇h)(xk ) Fixed point: x∗ = (Id −τ∇h)(x∗) ⇔ Minimizer: ∇h(x∗) = 0 9/51

: Rn → R x∗ ∈ arg min x∈Rn h(x) Explicit Gradient Descent operator for diﬀerentiable functions xk+1 = (Id −τ∇h)(xk ) Fixed point: x∗ = (Id −τ∇h)(x∗) ⇔ Minimizer: ∇h(x∗) = 0 Convergence guarantee if Id −τ∇h is non-expansive ||(Id −τ∇h)(x) − (Id −τ∇h)(y)|| ≤ ||x − y|| Suﬃcient condition: if h convex with L-Lipschitz gradient and τL < 2 9/51

: Rn → R x∗ ∈ arg min x∈Rn h(x) Implicit proximal operator for (non)diﬀerentiable (convex) functions Proxτh (z) ∈ arg min x∈Rn 1 2τ ||x − z||2 + h(x) xk+1 = Proxτh (xk ) = (Id +τ∂h)−1(xk ) 9/51

: Rn → R x∗ ∈ arg min x∈Rn h(x) Implicit proximal operator for (non)diﬀerentiable (convex) functions Proxτh (z) ∈ arg min x∈Rn 1 2τ ||x − z||2 + h(x) xk+1 = Proxτh (xk ) = (Id +τ∂h)−1(xk ) Fixed point: x∗ = Proxτh (x∗) ⇔ Minimizer: 0 ∈ ∂h(x∗) 9/51

: Rn → R x∗ ∈ arg min x∈Rn h(x) Implicit proximal operator for (non)diﬀerentiable (convex) functions Proxτh (z) ∈ arg min x∈Rn 1 2τ ||x − z||2 + h(x) xk+1 = Proxτh (xk ) = (Id +τ∂h)−1(xk ) Fixed point: x∗ = Proxτh (x∗) ⇔ Minimizer: 0 ∈ ∂h(x∗) For convex functions, Proxτh is non-expansive for all τ > 0 9/51

: Rn → R x∗ ∈ arg min x∈Rn h(x) For γ-strongly convex functions ∇h(x) − ∇h(y), x − y ≥ γ||x − y||2 non-expansiveness → contractiveness 9/51

: Rn → R x∗ ∈ arg min x∈Rn h(x) For γ-strongly convex functions ∇h(x) − ∇h(y), x − y ≥ γ||x − y||2 non-expansiveness → contractiveness • Proximal operator || Proxτh (x) − Proxτh (y) ≤ 1 1 + τγ ||x − y|| • Gradient descent operator, for τ < 2 L+γ ||(Id −τ∇h)(x) − (Id −τ∇h)(y)|| ≤ (1 − τγ)||x − y|| 9/51

: Rn → R x∗ ∈ arg min x∈Rn h(x) To sum up Convexity of function h ⇒ non-expansiveness of gradient/proximal operators ⇒ Convergence of gradient iterations to minimizers of h 9/51

Gradient-based optimization algorithms Minimization problem with explicit prior x∗ ∈
arg min x∈Rn f (x) + g(x) Proximal algorithms • Proximal Gradient Descent (PGD) xk+1 = TPGD (xk ) = Proxτg ◦(Id −τ∇f )(xk ) • Douglas-Rashford-Splitting (DRS) xk+1 = TDRS (xk ) = (2 Proxτf − Id) ◦ (2 Proxτg − Id)(xk ) • Half-Quadratic-Splitting (HQS) xk+1 = THQS (xk ) = Proxτf ◦ Proxτg (xk ) 10/51

arg min x∈Rn f (x) + g(x) Proximal algorithms • Proximal Gradient Descent (PGD) xk+1 = TPGD (xk ) = Proxτg ◦(Id −τ∇f )(xk ) • Douglas-Rashford-Splitting (DRS) xk+1 = TDRS (xk ) = (2 Proxτf − Id) ◦ (2 Proxτg − Id)(xk ) • Half-Quadratic-Splitting (HQS) xk+1 = THQS (xk ) = Proxτf ◦ Proxτg (xk ) Fixed points of these non-expansive operators ⇔ Minimizers of f + g 10/51

arg min x∈Rn f (x) + g(x) Proximal algorithms • Proximal Gradient Descent (PGD) xk+1 = TPGD (xk ) = Proxτg ◦(Id −τ∇f )(xk ) • Douglas-Rashford-Splitting (DRS) xk+1 = TDRS (xk ) = (2 Proxτf − Id) ◦ (2 Proxτg − Id)(xk ) • Half-Quadratic-Splitting (HQS) xk+1 = THQS (xk ) = Proxτf ◦ Proxτg (xk ) Fixed points of these non-expansive operators ⇔ Minimizers of f + g How using (deep) implicit learned priors? 10/51

Plug & Play Image Restoration Denoising MAP solution y =
ˆ x + ξ with ξ ∼ N(0, σ2 Idn ): x∗ MAP (y) = arg min x∈Rn 1 2σ2 ||x − y||2 − log p(x) = Prox−σ2 log p (y) A denoiser is related to an implicit prior p 11/51

ˆ x + ξ with ξ ∼ N(0, σ2 Idn ): x∗ MAP (y) = arg min x∈Rn 1 2σ2 ||x − y||2 − log p(x) = Prox−σ2 log p (y) A denoiser is related to an implicit prior p Plug & Play [Venkatakrishnan et al. ’13, Romano et al. 17]: Replace the gradient operator of the regularization by an external denoiser Dσ : Rn → Rn learnt for noise level σ Dσ (y) ≈ x∗ MAP (y) = Prox−σ2 log p (y) 11/51

ˆ x + ξ with ξ ∼ N(0, σ2 Idn ): x∗ MAP (y) = arg min x∈Rn 1 2σ2 ||x − y||2 − log p(x) = Prox−σ2 log p (y) A denoiser is related to an implicit prior p Plug & Play [Venkatakrishnan et al. ’13, Romano et al. 17]: Replace the gradient operator of the regularization by an external denoiser Dσ : Rn → Rn learnt for noise level σ Dσ (y) ≈ x∗ MAP (y) = Prox−σ2 log p (y) Example Dσ : DRUNet [Zhang et al’ 21] Remark: State-of-the-art denoisers are not non-expansive 11/51

Plug & Play algorithms Proximal algorithms • PGD: xk+1 =
Proxτg ◦ (Id −τ∇f )(xk ) • DRS: xk+1 = (2 Proxτf − Id) ◦ (2Proxτg − Id)(xk ) • HQS: xk+1 = Proxτf ◦Proxτg (xk ) Plug & Play algorithms • PnP-PGD: xk+1 = Dσ ◦ (Id −τ∇f )(xk ) • PnP-DRS: xk+1 = (2 Proxτf − Id) ◦ (2Dσ − Id)(xk ) • PnP-HQS: xk+1 = Proxτf ◦Dσ (xk ) 12/51

Plug & Play Image Restoration Pros and cons High
image restoration performance of PnP methods with deep denoisers IRCNN [Zhang et al. ’17], DPIR [Zhang et al. ’21] 13/51

image restoration performance of PnP methods with deep denoisers IRCNN [Zhang et al. ’17], DPIR [Zhang et al. ’21] Implicit prior → no tractable minimization problem minx f (x)+g(x) 13/51

image restoration performance of PnP methods with deep denoisers IRCNN [Zhang et al. ’17], DPIR [Zhang et al. ’21] Implicit prior → no tractable minimization problem minx f (x)+g(x) Eﬃcient denoisers are not non-expansive → Convergence guarantees 13/51

image restoration performance of PnP methods with deep denoisers IRCNN [Zhang et al. ’17], DPIR [Zhang et al. ’21] Implicit prior → no tractable minimization problem minx f (x)+g(x) Eﬃcient denoisers are not non-expansive → Convergence guarantees Relevant priors are not log-concave → Non-convex problems 13/51

Convergence of PnP methods Plug & Play algorithms • PnP-PGD:
xk+1 = Dσ ◦ (Id −τ∇f )(xk ) • PnP-DRS: xk+1 = (2 Proxτf − Id) ◦ (2Dσ − Id)(xk ) • PnP-HQS: xk+1 = Proxτf ◦Dσ (xk ) Convergence results Pseudo-linear denoisers Dσ (x) = Kσ (x)x [Sreehari et al. ’16, Romano et al. ’17, Chan et al. ’19, Gavaskar and Chaudhury ’20, Nair et al. ’21] Deep denoisers 14/51

Related works for deep denoisers Fixed point strategies • Denoisers
assumed non-expansive [Reehorst and Schniter ’18, Liu et al. ’21], ﬁrmly non-expansive [Terris et al. ’20, Sun et al. ’21] or averaged [Sun et al. ’19, Hertrich et al. ‘21, Bohra et al. ‘21] ||Dσ (x) − Dσ (y)|| ≤ ||x − y|| non-expansiveness can degrade denoising performances 15/51

assumed non-expansive [Reehorst and Schniter ’18, Liu et al. ’21], ﬁrmly non-expansive [Terris et al. ’20, Sun et al. ’21] or averaged [Sun et al. ’19, Hertrich et al. ‘21, Bohra et al. ‘21] ||Dσ (x) − Dσ (y)|| ≤ ||x − y|| non-expansiveness can degrade denoising performances • non-expansive residual Id −Dσ , L > 1 Lipschitz denoiser [Ryu et al. ’19] Can only handle strongly convex data terms f xk+1 = Dσ ◦ (Id −τ∇f )(xk ) 15/51

assumed non-expansive [Reehorst and Schniter ’18, Liu et al. ’21], ﬁrmly non-expansive [Terris et al. ’20, Sun et al. ’21] or averaged [Sun et al. ’19, Hertrich et al. ‘21, Bohra et al. ‘21] ||Dσ (x) − Dσ (y)|| ≤ ||x − y|| non-expansiveness can degrade denoising performances • non-expansive residual Id −Dσ , L > 1 Lipschitz denoiser [Ryu et al. ’19] Can only handle strongly convex data terms f xk+1 = Dσ ◦ (Id −τ∇f )(xk ) Implicit prior: no tractable minimization problem No explicit expression of g such that x∗ ∈ arg minx f (x) + g(x) 15/51

assumed non-expansive [Reehorst and Schniter ’18, Liu et al. ’21], ﬁrmly non-expansive [Terris et al. ’20, Sun et al. ’21] or averaged [Sun et al. ’19, Hertrich et al. ‘21, Bohra et al. ‘21] ||Dσ (x) − Dσ (y)|| ≤ ||x − y|| non-expansiveness can degrade denoising performances • non-expansive residual Id −Dσ , L > 1 Lipschitz denoiser [Ryu et al. ’19] Can only handle strongly convex data terms f xk+1 = Dσ ◦ (Id −τ∇f )(xk ) Implicit prior: no tractable minimization problem No explicit expression of g such that x∗ ∈ arg minx f (x) + g(x) Our objective: → 15/51

Overview Find x from observation y = Ax + ξ
Maximum A-Posteriori arg min x∈Rn f (x) + g(x) unknown prior p 16/51

Maximum A-Posteriori arg min x∈Rn f (x) + g(x) unknown prior p Plug-and-Play (PnP) xk+1 = Proxτf ◦Dσ (xk ) implicit prior no minimization problem no convergence guarantees SOTA restoration 16/51

3. Gradient Step Plug & Play

Gradient step denoiser and PnP Idea [Hurault et al ‘21]:
• Deﬁne the denoiser as a gradient descent step over a diﬀerentiable (non-convex) scalar function gσ : Rn → R: Dσ = Id −∇gσ • Proposed PnP algorithm, for τ > 0 xk+1 = Proxτf ◦(τDσ + (1 − τ) Id)(xk ) 17/51

Gradient step denoiser and PnP Idea [Hurault et al ‘21]:
• Deﬁne the denoiser as a gradient descent step over a diﬀerentiable (non-convex) scalar function gσ : Rn → R: Dσ = Id −∇gσ • Proposed PnP algorithm, for τ > 0 xk+1 = Proxτf ◦(τDσ + (1 − τ) Id)(xk ) Plugging the Gradient step denoiser Dσ xk+1 = Proxτf ◦(Id −τ∇gσ )(xk ) ⇔ Proximal Gradient Descent on the (non-convex) problem min x f (x) + gσ (x) 17/51

Maximum A-Posteriori arg min x∈Rn f (x) + g(x) unknown prior p Plug-and-Play (PnP) xk+1 = Proxτf ◦Dσ (xk ) implicit prior no minimization problem no convergence guarantees SOTA restoration 18/51

Maximum A-Posteriori arg min x∈Rn f (x) + g(x) unknown prior p Plug-and-Play (PnP) xk+1 = Proxτf ◦Dσ (xk ) implicit prior no minimization problem no convergence guarantees SOTA restoration GS-PnP x∗ ∈ arg min x∈Rn f (x) + gσ (x) explicit prior minimization problem ? convergence guarantees ? SOTA restoration 18/51

Convergence of non-convex PGD Minimization problem for a proper function
f min x f (x) + gσ (x) PnP algorithm xk+1 = Proxτf ◦(Id −τ∇gσ )(xk ) 19/51

f min x f (x) + gσ (x) PnP algorithm xk+1 = Proxτf ◦(Id −τ∇gσ )(xk ) Function value and residual convergence [Beck and Teboulle ’09] for τL < 1 • gσ diﬀerentiable with L-Lipschitz gradient • gσ bounded from below • f bounded from below, may be non-convex 19/51

f min x f (x) + gσ (x) PnP algorithm xk+1 = Proxτf ◦(Id −τ∇gσ )(xk ) Function value and residual convergence [Beck and Teboulle ’09] for τL < 1 • gσ diﬀerentiable with L-Lipschitz gradient • gσ bounded from below • f bounded from below, may be non-convex Convergence to stationary points [Attouch, Bolte and Svaiter ’13] • f and gσ satisﬁes the Kurdyka-Lojasiewicz (KL) • The iterates xk are bounded 19/51

Design of the potential gσ Gradient Step denoiser Dσ (x)
= x − ∇gσ (x) Parameterization of gσ to satisfy the required assumptions 20/51

= x − ∇gσ (x) Parameterization of gσ to satisfy the required assumptions • Convex scalar neural network gσ : Rn → R+ [Cohen et al. ’21] Limited denoising performances 20/51

= x − ∇gσ (x) Parameterization of gσ to satisfy the required assumptions • Convex scalar neural network gσ : Rn → R+ [Cohen et al. ’21] Limited denoising performances • Proposed non-convex potential gσ (x) = 1 2 ||x − Nσ (x)||2 with a neural network Nσ : Rn → Rn Dσ (x) = x − ∇gσ (x) = Nσ (x) + JNσ (x)T (x − Nσ (x)) Any eﬃcient denoising architecture Nσ (x) can be reused 20/51

Design of the potential gσ Parameterization of gσ gσ (x)
= 1 2 ||x − Nσ (x)||2 Nσ : Rn → Rn, neural network with C2 and Lipschitz gradient activations 21/51

= 1 2 ||x − Nσ (x)||2 Nσ : Rn → Rn, neural network with C2 and Lipschitz gradient activations Check the required assumptions gσ is bounded from below, C1 and with L > 1 Lipschitz gradient ||∇2gσ (x)||S 21/51

= 1 2 ||x − Nσ (x)||2 Nσ : Rn → Rn, neural network with C2 and Lipschitz gradient activations Check the required assumptions gσ is bounded from below, C1 and with L > 1 Lipschitz gradient ||∇2gσ (x)||S Coercivity to bound iterates xk [Laumont et al. ‘21] ˆ gσ (x) = gσ (x) + 1 2 ||x − ProjC (x)||2 for a large convex compact set C (never activated in practice) 21/51

Interpretation Gradient Step denoiser Dσ (x) = x − ∇
1 2 ||x − Nσ (x)||2 = Nσ (x) + JNσ (x)T (x − Nσ (x)) • Correct a denoiser Nσ (x) to make it a conservative vector ﬁeld RED [Romano et al. ’17] • Learn the reﬁtting of the denoiser Nσ (x) CLEAR [Deledalle et al.’17] 22/51

GS-DRUNet Gradient Step denoiser Dσ (x) = Nσ (x) +
JNσ (x)T (x − Nσ (x)) Architecture for Nσ : light version of DRUNet [Zhang et al’ 21] Training with L2 loss, for σ ∈ [0, 50/255]: L(σ) = Ex∼p,ξσ∼N(0,σ2) ||Dσ (x + ξσ ) − x||2 Training set: Berkeley, Waterloo, DIV2K and Flick2K 23/51

GS-DRUNet PSNR on the CBSD68 dataset σ(./255) 5 15 25
50 FFDNet 39.95 33.53 30.84 27.54 DnCNN 39.80 33.55 30.87 27.52 DRUNet 40.31 33.97 31.32 28.08 GS-DRUNet 40.26 33.90 31.26 28.01 DRUNet light 40.19 33.89 31.25 28.00 24/51

Maximum A-Posteriori arg min x∈Rn f (x) + g(x) unknown prior p Plug-and-Play (PnP) xk+1 = Proxτf ◦Dσ (xk ) implicit prior no minimization problem no convergence guarantees SOTA restoration GS-PnP x∗ ∈ arg min x∈Rn f (x) + gσ (x) explicit prior minimization problem convergence guarantees ? SOTA restoration 25/51

General image inverse problems Objective function for any degradation operator
A and noise level ν f (x) + gσ (x) = λ 2 ||Ax − y||2 + 1 2 ||Nσ (x) − x||2 GS-PnP algorithm xk+1 = Proxτf ◦(τDσ + (1 − τ) Id)(xk ) • Deep denoiser Dσ (x) = Nσ (x) + JNσ (x)T (x − Nσ (x)) • Proximal operator Proxτf (z) = τλAT A + Id −1 τλAT y + z • Backtracking procedure for τ [Hu et al. ‘22] - Rough empirical estimation of the lipschitz constant L - Not being stuck at the ﬁrst local minima • Automatic tuning via Reinforcement Lerning [Wei et. al ‘20’] 26/51

Image inverse problems Deblurring with convolution with periodic boundary conditions:
A = H diagonal in the Fourier basis = ∗ + 27/51

Deblurring: quantitative results (a) (b) (c) (d) (e) (f) (g)
(h) (i) (j) ν Method Avg 0.01 EPLL 28.32 28.24 28.36 25.80 29.61 27.15 26.90 26.69 25.84 26.49 27.34 RED 30.47 30.01 30.29 28.09 31.22 28.92 28.90 28.67 26.66 28.45 29.17 IRCNN 32.96 32.62 32.53 32.44 33.51 33.62 32.54 32.20 28.11 29.19 31.97 MMO 32.35 32.06 32.24 31.67 31.77 33.17 32.30 31.80 27.81 29.26 31.44 DPIR 33.76 33.30 33.04 33.09 34.10 34.34 33.06 32.77 28.34 29.16 32.50 GS-PnP 33.52 33.07 32.91 32.83 34.07 34.25 32.96 32.54 28.11 29.03 32.33 0.03 EPLL 25.31 25.12 25.82 23.75 26.99 25.23 25.00 24.59 24.34 25.43 25.16 RED 25.71 25.32 25.71 24.38 26.65 25.50 25.27 24.99 23.51 25.54 25.26 IRCNN 28.96 28.65 28.90 28.38 30.03 29.87 28.92 28.52 25.92 27.64 28.58 DPIR 29.38 29.06 29.21 28.77 30.22 30.23 29.34 28.90 26.19 27.81 28.91 GS-PnP 29.22 28.89 29.20 28.60 30.32 30.21 29.32 28.92 26.38 27.89 28.90 0.05 EPLL 24.08 23.91 24.78 22.57 25.68 23.98 23.70 23.19 23.75 24.78 24.04 RED 22.78 22.54 23.13 21.92 23.78 22.97 22.89 22.67 22.01 23.78 22.84 IRCNN 27.00 26.74 27.25 26.37 28.29 28.06 27.22 26.81 24.85 26.83 26.94 DPIR 27.52 27.35 27.73 27.02 28.63 28.46 27.79 27.30 25.25 27.11 27.42 GS-PnP 27.45 27.28 27.70 26.98 28.68 28.44 27.81 27.38 25.49 27.15 27.44 28/51

Deblurring: qualitative inspection Motion kernel (b) and ν = 0.03:
Clean Observed (20.97dB) RED (25.92dB) IRCNN (28.66dB) DPIR (29.76dB) GS-PnP (29.90dB) (f + gσ)(xk ) ||xk+1 − xk || 29/51

Deblurring: convergence inspection DPIR: 8 iterations with timestep σ decreasing
uniformly in log-scale from 50 to ν and τ ∝ σ2 Asymptotic behavior of DPIR: (i) Decreasing timestep along 1000 iterations (ii) Decreasing timestep on the 8 ﬁrst iterations and constant for the next 992 ones 30/51

Image inverse problems Super-resolution: A = SH ; eﬃcient closed-form
solution for Proxτf [Zhao et al. ’16] =         ∗         s + 31/51

Super-resolution: quantitative results Kernels Method s = 2 s =
3 Avg ν = 0.01 ν = 0.03 ν = 0.05 ν = 0.01 ν = 0.03 ν = 0.05 Bicubic 24.85 23.96 22.79 23.14 22.52 21.62 23.15 RED 28.29 24.65 22.98 26.13 24.02 22.37 24.74 IRCNN 27.43 26.22 25.86 26.12 25.11 24.79 25.92 DPIR 28.62 27.30 26.47 26.88 25.96 25.22 26.74 GS-PnP 28.77 27.54 26.63 26.85 26.05 25.29 26.86 Bicubic 23.38 22.71 21.78 22.65 22.08 21.25 22.31 RED 26.33 23.91 22.45 25.38 23.40 21.91 23.90 IRCNN 25.83 24.89 24.59 25.36 24.36 23.95 24.83 DPIR 26.84 25.59 24.89 26.24 24.98 24.32 25.48 GS-PnP 26.80 25.73 25.03 26.18 25.08 24.31 25.52 32/51

Super-resolution: qualitative inspection Isotropic kernel, s = 2 and ν
= 0.03: Clean Observed (20.97dB) RED (21.75dB) IRCNN (22.82dB) DPIR (23.97dB) GS-PnP (24.81dB) (f + gσ)(xk ) ||xk+1 − xk || 33/51

Image inverse problems Inpainting (ν = 0): A diagonal, with
values in {0, 1}, f (x) = ı{x | Ax=y} Proxτf (x) = Π{x | Ax=y} (x) = Ay − Ax + x = ⊗ 34/51

Inpainting 35/51

Maximum A-Posteriori arg min x∈Rn f (x) + g(x) unknown prior p Plug-and-Play (PnP) xk+1 = Proxτf ◦Dσ (xk ) implicit prior no minimization problem no convergence guarantees SOTA restoration GS-PnP x∗ ∈ arg min x∈Rn f (x) + gσ (x) explicit prior minimization problem convergence guarantees SOTA restoration 36/51

Convergence of PnP algorithms PnP algorithms • PnP-PGD: xk+1 =
Dσ ◦ (Id −τ∇f )(xk ) • PnP-DRS: xk+1 = (2 Proxτf − Id) ◦ (2Dσ − Id)(xk ) • PnP-HQS: xk+1 = Proxτf ◦Dσ (xk ) 3. Gradient step denoiser: PnP-HQS 37/51

Dσ ◦ (Id −τ∇f )(xk ) • PnP-DRS: xk+1 = (2 Proxτf − Id) ◦ (2Dσ − Id)(xk ) • PnP-HQS: xk+1 = Proxτf ◦Dσ (xk ) 3. Gradient step denoiser: PnP-HQS Equivalent to PGD: xk+1 = Proxτf ◦(Id −τ∇gσ )(xk ) 37/51

Dσ ◦ (Id −τ∇f )(xk ) • PnP-DRS: xk+1 = (2 Proxτf − Id) ◦ (2Dσ − Id)(xk ) • PnP-HQS: xk+1 = Proxτf ◦Dσ (xk ) 3. Gradient step denoiser: PnP-HQS Equivalent to PGD: xk+1 = Proxτf ◦(Id −τ∇gσ )(xk ) 4. What about other PnP algorithms? 37/51

4. Proximal Gradient Step Plug & Play

Proximal Gradient Step denoiser PnP algorithms • PnP-PGD: xk+1 =
Dσ ◦ (Id −τ∇f )(xk ) • PnP-DRS: xk+1 = (2 Proxτf − Id) ◦ (2Dσ − Id)(xk ) PGD/DRS rely on implicit gradient steps Dσ = Proxφσ 38/51

Dσ ◦ (Id −τ∇f )(xk ) • PnP-DRS: xk+1 = (2 Proxτf − Id) ◦ (2Dσ − Id)(xk ) PGD/DRS rely on implicit gradient steps Dσ = Proxφσ of proximal operators of convex functions φσ Firmly non-expansive operator: limited denoising performances 38/51

Dσ ◦ (Id −τ∇f )(xk ) • PnP-DRS: xk+1 = (2 Proxτf − Id) ◦ (2Dσ − Id)(xk ) PGD/DRS rely on implicit gradient steps Dσ = Proxφσ of proximal operators of convex functions φσ Firmly non-expansive operator: limited denoising performances Idea [Hurault et al. ‘22] • Build on top of the explicit gradient step denoiser Dσ = Id −∇gσ • Make Id −∇gσ the proximal operator of a non-convex potential φσ 38/51

Characterisation of proximal operators Gradient step denoiser Dσ = Id
−∇gσ = ∇ 1 2 ||x||2 − gσ (x) := ∇hσ Proposition [Moreau’65, Gribonval and Nikolova ’20] If hσ is convex then there exists some function φσ such that Dσ ∈ Proxφσ (x) 39/51

Characterisation of proximal operators Gradient step denoiser Dσ = Id
−∇gσ = ∇ 1 2 ||x||2 − gσ (x) := ∇hσ Theorem [Gribonval ’11, Gribonval and Nikolova ’20] If gσ is Ck+1, k ≥ 1 and if ∇gσ is contractive (L < 1 Lipschitz), then • hσ is (1 − L) strongly convex • Dσ = Proxφσ is injective with φσ (x) := gσ (Dσ −1(x))) − 1 2 ||Dσ −1(x) − x||2 + Kσ if x ∈ Im(Dσ ) • φσ ≥ gσ (x) • φσ is L/(L + 1) semi-convex (φσ (x) + L 2(L+1) ||x||2 is convex) • φσ is Ck and with L 1−L Lipschitz gradient on Im(Dσ ) 40/51

Proximal Gradient Step denoiser Proposed Proximal gradient step denoiser •
Gradient step expression Dσ = Id −∇gσ • Contractive residual Id −Dσ = ∇gσ , similar to [Ryu et al ’19] The denoiser is not non-expansive Residual condition hard to satisfy exactly 41/51

Proximal Gradient Step denoiser Proposed Proximal gradient step denoiser •
Gradient step expression Dσ = Id −∇gσ • Contractive residual Id −Dσ = ∇gσ , similar to [Ryu et al ’19] The denoiser is not non-expansive Residual condition hard to satisfy exactly If gσ is C2: Id −Dσ contractive ⇔ ||JId −Dσ ||S < 1 Penalization of the spectral norm of the Jacobian [Gulrajani et al ’17, Terris et al. ’20] 41/51

Prox-DRUNet Architecture for Nσ : light version of DRUNet [Zhang
et al’ 21] Training, for σ ∈ [0, 25], with L2 loss and spectral norm penalization: LS (σ) = Ex∼p,ξσ∼N(0,σ2) ||Dσ (x + ξσ ) − x||2+µ max(||J(Id −Dσ) (x + ξσ )||S , 1− ) PSNR on the CBSD68 dataset σ(./255) 5 15 25 FFDNet 39.95 33.53 30.84 DnCNN 39.80 33.55 30.87 DRUNet 40.31 33.97 31.32 GS-DRUNet 40.27 33.92 31.28 prox-DRUNet (µ = 10−3) 40.12 33.60 30.82 prox-DRUNet (µ = 10−2) 40.04 33.51 30.64 nonexp-DRUNet (µ = 10−3) 39.71 33.50 30.89 nonexp-DRUNet (µ = 10−2) 34.92 31.42 29.42 42/51

Prox-DRUNet Posterior approximation of the Lipschitz constant with power iterations
• GS-DRUNet: L = maxx ||JId −Dσ(x) ||S • Prox-DRUNet: L = maxx ||JId −Dσ(x) ||S • nonexp-DRUNet: L = maxx ||JNσ(x) ||S σ(./255) 0 5 10 15 20 25 GS-DRUNet (µ = 0) 0.94 1.26 2.47 1.96 2.50 3.27 Prox-DRUNet (µ = 10−3) 0.86 0.94 0.97 0.98 0.99 1.19 Prox-DRUNet (µ = 10−2) 0.87 0.92 0.95 0.99 0.96 0.96 nonexp-DRUNet (µ = 10−3) 1.06 1.07 1.07 1.10 1.13 1.20 nonexp-DRUNet (µ = 10−2) 0.97 0.98 0.98 0.98 0.98 0.98 Empirical value L on the CBSD68 dataset 43/51

PnP algorithms with proximal gradient step denoiser Proximal gradient step
denoiser Dσ = Id −∇gσ = Proxφσ with φσ (x) := gσ (Dσ −1(x))) − 1 2 ||Dσ −1(x) − x||2 if x ∈ Im(Dσ ) +∞ otherwise Minimization objective min x f (x) + φσ (x) Proximal algorithms • PGD: xk+1 = Proxτφσ ◦ (Id −τ∇f )(xk ) • DRS xk+1 = (2 Proxτf − Id) ◦ (2Proxτφσ − Id)(xk ) 44/51

PnP algorithms with proximal gradient step denoiser Proximal gradient step
denoiser Dσ = Id −∇gσ = Proxφσ with φσ (x) := gσ (Dσ −1(x))) − 1 2 ||Dσ −1(x) − x||2 if x ∈ Im(Dσ ) +∞ otherwise Minimization objective min x f (x) + φσ (x) Proximal algorithms • PGD: xk+1 = Proxτφσ ◦ (Id −τ∇f )(xk ) • DRS xk+1 = (2 Proxτf − Id) ◦ (2Proxτφσ − Id)(xk ) Timestep has to be ﬁxed to τ = 1 44/51

Non-convex PGD Minimization objective min x f (x) + φσ
(x) PnP-PGD, for diﬀerentiable f xk+1 = Proxφσ ◦(Id −∇f )(xk ) Convergence (function value and iterates) [Attouch et al.’13, Li and Lin ’15, Beck ’17] • φσ and f bounded from below • φσ is L L+1 semi convex with L < 1 • f of class C1 with a Lf < L+2 L+1 -Lipschitz gradient 45/51

Non-convex PGD Minimization objective min x f (x) + φσ
(x) PnP-PGD, for differentiable f xk+1 = Proxφσ ◦(Id −∇f )(xk ) Convergence (function value and iterates) [Attouch et al.’13, Li and Lin ’15, Beck ’17] • φσ and f bounded from below • φσ is L L+1 semi convex with L < 1 • f of class C1 with a Lf < L+2 L+1 -Lipschitz gradient What about non differentiable data fidelity terms? 45/51

Non-convex DRS Minimization objective min x f (x) + φσ
(x) PnP-DRS xk+1 = (2 Proxφσ − Id) ◦ (2 Proxf − Id)(xk ) Reformulation (also equivalent to ADMM)      yk+1 = Proxφσ (xk ) zk+1 = Proxf (2yk+1 − xk+1 ) xk+1 = xk + (zk+1 − yk+1 ) 46/51

(x) PnP-DRS xk+1 = (2 Proxφσ − Id) ◦ (2 Proxf − Id)(xk ) Reformulation (also equivalent to ADMM)      yk+1 = Proxφσ (xk ) zk+1 = Proxf (2yk+1 − xk+1 ) xk+1 = xk + (zk+1 − yk+1 ) Douglas-Rachford Envelope as Lyapunov function FDR (x) = φσ (y) + f (z) + y − x, y − z + 1 2 ||y − z||2 Convergence for proper l.s.c. functions f and φσ if one function has a Lipschitz gradient with constant < 1 [Themelis and Patrinos ’20] 46/51

(x) PnP-DRS xk+1 = (2 Proxφσ − Id) ◦ (2 Proxf − Id)(xk ) Reformulation (also equivalent to ADMM)      yk+1 = Proxφσ (xk ) zk+1 = Proxf (2yk+1 − xk+1 ) xk+1 = xk + (zk+1 − yk+1 ) Douglas-Rachford Envelope as Lyapunov function FDR (x) = φσ (y) + f (z) + y − x, y − z + 1 2 ||y − z||2 Convergence for proper l.s.c. functions f and φσ if one function has a Lipschitz gradient with constant < 1 [Themelis and Patrinos ’20] φσ has a L 1−L Lipschitz gradient: the residual Id −Dσ must be L < 0.5-Lipschitz 46/51

Discussion on assumptions Inverse problem • f (x) = λ
2 ||Ax − y||2 • Dσ = Id −∇gσ , with (Id −Dσ ) L < 1-Lipschitz 47/51

Discussion on assumptions Inverse problem • f (x) = λ
2 ||Ax − y||2 • Dσ = Id −∇gσ , with (Id −Dσ ) L < 1-Lipschitz PnP-PGD ||∇f || = λ||At A|| < L+2 L+1 Solved with a α-relaxed PGD [Tseng ‘08, Hurault et. al ‘23]      qk+1 = (1 − α)yk + αxk xk+1 = Dσ (xk − λ∇f (qk+1 )) yk+1 = (1 − α)yk + αxk+1 PnP-DRS λ free Can deal with non diﬀerentiable data ﬁdelity terms f Restricted to denoisers with L < 0.5-Lipschitz residual 47/51

Quantitative evaluation PSNR averaged on CBSD68 over 10 blur kernels
(deblurring) and 4 kernels and 2 scales (super-resolution) Deblurring Super-resolution scale s = 2 scale s = 3 Noise level ν 0.01 0.03 0.05 0.01 0.03 0.05 0.01 0.03 0.05 IRCNN 31.42 28.01 26.40 26.97 25.86 25.45 25.60 24.72 24.38 DPIR 31.93 28.30 26.82 27.79 26.58 25.83 26.05 25.27 24.66 GS-PnP-HQS 31.70 28.28 26.86 27.88 26.81 26.01 25.97 25.35 24.74 Prox-PnP-PGD 30.91 27.97 26.66 27.68 26.57 25.81 25.94 25.20 24.62 Prox-PnP-DRS 31.54 28.07 26.60 27.93 26.61 25.79 26.13 25.29 24.67 Prox-PnP-αPGD 31.55 28.03 26.66 27.92 26.61 25.80 26.03 25.26 24.61 48/51

Deblurring Clean Observed Prox-PnP-PGD Prox-PnP-DRS (29.41dB) (29.65dB) mini≤k ||xi+1 −
xi ||2 Fλ,σ (xk ) 49/51

Super-resolution Clean Observed Prox-PnP-PGD Prox-PnP-DRS (27.27dB) (27.95dB) mini≤k ||xi+1 −
xi ||2 Fλ,σ (xk ) 50/51

Conclusion

Conclusion Contributions • Gradient based denoisers Dσ not non-expansive •
Convergent PnP schemes • Tractable (non-convex) optimization problems • State-of-the-art results for various imaging tasks 51/51

Thank you

• S. Hurault, A. Leclaire, N. Papadakis - Gradient Step
Denoiser for convergent Plug-and-Play - ICLR, 2022 • S. Hurault, A. Leclaire, N. Papadakis - Proximal Denoiser for Convergent Plug-and-Play Optimization with Nonconvex Regularization - ICML, 2022 • S. Hurault, A. Chambolle, A. Leclaire, N. Papadakis, - A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal denoiser, 2023.

Denoising prior Find x from observation y = x +
ξ • Input (unknown) distribution of clean image p(x) • Gaussian noise ξ ∼ N(0, σ2 Id) • Noisy observation y ∼ pσ (y) = p ∗ N(0, σ2 Id) → Two optimal denoisers

ξ • Input (unknown) distribution of clean image p(x) • Gaussian noise ξ ∼ N(0, σ2 Id) • Noisy observation y ∼ pσ (y) = p ∗ N(0, σ2 Id) → Two optimal denoisers MAP estimator DMAP σ (y) = arg max x p(x|y) MMSE estimator DMMSE σ (y) = Ex∼p(x|y) [x]

ξ • Input (unknown) distribution of clean image p(x) • Gaussian noise ξ ∼ N(0, σ2 Id) • Noisy observation y ∼ pσ (y) = p ∗ N(0, σ2 Id) → Two optimal denoisers MAP estimator DMAP σ (y) = arg max x p(x|y) MMSE estimator DMMSE σ (y) = Ex∼p(x|y) [x] = arg max x∈Rn p(y|x)p(x) p(y) = arg min x∈Rn − log p(y|x) − log p(x) = arg min x∈Rn 1 2σ2 ||x − y||2 − log p(x) = Prox−σ2 log p (y)

ξ • Input (unknown) distribution of clean image p(x) • Gaussian noise ξ ∼ N(0, σ2 Id) • Noisy observation y ∼ pσ (y) = p ∗ N(0, σ2 Id) → Two optimal denoisers MAP estimator DMAP σ (y) = arg max x p(x|y) MMSE estimator DMMSE σ (y) = Ex∼p(x|y) [x] Bayes (implicit gradient): DMAP σ = Prox−σ2 log p

ξ • Input (unknown) distribution of clean image p(x) • Gaussian noise ξ ∼ N(0, σ2 Id) • Noisy observation y ∼ pσ (y) = p ∗ N(0, σ2 Id) → Two optimal denoisers MAP estimator DMAP σ (y) = arg max x p(x|y) MMSE estimator DMMSE σ (y) = Ex∼p(x|y) [x] Bayes (implicit gradient): DMAP σ = Prox−σ2 log p Tweedie (explicit gradient): DMMSE σ = Id −σ2∇(− log pσ )

ξ • Input (unknown) distribution of clean image p(x) • Gaussian noise ξ ∼ N(0, σ2 Id) • Noisy observation y ∼ pσ (y) = p ∗ N(0, σ2 Id) → Two optimal denoisers MAP estimator DMAP σ (y) = arg max x p(x|y) MMSE estimator DMMSE σ (y) = Ex∼p(x|y) [x] Bayes (implicit gradient): DMAP σ = Prox−σ2 log p Tweedie (explicit gradient): DMMSE σ = Id −σ2∇(− log pσ ) → Real denoiser Dσ ≈ DMAP σ or Dσ ≈ DMMSE σ

ξ • Input (unknown) distribution of clean image p(x) • Gaussian noise ξ ∼ N(0, σ2 Id) • Noisy observation y ∼ pσ (y) = p ∗ N(0, σ2 Id) → Two optimal denoisers MAP estimator DMAP σ (y) = arg max x p(x|y) MMSE estimator DMMSE σ (y) = Ex∼p(x|y) [x] Bayes (implicit gradient): DMAP σ = Prox−σ2 log p Tweedie (explicit gradient): DMMSE σ = Id −σ2∇(− log pσ ) → Real denoiser Dσ ≈ DMAP σ or Dσ ≈ DMMSE σ → A denoiser provides an implicit prior

ξ • Input (unknown) distribution of clean image p(x) • Gaussian noise ξ ∼ N(0, σ2 Id) • Noisy observation y ∼ pσ (y) = p ∗ N(0, σ2 Id) → Two optimal denoisers MAP estimator DMAP σ (y) = arg max x p(x|y) MMSE estimator DMMSE σ (y) = Ex∼p(x|y) [x] Bayes (implicit gradient): DMAP σ = Prox−σ2 log p Tweedie (explicit gradient): DMMSE σ = Id −σ2∇(− log pσ ) → Real denoiser Dσ ≈ DMAP σ or Dσ ≈ DMMSE σ → A denoiser provides an implicit prior → State-of-the-art denoisers are not non-expansive

PnP and RED algorithms Proximal algorithms • GD: xk+1 =
(Id −τ(∇f + λ∇g))(xk ) • HQS: xk+1 = Proxτg ◦ Proxτf (xk ) • PGD: xk+1 = Proxτg ◦(Id −τ∇f )(xk ) • DRS: xk+1 = 1 2 Id +1 2 (2 Proxτf − Id) ◦ (2 Proxτg − Id) (xk ) MAP denoiser Dσ (y) = Prox−σ2 log p (y) MMSE denoiser Dσ (y) = (Id +σ2∇ log pσ )(y) PnP algorithms [Venkatakrishnan et al. ‘13] Proxτg ← Dσ RED algorithms [Romano et al. ‘17] ∇g ← Id −Dσ

(Id −τ(∇f + λ∇g))(xk ) • HQS: xk+1 = Proxτg ◦ Proxτf (xk ) • PGD: xk+1 = Proxτg ◦(Id −τ∇f )(xk ) • DRS: xk+1 = 1 2 Id +1 2 (2 Proxτf − Id) ◦ (2 Proxτg − Id) (xk ) MAP denoiser Dσ (y) = Prox−σ2 log p (y) MMSE denoiser Dσ (y) = (Id +σ2∇ log pσ )(y) PnP algorithms [Venkatakrishnan et al. ‘13] Proxτg ← Dσ RED algorithms [Romano et al. ‘17] ∇g ← Id −Dσ • PnP-HQS: xk+1 = Dσ ◦ Proxτf (xk ) • PnP-PGD: xk+1 = Dσ ◦ (Id −τ∇f )(xk ) • PnP-DRS: xk+1 = 1 2 Id +1 2 (2 Proxτf − Id) ◦ (2Dσ − Id)(xk )

(Id −τ(∇f + λ∇g))(xk ) • HQS: xk+1 = Proxτg ◦ Proxτf (xk ) • PGD: xk+1 = Proxτg ◦(Id −τ∇f )(xk ) • DRS: xk+1 = 1 2 Id +1 2 (2 Proxτf − Id) ◦ (2 Proxτg − Id) (xk ) MAP denoiser Dσ (y) = Prox−σ2 log p (y) MMSE denoiser Dσ (y) = (Id +σ2∇ log pσ )(y) PnP algorithms [Venkatakrishnan et al. ‘13] Proxτg ← Dσ RED algorithms [Romano et al. ‘17] ∇g ← Id −Dσ We have Id −τ∇g = τDσ + (1 − τ) Id • RED-GD: xk+1 = (τDσ + (1 − τ) Id − τ∇f )(xk ) • RED-PGD: xk+1 = Proxτf ◦(τDσ + (1 − τ) Id)(xk )

Non-convex PGD: function value and residual convergence Theorem [Beck and
Teboulle ’09] Assuming • f : Rn → R ∪ {+∞} proper, convex, lower semicontinous • g : Rn → R diﬀerentiable with L-Lipschitz gradient • F = f + λg bounded from below Then, for τ < 1 λL , the iterates xk given by the PGD algorithm xk+1 = Proxτf ◦(Id −λτ∇g)(xk ) verify (i) (F(xk )) is non-increasing and converging. (ii) The residual ||xk+1 − xk || converges to 0 (iii) Cluster points of (xk ) are stationary points of F. Remark: Can be extended to non-convex data terms [Li and Lin ’15]

Non-convex PGD: iterates convergence Theorem [Attouch, Bolte and Svaiter ’13]
Assuming • Same as before + f and g verify the Kurdyka-Lojasiewicz (KL) property Then, for τ < 1 λL , if the sequence (xk ) given by xk+1 = Proxτf ◦(Id −λτ∇g)(xk ) is bounded, it converges, with ﬁnite length, to a critical point of f + λg. Go back

Non-convex Proximal Gradient Descent Theorem (adapted from [Attouch et al.’13,
Li and Lin ’15, Beck ’17]) For F = f + φσ , assume • φσ such that Dσ = Id −∇gσ = Proxφσ with gσ : Rn → R ∪ {+∞} of class C2 with L < 1-Lipschitz gradient and bounded from below • f : Rn → R ∪ {+∞} diﬀerentiable with Lf < 1-Lipschitz gradient and bounded from below Then the iterates xk+1 = Dσ ◦ (Id −∇f )(xk ) of PnP-PGD satisfy (i) (F(xk )) is non-increasing and converges (ii) The residual converges ||xk+1 − xk || convers to 0 (iii) Cluster points of (xk ) are stationary points of F (iv) If f and gσ are respectively KL and semi-algebraic, then if (xk ) is bounded, it converges, with ﬁnite length, to a stationary point of F Go back

Non-convex DRS Theorem (Adapted from [Themelis and Patrinos ’20]) For
F = f + φσ , assume • φσ such that Dσ = Id −∇gσ = Proxφσ with gσ : Rn → R ∪ {+∞} of class C2 with L-Lipschitz gradient (such that 2L3 + L2 + 2L − 1 < 0) and bounded from below • f : Rn → R ∪ {+∞} proper l.s.c and bounded from below Take the Douglas-Rachford Envelope as Lyapunov function FDR (x) = φσ(y) + f (z) + y − x, y − z + 1 2 ||y − z||2 Then the iterates of PnP-DRS satisfy (i) (FDR (xk )) is nonincreasing and converges (ii) The residual ||yk − zk || converges to 0 (iii) For any cluster point (y∗, z∗, x∗), y∗ and z∗ are stationary points of F (iv) If f and gσ are KL and semi-algebraic, then if (yk , zk , xk ) is bounded, it converges, and yk and zk converge to the same stationary point of F Go back

Numerical implementation Denoiser: Dσ (x) = Nσ (x) + JNσ
(x)T (x − Nσ (x)) Computing JNσ (x)T (x − Nσ (x)): N = DRUNet light ( x , sigma ) t o r c h . autograd . grad (N, x , g r a d o u t p u t s=x − N, c r e a t e g r a p h=True , o n l y i n p u t s=True ) [ 0 ] Go back

Fixed points of the denoiser Nσ (x) Initialization σ =
0 σ = 0.5 σ = 1 σ = 5 σ = 10 σ = 20 σ = 30 σ = 40

Learning the Gradient-step and Proximal denoisers Our denoisers write Dσ
= Id −∇gσ with gσ (x) = 1 2 ||x − Nσ (x)||2 with a C1 or C2 neural network Nσ : Rn → Rn (DRUNet [Zhang et al. ’21] with ELU) Dσ (x) = x − ∇gσ (x) = Nσ (x) + JNσ (x)T (x − Nσ (x)) Training loss: GS-Denoiser - Prox-Denoiser arg min Param(Dσ) Ex,ξσ ||Dσ (x + ξσ ) − x||2 + µ max(||∇2gσ (x + ξσ )||S , 1 − ) σ(./255) 5 15 25 DRUNet 40.19 33.89 31.25 GS-Denoiser 40.26 33.90 31.26 Prox-Denoiser 40.12 33.60 30.82 Table 1: Denoising PSNR on the CBSD68 dataset, for various σ.

Learning the Gradient-step and Proximal denoisers Our denoisers write Dσ
= Id −∇gσ with gσ (x) = 1 2 ||x − Nσ (x)||2 with a C1 or C2 neural network Nσ : Rn → Rn (DRUNet [Zhang et al. ’21] with ELU) Dσ (x) = x − ∇gσ (x) = Nσ (x) + JNσ (x)T (x − Nσ (x)) Training loss: GS-Denoiser - Prox-Denoiser arg min Param(Dσ) Ex,ξσ ||Dσ (x + ξσ ) − x||2 + µ max(||∇2gσ (x + ξσ )||S , 1 − ) σ(./255) 5 15 25 GS-DRUNet 1.26 1.96 3.27 Prox-DRUNet 0.92 0.99 0.96 Table 1: maxx ||∇2gσ (x)||S on the CBSD68 dataset, for various σ.

Averaged operator theory [Bauschke & Combettes ‘11] Deﬁnition T :
Rn → Rn and θ ∈ (0, 1). T is θ-averaged if there is a nonexpansive operator R s.t. T = θR + (1 − θ) Id . • This is equivalent to ∀x, y ∈ Rn, T(x)−T(y) 2 + 1−θ θ (Id −T)(x)−(Id −T)(y) 2 ≤ x −y 2. • T θ-averaged =⇒ T nonexpansive. • 1/2-averaged = “ﬁrmly nonexpansive”.

Averaged operator theory [Bauschke & Combettes ‘11] Definition T :
Rn → Rn and θ ∈ (0, 1). T is θ-averaged if there is a nonexpansive operator R s.t. T = θR + (1 − θ) Id . • This is equivalent to ∀x, y ∈ Rn, T(x)−T(y) 2 + 1−θ θ (Id −T)(x)−(Id −T)(y) 2 ≤ x −y 2. • T θ-averaged =⇒ T nonexpansive. • 1/2-averaged = “firmly nonexpansive”. Theorem (Krasnosel’ski˘ ı-Mann) Let T : Rn → Rn be a θ-averaged operator such that Fix(T) = ∅. Then the sequence xk+1 = T(xk ) converges to a fixed point of T.

Averaged operator theory [Bauschke & Combettes ‘11] Proposition Let T1
θ1 -averaged and T2 θ2 -averaged, with any θ1 , θ2 ∈ (0, 1). • Then T1 ◦ T2 is θ-averaged with θ = θ1+θ2−2θ1θ2 1−θ1θ2 ∈ (0, 1). • For α ∈ [0, 1], αT1 + (1 − α) Id is αθ1 -averaged. Proposition If f : Rn → R is convex and L-smooth (i.e. diﬀerentiable with L-Lipschitz gradient) • Proxτf is τL 2(1+τL) -averaged for any τ > 0. • Id −τ∇f is τL 2 -averaged for τ < 2 L .

Averaged operator theory for PnP convergence PnP algorithms: xk+1 =
TPnP (xk ) with TPnP =      THQS = Dσ ◦ Proxτf TPGD = Dσ ◦ (Id −τ∇f ) TDRS = 1 2 Id +1 2 (2Dσ − Id) ◦ (2 Proxτf − Id) Theorem If f : Rn→R convex, L-smooth and Dσ isθ-averaged, θ ∈ (0, 1) • PnP-HQS converges towards a fixed point of THQS . • If τL < 2, PnP-PGD converges towards a fixed point of TPGD . • If θ ≤ 1/2, PnP-DRS converges towards a fixed point of TDRS .

Averaged operator theory for PnP convergence PnP algorithms: xk+1 =
TPnP (xk ) with TPnP =      THQS = Dσ ◦ Proxτf TPGD = Dσ ◦ (Id −τ∇f ) TDRS = 1 2 Id +1 2 (2Dσ − Id) ◦ (2 Proxτf − Id) Theorem If f : Rn→R convex, L-smooth and Dσ isθ-averaged, θ ∈ (0, 1) • PnP-HQS converges towards a fixed point of THQS . • If τL < 2, PnP-PGD converges towards a fixed point of TPGD . • If θ ≤ 1/2, PnP-DRS converges towards a fixed point of TDRS . Does not extend to nonconvex data-fidelity terms f . • If f is L-smooth and strongly convex, Proxτf is contractive, and for τL < 2, Id −τ∇f is contractive → assume Dσ (1 + )-Lipshitz [Ryu et. al, ’19].

How to build averaged deep denoisers ? Average operator: Dσ
= θRσ + (1 − θ) Id with Rσ non-expansive. → How to train non-expansive networks ? • Spectral normalization. [Miyato et. al ‘18, Ryu et. al ‘19] Lipschitz constant 1 for large networks. Does not allow skip connexions.

= θRσ + (1 − θ) Id with Rσ non-expansive. → How to train non-expansive networks ? • Spectral normalization. [Miyato et. al ‘18, Ryu et. al ‘19] Lipschitz constant 1 for large networks. Does not allow skip connexions. • Deep spline neural networks [Goujon, Neumayer et. al, ‘22] • Convolutional Proximal Neural Networks [Hertrich et. al, ‘20] • Soft regularization of the training loss [Terris et. al ’21]. • Dσ = Id −∇gσ with gσ Input Convex Neural Network (ICNN) [Meunier et. al ‘21].

= θRσ + (1 − θ) Id with Rσ non-expansive. → How to train non-expansive networks ? • Spectral normalization. [Miyato et. al ‘18, Ryu et. al ‘19] Lipschitz constant 1 for large networks. Does not allow skip connexions. • Deep spline neural networks [Goujon, Neumayer et. al, ‘22] • Convolutional Proximal Neural Networks [Hertrich et. al, ‘20] • Soft regularization of the training loss [Terris et. al ’21]. • Dσ = Id −∇gσ with gσ Input Convex Neural Network (ICNN) [Meunier et. al ‘21]. Non-expansiveness can harm denoising performance.

Nonexpansive convolutional neural networks [Terris et al., 2021] Idea: •
Build a nonexpansive convolutional neural network (CNN) D = TM ◦ · · · ◦ T1 with Tm (x) = Rm (Wm x + bm ) where Rm is an (averaged) activation function, Wm a convolution, and bm a bias. • We want to have D = Id +Q 2 with Q nonexpansive. • During training, the Lipschitz constant of 2D − Id is penalized.

Convolution proximal neural networks [Hertrich et al., 2022] Idea: Build
a convolutional proximal neural network (cPNN) Φu = TM ◦ · · · ◦ T1 with Tm (x) = W T m σm (Wm x + bm ) where u = (Wm , σm , bm )1≤m≤M is a collection of parameters. The linear operators Wm (or W T m ) are convolutions lying in a Stiefel manifold St(d, n) = { W ∈ Rn×d | W T W = Id }. The resulting denoiser is then D = Id −γΦu . • Ideally, Φu is a composition of M firmly non-expansive operators, thus averaged. • In practice, Wm is a convolution with limited filter length. • Condition Wm ∈ St is approximated with a term W T m Wm − Id 2 F in the learning cost. • Φu is verified in practice to be t-averaged with t close to 1 2 .

Deep Spline Neural Networks [Goujon, Neumayer et al., 2022] Idea:
Approximate the proximal operator of a convex-ridge regularizer R(x) = P p=1 i ψp (hp ∗ x(i)) hp are convolution kernels, ψp are particular C1 convex functions. Given a noisy y, ProxλR (y) = arg min x∈Rn 1 2 x − y 2 + λR(x) is approximated with t iterations of the gradient-step x → x − α((x − y) + λ∇R(x)). The output after t iterations is denoted by Tt R,λ,α (y). • Tt R,λ,α approximates the prox of a convex function • Linear spline parameterization of ψp justiﬁed by a density result

Convergent Plug & Play schemes for Inverse Prob...

Convergent Plug & Play schemes for Inverse Problems in Imaging

More Decks by npapadakis

Other Decks in Research

Featured

Transcript