論文解説 Latent Diffusion Model

論⽂解説 High-Resolution Image Synthesis with Latent Diffusion Models Takehiro Matsuda

2 Explosion of Diffusion Model https://www.youtube.com/watch?v=Bo3VZCjDhGI This video was created
using 36 consecutive phrases that define the visual narrative. Stable Diffusionを「いらすとや」で追加学習する https://tadaoyamaoka.hatenablog.com/entry/2022/09/18/134024 https://memeplex.app/ ⽇本でリリースされたWebサービス (複数のAIモデルを選択可) more and more …

3 論⽂情報 • タイトル：High-Resolution Image Synthesis with Latent Diffusion Models
• 論⽂： https://arxiv.org/abs/2112.10752 • コード： https://github.com/CompVis/latent-diffusion • 投稿学会： CVPR2022 • 著者： Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bj¨orn Ommer • 所属：Ludwig Maximilian University of Munich & IWR, Heidelberg University, Runway ML 選んだ理由： • 最近話題のStable Diffusionの前⾝となる論⽂ • Diffusion Modelの基本原理・構成を知りたい

4 Latent Diffusion Model Overview 画像⽣成の⾼い性能を持ちつつ計算量を削減、様々なタスクに使⽤可能なアーキテクチャを⽰す。

5 まず、論⽂“Denoising Diffusion Probabilistic Models”(DPM)をもとに基本的なDiffusion Modelについて説明

6 Basic of Diffusion Model Forward trajectory(diffusion) Reverse trajectory(denoising) 複雑な分布x0
を”徐々に”簡単な分布xT に変換するようなマルコフ過程を定義する 𝑞 𝑥! |𝑥!"# これに対して逆変換になるような過程𝑝$ 𝑥!"# |𝑥! を”学習で得る”ことで簡単な分布XT を与えることで学習データセットに近い意味あるデータを⽣成できる。データ⽣成拡散過程

7 Toy Diffusion Model 下記記事をもとに点群表現したピカチュウとイーブイをDiffusion Modelで再現する A Toy Diffusion model
you can run on your laptop https://medium.com/mlearning-ai/a-toy-diffusion-model-you-can-run-on-your-laptop- 20e9e5a83462

8 Diffusion Model Process 𝛽! が⼗分に⼩さければ、逆過程も同様の関数系で表わせる。 (Kolmogorov equation) ガウス分布で少量のノイズを付与していく。 𝑝
𝑥!"# |𝑥! の平均𝜇$ 𝑥! , 𝑡 、分散Σ$ 𝑥! , 𝑡 を推定する問題⽬的関数

9 Implementation of DPM 論⽂にはネットワーク構造の記載はないが、GitHubの実装は下のような特徴 • ResNetBlockで構成するU-Net形状 • self-attentionあり •
timeはsinusoidal position embedding https://github.com/hojonathanho/diffusion/blob/ master/diffusion_tf/models/unet.py

10 Example of frozen for t tが⼤きいところで分岐させるほど⼤域的な表現に変化が表れ、 t=0に近いところほど微細な変化が表れている。

11 Example of Interpolation ２枚の画像をサンプリングして、対応するノイズを求めて混合させることで、融合した⽣成画像を得られる。 Interpolation images with 500
timesteps of diffusion

12 Diffusion Model vs. GAN • GANと⽐べて多様なデータの⽣成に強い • 学習が安定している短所
• 学習・⽣成に時間がかかる (学習で150-1000個の V100を1⽇使う, 50kのサンプルの⽣成に1つのA100で5⽇かかる [15]) • 潜在変数の次元数が⾼い⻑所 Diffusion ModelをGANと⽐較して [15]Diffusion models beat gans on image synthesis

13 本題となる“High-Resolution Image Synthesis with Latent Diffusion Models”(LDM) について説明

14 Latent Diffusion Models basic idea 直接Pixel空間で処理をすることで学習、推論(⽣成)の時間がかかっている。 Autoencoderから得られたlatent spaceでDiffusion modelを適⽤することで、
多様なデータの⽣成を⾼速に⾏うことができるようになった。また、タスクごとのconditioningとなるネットワークと連結cross-attentionを導⼊することで、テキスト(プロンプト)などをもとにした⽣成を可能にした。

15 Latent Diffusion Models concept autoencoderで知覚的に等価な低次元の表現空間latent spaceを得る Diffusion Modelはlatent
spaceで学習する。⼊⼒データxがEncoderを通してLatent spaceでの特徴になってからDiffusion Processが⾏われる。

16 LDM Architecture 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾!
𝑑 ・𝑉 Latent space Denoising data Conditioning Encoderを通した値各Wは学習パラメータ 𝐿𝑎𝑡𝑒𝑛𝑡 𝑆𝑝𝑎𝑐𝑒の𝐷𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑀𝑜𝑑𝑒𝑙の𝜀" とタスクごとの𝜏" を同時に学習される。ネットワークの役割(phase)を分割 (１つのネットワークとして連結) • Encoder-Decoder • Diffusion model • Conditioning

17 Text-to-image training flow image dog, flying disc, beach Transformer

18 Unconditional image synthesis sample

19 Result Score of unconditional image synthesis

20 Text-to-image we train a 1.45B parameter KL-regularized LDM conditioned
on language prompts on LAION-400M [78]. We employ the BERT-tokenizer [14] and implement 𝜏" as a transformer [97] to infer a latent code which is mapped into the UNet via (multi-head) cross attention.

21 Layout-to-Image Synthesis

22 Super Resolution, Inpainting

23 Score of super resolution

24 Score of inpainting

25 Score of super resolution, inpainting Super resolutionとInpaintingの定量評価では従来⼿法を上回っていないが、⼈間の主観評価では⾼い評価を得た。

26 参考資料 Denoising Diffusion Probabilistic Models https://arxiv.org/abs/2006.11239 Improved Precision and
Recall Metric for Assessing Generative Models https://arxiv.org/abs/1904.06991 【AI論⽂解説】物理学の知識を背景とした画像⽣成⼿法Part1 Diffusion Probabilistic Models https://www.youtube.com/watch?v=DDGgKt_CyRQ 【AI論⽂解説】物理学の知識を背景とした画像⽣成⼿法Part2 Diffusion Probabilistic Models https://www.youtube.com/watch?v=G4tGMueM6lg 【Deep Learning研修（発展）】データ⽣成・変換のための機械学習第７回前編「Diffusion models」 https://www.youtube.com/watch?v=10ki2IS55Q4 A Toy Diffusion model you can run on your laptop https://medium.com/mlearning-ai/a-toy-diffusion-model-you-can-run-on-your-laptop-20e9e5a83462 Ultimate Guide to Diffusion Models | ML Coding Series | Denoising Diffusion Probabilistic Models https://www.youtube.com/watch?v=y7J6sSO1k50 Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models | ML Coding Series https://www.youtube.com/watch?v=f6PtJKdey8E

論文解説 Latent Diffusion Model

論文解説 Latent Diffusion Model

koharite

More Decks by koharite

Other Decks in Research

Featured

Transcript

論⽂解説 High-Resolution Image Synthesis with Latent Diffusion Models Takehiro Matsuda

2 Explosion of Diffusion Model https://www.youtube.com/watch?v=Bo3VZCjDhGI This video was created

3 論⽂情報 • タイトル：High-Resolution Image Synthesis with Latent Diffusion Models

4 Latent Diffusion Model Overview 画像⽣成の⾼い性能を持ちつつ計算量を削減、様々なタスクに使⽤可能なアーキテクチャを⽰す。

5 まず、論⽂“Denoising Diffusion Probabilistic Models”(DPM)をもとに基本的なDiffusion Modelについて説明

6 Basic of Diffusion Model Forward trajectory(diffusion) Reverse trajectory(denoising) 複雑な分布x0

7 Toy Diffusion Model 下記記事をもとに点群表現したピカチュウとイーブイをDiffusion Modelで再現する A Toy Diffusion model

8 Diffusion Model Process 𝛽! が⼗分に⼩さければ、逆過程も同様の関数系で表わせる。 (Kolmogorov equation) ガウス分布で少量のノイズを付与していく。 𝑝

9 Implementation of DPM 論⽂にはネットワーク構造の記載はないが、GitHubの実装は下のような特徴 • ResNetBlockで構成するU-Net形状 • self-attentionあり •

10 Example of frozen for t tが⼤きいところで分岐させるほど⼤域的な表現に変化が表れ、 t=0に近いところほど微細な変化が表れている。

11 Example of Interpolation ２枚の画像をサンプリングして、対応するノイズを求めて混合させることで、融合した⽣成画像を得られる。 Interpolation images with 500

12 Diffusion Model vs. GAN • GANと⽐べて多様なデータの⽣成に強い • 学習が安定している短所

13 本題となる“High-Resolution Image Synthesis with Latent Diffusion Models”(LDM) について説明

14 Latent Diffusion Models basic idea 直接Pixel空間で処理をすることで学習、推論(⽣成)の時間がかかっている。 Autoencoderから得られたlatent spaceでDiffusion modelを適⽤することで、

15 Latent Diffusion Models concept autoencoderで知覚的に等価な低次元の表現空間latent spaceを得る Diffusion Modelはlatent

16 LDM Architecture 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾!

17 Text-to-image training flow image dog, flying disc, beach Transformer

18 Unconditional image synthesis sample

19 Result Score of unconditional image synthesis

20 Text-to-image we train a 1.45B parameter KL-regularized LDM conditioned

21 Layout-to-Image Synthesis

22 Super Resolution, Inpainting

23 Score of super resolution

24 Score of inpainting

25 Score of super resolution, inpainting Super resolutionとInpaintingの定量評価では従来⼿法を上回っていないが、⼈間の主観評価では⾼い評価を得た。

26 参考資料 Denoising Diffusion Probabilistic Models https://arxiv.org/abs/2006.11239 Improved Precision and