論文解説 ControlNet

論⽂解説 Adding Conditional Control to Text-to-Image Diffusion Models Takehiro Matsuda

2 論⽂情報タイトル：Adding Conditional Control to Text-to-Image Diffusion Models •
論⽂： https://arxiv.org/abs/2302.05543 • コード： https://github.com/lllyasviel/ControlNet • 投稿学会： - • 著者： Lvmin Zhang and Maneesh Agrawala • 所属：Stanford University 選んだ理由： • ⽣成系のDNNの進歩が著しい。ユーザーの望む出⼒を出す⼿法として、シンプルだが強⼒なように⾒えた。

3 Purpose of ControlNet Stable Diffusionのような⼤規模データで学習したユーザーからの Prompt(テキスト)指⽰により⾼精細な画像⽣成をできるDNNモデルが公開されている。しかし、被写体のポーズ、画像の構図などをテキストで指⽰して希望通りに⽣成することは難しい。
元の⾼性能なDNNモデルを崩さずに、⽐較的少量の学習でinput conditions を追加する⼿法を提案する。 (Image-to-Imageの変換は⾊など全体の傾向に影響を受けやすい)

4 Introduction Stable Diffusionなどのtext-image⽣成で⽤いられるデータセット LAION-5B データ量：5 billion ユーザーの所望するポーズなどのデータ object shape/normal,
pose understanding, etc. データ量：たいていは100k以下 5x104くらいのデータ量の差があり、単純にfinetuningのようなことをすると overfittingし、画像⽣成の多様性などが失われる可能性がある。また、⼤規模データ・⼤規模ネットワークを学習できる環境がない場合は多い。

5 Stable Diffusion 前々回紹介したLDMのアーキテクチャで、 LAION-5Bのデータベースをもとに学習し、プロンプト(text)を与えることで⾼精細で多様な画像を⽣成することが可能なモデル。 Open Sourceのため、様々な派⽣モデルが⽣まれている。 ver1.x :860M
UNet and CLIP ViT-L/14 text encoder Encoder block: 12個 Decoder: block: 12個 Middle block: 1個 8 blockはdown-sampling or up-sampling convolution layers 17 blockはmain blockで4つのresnet layerと2つの Vision Transformersをもつ 512 x 512サイズの画像から64 x 64サイズの latent imageにされる。

6 ControlNet Network Architecture 元のnetworkのパラメータは変更せずにlockする 1x1の初期値weigh=0, bias=0の学習可能な Convolution layer
学習済みのNetworkのそれぞれ対応するBlockからコピーしたパラメータから開始する。ユーザーの所望する条件を表す画像

7 ControlNet module 𝑦! = 𝐹 𝑥; Θ + 𝒵
𝐹 𝑥 + 𝒵 𝑥; Θ"# ; Θ! ; Θ"$ =0 ① ② = 𝐹 𝑥; Θ! = 𝐹 𝑥; Θ ③ = 𝒵 𝐹 𝑥; Θ! ; Θ"$ = 0 初期状態では 𝑦! = 𝑦 ControlNetは何も働きかけていない(元のネットワークの性能を維持)

8 Training with Zero convolution DNNで0初期化するとうまく学習が進まなかったりするが、今回のケースでは以下のように重みWがアップデートできる。

9 Input convert in ControlNet 512 x 512サイズの画像から64 x 64
サイズのlatent imageにされる。ユーザー⼊⼒からlatent spaceの 64 x 64サイズに合わせるため、 4 x 4 kernelと2 x 2 stridesの4つのConvolution layer(channel: 16, 32, 64, 128)を導⼊する

10 Learning objective of the entire diffusion model diffusion network:
εθ time step: t text prompts: ct task-specific conditions: cf noisy image: zt 学習時にランダムに50%の確率でtext prompts ct を空のstringにすることで、 task-specific conditionsのsemantic contentsを認識するようにfacilitateする。

11 Training for ControlNet 元のネットワーク(Stable Diffusion)のパラメータはlockされ、少数のuser conditioning dataから学習していく。 23%
more GPU memory, 34% more time in each training iteration (as tested on a single Nvidia A 100 PCIE40G) Small-Scale Training DefaultでControlNetを接続している”SD Middle Block”と”SD Decoder Block 1,2,3,4”のうち、 ”SD Decoder Block 1,2,3,4”の接続を外す。 RTX 2070TI laptop GPUで実⾏でき、 1.6倍速く学習可能 Large-Scale Training 8台以上のNvidia A100 80Gと100万以上の学習データが利⽤可能なら、Overfittingのリスクは低いので、最初に5万ステップ以上でControlNetを学習しその後Stable Diffusionのすべての重みのlockを外し、全モデルを通してのjointly trainingを⾏う。 Improved Training

12 Experiment setting base modelはStable Diffusion 1.5 ⽣成のパラメータはCFG-scaleは9.0、samplerはDDIM、stepは20 Prompt 設定
1. No prompt: empty string 2. Default prompt: “a professional, detailed, high-quality image” - Stable diffusionはpromptとともに学習しているのでno promptだとrandom texture mapを⽣成しがち 3. Automatic prompt: “default prompt”で得られた結果に対して、BLIPのようなautomatic image captioning methodを適⽤しpromptを得て、再度そのpromptを与え⽣成する。 4. User prompt: ユーザーがpromptを与える。

13 Implementation - Experiment Canny Edge internetから取得した300万データから edge-image-caption pairsを⽣成。 Nvidia
A100 80Gで600 GPU-hoursで学習。

14 Implementation - Experiment Hough Line Place2からlearning-based deep Hough transformとBLIPで60万のedge-image-caption
pairsを⽣成。 Canny modelのcheckpointを始点にしてNvidia A100 80Gで150 GPU-hoursで学習。

15 Implementation - Experiment HED Boundary internetから取得した300万データから edge-image-caption pairsを⽣成。 Nvidia
A100 80Gで300 GPU-hoursで学習。

16 Implementation - Experiment User Sketching HED boundary detectionとa set
of strong data augmentationsを使ってhuman scribbleを作成した。 Internetから取得したデータから50万のscribble-image caption pairsを⽣成。 Canny modelのcheckpointを始点にしてNvidia A100 80Gで150 GPU-hoursで学習。 (random thresholds, randomly masking out a random percentage of scribbles, random morphological transformations, and random non-maximum suppression)

17 Implementation - Experiment Human Pose (Openpose) learning-based pose estimation
methodを使い、Internetから取得したデータから20万の pose-image-caption pairsを作成する。 Canny modelのcheckpointを始点にしてNvidia A100 80Gで300 GPU-hoursで学習。

18 Implementation - Experiment Semantic Segmentation(ADE20K) ADE20K datasetにBLIPを使いcaptionをつけ、16万4千のsegmentation-image-caption pairs を作る。
Nvidia A100 80Gで200 GPU-hoursで学習。 using default prompt

19 Implementation - Experiment Depth(large-scale) Midasを使いInternetから300万のdepth-image-caption pairsを作る。 Nvidia A100 80Gで500
GPU-hoursで学習。

20 Masked Diffusion Maskした領域にガイドを描画することで、canny-edge modelを使い、意図に沿った画像⽣成を⾏う。

21 Compare with PITI PITI(Pretraining-Image-to-Image)と⽐較このタスクで扱うのが難しかった“wall”, “paper”, “cup”のsemantic consistencyが良い。

22 Compare with Stable Diffusion v2 Depth-to-Image ControlNetは少ない学習リソースで正確な構造の⽣成を⾏うことができている。

23 Compare with Sketch-guided diffusion Sketch-guided diffusionでthe most challenging casesとされていたuser
inputに対する⽣成結果

24 Compare with Taming Transformers Taming Transformersでthe most challenging casesとされていたinputに対する⽣成結果

25 Compare with Stable Diffusion add the channel Stable Diffusionでofficialの⼿法とされている
input layerにチャンネルを加える⽅法(depth- to-image structureと同じ)と⽐較 ControlNetが⾼精細で不⾃然でない⽣成をしている。

26 Ablation study - sudden convergence zero convolutionsを⽤いているため、学習途中でも元のネットワークの性能によるhigh-quality image が⽣成できる。
モデルは学習中に突然input conditionに適応するようになっていることが観察された。これを”sudden converge phenomenon”と名付けた。

27 Ablation study – difference with training data size 学習データセットのサイズの違いによる⽣成結果を⽰す。
学習データが少なくても意図に沿った⽣成が可能だが、多いほうがHigh-Qualityに⾒える。

論文解説 ControlNet

論文解説 ControlNet

koharite

More Decks by koharite

Other Decks in Research

Featured

Transcript

論⽂解説 Adding Conditional Control to Text-to-Image Diffusion Models Takehiro Matsuda

2 論⽂情報タイトル：Adding Conditional Control to Text-to-Image Diffusion Models •

4 Introduction Stable Diffusionなどのtext-image⽣成で⽤いられるデータセット LAION-5B データ量：5 billion ユーザーの所望するポーズなどのデータ object shape/normal,

6 ControlNet Network Architecture 元のnetworkのパラメータは変更せずにlockする 1x1の初期値weigh=0, bias=0の学習可能な Convolution layer

7 ControlNet module 𝑦! = 𝐹 𝑥; Θ + 𝒵

8 Training with Zero convolution DNNで0初期化するとうまく学習が進まなかったりするが、今回のケースでは以下のように重みWがアップデートできる。

9 Input convert in ControlNet 512 x 512サイズの画像から64 x 64

10 Learning objective of the entire diffusion model diffusion network:

11 Training for ControlNet 元のネットワーク(Stable Diffusion)のパラメータはlockされ、少数のuser conditioning dataから学習していく。 23%

12 Experiment setting base modelはStable Diffusion 1.5 ⽣成のパラメータはCFG-scaleは9.0、samplerはDDIM、stepは20 Prompt 設定

13 Implementation - Experiment Canny Edge internetから取得した300万データから edge-image-caption pairsを⽣成。 Nvidia

14 Implementation - Experiment Hough Line Place2からlearning-based deep Hough transformとBLIPで60万のedge-image-caption

15 Implementation - Experiment HED Boundary internetから取得した300万データから edge-image-caption pairsを⽣成。 Nvidia

16 Implementation - Experiment User Sketching HED boundary detectionとa set

17 Implementation - Experiment Human Pose (Openpose) learning-based pose estimation

18 Implementation - Experiment Semantic Segmentation(ADE20K) ADE20K datasetにBLIPを使いcaptionをつけ、16万4千のsegmentation-image-caption pairs を作る。

19 Implementation - Experiment Depth(large-scale) Midasを使いInternetから300万のdepth-image-caption pairsを作る。 Nvidia A100 80Gで500

20 Masked Diffusion Maskした領域にガイドを描画することで、canny-edge modelを使い、意図に沿った画像⽣成を⾏う。

21 Compare with PITI PITI(Pretraining-Image-to-Image)と⽐較このタスクで扱うのが難しかった“wall”, “paper”, “cup”のsemantic consistencyが良い。

22 Compare with Stable Diffusion v2 Depth-to-Image ControlNetは少ない学習リソースで正確な構造の⽣成を⾏うことができている。

23 Compare with Sketch-guided diffusion Sketch-guided diffusionでthe most challenging casesとされていたuser

24 Compare with Taming Transformers Taming Transformersでthe most challenging casesとされていたinputに対する⽣成結果

25 Compare with Stable Diffusion add the channel Stable Diffusionでofficialの⼿法とされている

26 Ablation study - sudden convergence zero convolutionsを⽤いているため、学習途中でも元のネットワークの性能によるhigh-quality image が⽣成できる。

27 Ablation study – difference with training data size 学習データセットのサイズの違いによる⽣成結果を⽰す。