文献紹介_20180622_MUNIT _ Multimodal Unsupervised Image-to-Image Translation

文献紹介 MUNIT | Multimodal Unsupervised Image-to-Image Translation author: Xun Huang
(Cornell University, NVIDIA)

abstract - image translation において、 unsupervised で multi-modal な手法を提案した。 -
精度も supervised に近い値が出た。

Related Works - GANs: Generative Adversarial Networks - 設計の難しい loss
関数に対し、その loss 関数すら Neural Network で学習させてしまおうという手法 - image generation, text generation などの多くの応用先 - 生成モデル(generative model) ≒ 教師なし (unsupervised) - P(X) をモデリング (X: 画像など） Generator c.f. ProgressinGAN gaussian noise generated image OR Discriminator real image True 1 / False 0

Related Works - Image Translation - input: an image in
the source domain - output: an image in the target domain - using GAN

Related Works - Image Translation - unsupervised approach - pair
がいらない。 - e.g. CycleGAN

Related Works - Image Translation - cycle consistency loss -
変換した画像を、逆変換した画像が、元の入力画像と近くなるよう学習 - unsupervised machine translation などでも似たような手法が使われている。

Related Works - Image Translation の問題 - multi-modal mapping ではない。
- 馬→シマウマ、などドメイン内で多峰性がなければ上手くいくが、 - 猫→犬などの場合、犬の画像は、ポメラニアンでもいいし、柴犬でもいい。 - このようにドメイン内の分布が多峰性（ multi-modal) だと、生成がうまくいかない。 - BicycleGAN は multi-modal mapping だが、 supervision が必要。 - 今回は、 mutli-modal かつ unsupervised な image translation を提案。

Related Works - Auto-Encoder - データを本質的な情報だけに圧縮する手法。(feature extraction, dimentional reduction)

Related Works - VAE Variational Auto-Encoder - latent representation を
確率変数とすることで、連続的な表現を可能にしたもの。

Related Works - Disentangled Representation - なんらかの方法で、 latent representation を、情報の意味で分
割する方法。 contents (shape, pose, location, …) style (pattern, color, appearance, ...)

Method - MUNIT - 変換する image を Auto-Encoder を使い、以下の 2つの
latent representation に分割して embedding - content: 変換後も保存したい情報 - e.g. 変換元の虎の顔の向き、位置 - style: 変換後は持ち越したくなく、かつ、変換先のドメインの情報を使って multi-modal に操作したい情報 - e.g. 変換先の猫の毛色、見た目

Method - MUNIT - 3 つの loss を使う。 - まずは、
image を Auto-Encoder を使い、content と style に分けて embedding (これだけだと、無理。次のGANのステップが必要。) - ①Image reconstruction loss - style と content から image を reconstruction。 s と c で元の image の本質的な情報を保存するように学習。 ① ①

Method - MUNIT - ② Adversarial loss - 変換後の画像が、target domain
の画像か/そうでないかをdiscriminator が識別。 - content: 変換前の image のもの - style: Gaussian noise - これだけだと、c, s の情報に関係ない target っぽい generated image でも Discriminator は騙されてしまう。 (e.g. とりあえず猫っぽい画像を生成しとけば OKってなる。そうでなく、虎の向きは保持したい。) ② ②

Method - MUNIT - ③ Latent reconstruction loss - 変換に使った
content と style を変換後の画像から復元できるようにする。 - 変換後の画像が、変換に使った content と style の情報を保持していないと行けない。 ③ ③ ③ ③

Method - MUNIT ① ① ③ ③ ③ ③ ②
②

Method - Auto-Encoder - Downsampling: CNN - AdaIn: parameters in
normalization layers to represent styles

Method - Auto-Encoder - Discriminator - LSGAN objective - multi-scale
discriminators - to learn realistic details - to learn correct global structure - Domain-invariant perceptual loss - supervised setting でしか使えない perceptual loss を unsupervised にも拡張 - a distance in the VGG feature space between the output and the reference image - high-resolution の学習を助ける。

Evaluation - Task - supervised image translation - unsupervised image
translatinon

- dataset: Edges <-> shoes/handbags - colored image - corresponding
edge images - eval. metric - quality: human preference - diversity: LPIPS distance - baselines - UNIT - CycleGAN - CycleGAN with noise - BicycleGAN Evaluation - supervised

Evaluation - Baselines - UNIT: latent representation が disentangled でな
く、 1つ。 - CycleGAN - CycleGAN with noise: input image に Gaussian noise を加える。 - BicycleGAN: continuous multi-modal mapping が可能。ただ、supervision が必要。

Results - supervised - qualitative

Evaluation - Human Preference - to evaluate the quality -
Amazon Mechanical Turk - 500 questions/worker - 1 source image - 2 translated images from different methods

Evaluation - LPIPS Distances - to evaluate diversity - a
weighted L2 distance between pairs of deep features of randomly-sampled translated images from the same input - deep feature extractor: ImageNet-pretrained AlexNet - correlate well with human perceptual similarity - 1900 pairs - 100 input images - x 19 output pairs/input

- BicycleGAN との Quality 以外の比較では全てにおいて優っている。 - 3つの loss のうち 1
つでも欠けると、 Quality が大幅に下がることから、すべての loss が有効だと判断できる。 Results - supervised - quantitative

- dataset: Animal image translation - 動物の画像が category ごとにまとまっている。
- pair なし。 - eval. metric - IS = Inception Score - CIS = conditional Inception Score - baselines - UNIT - CycleGAN - CycleGAN with noise - (BicycleGAN は supervised しか対応してないので、なし) Evaluation - unsupervised big cats house cats dogs

Results - unsupervised - qualitative cycleGAN

Evaluation - (C)IS=(Conditional) Inception Score - popular for image generation
- to evaluate quality and diversity - IS: diversity of all output images - Inception-v3 で識別しやすい画像であるほどスコアが高い。 - CIS: diversity of outputs conditioned on a single input image - more suited for evaluating multi-modal mapping - e.g. 1 枚の猫の画像が、ほぼ完璧な犬の画像に変換されたら、 ISは高くなる。ただ、もし、その変換先が、画像ごとに同じ犬の画像に変換される（ multi-modal mapping でない）なら、 IS は高いが、 CIS は低くなる。

Evaluation - (C)IS=(Conditional) Inception Score - x1: source image -
x2: target image - x1->2: translated image from 1 to 2 - y: class=mode (e.g. ポメラニアン、柴犬、シベリアンハスキー if X2 is a set of dogs)

- 既存の unsupervised approach に比べ圧勝。 (the higher the better) Results
- unsupervised - quantitative

- style の指定を、noise ではなく、 target domain の 1 枚の image
を使い、任意の style で指定できる。 Results - Example-guided image translation

Conclusion - unsupervised な multi-modal image translation の手法を提案した。 - Auto-Encoder
の中間層を disentangled にすることで解決した。 - supervised image translation においては、supervised multi-modal の BicycleGAN に近いスコアを出した。 - unsupervised image translation においては他を圧勝した。

文献紹介_20180622_MUNIT _ Multimodal Unsupervised I...

文献紹介_20180622_MUNIT _ Multimodal Unsupervised Image-to-Image Translation

hrsma2i

More Decks by hrsma2i

Other Decks in Research

Featured

Transcript

文献紹介 MUNIT | Multimodal Unsupervised Image-to-Image Translation author: Xun Huang

abstract - image translation において、 unsupervised で multi-modal な手法を提案した。 -

Related Works - GANs: Generative Adversarial Networks - 設計の難しい loss

Related Works - Image Translation - input: an image in

Related Works - Image Translation - unsupervised approach - pair

Related Works - Image Translation - cycle consistency loss -

Related Works - Image Translation の問題 - multi-modal mapping ではない。

Related Works - Auto-Encoder - データを本質的な情報だけに圧縮する手法。(feature extraction, dimentional reduction)

Related Works - VAE Variational Auto-Encoder - latent representation を

Related Works - Disentangled Representation - なんらかの方法で、 latent representation を、情報の意味で分

Method - MUNIT - 変換する image を Auto-Encoder を使い、以下の 2つの

Method - MUNIT - 3 つの loss を使う。 - まずは、

Method - MUNIT - ② Adversarial loss - 変換後の画像が、target domain

Method - MUNIT - ③ Latent reconstruction loss - 変換に使った

Method - MUNIT ① ① ③ ③ ③ ③ ②

Method - Auto-Encoder - Downsampling: CNN - AdaIn: parameters in

Method - Auto-Encoder - Discriminator - LSGAN objective - multi-scale

Evaluation - Task - supervised image translation - unsupervised image

- dataset: Edges <-> shoes/handbags - colored image - corresponding

Evaluation - Baselines - UNIT: latent representation が disentangled でな

Results - supervised - qualitative

Results - supervised - qualitative

Evaluation - Human Preference - to evaluate the quality -

Evaluation - LPIPS Distances - to evaluate diversity - a

- BicycleGAN との Quality 以外の比較では全てにおいて優っている。 - 3つの loss のうち 1

- dataset: Animal image translation - 動物の画像が category ごとにまとまっている。

Results - unsupervised - qualitative cycleGAN

Evaluation - (C)IS=(Conditional) Inception Score - popular for image generation

Evaluation - (C)IS=(Conditional) Inception Score - x1: source image -

- 既存の unsupervised approach に比べ圧勝。 (the higher the better) Results

- style の指定を、noise ではなく、 target domain の 1 枚の image

Conclusion - unsupervised な multi-modal image translation の手法を提案した。 - Auto-Encoder