NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion Chenfei
Wu1∗ Jian Liang2∗ Lei Ji1 Fan Yang1 Yuejian Fang2 Daxin Jiang1 Nan Duan1† 1Microsoft Research Asia 2Peking University Presentor: Kai Katsumata Nakayama Lab. ∗ Both authors contributed equally to this research. † Corresponding author.

Basic information Title NÜWA: Visual Synthesis Pre-training for Neural visUal
World creAtion Authors Chenfei Wu Jian Liang Lei Ji Fan Yang Yuejian Fang Daxin Jiang Nan Duan Affiliation Microsoft Research Asia Peking University Date 2021/11/24 (Arxiv) https://arxiv.org/abs/2111.12417 Project url https://github.com/microsoft/NUWA 1 / 26

Abstract ”This paper presents a unified multimodal pre-trained model called
NÜWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.” (Wu et al., 2021b) 2 / 26

Teaser figure Text-To-Image (T2I) A dog with goggles staring at
the camera. A person is preparing some art. grass water house sky tree a horse is running on the grassland grass water house sky tree grass water house sky tree Sketch-To-Image (S2I) The car is reversing Image Completion (I2I) Image Manipulation (TI2I) Text-To-Video (T2V) Sketch-To-Video (S2V) Video Prediction (V2V) Video Manipulation (TV2V) grass water house sky tree flower cup wall vase door table Figure 1: NÜWA supports several typical generation and manipulation tasks. The figure is cited from (Wu et al., 2021b) 3 / 26

VQ-VAE-based Visual Auto-Regressive Models (previous work) Transformer Target Flattened sequence
Discrete Latents Codebook Conv3D Encoder Discrete Latents Conv3D Decoder Figure 2: The figure is cited from (Yan et al., 2021) CC BY 4 / 26

VQ-VAE-based Visual Auto-Regressive Models (previous work) Encoder Decoder 7KHKHDGRIDORYHO\FDW Ӟݝݢᆽጱੜሞጱ१؟̶
Discretize Recover Ӟݝݢᆽጱੜሞጱ१؟̶ Text Tokenizer (sentence pieces) Image Tokenizer (Discrete AutoEncoder) [ROI1] Text Token Text Token [BASE] [BOI1] [EOI1] Image Token Image Token ŏŏ ŏŏ Flattern Input Text: Input Image: Transformer (GPT) z }| { <latexit sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit> z }| { <latexit sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit> 7H[WWRNHQVUDQJLQJIURPWR ,PDJHWRNHQVUDQJLQJIURPWR Figure 3: The figure is cited from (Ding et al., 2021) 5 / 26

Difference between previous work (DALL-E(Ramesh et al., 2021), CogView(Ding et
al., 2021), GODIVA(Wu et al., 2021a) ) DALL-E(Ramesh et al., 2021) extend to video → GODIVA(Wu et al., 2021a) CogView(Ding et al., 2021) VideoGPT(Yan et al., 2021) ↓ extend to multimodal NUWA 6 / 26

Overview of NÜWA 3D-Decoder 1D-Encoder 3D-Encoder 2D-Encoder A light wind
blew across the country road. Input Text Output Image Input Image Sketch Input Video Sketch Output Video Input Image Parts Input Video Frames Output Remaining Parts Output Future Frames Visual Generation Visual Completion, Prediction, Manipulation Figure 4: Structure of NÜWA. The figure is cited from (Wu et al., 2021b) 7 / 26

VQGAN zi = arg min j∈{0,...,N−1} || RdB E(I)i −Bj
||2, (1) ˆ I = G(B[z]), (2) LV = || RH×W ×C ∈ I − ˆ I||2 2 + ||sg[ Rh×w×dB E(I)] − B[z]||2 2 + ||E(I) − sg[B[z]]||2 2 , (3) LP = ||CNN(I) − CNN(ˆ I)||2 2 , (4) LG = logD(I) + log(1 − D(ˆ I)), (5) where dB = 256, N = 12, 288, H = W ∈ {256, 336}, h = w ∈ {16, 21, 32}. 8 / 26

3D Nearby Self-Attention (3DNA) Y = 3DNA( Rh×w×s×din ∈ X,
C ∈ Rh ×w ×s ×din ; W), (6) Reh×ew×es×din ∈ N(i,j,k) = Cabc |a − i | ≤ eh, |b − j | ≤ ew, |c − k | ≤ es , (7) Rh×w×s×dout Q(i,j,k) = XWQ, (8) Reh×ew×es×dout K(i,j,k) = N(i,j,k)WK, (9) Reh×ew×es×dout V (i,j,k) = N(i,j,k)WV , (10) yijk = softmax (Q(i,j,k))T(K(i,j,k))T √ din V (i,j,k), (11) where WQ , WK , WV ∈ Rdin×dout . 9 / 26

3DNA 3D block-sparse 3D axial-sparse (row) 3D nearby-sparse (ours) Considering
previous tokens in a fixed 3D-block. Considering previous tokens in each 3D axis. Considering previous tokens in a 3D nearby sliding window. 3D sparse attentions. 10 / 26

3D Encoder-Decoder Yijk := Yijk + Ph i + Pw
j + Ps k (12) Cijk := Cijk + Ph i + Pw j + Ps k (13) C(l) = 3DNA(C(l−1), C(l−1)), (14) Y (l) ijk =3DNA(Y (l−1) <i,<j,<k , Y (l−1) <i,<j,<k ) +3DNA(Y (l−1) <i,<j,<k , C(L)), (15) where Y ∈ Rh×w×s×dout , C ∈ Rh ×w ×s ×din , V (1) 0,0,0 is < bos >. 11 / 26

Training Objective L = − h×w t=1 log pθ yt
y<t , Ctext; θ − h×w×s t=1 log pθ yt y<t , c; θ − h×w×s t=1 log pθ yt y<t , Ctext; θ (16) Training on Text-to-Image (T2I), Video Prediction (V2V) and Text-to-Video (T2V) with cross-entropy loss. 12 / 26

Experiments - quantitative results Model FID-0↓ FID-1 FID-2 FID-4 FID-8
IS↑ CLIPSIM↑ AttnGAN (Xu et al., 2018) 35.2 44.0 72.0 108.0 100.0 23.3 0.2772 DM-GAN (Zhu et al., 2019) 26.0 39.0 73.0 119.0 112.3 32.2 0.2838 DF-GAN (Tao et al., 2020) 26.0 33.8 55.9 91.0 97.0 18.7 0.2928 DALL-E (Ramesh et al., 2021) 27.5 28.0 45.5 83.5 85.0 17.9 - CogView (Ding et al., 2021) 27.1 19.4 13.9 19.4 23.6 18.2 0.3325 XMC-GAN (Zhang et al., 2021) 9.3 - - - - 30.5 - NÜWA 12.9 13.8 15.7 19.3 24 27.2 0.3429 Table 1: T2I task on MSCOCO (256×256). Model Acc↑ FID-img↓ FID-vid↓ CLIPSIM↑ T2V (64×64) (Li et al., 2018) 42.6 82.13 14.65 0.2853 SC (128×128) (Balaji et al., 2019) 74.7 33.51 7.34 0.2915 TFGAN (128×128) (Balaji et al., 2019) 76.2 31.76 7.19 0.2961 NÜWA (128×128) 77.9 28.46 7.05 0.3012 Table 2: T2V task on the Kinetics dataset. Model Cond. FVD↓ MoCoGAN (Tulyakov et al., 2018) 4 503 SVG-FP (Denton and Fergus, 2018) 2 315 CNDA (Finn et al., 2016) 2 297 SV2P (Babaeizadeh et al., 2017) 2 263 SRVP (Franceschi et al., 2020) 2 181 VideoFlow (Kumar et al., 2019) 3 131 LVT (Rakhimov et al., 2020) 1 126±3 SAVP (Lee et al., 2018) 2 116 DVD-GAN-FP (Clark et al., 2019) 1 110 Video Transformer (S) (Weissenborn et al., 2020) 1 106±3 TriVD-GAN-FP (Luc et al., 2020) 1 103 CCVS (Moing et al., 2021) 1 99±2 Video Transformer (L) (Weissenborn et al., 2020) 1 94±2 NÜWA 1 86.9 Table 3: V2V task on BAIR (64×64). 13 / 26

Experiments - qualitative results A very cute cat laying by
a big bike. China airlines plain on the ground at an airport with baggage cars nearby. A table that has a train model on it with other cars and things. A living room with a tv on top of a stand with a guitars sitting next to. A couple of people are sitting on a wood bench. A very cute giraffe making a funny face. A kitchen with a fridge, stove and sink. A group of animals are standing in the snow. A green train is coming down the tracks. A group of skiers are preparing to ski down a mountain. A small kitchen with low a ceiling. A child eating a birthday cake near some balloons. XMC-GAN (256×256) NÜWA(ours) (256×256) A living area with a television and a table. NÜWA(ours) (256×256) XMC-GAN (256×256) NÜWA(ours) (256×256) DALL-E (256×256) Figure 6: T2I task on MSCOCO. The figure is cited from (Wu et al., 2021b). 14 / 26

Experiments - qualitative results Input Text: playing golf at swimming
pool Input Text: running on the sea T2V NÜWA(ours) T2V NÜWA(ours) NÜWA(ours) (336×336) GODIVA (128×128) Input Text: playing golf on grass TFGAN (128×128) T2V (64×64) Figure 7: T2V task on the Kinetics dataset. The figure is cited from (Wu et al., 2021b). 15 / 26

Experiments - qualitative results Input Ground Truth Taming (256×256) SPADE
(256×256) NÜWA(ours) NÜWA(ours) (256×256) Figure 8: S2I)task on MSCOCO stuff dataset. The figure is cited from (Wu et al., 2021b). Input NÜWA(ours) NÜWA(ours) (256×256) Taming (256×256) Figure 9: I2I in a zero-shot manner. The figure is cited from (Wu et al., 2021b). 16 / 26

Experiments - qualitative results A photo of a camping tent
A photo of a bouquet of flowers A photo of a blue firetruck Manipulation Raw Image Paint By Word NÜWA(ours) Figure 10: TI2I in a zero-shot manner. The figure is cited from (Wu et al., 2021b). Reconstructed Image Raw Image Raw Sketch Reconstructed Sketch Figure 11: Reconstruction samples of VQ-GAN and VQ-GAN-Seg. The figure is cited from (Wu et al., 2021b). 17 / 26

Experiments - qualitative results Manipulation1: The diver is swimming to
the surface. Manipulation2: The diver is swimming to the bottom. Manipulation3: The diver is flying to the sky Raw Video: Figure 12: Samples of different manipulations on the same video. The figure is cited from (Wu et al., 2021b). 18 / 26

Experiments - quantitative results Model Dataset R → D Rate
SSIM FID VQ-VAE ImageNet 2562 → 162 F16 0.7026 13.3 VQ-GAN ImageNet 2562 → 162 F16 0.7105 6.04 VQ-GAN ImageNet 2562 → 322 F8 0.8285 2.03 VQ-GAN ImageNet 3362 → 212 F16 0.7213 4.79 VQ-GAN OpenImages 3362 → 212 F16 0.7527 4.31 Model Dataset R → D Rate PA FWIoU VQ-GAN-Seg MSCOCO 3362 → 212 F16 96.82 93.91 VQ-GAN-Seg VSPW 3362 → 212 F16 95.36 91.82 Table 4: Effectiveness of different VQ-VAE (VQ-GAN) settings. Model Pre-trained Tasks FID-vid↓ CLIPSIM↑ NÜWA-TV T2V 52.98 0.2314 NÜWA-TV-TI T2V+T2I 53.92 0.2379 NÜWA-TV-VV T2V+V2V 51.81 0.2335 NÜWA T2V+T2I+V2V 47.68 0.2439 Table 5: Effectiveness of multi-task pre-training for T2V task on MSRVTT. Model Encoder Decoder FID-vid↓ Detected PA↑ NÜWA-FF Full Full 35.21 0.5220 NÜWA-NF Nearby Full 33.63 0.5357 NÜWA-FN Full Nearby 32.06 0.5438 NÜWA-AA Axis Axis 29.18 0.5957 NÜWA Nearby Nearby 27.79 0.6085 Table 6: Effectiveness of 3D nearby attention for S2V task on VSPW. 19 / 26

Experiments - qualitative results Raw Image VQVAE ImageNet VQGAN ImageNet
VQGAN ImageNet VQGAN OpenImages VQGAN ImageNet Figure 13: Reconstruction results of R → D compression settings on VQ-VAE (VQ-GAN). Figure is cited from (Wu et al., 2021b) but is not appeared in submitted paper. 20 / 26

Question & weakness • Need ablation study. • Previous work
use different decoders (dVAE, VQVAE, VQGAN). • Need more comparision with CogView. • More pre-training task? 21 / 26

Summary • NUWA is a novel multimodal zero shot image
generation model. • 3DNA enables unifined representation of text, image, and video. • Important: 64 A100 GPUs for two weeks. 22 / 26

References i Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H.
Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017. Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa, and Hans Peter Graf. Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis. In IJCAI, pages 1995–2001, 2019. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019. Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1174–1183. PMLR, 2018. Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. pages 1–18, 2021. URL http://arxiv.org/abs/2105.13290. Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. Advances in neural information processing systems, 29:64–72, 2016. Jean-Yves Franceschi, Edouard Delasalles, Mickaël Chen, Sylvain Lamprier, and Patrick Gallinari. Stochastic latent residual video prediction. In International Conference on Machine Learning, pages 3233–3246. PMLR, 2020. 23 / 26

References ii Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn,
Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A conditional flow-based model for stochastic video generation. arXiv preprint arXiv:1903.01434, 2019. Alex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018. Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer, and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020. Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. CCVS: Context-aware Controllable Video Synthesis. arXiv preprint arXiv:2107.08037, 2021. Ruslan Rakhimov, Denis Volkhonskiy, Alexey Artemov, Denis Zorin, and Evgeny Burnaev. Latent Video Transformer. arXiv preprint arXiv:2006.10704, 2020. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021. 24 / 26

References iii Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe,
Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020. Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1526–1535, 2018. Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. In ICLR, 2020. Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions. arXiv:2104.14806 [cs], April 2021a. Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. NUWA: Visual Synthesis Pre-training for Neural visUal World creAtion. 2021b. URL http://arxiv.org/abs/2111.12417. 25 / 26

References iv Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1316–1324, 2018. Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv preprint arXiv:2104.10157, 2021. Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 833–842, 2021. Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5810, 2019. 26 / 26

NÜWA: Visual Synthesis Pre-training for Neural ...

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

raven

More Decks by raven

Other Decks in Research

Featured

Transcript

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion Chenfei

Basic information Title NÜWA: Visual Synthesis Pre-training for Neural visUal

Abstract ”This paper presents a unified multimodal pre-trained model called

Teaser figure Text-To-Image (T2I) A dog with goggles staring at

VQ-VAE-based Visual Auto-Regressive Models (previous work) Transformer Target Flattened sequence

VQ-VAE-based Visual Auto-Regressive Models (previous work) Encoder Decoder 7KHKHDGRIDORYHO\FDW Ӟݝݢᆽጱੜሞጱ१؟̶

Difference between previous work (DALL-E(Ramesh et al., 2021), CogView(Ding et

Overview of NÜWA 3D-Decoder 1D-Encoder 3D-Encoder 2D-Encoder A light wind

VQGAN zi = arg min j∈{0,...,N−1} || RdB E(I)i −Bj

3D Nearby Self-Attention (3DNA) Y = 3DNA( Rh×w×s×din ∈ X,

3DNA 3D block-sparse 3D axial-sparse (row) 3D nearby-sparse (ours) Considering

3D Encoder-Decoder Yijk := Yijk + Ph i + Pw

Training Objective L = − h×w t=1 log pθ yt

Experiments - quantitative results Model FID-0↓ FID-1 FID-2 FID-4 FID-8

Experiments - qualitative results A very cute cat laying by

Experiments - qualitative results Input Text: playing golf at swimming

Experiments - qualitative results Input Ground Truth Taming (256×256) SPADE

Experiments - qualitative results A photo of a camping tent

Experiments - qualitative results Manipulation1: The diver is swimming to

Experiments - quantitative results Model Dataset R → D Rate

Experiments - qualitative results Raw Image VQVAE ImageNet VQGAN ImageNet

Question & weakness • Need ablation study. • Previous work

Summary • NUWA is a novel multimodal zero shot image

References i Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H.

References ii Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn,

References iii Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe,

References iv Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,