Style Transfer Survey by VIVEN Inc

1 Style Transfer Survey Tapas Dutta, Deep Learning Engineer

 Summary The work proposes a unidirectional GAN model for
neural artistic style transfer while ensuring cyclic consistency.  Related Works Gatys(2016) first introduced style transfer for image synthesis. Karras(2019) improved upon it to make it faster and efficient. Yang(2018) used a GAN based algorithm to control style ingrained on continuous flow of color to incorporate brush stroke patterns. Zhang(2020) used imagenetpretrained models to improve performance. Zhu(2017) introduced image translation between two domains with unpaired initial distributions.  Proposed Methodology The work proposes two approaches for style transfer. The first approach uses a GAN with a generator with encoder-decoder architecture and two discriminator for style and content. The encoder is used to extract object features from content image and local-global fused features from style images. The features are encoded into latent vector space within the bottleneck layers. The decoder is used to reconstruct the transferred image as the same size as input. PatchGAN is used as the content discriminator since it penalizes structure at scale allowing for generation of a semantically accurate image. By considering the image a Markov random field with plate per patch thus it can support having original color in generated images. A wavelet CNN discriminator consists of wavelet transformation layer followed by CNN layers. 2 Neural Artistic Style Transfer with Conditional Adversarial Networks Copyright © 2022 VIVEN Inc. All Rights Reserved Thus, the discriminator can fuse global-local features to extract style from images. The second approach consists of content encoder, style encoder and a decoder. The content encoder is like StyleGAN were skip connections are used between intermediate content encoding modules and intermediate upscaling CNN modules. The module is trained using marginal loss function. A DenseNetmodel is used as style encode since it concatenates features from previous modules and the weighted transition layers create complex feature maps. The decoder sub-module in generator starts with style encoded features and is fed low level content encoded features using skip connections.  Next must-read paper: Generative image inpainting with salient prior and relative total variation

 Summary Recent style transfer works struggle with retraining global
information, thus this work uses a transformer based to overcome this. Two separate transformer for style and content encoding are used. The decoder uses a transformer to stylize the content.  Related Works Gatys(2016) used CNN to extract features from content and stylized images and optimized the algorithm to generate stylized images. Huang(2017) proposed adaptive instance normalization that can stylize image by replacing the mean and variance of content image with that of stylized exemplar. Chen(2021) proposed Internal-External Style transfer algorithm optimized using two types of contrastive loss to improve results quality.  Proposed Methodology Given content image(𝐼𝑐 ) and style image (𝐼𝑠 ) they are split into patches and a linear layer is used to obtain embeddings(𝜀) of size 𝐿 × 𝐶 where 𝐿 = 𝐻×𝑊 𝑚×𝑚 for m=8 patch size and C being embedding dimension. The attention score for 𝑖𝑡ℎ and 𝑗𝑡ℎ patch is calculated as 𝐴𝑖,𝑗 = ( 𝜀𝑖 + 𝑃𝑖 𝑊 𝑞 )𝑇( 𝜀𝑗 + 𝑃 𝑗 𝑊𝑘 ) for P representing positional encoding, 𝑊 𝑞 and 𝑊𝑘 represents query and key weights, respectively. The positional embedding are formulated as𝑃𝐶𝐴 (𝑥, 𝑦) = σ 𝑘=0 𝑠 σ 𝑙=0 𝑠 (𝑎𝑘𝑙 𝑃𝐿 (𝑥𝑘 ,𝑦𝑙 )) where 𝑃𝐿 = ℱ𝑝𝑜𝑠 (𝐴𝑣𝑔𝑃𝑜𝑜𝑙𝑛×𝑛 (𝜀)) for ℱ𝑝𝑜𝑠 representing 1 × 1 convolution and s represents number of query, key 3 StyTr2:Image Style Transfer with Transformers Copyright © 2022 VIVEN Inc. All Rights Reserved and value are calculated as 𝑄 = 𝑍𝑐 𝑊 𝑞 , 𝐾 = 𝑍𝑐 𝑊𝑘 and 𝑉 = 𝑍𝑐 𝑊 𝑣 neighboring patches. Thus, for content image 𝑍𝑐 = {𝜀𝑐1 + 𝑃𝐶𝐴1 , 𝜀𝑐2 + 𝑃𝐶𝐴2 … 𝜀𝑐𝐿 + 𝑃𝐶𝐴𝐿 } and the multi-head attention is calculated as 𝐹𝑀𝑆𝐴 (𝑄, 𝐾, 𝑉) = 𝐶𝑜𝑛𝑐𝑎𝑡(𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛1 𝑄, 𝐾, 𝑉 , … , 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑛 𝑄, 𝐾, 𝑉 )𝑊0 with layer normalization applied after each block and followed by a skip connection. Similar encoding is done for style image except for positional encoding. Thus, the content encoder can be represented as 𝑌′𝑐 = ℱ𝑀𝑆𝐴 𝑄, 𝐾, 𝑉 + 𝑄 and 𝑌𝑐 = ℱ𝐹𝐹𝑁 𝑌′𝑐 + 𝑌′𝑐 . The decoder consists of two multihead attention layers and one residual layer for content input 𝑌′′𝑐 = {𝑌𝑐1 + 𝑃𝐶𝐴1 ,𝑌𝑐2 + 𝑃𝐶𝐴2 ,… 𝑌𝑐𝐿 + 𝑃𝐶𝐴𝐿 } and style image 𝑌𝑠 = {𝑌𝑠1 , 𝑌𝑠2 ,… 𝑌𝑠𝐿 } the query, key and value are calculated 𝑄 = 𝑌′′𝑐 𝑊 𝑞 , 𝐾 = 𝑌′′𝑠 𝑊𝑘 , 𝑉 = 𝑌′′𝑠 𝑊 𝑣 it can be formulated as 𝑋′′ = ℱ𝑀𝑆𝐴 𝑄, 𝐾, 𝑉 + 𝑄, 𝑋′ = ℱ𝑀𝑆𝐴 (𝑋′′ +

 Summary Recent style transfer works struggle with content structure
and style patterns along with local and global patterns due to patch-based mechanism. Thus, this work proposes a style-attentional network to overcome these issues.  Related Works Huang(2017) adjusted the mean and variance of content image to that of style image by transferring global feature statistics. Sheng(2018) used a patch-based decorator to transfer content features to semantically nearest style features by minimizing the difference between their holistic feature distribution.  Proposed Methodology The network consists of an encoder-decoder mechanism and a style-attentional module trained using a novel identity loss. For content image 𝐼𝑐 and style image 𝐼𝑠 to generate output 𝐼𝑐𝑠 . A VGG19 network is used as an encoder which along with a symmetric decoder and two SANets are trained jointly. The SANets receives input from multiple VGG19 layers(Relu_4_1, Relu_5_1) which are then combined. For 𝐹𝑠 and 𝐹𝑐 representing the features extracted from style and content image using VGG19 network the combined output 𝐹𝑐𝑠 = 𝑆𝐴𝑁𝑒𝑡 𝐹𝑐 ,𝐹𝑠 . Within the SANetthe inputs are mean-variance channel-wise normalized as F’. Thus, attention between 𝐹′𝑠 and 𝐹′𝑐 is calculated as 𝐹𝑐𝑠 𝑖 = 1 𝐶(𝐹) σ ∀𝑗 exp(𝑊𝑓 𝐹′𝑇 𝑐 𝑗 𝑊 𝑔 𝐹𝑠 𝑗)𝑊ℎ 𝐹 𝑠 𝑗) and 4 Arbitrary Style Transfer with Style-Attentional Networks Copyright © 2022 VIVEN Inc. All Rights Reserved 𝐶 𝐹 = σ ∀𝑗 exp(𝑊𝑓 𝐹′𝑇 𝑐 𝑗 𝑊 𝑔 𝐹𝑠 𝑗). After applying a 1 × 1 convolution it is added to 𝐹𝑐 to obtain 𝐹𝑐𝑠𝑐 . For 𝐹𝑐𝑠𝑐 𝑟_4_1 and 𝐹𝑐𝑠𝑐 𝑟_5_1 representing the output of 2 SANets they are combined as 𝐹𝑐𝑠𝑐 𝑚 = 𝑐𝑜𝑛𝑣3×3 (𝐹𝑐𝑠𝑐 𝑟_4_1 + 𝑢𝑝𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔(𝐹𝑐𝑠𝑐 𝑟_5_1)). The resultant feature map is in turn fed to decoder. The module is optimized as 𝐿 = 𝜆𝑐 𝐿𝑐 + 𝜆𝑠 𝐿𝑠 + 𝐿𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦 where 𝐿𝑐 = ||𝐸(𝐼𝑐𝑠 )′𝑟−4−1 − 𝐹′ 𝑐 𝑟−4−1||𝑐 + ||𝐸(𝐼𝑐𝑠 )′𝑟−5−1 − 𝐹′ 𝑐 𝑟−5−1||𝑐 , 𝐿𝑠 = σ 𝑖=1 𝐿 ||𝜙𝑖 𝐼𝑐𝑠 − 𝜙𝑖 𝐼𝑠 ||2 and 𝐿𝑖𝑑𝑛𝑒𝑡𝑖𝑡𝑦 = 𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦1 (| 𝐼𝑐𝑐 − 𝐼𝑐 |2 + 𝐼𝑠𝑠 − 𝐼𝑠 |2 + 𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦2 (σ 𝑖=1 𝐿 ||𝜙𝑖 𝐼𝑐𝑐 − 𝜙𝑖 𝐼𝑐 ||2 + σ 𝑖=1 𝐿 ||𝜙𝑖 𝐼𝑠𝑠 − 𝜙𝑖 𝐼𝑠 ||2 ) where 𝐼𝑐𝑐 and 𝐼𝑠𝑠 denotes output from 2 same content and style images, 𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦1 and 𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦2 denotes loss weights.  Result MS-COCO and WikiArtare used content and style images, respectively.  Next must-read paper : “Avatar-Net: Multi- scale zero-shot style transfer by feature decoration”

 Summary The work proposes a high-speed algorithm with non-parametric
warping that supports two exemplar one for texture and another for geometry. The proposed network learns mapping from 4D array of inter-feature distances to a non-parametric 2D warp field.  Related Works Gatys(2015, 2016) used later layers output to represent content and feature correlation between different layers to represent style. Ulyanov(2017) replace batch normalization with image normalization to improve quality. Yaniv(2019) used features from geometric styles of individual artists. Kim(2020) matched points between content image and style image, filtered results with low match and trained using a warping loss. Liu(2020) proposed a mapping from 4D function of distance measures to 2D parametric warp.  Proposed Methodology For content image 𝐼𝑐 , geometric image 𝐼𝑔 and texture image 𝐼𝑡 an output 𝐼𝑜 is calculated using geometric warping module D and texture rendering module R as 𝐼𝑜 = 𝑅(𝐷(𝐼𝑔 , 𝐼𝑐 ), 𝐼𝑡 ). The geometric warping module consists of feature extraction(VGG), feature matching and a warp network. Matching between 𝐹𝑐 (content features) and 𝐹 𝑔 (geometric features) are calculated as 𝑀𝑐𝑔 = <𝐹𝑐|𝐹𝑔> (σ𝑝=1 𝑊 σ𝑞=1 𝐻 <𝐹𝑐|𝐹𝑔>)0.5 where < 𝐹𝑐 |𝐹 𝑔 > represents the inner product. 5 Learning to Warp for Style Transfer Copyright © 2022 VIVEN Inc. All Rights Reserved M is converted from 4D to 2D as 𝑓: ℝ𝑊×𝐻×𝑊×𝐻 → ℝ𝑊1×𝐻1×2 this addressed as an optimization problem. The module is optimized as ℒ 𝐹𝑐 ,𝐹 𝑔 𝑤 = − σ 𝑚∈𝐼𝑐 σ 𝑛∈𝑁𝑚 log(𝑝(𝑤 𝐹𝑐 𝑚 ,𝐹 𝑔 𝑛)) where 𝑁𝑚 is a search window centered around m and 𝑝 𝑤 𝐹𝑐 𝑚 , 𝐹 𝑔 𝑛 = exp(𝑀(𝑤 𝐹𝑐 𝑚 ,𝐹𝑔 𝑛)) σ𝑡∈𝑁𝑚 exp(𝑀 𝑤 𝐹𝑐 𝑚 ,𝐹𝑔 𝑡 ) . For Gram matrix as 𝐺 𝐹𝑙 𝐼 = [𝐹𝑙 𝐼 ]𝑇𝐹𝑙 𝐼 where 𝐹𝑙 𝐼 is the VGG feature map of lth layer the style distance is calculated as ∆𝑆 𝐼𝑜 ,𝐼𝑡 = σ 𝑙∈𝑙𝑡 || 𝐺 𝐹𝑙 𝐼𝑜 − 𝐺 𝐹𝑙 𝐼𝑡 ||2 and content distance ∆𝑐 𝐼𝑜 , 𝐼𝑤 = ||𝐹𝑙𝑐 𝐼𝑜 − 𝐹;𝑐 𝐼𝑤 ||2 for selected channels 𝑙𝑡 and 𝑙𝑐 for content texture and content respectively. The network is jointly optimized as 𝐼𝑜 = 𝑎𝑟𝑔𝑚𝑖𝑛𝐼 [𝛼∆𝑠 𝐼, 𝐼𝑡 + 𝛽∆𝑐 (𝐼, 𝐼𝑤 )].  Result MS-COCO and PF_PASCAL are used to train the algorithm. For evaluation participants are asked to choose preferred output images from multiple algorithms and the proposed algorithm obtained the highest user preference.  Next must-read paper : “Deformable style transfer”

 Summary The work overcomes the diversity of the task
in an efficient manner in three stages i.e., structure alignment, texture refinement, holistic effect enhancement in a coarse-to-fine refinement.  Related Works Conventional texture transfer algorithms use hand-crafted algorithms for texture transfer. Men(2018) proposed framework for interactive texture transfer using improved PatchMatch and custom channels. Li(2016) combined Markov Random Fields(MRFs) and DCNNs for texture transfer. Chen(2016) proposed “style-swap” operation for fast patch-based stylization. Goodfellow(2014) proposed GAN for textures transfer.  Proposed Methodology For source image 𝑆𝑠𝑡𝑦 , semantic map 𝑆𝑠𝑒𝑚 , the algorithm generates stylized target image 𝑇𝑠𝑡𝑦 , given target semantic map 𝑇𝑠𝑒𝑚 . This is done in three stages namely global view structure alignment, local view texture refinement and holistic effect enhancement. The first part of an autoencoder upto ReLu_X_1(𝑋 ∈ {1,2,3,4,5}) along with 5 decoders used for image reconstruction trained as 𝐿𝑟𝑒𝑐𝑜𝑛 = 𝐼𝑟 − 𝐼𝑖 |2 2 + 𝜆 |𝜙(𝐼𝑟 ) − 𝜙(𝐼𝑖 )||2 2 for 𝜙 denoting a VGG encoder and 𝐼𝑟 , 𝐼𝑖 representing input and reconstructued output. For 𝐹𝑆𝑠𝑡𝑦 and 𝐹𝑇𝑠𝑡𝑦 𝑡 representing the VGG features of stylized image and target image 𝑇𝑠𝑡𝑦 𝑡 , they are standardized(𝐹 1 𝑆𝑠𝑡𝑦 and 𝐹 1 𝑇𝑠𝑡𝑦 𝑡 ) and fuse the feature maps 6 Texture Reformer: Towards Fast and Universal Interactive Texture Transfer Copyright © 2022 VIVEN Inc. All Rights Reserved using the semantic map features 𝐹𝑆𝑠𝑒𝑚 and 𝐹𝑇𝑠𝑒𝑚 as 𝐹𝑆 = 𝐹 1 𝑆𝑠𝑡𝑦 ۨ 𝜔 𝐹𝑆𝑠𝑒𝑚; 𝐹𝑇 = 𝐹 1 𝑇𝑠𝑡𝑦 𝑡 ۨ 𝜔 𝐹𝑇𝑠𝑒𝑚 here ۨ represents concatenation operation. To enhance holistic effect of stylized image on target 𝑆𝐸 𝐹𝑆𝑠𝑡𝑦, 𝐹𝑇𝑠𝑡𝑦 𝑡 = 𝛼 𝐹𝑆𝑠𝑡𝑦 𝐹𝑇𝑠𝑡𝑦 𝑡 −𝜇 𝐹𝑇𝑠𝑡𝑦 𝑡 𝛼 𝐹𝑇𝑠𝑡𝑦 𝑡 + 𝜇(𝐹𝑆𝑠𝑡𝑦) where 𝛼 and 𝜇 represents mean and standard deviation, respectively. The patch size used is determined as 𝑝 = min 𝐻 𝐹𝑆 ,𝑊 𝐹𝑆 , 𝐻 𝐹𝑇 ,𝑊 𝐹𝑇 − 1 where H and W represents the height and width of the feature. Since the global alignment is done at the deepest Relu5_1 the computational cost is minimized, the layer provides the highest-level structure features and largest receptive field. To rectify the local structure and texture a patch of size 3 and shallow features from Relu4_1 is used on the output of stage-I. Statistics based enhancements on low-level features x={1,2,3} are performed to preserve low level effects.  Result MS-COCO is used for validating the proposed algorithm. The evaluation metrics used are Structural Similarity Index, Learned Perceptual Image Patch Similarity and compared with state-of-the-art algorithms.  Next must-read paper : “A common framework for interactive texture transfer. ”

 Summary The work proposes Avatar-Net an algorithm for visually
plausible multi- scale transfer for arbitrary style this is done by preserving not only the feature distribution but also detailed style patterns in style image.  Related Works Gatys(2017) formulated the task as an optimization problem that balances between content and style. Chen(2017) used filtering kernels to strengthen representative power for multiple styles. Huang(2017) used adaptive instance normalization to adjust channel-wise statistics of content features. Chen(2016) swapped content patches with closest style features from intermediate layer of an auto-encoder. Li(2017) recursively applied whitening and color transformation at multi levels of an autoencoder to transfer style patterns.  Proposed Methodology For bottleneck features 𝑧 ∈ ℝ𝐻×𝑊×𝐶 be extracted from encoder decoder network for image x, 𝑧𝑐 and 𝑧𝑠 represent the features for content and style images, respectively. The features z are projected to z’ as 𝑧′ = 𝑊 ۪(𝑧 − 𝜇(𝑧)) where 𝜇(𝑧) represents mean of z, thus 𝑧′𝑐 and 𝑧′𝑠 are obtained. The elements of 𝑧′𝑐 are aligned to nearest element in 𝑧′𝑠 as 𝑧′𝑐𝑠 = Φ(𝑧′ 𝑠 )𝑇 ۪ 𝛽(Φ′ 𝑧′ 𝑠 ۪ 𝑧′ 𝑐 ) where Φ(𝑧′ 𝑠 ) ∈ ℝ𝑃×𝑃×𝐶×(𝐻×𝑊) is the style kernel of patch P and Φ′(𝑧′ 𝑠 ) is the normalized style kernel. To apply color transformation 𝑧𝑐𝑠 = 𝐶𝑠 ۪ 𝑧′𝑐𝑠 + 𝜇(𝑧𝑠 ) for coloring kernel 𝐶𝑠 derived from 7 Avatar-Net: Multi-scale Zero-shot Style Transfer by Feature Decoration Copyright © 2022 VIVEN Inc. All Rights Reserved Covariance matrix derived from C(𝑍𝑠 ). The encoder is a concatenation of multiple encoder blocks that extracts features as 𝑒𝑙 = 𝐸𝜃𝑒𝑛𝑐 𝑙 𝑒𝑙−1 ,𝑙 ∈ {1, … , 𝐿}, the decoder generates intermediate features as 𝑑𝑙 = 𝐷𝜃𝑑𝑒𝑐 𝑙+1 𝑒𝑙+1 . The style fusion module is implemented as ℱ𝑆𝐹 𝑑𝑐𝑠 𝑙 ;𝑒𝑠 𝑙 = 𝜎 𝑒𝑠 𝑙 ∘ 𝑑𝑐𝑠 𝑙 −𝜇 𝑑𝑐𝑠 𝑙 𝜎 𝑑𝑐𝑠 𝑙 + 𝜇(𝑒𝑠 𝑙) where ∘ represents channel-wise multiplication and 𝜎 ∎ denotes the channel-wise standard deviation.  Result MSCOCO is used for training and evaluating the proposed model.  Next must-read paper : “Fast patch-based style transfer of arbitrary style”

 Summary The work proposes a generative model TuiGAN for
coarse-to-fine image translation for one-shot unsupervised learning.  Related Works Isola(2017) proposed pix2pix a generative model for supervised image to image tasks. Liu(2019) proposed FUNIT for few-shot unsupervised tasks. Gatys(2016) proposed style transfer by minimizing Gram matrix of deep features. Shocher(2018) proposed InGAN to learn internal patch distribution using image-specific GAN. Shaham(2019) proposed SinGAN an unconditional pyramidal generative model to learn patch-based distribution at different scales.  Proposed Methodology For two images 𝐼𝐴 and 𝐼𝐵 from domain A and B respectively two mapping functions are used as 𝐼𝐴𝐵 = 𝐺𝐴𝐵 (𝐼𝐴 ) and 𝐼𝐵𝐴 = 𝐺𝐵𝐴 (𝐼𝐵 ). The generators 𝐺𝐴𝐵 and 𝐺𝐵𝐴 are implemented as a series of generators {𝐺𝐴𝐵 𝑛 }𝑛=0 𝑁 and {𝐺𝐵𝐴 𝑛 }𝑛=0 𝑁 which are verified using discriminators 𝐷𝐴 𝑛 and 𝐷𝐵 𝑛. Generators n<N receives inputs from the original image and previous generator’s output𝐼𝐴𝐵 𝑛+1 to calculate output as 𝐼𝐴𝐵,𝜙 𝑛 = 𝜙(𝐼𝐴 𝑛) to process the input followed by calculating an attention map as 𝐴𝑛 = 𝜓(𝐼𝐴𝐵,𝜙 𝑛 ,𝐼𝐴 𝑛,𝐼𝐴𝐵 𝑛+1↑) followed by linearly combining 𝐼𝐴𝐵,𝜙 𝑛 and 𝐼𝐴𝐵 𝑛+1↑ as 𝐼′𝐴𝐵 𝑛 = 𝐴𝑛 ۪ 𝐼𝐴𝐵,𝜙 𝑛 + (1- 𝐴𝑛) ۪ 𝐼𝐴𝐵 𝑛+1↑. The loss function for the 𝑛𝑡ℎ scale is calculated as 𝐿𝐴𝑙𝑙 𝑛 = 𝐿𝐴𝐷𝑉 𝑛 + 𝜆𝐶𝑌𝐶 𝐿𝐶𝑌𝐶 𝑛 + 𝜆𝐼𝐷𝑇 𝐿𝐼𝐷𝑇 𝑛 + 𝜆𝑇𝑉 𝐿𝑇𝑉 𝑛 for weighted adversarial, cycle-consistency, identity 8 TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images Copyright © 2022 VIVEN Inc. All Rights Reserved and total variation loss. Adversarial is calculated as 𝐿𝐴𝐷𝑉 𝑛 = 𝐷𝐵 𝑛 𝐼𝐵 𝑛 − 𝐷𝐵 𝑛 𝐺𝐴𝐵 𝑛 𝐼𝐴 𝑛 + 𝐷𝐴 𝑛 𝐼𝐴 𝑛 − 𝐷𝐴 𝑛 𝐺𝐵𝐴 𝑛 𝐼𝐵 𝑛 − 𝜆𝑃𝐸𝑁 (| △ 𝐼′ 𝐵 𝑛 𝐷𝐵 𝑛 𝐼′ 𝐵 𝑛 |2 − 1)2 − 𝜆𝑃𝐸𝑁 (| △ 𝐼′ 𝐴 𝑛 𝐷𝐴 𝑛 𝐼′ 𝐴 𝑛 |2 − 1)2, cycle-consistency loss is calculated as 𝐿𝐶𝑌𝐶 𝑛 = | 𝐼𝐴 𝑛 − 𝐼𝐴𝐵𝐴 𝑛 |1 + ||𝐼𝐵 𝑛 − 𝐼𝐵𝐴𝐵 𝑛 ||1 identity loss is calculated as 𝐿𝐼𝐷𝑇 𝑛 = | 𝐼𝐴 𝑛 − 𝐼𝐴𝐴 𝑛 |1 + ||𝐼𝐵 𝑛 − |𝐼𝐵𝐵 𝑛 |1 to avoid noisy and overly pixelated output 𝐿𝑡𝑣 (𝑥) = σ 𝑖,𝑗 ((𝑥 𝑖, 𝑗 + 1 − 𝑥 𝑖, 𝑗 )2 + (𝑥 𝑖 + 1, 𝑗 − 𝑥[𝑖, 𝑗])2)0.5  Result To evaluate the effectiveness of proposed algorithm the metrics used are Single Image Fr’echetInception Distance (SIFID) which is the Fr’echet Inception Distance (FID) between deep features of two images, Perceptual Distance(PD) between images and User Preferance.  Next must-read paper : “Unpaired image-to-image translation using cycle-consistent adversarial networks ”

 Summary The work proposes Style Transfer by Relaxed Optimal
Transport and Self- Similarity (STROTSS) which allows user-specified region-to-region or point-to- point control over visual similarity for unconstrained style transfer.  Related Works Gatys (2016) used model trained classification to extract features for representing the content and style images. The algorithm is trained using the Frobenius norm of the output and style, output and content for style and content loss, respectively. Li(2016) improved upon the style loss by calculating the Markov Random Field over the extracted features. Thus, the content patches are matched to their closest target patches.  Proposed Methodology For image X a VGG16 pretrained on ImageNet is used for feature extraction represented as 𝜙(𝑋)𝑖 representing feature extracted from 𝑖𝑡ℎ layer of the network. Bilinear upsampling is used to match the original image’s spatial dimension followed by concatenation, the resultant is used for feature representation. The algorithm is optimized as 𝐿 𝑋, 𝐼𝐶 ,𝐼𝑆 = 𝛼𝑙𝑐+𝑙𝑚+𝑙𝑟+1 𝛼 𝑙𝑝 2+𝛼+1 𝛼 where 𝑙𝑚 + 𝑙𝑟 + 1 𝛼 𝑙𝑝 constitute the style loss and 𝛼𝑙𝑐 make up the content loss. The style loss is calculated using relaxed earth mover distance(REMD) as 9 Style Transfer by Relaxed Optimal Transport and Self-Similarity Copyright © 2022 VIVEN Inc. All Rights Reserved 𝑙𝑟 = 𝑅𝐸𝑀𝐷 𝐴, 𝐵 = max(1 𝑛 σ 𝑖 𝑚𝑖𝑛𝑗 𝐶𝑖𝑗 , 1 𝑚 𝑗 σ 𝑚𝑖𝑛𝑖 𝐶𝑖𝑗 ) where cost matrix 𝐶𝑖𝑗 = 1 − 𝐴𝑖 ⊙ 𝐵𝑖 𝐴𝑖 | 𝐵𝑗 | . To preserve the magnitude of feature vectors moment matching loss 𝑙𝑚 = 1 𝑑 | 𝜇𝐴 − 𝜇𝐵 |1 + 1 𝑑2 | 𝔼𝐴 − 𝔼𝐵 |1 where 𝜇 and 𝔼 represents mean and covariance. REMD is used with Euclidean metric to preserve pixel color. The content loss is calculated as 𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝑋, 𝐶 = 1 𝑛2 σ 𝑖,𝑗 | 𝐷𝑖𝑗 𝑋 σ 𝑖 𝐷𝑖𝑗 𝑋 − 𝐷𝑖𝑗 𝐼𝐶 σ 𝑖 𝐷𝑖𝑗 𝐼𝐶 | where 𝐷𝑖𝑗 𝑋 and 𝐷𝑖𝑗 𝐼𝐶 represents cosine distance of all features extracted from output and content image, respectively.  Result Evaluation is done under 3 regimes namely ’Paired’ when content and style represent the same object, ’Unpaired’ when content and style do not represent the same object and ‘Texture’ when content represents an object and style represents a texture.  Next must-read paper : “Controlling perceptual factors in neural style transfer”

 Summary The work attempts to overcomes disharmonious colour and
repetitive patterns by learning human-aware style information and further consider style-to-style relations which is overlooked in current studies.  Related Works Gatys (2016) used Gram matrix on pretrained DCNN for neural style transfer. Soh (2020) proposed a fast, flexible, and lightweight self-supervised super- resolution algorithm. Park(2020) enhanced the resolution of restored images using super-resolution method. Wang(2021) proposed an algorithm capable of learning internal statistics for inpainting. Kang(2020) proposed conditional contrastive loss to learn data-to-data and data-to-class relations. Liu(2021) proposed latent-augmented contrastive loss for diverse image synthesis.  Proposed Methodology Given content image C and style image S the task is to create artistic image 𝐼𝑆𝐶 . To this end a VGG19 is used as encoder E to extract features, style attentional network T and generative network D is used. The model is optimized using style loss ℒ𝑠 = σ 𝑖=1 𝐿 ||𝜇(∅𝑖 𝐼𝑠𝑐 ) − 𝜇(∅𝑖 𝐼𝑠 )||2 + ||𝜎 ∅𝑖 𝐼𝑠𝑐 − 𝜎 ∅𝑖 𝐼𝑠 ||2 where ∅𝑖 denotes the output from ith layer in VGG19 model. The adversarial loss used ℒ𝑎𝑑𝑣 = 𝔼𝐼𝑠~𝑆 log 𝒟 𝐼𝑠 + 𝔼𝐼𝐶~𝐶,𝐼𝑠~𝑆 [log(1 − 𝒟 𝐷 𝑇 𝐸 𝐼𝑐 , 𝐸 𝐼𝑠 )]. Content loss ℒ𝑐 = || ∅𝑐𝑜𝑛𝑣_4_2 𝐼𝑠𝑐 − ∅𝑐𝑜𝑛𝑣42 𝐼𝑐 ||2 , 10 Artistic Style Transfer with Internal-external Learning and Contrastive Learning Copyright © 2022 VIVEN Inc. All Rights Reserved identity loss as ℒ𝑖𝑑𝑛𝑒𝑡𝑖𝑡𝑦 = 𝜆𝑖𝑑𝑛𝑒𝑡𝑖𝑡𝑦1 (ቚ I𝑐𝑐 − 𝐼𝑐 |2 + I𝑠𝑠 − 𝐼𝑠 |2 +

 Summary Existing works overlook shallow features and do not
consider local features statistics making creating unnatural output with distortions. The work calculates the spatial attention score from both shallow and deep features of content and style followed by normalization. A novel loss function is proposed to improve quality.  Related Works Huang(2017) proposed applied mean and standard deviation of style feature to content features. Jing(2020) used dynamic instance normalization such that weights for intermediate CNN blocks are generated by other network with style image as input. Chen(2016) relied on similarity between image and style patches for style transfer. Park(2019) proposed Style-Attention Network to match content and style features.  Proposed Methodology Given style image 𝐼𝑠 , content image 𝐼𝑐 the output would be stylized image 𝐼𝑐𝑠 . The encoder uses VGG19 to obtain multi-scale feature maps and a decoder symmetric to VGG19. To utilize shallow levels features from ReLU3-1, ReLU4-1, ReLU5-1 are used. The extracted features upto the current layer are upsampled and concatenated 𝐹∗ 1:𝑥. Similar strategy is done for content image and fed to AdaAttN module as 𝐹𝑐𝑠 𝑥 = 𝐴𝑑𝑎𝐴𝑡𝑡𝑁(𝐹𝑐 𝑥, 𝐹𝑠 𝑥,𝐹𝑐 1:𝑥, 𝐹∗ 1:𝑥). The features are fed to decoder as 𝐼𝑐𝑠 = 𝐷𝑒𝑐(𝐹𝑐𝑠 3, 𝐹𝑐𝑠 4,𝐹𝑐𝑠 5 ). AdaAttN works in three stages calculating the attention map, weighted mean and standard variance 11 AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer Copyright © 2022 VIVEN Inc. All Rights Reserved of style features calculation, normalizing content feature. For attention map the 𝐹𝑐 1:𝑥,𝐹∗ 1:𝑥, 𝐹𝑠 𝑥 are normalized and passed through 1 × 1 convolution and considered as query, queue and key to calculate the feature map as A = softmax(𝑄𝑇۪𝐾) and the attention features as 𝑀 = 𝑉 ۪ 𝐴𝑇. The variance is calculated as 𝑆 = ( 𝑉 ⊙ 𝑉 ۪𝐴𝑇 − 𝑀 ⊙ 𝑀)0.5 and the output as 𝐹𝑐𝑠 𝑥 = 𝑆 ⊙ 𝑁𝑜𝑟𝑚 𝐹𝑐 𝑥 + 𝑀. The network is optimized using weighted content and style loss as 𝐿 = 𝜆𝑔 𝐿𝑔𝑠 + 𝜆𝑙 𝐿𝑙𝑓 where 𝐿𝑔𝑓 = σ 𝑥=2 5 (|ห𝜇 𝐸𝑥 𝐼𝑐𝑠 −

12 Copyright © 2022 VIVEN Inc. All Rights Reserved 会社名
株式会社微分代表者吉田慎太郎所在地東京都新宿区新宿4丁目1-16 JR新宿ミライナタワー18F JustCo新宿設立 2020年10月資本金 7,000,000（2022年10月時点）従業員 20名（全雇用形態含む）事業内容教育機関向けソフトウェア「School DX」の開発ウェブアプリケーションの開発画像認識/自然言語の研究開発会社概要

Style Transfer Survey by VIVEN Inc

Style Transfer Survey by VIVEN Inc

VIVEN, Inc.

More Decks by VIVEN, Inc.

Other Decks in Science

Featured

Transcript

1 Style Transfer Survey Tapas Dutta, Deep Learning Engineer

 Summary The work proposes a unidirectional GAN model for

 Summary Recent style transfer works struggle with retraining global

 Summary Recent style transfer works struggle with content structure

 Summary The work proposes a high-speed algorithm with non-parametric

 Summary The work overcomes the diversity of the task

 Summary The work proposes Avatar-Net an algorithm for visually

 Summary The work proposes a generative model TuiGAN for

 Summary The work proposes Style Transfer by Relaxed Optimal

 Summary The work attempts to overcomes disharmonious colour and

 Summary Existing works overlook shallow features and do not

12 Copyright © 2022 VIVEN Inc. All Rights Reserved 会社名

13 Copyright © 2022 VIVEN Inc. All Rights Reserved