Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Style Transfer Survey by VIVEN Inc

Style Transfer Survey by VIVEN Inc

VIVEN, Inc.

March 17, 2023
Tweet

More Decks by VIVEN, Inc.

Other Decks in Science

Transcript

  1. 1
    Style Transfer
    Survey
    Tapas Dutta, Deep Learning Engineer

    View Slide

  2.  Summary
    The work proposes a unidirectional GAN model for neural artistic style transfer while
    ensuring cyclic consistency.
     Related Works
    Gatys(2016) first introduced style transfer for image synthesis. Karras(2019)
    improved upon it to make it faster and efficient. Yang(2018) used a GAN based
    algorithm to control style ingrained on continuous flow of color to incorporate brush
    stroke patterns. Zhang(2020) used imagenetpretrained models to improve
    performance. Zhu(2017) introduced image translation between two domains with
    unpaired initial distributions.
     Proposed Methodology
    The work proposes two approaches for style transfer. The first approach uses a
    GAN with a generator with encoder-decoder architecture and two discriminator for
    style and content. The encoder is used to extract object features from content image
    and local-global fused features from style images. The features are encoded into
    latent vector space within the bottleneck layers. The decoder is used to reconstruct
    the transferred image as the same size as input. PatchGAN is used as the content
    discriminator since it penalizes structure at scale allowing for generation of a
    semantically accurate image. By considering the image a Markov random field with
    plate per patch thus it can support having original color in generated images. A
    wavelet CNN discriminator consists of wavelet transformation layer followed by CNN
    layers.
    2
    Neural Artistic Style Transfer with Conditional Adversarial Networks
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    Thus, the discriminator can fuse global-local features to extract style from
    images. The second approach consists of content encoder, style encoder
    and a decoder. The content encoder is like StyleGAN were skip
    connections are used between intermediate content encoding modules
    and intermediate upscaling CNN modules. The module is trained using
    marginal loss function. A DenseNetmodel is used as style encode since it
    concatenates features from previous modules and the weighted transition
    layers create complex feature maps. The decoder sub-module in
    generator starts with style encoded features and is fed low level content
    encoded features using skip connections.
     Next must-read paper: Generative image inpainting with salient
    prior and relative total variation

    View Slide

  3.  Summary
    Recent style transfer works struggle with retraining global information, thus this
    work uses a transformer based to overcome this. Two separate transformer for
    style and content encoding are used. The decoder uses a transformer to stylize
    the content.
     Related Works
    Gatys(2016) used CNN to extract features from content and stylized images
    and optimized the algorithm to generate stylized images. Huang(2017)
    proposed adaptive instance normalization that can stylize image by replacing
    the mean and variance of content image with that of stylized exemplar.
    Chen(2021) proposed Internal-External Style transfer algorithm optimized
    using two types of contrastive loss to improve results quality.
     Proposed Methodology
    Given content image(𝐼𝑐
    ) and style image (𝐼𝑠
    ) they are split into patches and a
    linear layer is used to obtain embeddings(𝜀) of size 𝐿 × 𝐶 where 𝐿 = 𝐻×𝑊
    𝑚×𝑚
    for
    m=8 patch size and C being embedding dimension. The attention score for
    𝑖𝑡ℎ and 𝑗𝑡ℎ patch is calculated as 𝐴𝑖,𝑗
    = ( 𝜀𝑖
    + 𝑃𝑖
    𝑊
    𝑞
    )𝑇( 𝜀𝑗
    + 𝑃
    𝑗
    𝑊𝑘
    ) for P
    representing positional encoding, 𝑊
    𝑞
    and 𝑊𝑘
    represents query and key weights,
    respectively. The positional embedding are formulated as𝑃𝐶𝐴
    (𝑥, 𝑦) =
    σ
    𝑘=0
    𝑠 σ
    𝑙=0
    𝑠 (𝑎𝑘𝑙
    𝑃𝐿
    (𝑥𝑘
    ,𝑦𝑙
    )) where 𝑃𝐿
    = ℱ𝑝𝑜𝑠
    (𝐴𝑣𝑔𝑃𝑜𝑜𝑙𝑛×𝑛
    (𝜀)) for ℱ𝑝𝑜𝑠
    representing
    1 × 1 convolution and s represents number of query, key 3
    StyTr2:Image Style Transfer with Transformers
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    and value are calculated as 𝑄 = 𝑍𝑐
    𝑊
    𝑞
    , 𝐾 = 𝑍𝑐
    𝑊𝑘
    and 𝑉 = 𝑍𝑐
    𝑊
    𝑣
    neighboring
    patches. Thus, for content image 𝑍𝑐
    = {𝜀𝑐1
    + 𝑃𝐶𝐴1
    , 𝜀𝑐2
    + 𝑃𝐶𝐴2
    … 𝜀𝑐𝐿
    + 𝑃𝐶𝐴𝐿
    } and
    the multi-head attention is calculated as 𝐹𝑀𝑆𝐴
    (𝑄, 𝐾, 𝑉) =
    𝐶𝑜𝑛𝑐𝑎𝑡(𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛1
    𝑄, 𝐾, 𝑉 , … , 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑛
    𝑄, 𝐾, 𝑉 )𝑊0
    with layer normalization
    applied after each block and followed by a skip connection. Similar encoding
    is done for style image except for positional encoding. Thus, the content
    encoder can be represented as 𝑌′𝑐
    = ℱ𝑀𝑆𝐴
    𝑄, 𝐾, 𝑉 + 𝑄 and 𝑌𝑐
    = ℱ𝐹𝐹𝑁
    𝑌′𝑐
    + 𝑌′𝑐
    .
    The decoder consists of two multihead attention layers and one residual layer
    for content input 𝑌′′𝑐
    = {𝑌𝑐1
    + 𝑃𝐶𝐴1
    ,𝑌𝑐2
    + 𝑃𝐶𝐴2
    ,… 𝑌𝑐𝐿
    + 𝑃𝐶𝐴𝐿
    } and style image 𝑌𝑠
    =
    {𝑌𝑠1
    , 𝑌𝑠2
    ,… 𝑌𝑠𝐿
    } the query, key and value are calculated 𝑄 = 𝑌′′𝑐
    𝑊
    𝑞
    , 𝐾 = 𝑌′′𝑠
    𝑊𝑘
    ,
    𝑉 = 𝑌′′𝑠
    𝑊
    𝑣
    it can be formulated as 𝑋′′ = ℱ𝑀𝑆𝐴
    𝑄, 𝐾, 𝑉 + 𝑄, 𝑋′ = ℱ𝑀𝑆𝐴
    (𝑋′′ +

    View Slide

  4.  Summary
    Recent style transfer works struggle with content structure and style patterns along
    with local and global patterns due to patch-based mechanism. Thus, this work
    proposes a style-attentional network to overcome these issues.
     Related Works
    Huang(2017) adjusted the mean and variance of content image to that of style
    image by transferring global feature statistics. Sheng(2018) used a patch-based
    decorator to transfer content features to semantically nearest style features by
    minimizing the difference between their holistic feature distribution.
     Proposed Methodology
    The network consists of an encoder-decoder mechanism and a style-attentional
    module trained using a novel identity loss. For content image 𝐼𝑐
    and style image
    𝐼𝑠
    to generate output 𝐼𝑐𝑠
    . A VGG19 network is used as an encoder which along
    with a symmetric decoder and two SANets are trained jointly. The SANets
    receives input from multiple VGG19 layers(Relu_4_1, Relu_5_1) which are
    then combined. For 𝐹𝑠
    and 𝐹𝑐
    representing the features extracted from style and
    content image using VGG19 network the combined output 𝐹𝑐𝑠
    = 𝑆𝐴𝑁𝑒𝑡 𝐹𝑐
    ,𝐹𝑠
    .
    Within the SANetthe inputs are mean-variance channel-wise normalized as F’.
    Thus, attention between 𝐹′𝑠
    and 𝐹′𝑐
    is calculated as 𝐹𝑐𝑠
    𝑖 =
    1
    𝐶(𝐹)
    σ
    ∀𝑗
    exp(𝑊𝑓
    𝐹′𝑇
    𝑐
    𝑗
    𝑊
    𝑔
    𝐹𝑠
    𝑗)𝑊ℎ
    𝐹
    𝑠
    𝑗) and
    4
    Arbitrary Style Transfer with Style-Attentional Networks
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    𝐶 𝐹 = σ
    ∀𝑗
    exp(𝑊𝑓
    𝐹′𝑇
    𝑐
    𝑗
    𝑊
    𝑔
    𝐹𝑠
    𝑗). After applying a 1 × 1 convolution it is added to 𝐹𝑐
    to obtain 𝐹𝑐𝑠𝑐
    . For 𝐹𝑐𝑠𝑐
    𝑟_4_1 and 𝐹𝑐𝑠𝑐
    𝑟_5_1 representing the output of 2 SANets they are
    combined as 𝐹𝑐𝑠𝑐
    𝑚 = 𝑐𝑜𝑛𝑣3×3
    (𝐹𝑐𝑠𝑐
    𝑟_4_1 + 𝑢𝑝𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔(𝐹𝑐𝑠𝑐
    𝑟_5_1)). The resultant feature
    map is in turn fed to decoder. The module is optimized as 𝐿 = 𝜆𝑐
    𝐿𝑐
    + 𝜆𝑠
    𝐿𝑠
    +
    𝐿𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦
    where 𝐿𝑐
    = ||𝐸(𝐼𝑐𝑠
    )′𝑟−4−1 − 𝐹′
    𝑐
    𝑟−4−1||𝑐
    + ||𝐸(𝐼𝑐𝑠
    )′𝑟−5−1 − 𝐹′
    𝑐
    𝑟−5−1||𝑐
    , 𝐿𝑠
    =
    σ
    𝑖=1
    𝐿 ||𝜙𝑖
    𝐼𝑐𝑠
    − 𝜙𝑖
    𝐼𝑠
    ||2
    and 𝐿𝑖𝑑𝑛𝑒𝑡𝑖𝑡𝑦
    = 𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦1
    (| 𝐼𝑐𝑐
    − 𝐼𝑐
    |2
    + 𝐼𝑠𝑠
    − 𝐼𝑠
    |2
    +
    𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦2

    𝑖=1
    𝐿 ||𝜙𝑖
    𝐼𝑐𝑐
    − 𝜙𝑖
    𝐼𝑐
    ||2
    + σ
    𝑖=1
    𝐿 ||𝜙𝑖
    𝐼𝑠𝑠
    − 𝜙𝑖
    𝐼𝑠
    ||2
    ) where 𝐼𝑐𝑐
    and 𝐼𝑠𝑠
    denotes output from 2 same content and style images, 𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦1
    and 𝜆𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦2
    denotes loss weights.
     Result
    MS-COCO and WikiArtare used content and style images, respectively.
     Next must-read paper : “Avatar-Net: Multi- scale zero-shot style
    transfer by feature decoration”

    View Slide

  5.  Summary
    The work proposes a high-speed algorithm with non-parametric warping that
    supports two exemplar one for texture and another for geometry. The
    proposed network learns mapping from 4D array of inter-feature distances to a
    non-parametric 2D warp field.
     Related Works
    Gatys(2015, 2016) used later layers output to represent content and feature
    correlation between different layers to represent style. Ulyanov(2017) replace
    batch normalization with image normalization to improve quality.
    Yaniv(2019) used features from geometric styles of individual artists.
    Kim(2020) matched points between content image and style image, filtered
    results with low match and trained using a warping loss. Liu(2020) proposed
    a mapping from 4D function of distance measures to 2D parametric warp.
     Proposed Methodology
    For content image 𝐼𝑐
    , geometric image 𝐼𝑔
    and texture image 𝐼𝑡
    an output 𝐼𝑜
    is
    calculated using geometric warping module D and texture rendering module R
    as 𝐼𝑜
    = 𝑅(𝐷(𝐼𝑔
    , 𝐼𝑐
    ), 𝐼𝑡
    ). The geometric warping module consists of feature
    extraction(VGG), feature matching and a warp network. Matching between
    𝐹𝑐
    (content features) and 𝐹
    𝑔
    (geometric features) are calculated as
    𝑀𝑐𝑔
    =
    (σ𝑝=1
    𝑊 σ𝑞=1
    𝐻 )0.5
    where < 𝐹𝑐
    |𝐹
    𝑔
    > represents the inner product.
    5
    Learning to Warp for Style Transfer
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    M is converted from 4D to 2D as 𝑓: ℝ𝑊×𝐻×𝑊×𝐻 → ℝ𝑊1×𝐻1×2 this addressed as an
    optimization problem. The module is optimized as ℒ 𝐹𝑐
    ,𝐹
    𝑔
    𝑤 =
    − σ
    𝑚∈𝐼𝑐
    σ
    𝑛∈𝑁𝑚
    log(𝑝(𝑤 𝐹𝑐
    𝑚 ,𝐹
    𝑔
    𝑛)) where 𝑁𝑚
    is a search window centered around
    m and 𝑝 𝑤 𝐹𝑐
    𝑚 , 𝐹
    𝑔
    𝑛 = exp(𝑀(𝑤 𝐹𝑐
    𝑚 ,𝐹𝑔
    𝑛))
    σ𝑡∈𝑁𝑚
    exp(𝑀 𝑤 𝐹𝑐
    𝑚 ,𝐹𝑔
    𝑡 )
    . For Gram matrix as 𝐺 𝐹𝑙 𝐼 =
    [𝐹𝑙 𝐼 ]𝑇𝐹𝑙 𝐼 where 𝐹𝑙 𝐼 is the VGG feature map of lth layer the style distance is
    calculated as ∆𝑆
    𝐼𝑜
    ,𝐼𝑡
    = σ
    𝑙∈𝑙𝑡
    || 𝐺 𝐹𝑙 𝐼𝑜
    − 𝐺 𝐹𝑙 𝐼𝑡
    ||2 and content distance
    ∆𝑐
    𝐼𝑜
    , 𝐼𝑤
    = ||𝐹𝑙𝑐 𝐼𝑜
    − 𝐹;𝑐 𝐼𝑤
    ||2 for selected channels 𝑙𝑡
    and 𝑙𝑐
    for content texture
    and content respectively. The network is jointly optimized as 𝐼𝑜
    =
    𝑎𝑟𝑔𝑚𝑖𝑛𝐼
    [𝛼∆𝑠
    𝐼, 𝐼𝑡
    + 𝛽∆𝑐
    (𝐼, 𝐼𝑤
    )].
     Result
    MS-COCO and PF_PASCAL are used to train the algorithm. For evaluation
    participants are asked to choose preferred output images from multiple
    algorithms and the proposed algorithm obtained the highest user preference.
     Next must-read paper : “Deformable style transfer”

    View Slide

  6.  Summary
    The work overcomes the diversity of the task in an efficient manner in three
    stages i.e., structure alignment, texture refinement, holistic effect enhancement
    in a coarse-to-fine refinement.
     Related Works
    Conventional texture transfer algorithms use hand-crafted algorithms for
    texture transfer. Men(2018) proposed framework for interactive texture transfer
    using improved PatchMatch and custom channels. Li(2016) combined
    Markov Random Fields(MRFs) and DCNNs for texture transfer. Chen(2016)
    proposed “style-swap” operation for fast patch-based stylization.
    Goodfellow(2014) proposed GAN for textures transfer.
     Proposed Methodology
    For source image 𝑆𝑠𝑡𝑦
    , semantic map 𝑆𝑠𝑒𝑚
    , the algorithm generates stylized
    target image 𝑇𝑠𝑡𝑦
    , given target semantic map 𝑇𝑠𝑒𝑚
    . This is done in three stages
    namely global view structure alignment, local view texture refinement and
    holistic effect enhancement. The first part of an autoencoder upto
    ReLu_X_1(𝑋 ∈ {1,2,3,4,5}) along with 5 decoders used for image
    reconstruction trained as 𝐿𝑟𝑒𝑐𝑜𝑛
    = 𝐼𝑟
    − 𝐼𝑖
    |2
    2 + 𝜆 |𝜙(𝐼𝑟
    ) − 𝜙(𝐼𝑖
    )||2
    2 for 𝜙 denoting
    a VGG encoder and 𝐼𝑟
    , 𝐼𝑖
    representing input and reconstructued output. For
    𝐹𝑆𝑠𝑡𝑦 and 𝐹𝑇𝑠𝑡𝑦
    𝑡
    representing the VGG features of stylized image and target
    image 𝑇𝑠𝑡𝑦
    𝑡 , they are standardized(𝐹
    1
    𝑆𝑠𝑡𝑦 and 𝐹
    1
    𝑇𝑠𝑡𝑦
    𝑡
    ) and fuse the feature maps
    6
    Texture Reformer: Towards Fast and Universal Interactive Texture Transfer
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    using the semantic map features 𝐹𝑆𝑠𝑒𝑚 and 𝐹𝑇𝑠𝑒𝑚 as 𝐹𝑆 = 𝐹
    1
    𝑆𝑠𝑡𝑦 ۨ 𝜔 𝐹𝑆𝑠𝑒𝑚; 𝐹𝑇 =
    𝐹
    1
    𝑇𝑠𝑡𝑦
    𝑡
    ۨ 𝜔 𝐹𝑇𝑠𝑒𝑚 here ۨ represents concatenation operation. To enhance holistic
    effect of stylized image on target 𝑆𝐸 𝐹𝑆𝑠𝑡𝑦, 𝐹𝑇𝑠𝑡𝑦
    𝑡
    = 𝛼 𝐹𝑆𝑠𝑡𝑦
    𝐹𝑇𝑠𝑡𝑦
    𝑡
    −𝜇 𝐹𝑇𝑠𝑡𝑦
    𝑡
    𝛼 𝐹𝑇𝑠𝑡𝑦
    𝑡
    +
    𝜇(𝐹𝑆𝑠𝑡𝑦) where 𝛼 and 𝜇 represents mean and standard deviation, respectively.
    The patch size used is determined as 𝑝 = min 𝐻 𝐹𝑆 ,𝑊 𝐹𝑆 , 𝐻 𝐹𝑇 ,𝑊 𝐹𝑇 − 1
    where H and W represents the height and width of the feature. Since the global
    alignment is done at the deepest Relu5_1 the computational cost is minimized,
    the layer provides the highest-level structure features and largest receptive field.
    To rectify the local structure and texture a patch of size 3 and shallow features
    from Relu4_1 is used on the output of stage-I. Statistics based enhancements
    on low-level features x={1,2,3} are performed to preserve low level effects.
     Result
    MS-COCO is used for validating the proposed algorithm. The evaluation
    metrics used are Structural Similarity Index, Learned Perceptual Image Patch
    Similarity and compared with state-of-the-art algorithms.
     Next must-read paper : “A common framework for interactive texture
    transfer. ”

    View Slide

  7.  Summary
    The work proposes Avatar-Net an algorithm for visually plausible multi- scale
    transfer for arbitrary style this is done by preserving not only the feature
    distribution but also detailed style patterns in style image.
     Related Works
    Gatys(2017) formulated the task as an optimization problem that balances
    between content and style. Chen(2017) used filtering kernels to strengthen
    representative power for multiple styles. Huang(2017) used adaptive instance
    normalization to adjust channel-wise statistics of content features.
    Chen(2016) swapped content patches with closest style features from
    intermediate layer of an auto-encoder. Li(2017) recursively applied whitening
    and color transformation at multi levels of an autoencoder to transfer style
    patterns.
     Proposed Methodology
    For bottleneck features 𝑧 ∈ ℝ𝐻×𝑊×𝐶 be extracted from encoder decoder
    network for image x, 𝑧𝑐
    and 𝑧𝑠
    represent the features for content and style
    images, respectively. The features z are projected to z’ as 𝑧′ =
    𝑊 ۪(𝑧 − 𝜇(𝑧)) where 𝜇(𝑧) represents mean of z, thus 𝑧′𝑐
    and 𝑧′𝑠
    are obtained.
    The elements of 𝑧′𝑐
    are aligned to nearest element in 𝑧′𝑠
    as 𝑧′𝑐𝑠
    =
    Φ(𝑧′
    𝑠
    )𝑇 ۪ 𝛽(Φ′ 𝑧′
    𝑠
    ۪ 𝑧′
    𝑐
    ) where Φ(𝑧′
    𝑠
    ) ∈ ℝ𝑃×𝑃×𝐶×(𝐻×𝑊) is the style kernel of
    patch P and Φ′(𝑧′
    𝑠
    ) is the normalized style kernel. To apply color
    transformation 𝑧𝑐𝑠
    = 𝐶𝑠
    ۪ 𝑧′𝑐𝑠
    + 𝜇(𝑧𝑠
    ) for coloring kernel 𝐶𝑠
    derived from
    7
    Avatar-Net: Multi-scale Zero-shot Style Transfer by Feature Decoration
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    Covariance matrix derived from C(𝑍𝑠
    ). The encoder is a concatenation of
    multiple encoder blocks that extracts features as 𝑒𝑙 = 𝐸𝜃𝑒𝑛𝑐
    𝑙 𝑒𝑙−1 ,𝑙 ∈ {1, … , 𝐿},
    the decoder generates intermediate features as 𝑑𝑙 = 𝐷𝜃𝑑𝑒𝑐
    𝑙+1 𝑒𝑙+1 . The style
    fusion module is implemented as ℱ𝑆𝐹
    𝑑𝑐𝑠
    𝑙 ;𝑒𝑠
    𝑙 = 𝜎 𝑒𝑠
    𝑙 ∘ 𝑑𝑐𝑠
    𝑙 −𝜇 𝑑𝑐𝑠
    𝑙
    𝜎 𝑑𝑐𝑠
    𝑙
    + 𝜇(𝑒𝑠
    𝑙) where ∘
    represents channel-wise multiplication and 𝜎 ∎ denotes the channel-wise
    standard deviation.
     Result
    MSCOCO is used for training and evaluating the proposed model.
     Next must-read paper : “Fast patch-based style transfer of arbitrary
    style”

    View Slide

  8.  Summary
    The work proposes a generative model TuiGAN for coarse-to-fine image
    translation for one-shot unsupervised learning.
     Related Works
    Isola(2017) proposed pix2pix a generative model for supervised image to
    image tasks. Liu(2019) proposed FUNIT for few-shot unsupervised tasks.
    Gatys(2016) proposed style transfer by minimizing Gram matrix of deep
    features. Shocher(2018) proposed InGAN to learn internal patch distribution
    using image-specific GAN. Shaham(2019) proposed SinGAN an
    unconditional pyramidal generative model to learn patch-based distribution at
    different scales.
     Proposed Methodology
    For two images 𝐼𝐴
    and 𝐼𝐵
    from domain A and B respectively two mapping
    functions are used as 𝐼𝐴𝐵
    = 𝐺𝐴𝐵
    (𝐼𝐴
    ) and 𝐼𝐵𝐴
    = 𝐺𝐵𝐴
    (𝐼𝐵
    ). The generators 𝐺𝐴𝐵
    and
    𝐺𝐵𝐴
    are implemented as a series of generators {𝐺𝐴𝐵
    𝑛 }𝑛=0
    𝑁 and {𝐺𝐵𝐴
    𝑛 }𝑛=0
    𝑁 which
    are verified using discriminators 𝐷𝐴
    𝑛 and 𝐷𝐵
    𝑛. Generators nfrom the original image and previous generator’s output𝐼𝐴𝐵
    𝑛+1 to calculate output
    as 𝐼𝐴𝐵,𝜙
    𝑛 = 𝜙(𝐼𝐴
    𝑛) to process the input followed by calculating an attention map
    as 𝐴𝑛 = 𝜓(𝐼𝐴𝐵,𝜙
    𝑛 ,𝐼𝐴
    𝑛,𝐼𝐴𝐵
    𝑛+1↑) followed by linearly combining 𝐼𝐴𝐵,𝜙
    𝑛 and 𝐼𝐴𝐵
    𝑛+1↑ as
    𝐼′𝐴𝐵
    𝑛 = 𝐴𝑛 ۪ 𝐼𝐴𝐵,𝜙
    𝑛 + (1- 𝐴𝑛) ۪ 𝐼𝐴𝐵
    𝑛+1↑. The loss function for the 𝑛𝑡ℎ scale is
    calculated as 𝐿𝐴𝑙𝑙
    𝑛 = 𝐿𝐴𝐷𝑉
    𝑛 + 𝜆𝐶𝑌𝐶
    𝐿𝐶𝑌𝐶
    𝑛 + 𝜆𝐼𝐷𝑇
    𝐿𝐼𝐷𝑇
    𝑛 + 𝜆𝑇𝑉
    𝐿𝑇𝑉
    𝑛 for weighted
    adversarial, cycle-consistency, identity
    8
    TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    and total variation loss. Adversarial is calculated as 𝐿𝐴𝐷𝑉
    𝑛 = 𝐷𝐵
    𝑛 𝐼𝐵
    𝑛 −
    𝐷𝐵
    𝑛 𝐺𝐴𝐵
    𝑛 𝐼𝐴
    𝑛 + 𝐷𝐴
    𝑛 𝐼𝐴
    𝑛 − 𝐷𝐴
    𝑛 𝐺𝐵𝐴
    𝑛 𝐼𝐵
    𝑛 − 𝜆𝑃𝐸𝑁
    (| △
    𝐼′
    𝐵
    𝑛 𝐷𝐵
    𝑛 𝐼′
    𝐵
    𝑛 |2
    − 1)2 −
    𝜆𝑃𝐸𝑁
    (| △
    𝐼′
    𝐴
    𝑛 𝐷𝐴
    𝑛 𝐼′
    𝐴
    𝑛 |2
    − 1)2, cycle-consistency loss is calculated as 𝐿𝐶𝑌𝐶
    𝑛 =
    | 𝐼𝐴
    𝑛 − 𝐼𝐴𝐵𝐴
    𝑛 |1
    + ||𝐼𝐵
    𝑛 − 𝐼𝐵𝐴𝐵
    𝑛 ||1
    identity loss is calculated as 𝐿𝐼𝐷𝑇
    𝑛 = | 𝐼𝐴
    𝑛 − 𝐼𝐴𝐴
    𝑛 |1
    +
    ||𝐼𝐵
    𝑛 − |𝐼𝐵𝐵
    𝑛 |1
    to avoid noisy and overly pixelated output 𝐿𝑡𝑣
    (𝑥) =
    σ
    𝑖,𝑗
    ((𝑥 𝑖, 𝑗 + 1 − 𝑥 𝑖, 𝑗 )2 + (𝑥 𝑖 + 1, 𝑗 − 𝑥[𝑖, 𝑗])2)0.5
     Result
    To evaluate the effectiveness of proposed algorithm the metrics used are
    Single Image Fr’echetInception Distance (SIFID) which is the Fr’echet
    Inception Distance (FID) between deep features of two images, Perceptual
    Distance(PD) between images and User Preferance.
     Next must-read paper : “Unpaired image-to-image translation using
    cycle-consistent adversarial networks ”

    View Slide

  9.  Summary
    The work proposes Style Transfer by Relaxed Optimal Transport and Self-
    Similarity (STROTSS) which allows user-specified region-to-region or point-to-
    point control over visual similarity for unconstrained style transfer.
     Related Works
    Gatys (2016) used model trained classification to extract features for
    representing the content and style images. The algorithm is trained using the
    Frobenius norm of the output and style, output and content for style and content
    loss, respectively. Li(2016) improved upon the style loss by calculating the
    Markov Random Field over the extracted features. Thus, the content patches are
    matched to their closest target patches.
     Proposed Methodology
    For image X a VGG16 pretrained on ImageNet is used for feature extraction
    represented as 𝜙(𝑋)𝑖
    representing feature extracted from 𝑖𝑡ℎ layer of the network.
    Bilinear upsampling is used to match the original image’s spatial dimension
    followed by concatenation, the resultant is used for feature representation. The
    algorithm is optimized as 𝐿 𝑋, 𝐼𝐶
    ,𝐼𝑆
    =
    𝛼𝑙𝑐+𝑙𝑚+𝑙𝑟+1
    𝛼
    𝑙𝑝
    2+𝛼+1
    𝛼
    where 𝑙𝑚
    + 𝑙𝑟
    + 1
    𝛼
    𝑙𝑝
    constitute
    the style loss and 𝛼𝑙𝑐
    make up the content loss. The style loss is calculated using
    relaxed earth mover distance(REMD) as
    9
    Style Transfer by Relaxed Optimal Transport and Self-Similarity
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    𝑙𝑟
    = 𝑅𝐸𝑀𝐷 𝐴, 𝐵 = max(1
    𝑛
    σ
    𝑖
    𝑚𝑖𝑛𝑗
    𝐶𝑖𝑗
    , 1
    𝑚
    𝑗 σ 𝑚𝑖𝑛𝑖
    𝐶𝑖𝑗
    ) where cost matrix 𝐶𝑖𝑗
    =
    1 − 𝐴𝑖 ⊙ 𝐵𝑖
    𝐴𝑖 | 𝐵𝑗 |
    . To preserve the magnitude of feature vectors moment matching
    loss 𝑙𝑚
    = 1
    𝑑
    | 𝜇𝐴
    − 𝜇𝐵
    |1
    + 1
    𝑑2
    | 𝔼𝐴
    − 𝔼𝐵
    |1
    where 𝜇 and 𝔼 represents mean and
    covariance. REMD is used with Euclidean metric to preserve pixel color. The
    content loss is calculated as 𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡
    𝑋, 𝐶 = 1
    𝑛2
    σ
    𝑖,𝑗
    |
    𝐷𝑖𝑗
    𝑋
    σ
    𝑖
    𝐷𝑖𝑗
    𝑋

    𝐷𝑖𝑗
    𝐼𝐶
    σ
    𝑖
    𝐷𝑖𝑗
    𝐼𝐶
    | where 𝐷𝑖𝑗
    𝑋 and
    𝐷𝑖𝑗
    𝐼𝐶 represents cosine distance of all features extracted from output and content
    image, respectively.
     Result
    Evaluation is done under 3 regimes namely ’Paired’ when content and style
    represent the same object, ’Unpaired’ when content and style do not represent
    the same object and ‘Texture’ when content represents an object and style
    represents a texture.
     Next must-read paper : “Controlling perceptual factors in neural style
    transfer”

    View Slide

  10.  Summary
    The work attempts to overcomes disharmonious colour and repetitive patterns
    by learning human-aware style information and further consider style-to-style
    relations which is overlooked in current studies.
     Related Works
    Gatys (2016) used Gram matrix on pretrained DCNN for neural style transfer.
    Soh (2020) proposed a fast, flexible, and lightweight self-supervised super-
    resolution algorithm. Park(2020) enhanced the resolution of restored images
    using super-resolution method. Wang(2021) proposed an algorithm capable of
    learning internal statistics for inpainting. Kang(2020) proposed conditional
    contrastive loss to learn data-to-data and data-to-class relations. Liu(2021)
    proposed latent-augmented contrastive loss for diverse image synthesis.
     Proposed Methodology
    Given content image C and style image S the task is to create artistic image 𝐼𝑆𝐶
    .
    To this end a VGG19 is used as encoder E to extract features, style attentional
    network T and generative network D is used. The model is optimized using style
    loss ℒ𝑠
    = σ
    𝑖=1
    𝐿 ||𝜇(∅𝑖
    𝐼𝑠𝑐
    ) − 𝜇(∅𝑖
    𝐼𝑠
    )||2
    + ||𝜎 ∅𝑖
    𝐼𝑠𝑐
    − 𝜎 ∅𝑖
    𝐼𝑠
    ||2
    where ∅𝑖
    denotes the output from ith layer in VGG19 model. The adversarial loss used
    ℒ𝑎𝑑𝑣
    = 𝔼𝐼𝑠~𝑆
    log 𝒟 𝐼𝑠
    + 𝔼𝐼𝐶~𝐶,𝐼𝑠~𝑆
    [log(1 − 𝒟 𝐷 𝑇 𝐸 𝐼𝑐
    , 𝐸 𝐼𝑠
    )]. Content loss
    ℒ𝑐
    = || ∅𝑐𝑜𝑛𝑣_4_2
    𝐼𝑠𝑐
    − ∅𝑐𝑜𝑛𝑣42
    𝐼𝑐
    ||2
    , 10
    Artistic Style Transfer with Internal-external Learning and Contrastive Learning
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    identity loss as ℒ𝑖𝑑𝑛𝑒𝑡𝑖𝑡𝑦
    = 𝜆𝑖𝑑𝑛𝑒𝑡𝑖𝑡𝑦1
    (ቚ I𝑐𝑐
    − 𝐼𝑐
    |2
    + I𝑠𝑠
    − 𝐼𝑠
    |2
    +

    View Slide

  11.  Summary
    Existing works overlook shallow features and do not consider local features
    statistics making creating unnatural output with distortions. The work calculates
    the spatial attention score from both shallow and deep features of content and
    style followed by normalization. A novel loss function is proposed to improve
    quality.
     Related Works
    Huang(2017) proposed applied mean and standard deviation of style feature to
    content features. Jing(2020) used dynamic instance normalization such that
    weights for intermediate CNN blocks are generated by other network with style
    image as input. Chen(2016) relied on similarity between image and style
    patches for style transfer. Park(2019) proposed Style-Attention Network to
    match content and style features.
     Proposed Methodology
    Given style image 𝐼𝑠
    , content image 𝐼𝑐
    the output would be stylized image 𝐼𝑐𝑠
    .
    The encoder uses VGG19 to obtain multi-scale feature maps and a decoder
    symmetric to VGG19. To utilize shallow levels features from ReLU3-1, ReLU4-1,
    ReLU5-1 are used. The extracted features upto the current layer are upsampled
    and concatenated 𝐹∗
    1:𝑥. Similar strategy is done for content image and fed to
    AdaAttN module as 𝐹𝑐𝑠
    𝑥 = 𝐴𝑑𝑎𝐴𝑡𝑡𝑁(𝐹𝑐
    𝑥, 𝐹𝑠
    𝑥,𝐹𝑐
    1:𝑥, 𝐹∗
    1:𝑥). The features are fed to
    decoder as 𝐼𝑐𝑠
    = 𝐷𝑒𝑐(𝐹𝑐𝑠
    3, 𝐹𝑐𝑠
    4,𝐹𝑐𝑠
    5 ). AdaAttN works in three stages calculating the
    attention map, weighted mean and standard variance
    11
    AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    of style features calculation, normalizing content feature. For attention map the
    𝐹𝑐
    1:𝑥,𝐹∗
    1:𝑥, 𝐹𝑠
    𝑥 are normalized and passed through 1 × 1 convolution and
    considered as query, queue and key to calculate the feature map as A =
    softmax(𝑄𝑇۪𝐾) and the attention features as 𝑀 = 𝑉 ۪ 𝐴𝑇. The variance is
    calculated as 𝑆 = ( 𝑉 ⊙ 𝑉 ۪𝐴𝑇 − 𝑀 ⊙ 𝑀)0.5 and the output as 𝐹𝑐𝑠
    𝑥 = 𝑆 ⊙
    𝑁𝑜𝑟𝑚 𝐹𝑐
    𝑥 + 𝑀. The network is optimized using weighted content and style loss
    as 𝐿 = 𝜆𝑔
    𝐿𝑔𝑠
    + 𝜆𝑙
    𝐿𝑙𝑓
    where 𝐿𝑔𝑓
    = σ
    𝑥=2
    5 (|ห𝜇 𝐸𝑥 𝐼𝑐𝑠

    View Slide

  12. 12
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    会社名 株式会社 微分
    代表者 吉田 慎太郎
    所在地
    東京都新宿区新宿4丁目1-16 JR新宿ミ
    ライナタワー18F JustCo新宿
    設立 2020年10月
    資本金 7,000,000(2022年10月時点)
    従業員 20名(全雇用形態含む)
    事業内容
    教育機関向けソフトウェア「School DX」
    の開発
    ウェブアプリケーションの開発
    画像認識/自然言語の研究開発
    会社概要

    View Slide

  13. 13
    Copyright © 2022 VIVEN Inc. All Rights Reserved

    View Slide