Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Virtual-try-on Survey by VIVEN Inc

Virtual-try-on Survey by VIVEN Inc

VIVEN, Inc.

March 08, 2023
Tweet

More Decks by VIVEN, Inc.

Other Decks in Science

Transcript

  1. 1
    Virtual-try-on
    Survey
    Tapas Dutta, Deep Learning Engineer

    View Slide

  2. ◼ Summary
    The work attempts to preserve garment details and transfer garment
    under occlusion and difficult poses. This is achieved using a geometric
    matching procedure complemented with a powerful image generator.
    ◼ Related Works
    Xintong(2018) proposed VTON employing a two-stage approach i.e.,
    coarse-to-fine image generation and Thin-Plate Spline(TPS) transformation
    to align image to clothing. Wang(2018) used Geometric Matching
    Module(GMM) to learn TPS in an end-to-end manner. Rahman(2020)
    improved the human mask send to GMM. Yu(2019) improved the human
    representation fed to GMM. Choi(2021), Dong(2019), Yang(2020) used a
    secondary network for generating clothing segmentation which is used as
    additional input. Ge(2021), Issenhuth(2020) employed a teacher-student
    distillation module to overcome the need for error prone intermediate
    steps.
    ◼ Proposed Methodology
    The algorithm consists of 2 modules Body Part Geometric Matcher(BPGM)
    and Context-Aware Generator(CAG). The BPGM is responsible for
    estimating the parameters of TPS. It has 2 encoders to get feature maps
    for clothing𝜑1
    and body segmentation 𝜑2
    . To calculate loss the desired
    target appearance is needed, thus target clothing C that matches the one
    in the input image is used to optimize the model. These obtained feature 2
    C-VTON: Context-Driven Image-Based Virtual Try-On Network
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    maps are normalized along channel, spatially flattened(w*h*c-> wh*c)
    and is used concatenated. To calculate correlation matrix(C) as 𝐶 =
    𝜑1
    𝑇 𝜑2
    . This, correlation matrix is fed to a regressor which predicts TPS
    parameters 𝜃. This module is optimized using 3 losses namely target
    shape loss to have clothing match shape of pose as 𝐿𝑠ℎ𝑝
    = ||
    |
    𝑇𝜃
    𝑀𝑡

    𝑀𝑐
    |1
    where 𝑇𝜃
    represents TPS parameterized by 𝜃 , 𝑀𝑡
    as binary mask
    of clothing and 𝑀𝑐
    binary mask of clothing in input image. To preserve
    details in garment 𝐿𝑎𝑝𝑝
    = | 𝐶𝑤
    ۨ 𝑀𝑏
    − 𝐼 ⊙ 𝑀𝑏
    |1
    for input image I,
    warped garment 𝐶𝑤
    and 𝑀𝑏
    as binary mask of target area. Perceptual
    loss as 𝐿𝑣𝑔𝑔
    = σ
    𝑖
    𝑛 𝜆𝑖
    ||𝜃𝑖
    𝐶𝑤
    ⊙ 𝑀𝑏
    − 𝜃𝑖
    𝐼 ⊙ 𝑀𝑏
    ||1
    is used to ensure that
    sematic information is preserved in target area where 𝜃𝑖
    represents
    feature map generated before 𝑖𝑡ℎ max pooling layer of a VGG19 model.
    The entire module is trained on weighted average of these 3 losses.
    Body segmentation, input with masked clothing area, target and
    warped clothing images are concatenated channel wise and fed to CAG
    as Image Context(IC). The CAG architecture consists of ResNet blocks,
    upsampling layers and context aware normalization(CAN) calculated as
    𝑋𝐶𝐴𝑁
    = 𝑋𝐵𝑁
    ⊙ 𝜆 + 𝛽 for learnable parameter 𝜆 𝑎𝑛𝑑 𝛽. Each residual
    blocks receives output from previous block and IC. CAG’s output with
    target garment and garment in input image represented as 𝐼𝐶
    and 𝐼′𝐶
    .
    Perceptual loss is calculated between CAG’s output 𝐼′𝐶
    and input image.

    View Slide

  3. 3
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    For input I and CAG output 𝐼𝐶
    𝐷𝑠𝑒𝑔
    outputs d+1 channels for
    d body parts and 1 channel encodes whether a pixel is from real or
    generated data. Thus, 𝐷𝑠𝑒𝑔
    is optimized by minimizes the loss for
    segmentation maps for real image for first d feature maps and the last
    feature map is optimized as a classification task. This is trained in a
    generative adversarial network(GAN) approach. Matching discriminator
    loss is used to make the feature representation of output closer to
    garment compared to input. Based on segmentation map patches are
    extracted from neck, both upper and lower arms and forearms using
    convolution and discriminator is used to differentiate between real and
    fake samples.
    ◼ Result
    Comparison with state-of-the-art architectures are conducted on VITON
    and MPV dataset.
    ◼ Next must-read paper: “Parser-Free Virtual Try-on via Distilling
    Appearance Flows ”

    View Slide

  4. ◼ Summary
    The work proposes a novel recurrent generation pipeline to sequentially
    put on garments. This is done by encoding the shape and texture of
    garment, enabling them to be edited separately. By jointly training on
    pose parameters and inpainting details are preserved.
    ◼ Related Works
    Raj(2018) proposed SwapNet which is able to transfer garment from one
    person to another by generating a segmentation mask of the desired
    clothing on the desired pose. Neubeger(2020) first generates a
    segmentation mask for all garments followed by injecting the garment
    encodings in the associated regions. Men(2020) encodes garment in a 1D
    vector and is fed to StyleGAN. Conditioning on 2D pose allows for pose
    transfer as well as try-on. Esser(2018), Ma(2017), Men(2020),
    Tang(2020), Ulyanov(2017), Zhu(2019) use pose transfer using 2D key
    points, however due to their limited ability to capture garment details they
    produce blurry results.
    ◼ Proposed Methodology
    A person is represented as a tuple of pose, body, garments. Pose is
    represented as 18 key point heatmaps using OpenPose. Given source
    garment 𝑔𝑘
    the masked garment segment 𝑠𝑔𝑘
    and the pose is estimated
    and since it would be different from desired pose P, flow field 𝑓𝑔𝑘
    using
    Global Flow Field Estimator(GFLA) to align the garment segment 𝑠𝑔𝑘
    with P
    4
    Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    The garment segment 𝑠𝑔𝑘
    is fed to texture encoder 𝐸𝑡𝑒𝑥
    (3 VGG encoder in
    ADGAN), this output is warped by flow field 𝑓𝑔𝑘
    to output texture feature
    map 𝑇𝑔𝑘
    . Soft-shape mask 𝑀𝑔𝑘
    is calculated using 3 convolution layers 𝐸𝑠𝑒𝑔
    on 𝑇𝑔𝑘
    . 𝑀𝑔𝑘
    and 𝑇𝑔𝑘
    are stacked and passed through 2 convolution layers
    𝐸𝑚𝑎𝑝
    to get 𝑇′𝑔𝑘
    . Human segmenter is used to obtain background 𝑆𝑏𝑔
    and
    skin mask 𝑆𝑠𝑘𝑖𝑛
    which are encoded using 𝐸𝑡𝑒𝑥
    and 𝐸𝑠𝑒𝑔
    to obtain (𝑇𝑏𝑔
    , 𝑀𝑏𝑔
    )
    and (𝑇𝑠𝑘𝑖𝑛
    , 𝑀𝑠𝑘𝑖𝑛
    ). For 𝑀𝑓𝑔
    representing mask of all foreground body
    representation 𝑇′𝑏𝑜𝑑𝑦
    is calculated as 𝑇′𝑏𝑜𝑑𝑦
    = 𝑀𝑓𝑔
    ⊙ 𝐸𝑚𝑎𝑝
    𝑀𝑓𝑔
    ۪ 𝑏, 𝑀𝑓𝑔
    +
    1 − 𝑀𝑓𝑔
    ⊙ 𝑇′𝑏𝑔
    . To generate output the pose P is encoded using 𝐸𝑝𝑜𝑠𝑒
    , the
    result is 𝑍𝑝𝑜𝑠𝑒
    which along with 𝑇′𝑏𝑜𝑑𝑦
    is used to obtain 𝑍𝑏𝑜𝑑𝑦
    using 𝐺𝑏𝑜𝑑𝑦
    (2
    style blocks in ADGAN). To generate garments 𝑍𝑏𝑜𝑑𝑦
    is used as 𝑍0
    and 𝐺𝑔𝑎𝑟
    is used to produce the next state 𝑍𝐾
    from 𝑍𝐾−1
    , 𝑇′𝑔𝑘
    and 𝑀𝐾
    as 𝑍𝐾
    =
    ∮ 𝑍𝐾−1
    , 𝑇′
    𝑔𝑘
    ⊙ 𝑀𝐾
    + 𝑍𝐾−1
    ⊙ (1 − 𝑀𝑔𝑘
    ) for ∮ having architecture like 𝐺𝑏𝑜𝑑𝑦
    .
    The output image is prepared as 𝐺𝑑𝑒𝑐
    (𝑍𝐾
    ), for 𝐺𝑑𝑒𝑐
    as the decoder of
    ADGAN.
    ◼ Results
    DeepFashion dataset is used for evaluation of the model’s performance.
    ◼ Next must-read paper: “Controllable person image synthesis with
    attribute-decomposed gan.”

    View Slide

  5. ◼ Summary
    The work proposes a novel algorithm to overcome occlusion and human
    pose constraints and generate photo realistic images. This is done by first
    predicting the layout followed by determining whether content needs to be
    generated or preserved.
    ◼ Related Works
    Dong(2019) proposed a multi-pose guided image based virtual try-on
    network. Han(2018) used Thin-Plate-Spline(TPS) method for wrapping
    target garment to input pose. Wang(2018) used neural network to learn
    TPS wrapping achieving more accurate results. To preserve posture and
    previous garment details Yu(2019) makes use of features extracted from
    high level input features.
    ◼ Proposed Methodology
    The proposed Adaptive Content Generating and Preserving Network has 3
    modules namely the Semantic Generation Module(SGM) to generate masks
    of body parts and warped clothes, Clothes Warping Module(CWM) to wrap
    target clothing according to clothing mask using THS, the Content Fusion
    Module(CFM) to combine the information from previous 2 modules to
    determine if a part needs to be generated or preserved. The arm and torso
    segmentation masks are fused which along with target clothing image 𝑇𝑐
    and pose map 𝑀𝑝
    is used as input to SGM. SGM outputs mask of body 𝑀𝑤
    𝑆
    5
    Towards Photo-Realistic Virtual Try-On by Adaptively Generating↔Preserving Image Content
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    parts(head, arm, bottom clothes). Using 𝑀𝑤
    𝑆 , 𝑀𝑝
    and 𝑇𝑐
    the estimated
    clothing 𝑀𝐶
    𝑆 region is predicted. The module is trained in an adversarial
    manner along with a pixel-wise cross entropy loss. A Spatial
    Transformation Network(STN) is used to learn the mapping between 𝑇𝑐
    and 𝑀𝐶
    𝑆 to get wrapped clothing image 𝑇𝐶
    𝑊 which is constrained on 𝐿3
    as
    𝐿3
    = σ
    𝑝 ∈𝑃
    𝜆𝑟
    || 𝑝𝑝0
    |2
    − 𝑝𝑝1
    |2
    + || 𝑝𝑝2
    |2
    − 𝑝𝑝3
    |2
    + 𝜆𝑠
    (|𝑆 𝑝, 𝑝0
    − 𝑆 𝑝, 𝑝1
    +
    𝑆 𝑝, 𝑝2
    − 𝑆(𝑝,𝑝3
    )|) where 𝜆𝑟
    and 𝜆𝑠
    are trade-off parameters, 𝑝0−4
    are top,
    bottom, left, and right control points and 𝑆 𝑝, 𝑝𝑖
    is the slope between p
    and 𝑝𝑖
    points. This module is optimized using 𝐿𝑤
    for 𝐿𝑤
    = 𝐿3
    + 𝐿4
    where
    𝐿4
    = | 𝑇𝐶
    𝑊 − 𝐼𝐶
    |1
    . A refinement network is used to add further details as
    𝑇𝐶
    𝑅 = 1 − 𝛼 ⊙ 𝑇𝐶
    𝑊 + 𝛼 𝑇𝐶
    𝑅 for learnable parameter 𝛼 and refinement module
    output 𝑇𝐶
    𝑅. The composited body mask is calculated as 𝑀𝑊
    𝐶 = (𝑀𝑎
    𝐺 + 𝑀𝑤
    ) ⊙
    (1 − 𝑀𝐶
    𝑆) for generated body mask 𝑀𝑎
    𝐺, as 𝑀𝑎
    𝐺 = 𝑀𝑊
    𝑆 ⊙ 𝑀𝑐
    synthesized
    clothing mask 𝑀𝐶
    𝑆 and original body part mask 𝑀𝑤
    , also non-targeted
    details are preserved as 𝐼𝑤
    = 𝐼𝑤′
    ⊙ (1 − 𝑀𝐶
    𝑆) where 𝐼𝑤′
    = 𝐼 − 𝑀𝑊
    𝐶 . Thus, 𝐼𝑤
    ,
    𝑇𝐶
    𝑅, 𝑀𝑊
    𝐶 and 𝑀𝐶
    𝑆 are concatenated and used as input for inpainting based
    fusion GAN 𝐺3
    .
    ◼ Results
    Comparison with state-of-the-art architectures are conducted on VITON.
    ◼ Next must-read paper: “An image-based virtual try-on network with
    body and clothing feature preservation”

    View Slide

  6. ◼ Summary
    The work divides the task of virtual-try-on into two simpler tasks i.e.
    separate person’s body from their clothing followed by generating new
    images of wearer dressed in arbitrary garments.
    ◼ Related Works
    Yan(2017) synthesized human motion using single human image and
    human skeleton sequence. Ma(2018) generated image of human at desired
    pose using human and pose information. Han(2018) transferred garment to
    human using context matching module. Wang(2018) used a composition
    network to integrate garment onto human pose. Jetchev(2018) used Cycle-
    GAN for virtual try-on without the need of pose or shape information.
    ◼ Proposed Methodology
    The proposed algorithm contains two subnetworks, shape transfer network
    to generate semantic map of person and garment and an appearance
    transfer network to preserve fine details. The shape transfer network
    requires semantic information, which is obtained using a semantic
    parser(Gong, 2018) trained on LIP dataset. The segmentation is
    transformed into 10 channel binary maps, followed by masking the garment
    regions in binary maps. The pose estimation(17 keypoints) is obtained
    using a off-the-self pose estimator. A binary mask representing the arms,
    torso and top clothes is used for body representation. The keypoints and
    GarmentGAN: Photo-realistic Adversarial Fashion Transfer
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    binary mask are concatenated, fed to an encoder-decoder architecture and
    trained in an adversarial manner with Patch-GAN(Isola, 2017) as
    discriminator. CNN layers with 2 strides, 3X3 kernels, leakyReLU and
    layernormalization is used 5 times for downsampling. 4 residual blocks are
    used as bottleneck and residual CNN layers for upsampling are used as
    decoder. To have the architecture focus on regions that needs to be
    replaced, regions outside the masked region are replaced with the input.
    The generator loss 𝐿𝐺
    = 𝜆1
    𝐿𝑝𝑎𝑟𝑠𝑖𝑛𝑔
    + 𝜆2
    𝐿𝑝𝑒𝑟−𝑝𝑖𝑥𝑒𝑙
    − 𝐸 𝐷 𝐼′𝑠𝑒𝑔
    for 𝐿𝑝𝑎𝑟𝑠𝑖𝑛𝑔
    representing parsing loss, 𝐿𝑝𝑒𝑟−𝑝𝑖𝑥𝑒𝑙
    = ||𝐼𝑠𝑒𝑔−𝐼′
    𝑠𝑒𝑔||1
    𝑁
    where N is the number of
    pixels. The discriminator is trained as 𝐿𝐷
    = 𝐸 max 0,1 − 𝐷 𝐼𝑠𝑒𝑔
    +
    𝐸 max 0,1 + 𝐷 𝐼′
    𝑠𝑒𝑔
    + 𝜆3
    𝐿𝐺𝑃
    where 𝐿𝐺𝑃
    = 𝐸[(| 𝛿 𝑥
    𝐷 𝑥 |2
    − 1)2]. The
    appearance network receives generated segmentation maps, target clothing,
    body shape information as inputs which are fed to an encoder-decoder
    network trained adversely using multi-scale SN-PatchGan as discriminator.
    Features extracted from person representation and target clothing are used
    to estimate parameters of Thin-Plate Spline transformation to warp the
    garment in desired pose. The features of person after masking and warped
    clothing are concatenated and fed to SPADE normalization layers. Thus,
    generator loss 𝐿𝐺
    = 𝛼1
    𝐿𝑇𝑃𝑆
    + 𝛼2
    𝐿𝑝𝑒𝑟−𝑝𝑖𝑥𝑒𝑙
    +𝛼3
    𝐿𝑝𝑒𝑟𝑐𝑒𝑝𝑡
    +𝛼4
    𝐿𝑓𝑒𝑎𝑡
    − 𝐸(𝐷(𝐼′𝑝𝑒𝑟𝑠𝑜𝑛
    ))

    View Slide

  7. For 𝐼′𝑝𝑒𝑟𝑠𝑜𝑛
    representing generator’s output and 𝐿𝑝𝑒𝑟𝑐𝑒𝑝𝑡
    , 𝐿𝑓𝑒𝑎𝑡
    representing
    percept(Jonson, 2016) and feat(Wang, 2018) loss, respectively. 𝐿𝑇𝑃𝑆
    =
    𝐸(||𝐼𝑤𝑎𝑟𝑝𝑒𝑑
    − 𝐼𝑤𝑜𝑟𝑛
    ) where 𝐼𝑤𝑎𝑟𝑝𝑒𝑑
    𝑎𝑛𝑑 𝐼𝑤𝑜𝑟𝑛
    represents prediction using TPS
    and reference image of person wearing garment, respectively. The
    discriminator loss 𝐿𝐷
    = 𝐸 max 0,1 − 𝐷 𝐼𝑝𝑒𝑟𝑠𝑜𝑛
    + 𝐸 max 0,1 + 𝐷 𝐼𝑝𝑒𝑟𝑠𝑜𝑛
    +
    𝛽𝐿𝐺𝑃
    .
    ◼ Results
    Data collected by Han(2018) is used for evaluation using Iception
    Score(Salimas, 2016) and Frechet Inception Distance(Heusel, 2017).
    ◼ Next must-read paper: “Toward Characteristic-Preserving Image-based
    Virtual Try-On Network”
    GarmentGAN: Photo-realistic Adversarial Fashion Transfer
    Copyright © 2022 VIVEN Inc. All Rights Reserved

    View Slide

  8. ◼ Summary
    The work proposes Outfit-VITON with an inexpensive training pipeline and
    with the ability to synthesize outputs from multiple garments.
    ◼ Related Works
    Han(2018) used shape information to warp garment to fit pose using
    compositional stage and geometric warping. Wang(2018) used
    convolutional geometric matcher for geometric warping. Issenhuth(2019)
    trained network in an adversarial manner to preserve details. Wu(2018)
    used GAN to warp garment onto target person. Sangwoo(2018) use
    segmentation maps to overcome over generated garments. Yildirim(2019)
    can generate images of person wearing multiple garments. Raj(2018) can
    swap entire outfit between two query images using GAN.
    ◼ Proposed Methodology
    The person’s image 𝑥0 along with multiple garments (𝑥1 , 𝑥2 ,… 𝑥𝑀 ) are
    used as input. Segmentation and DensePose network is used to obtain body
    parsing and pose information(b) for each M images, of size 𝐻 × 𝑊 × 𝐷𝑐
    and
    𝐻 × 𝑊 × 𝐷𝑏
    respectively for 𝐷𝑐
    and 𝐷𝑏
    number of classes. A selected
    garment 𝑥𝑚 is passed through a shape autoencoder 𝐸𝑠ℎ𝑎𝑝𝑒
    and pooling layer
    is used to obtain features maps(𝑒𝑚,𝑐
    𝑠 ) of dimension 8 × 4 × 𝐷𝑠
    . This is done
    for each mask of person’s image to obtain a feature map of size 8 × 4 ×
    𝐷𝑠
    𝐷𝑐
    (𝑒′𝑠 ). When the user wants to use garment c from reference image m,
    Image Based Virtual Try-on Network from Unpaired Data
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    feature map in 𝑒′𝑠 is replaced by 𝑒𝑚 ,𝑐
    𝑠 followed by upsampling to 𝐻 × 𝑊 × 𝐷𝑠
    ×
    𝐷𝑐
    denoted by 𝑒𝑠 . This along with pose information is fed to shape generator
    network to obtain transformed segmentation map as 𝑠𝑦 = 𝐺𝑠ℎ𝑎𝑝𝑒
    (𝑏, 𝑒𝑠 ). The
    appearance generation module takes a reference image along with
    segmentation map of desired garment within the image and is fed to an
    appearance autoencoder after passing through regionwise pooling according
    to mask we obtain 𝑒𝑚,𝑐
    𝑡 ∈ 𝑅1×𝐷𝑡 . Similarly for query image we obtain a feature
    map of dimension 𝐷𝑐
    × 𝐷𝑡
    followed by replacing the 𝑐𝑡ℎ dimension by
    𝑒𝑚 ,𝑐
    𝑡 ,after replacing for each garment we obtain the appearance vector 𝑒𝑚
    𝑡
    which after region-wise-broadcasting we obtain feature map 𝑒𝑡 . The
    appearance generator receives 𝑠𝑦 and 𝑒𝑡 to generate virtual try-on output.
    The online optimization fine tunes the appearance generator using reference
    loss and GAN loss.
    ◼ Results
    Data scrapped from Amazon of both male and female and various garments
    are used to validate the performance of the model.
    ◼ Next must-read paper: “ Toward characteristic- preserving image-based
    virtual try-on network”

    View Slide

  9. ◼ Summary
    The proposed algorithm can preserve fine details in the garment. This is
    achieved using a geometric matching module to align garment to pose
    followed by a Try-on-Module to seamlessly integrate the warped garment
    onto the person.
    ◼ Related Works
    Jetchev(2017) transferred person’s clothes using target clothes as condition.
    Han(2018) generated image from garment image and person
    representation. Wang(2018) refined the garment details using by adding
    warped product image using Geometric Matching Module(GMM).
    Guler(2018) used a pose transfer network based on using dense pose
    condition.
    ◼ Proposed Methodology
    The person is represented using an 18 channel heatmap with each keypoint
    as a 11 × 11 white rectangle, 1 channel binary mask of different parts of
    human body, RGB image of the reserved parts of person. A geometric
    matching module is used to align target clothing to person representation.
    During training garment is matched from ground truth (c) to garment worn
    by person(𝑐𝑡
    ) using 4 parts: two networks for feature extraction from
    person(p) and ground truth garment(c), a correlation layer to combine
    them into a single tensor, a regression network to predict the parameters of
    Thin-Plate Spline(TPS) transformation, a TPS transformation to warp
    Toward Characteristic-Preserving Image-based Virtual Try-On Network
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    garment as warped garment(c’). The module is optimized as 𝐿𝐺𝑀𝑀
    =
    | 𝑐′ − 𝑐𝑡
    |1
    = | 𝑇𝑃𝑆(𝑐) − 𝑐𝑡
    |1
    . The try-on module comprising of a U-Net is fed a
    concatenation of person representation (p) and warped garment c’ to predict
    compositional mask M and render person 𝐼𝑟
    . Thus, the result becomes 𝐼0
    =
    𝑀 ⊙ 𝑐′ + 1 − 𝑀 ⊙ 𝐼𝑟
    . The module is trained using 𝐿𝑇𝑂𝑀
    = 𝜆𝑙1


    𝐼0
    − 𝐼𝑡
    |`1
    +
    𝜆𝑣𝑔𝑔
    σ
    𝑖=1
    5 ||∅𝑖
    𝐼0
    − ∅𝑖
    (𝐼𝑡
    ) |1
    + 𝜆𝑚𝑎𝑠𝑘
    | 1 − 𝑀 |`1
    . For person wearing garment 𝐼𝑡
    and ∅𝑖
    representing the 𝑖𝑡ℎ convolution layer’s output in a pretrained VGG19
    network.
    ◼ Results
    Experiments conducted are on data collected by Han(2017). It has more
    than 16K cleaned pairs with 14K in training and 2K in validation.
    ◼ Next must-read paper: “Viton: An image-based virtual try-on network”

    View Slide

  10. ◼ Summary
    The proposed algorithm can transfer garments from image of person
    wearing it to new person using three networks position alignment, texture
    alignment and fitting network.
    ◼ Related Works
    Ma(2017) used GAN used pose representation for person generation.
    Zhu(2017) generated fashion images from textual inputs.
    ◼ Proposed Methodology
    To transfer model M’s garments to person P, the dense pose is extracted as
    𝑀𝐷
    and 𝑃𝐷
    respectively. Barcentric coordinate interpolation is used to warp
    M to 𝑀′𝑊
    using UV coordinates of 𝑀𝐷
    and 𝑃𝐷
    . M, 𝑀𝐷
    , 𝑃𝐷
    and 𝑀′𝑊
    are
    concatenated and fed to an encoder(3 CNN and residual)-decoder(2
    deconvolution and 1 CNN) network to align model’s pose as 𝑀′𝐴
    trained in a
    self-supervised manner. A binary mask(R) of same size as 𝑀′𝑊
    such that if
    a pixel in 𝑀′𝑊
    belongs to background its mask value is 0 else 1 is created.
    Merged image as 𝑀′ = 𝑀′𝑊
    ⊙ 𝑅 + 𝑀′
    𝐴
    ⊙ (1 − 𝑅) will have sharp edges thus
    an encoder-decoder architecture is used to smooth them. ROI(R) mask is
    created by training a network with LIP_SSL pretrained model to predict
    garment and DensePose to predict upper torso mask and taking the union
    as the ground truth. 𝑀′′𝑅
    = 𝑀′𝑅
    ⊙ 𝑅 and 𝑃′ = 𝑃 ⊙ (1 − 𝑅) 𝑀′′𝑅
    and 𝑃′ are
    concatenated and fed to an encoder-decoder network for smoothing.
    M2E-Try On Net: Fashion from Model to Everyone
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    ◼ Next must-read paper: “Viton: An image-based virtual try-on network ”

    View Slide

  11. ◼ Summary
    The work proposes a novel weakly supervised algorithm to transfer
    garments across people with arbitrary garments and poses.
    ◼ Related Works
    Lassneret(2017) generated people with arbitrary garment conditioned on
    pose. Zhu(2017)
    ◼ Proposed Methodology
    The algorithm consists of a warping module and texturing module.
    For A representing garment and B representing desired body pose and
    shape, their segmentation maps 𝐴𝑐𝑠
    (body segmentation) and 𝐵𝑏𝑠
    (18
    channel garment segmentation) are fed to a dual channel conditional GAN.
    To strongly condition the output on body segmentation and weakly on
    garment segmentation, 𝐵𝑏𝑠
    is encoded into a narrow shape of 2 × 2 × 1024
    before upsampling to 512 × 512 × 1024 and concatenating with encoded
    body segmentation feature maps. The module is trained using 𝐵𝑏𝑠
    and 𝐵𝑐𝑠
    ,
    to prevent it from learning positional information from 𝐵𝑏𝑠
    random affine,
    crop, horizontal flips are used. Thus, the module can be represented as
    𝑍𝑐𝑠
    = 𝑓1(𝐴𝑐𝑠
    , 𝐵𝑏𝑠
    ) which is optimized as 𝐿𝑤𝑎𝑟𝑝
    = 𝐿𝐶𝐸
    + 𝜆𝑎𝑑𝑣
    𝐿𝑎𝑑𝑣
    where 𝐿𝐶𝐸
    =
    − σ
    𝑐 =1
    18 (𝐴𝑐𝑠
    (𝑖, 𝑗) = 𝑐)(log 𝑍𝑐𝑠
    (𝑖,𝑗)) and 𝐿𝑎𝑑𝑣
    = 𝐸𝑥 ~𝑝(𝐴𝑐𝑠)
    𝐷 𝑥 +
    𝐸𝑥 ~𝑝(𝑓1𝑒𝑛𝑐(𝐴𝑐𝑠,𝐵𝑏𝑠))
    1 − 𝐷 𝑓1𝑑𝑒𝑐
    (𝑧) . The texturing module receives 𝑍𝑐𝑠
    and ROI
    SwapNet: Image Based Garment Transfer
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    pooling of the desired garment in A after upsampling it is concatenated with
    𝑍𝑐𝑠
    and fed to a U-Net architecture. The module(f2) is trained using 𝐵𝑐𝑠
    and
    embedding of garment in augmented B and optimized using the following
    losses 𝐿𝐿1
    = ||𝑓2 𝑍′
    𝑐𝑠
    , 𝐴 − 𝐴||1
    , 𝐿𝑓𝑒𝑎𝑡
    = σ
    𝑙
    𝜆𝑙
    ||∅𝑙
    𝑓2 𝑍′
    𝑐𝑠
    ,𝐴 − ∅𝑙
    𝐴 ||2
    , 𝐿𝑎𝑑𝑣
    =
    𝐸𝑥 ~𝑝(𝐴)
    𝐷 𝑥 + 𝐸𝑧~𝑝(𝑓2𝑒𝑛𝑐)
    1 − 𝐷 𝑓2𝑑𝑒𝑐
    (𝑧) . Where ∅𝑙
    represents the layer l
    from pretrained VGG19 network. The discriminator is trained as 𝐿𝑎𝑑𝑣𝑑
    =
    𝐸𝑥 ~𝑝(𝐴)
    𝐷 𝑥 + 𝐸𝑧~𝑝(𝑓2𝑒𝑛𝑐)
    1 − 𝐷 𝑓2𝑑𝑒𝑐
    (𝑧) + 𝜆𝑔𝑝
    𝐸𝑧~𝑝(𝑧)
    ||Δ𝑧
    𝐷 𝑧 ||2
    ◼ Results
    The test split of VITON is used for comparison with CGAN and PG2.
    ◼ Next must-read paper: “Viton: An image-based virtual try-on network ”

    View Slide

  12. ◼ Summary
    The work proposes virtual try-on without the need for 3D information using
    a coarse-to-fine strategy. The algorithm first produces an image of target
    garment overlaid on the person as per pose followed by refinement.
    ◼ Related Works
    Guan(2012) proposed DRAPE algorithm to simulate 2D garments on 3D
    images. Eisert(2009) retextured the garment dynamically for real-time
    virtual try-on. Sekine(2014) adjusted 2D garments to users using body
    shape and depth images. Moll(2017) used 3D information to warp and
    extract features from garments. Yang(2017) extracted 3D information from
    2D images which are then re-targeted to other humans.
    ◼ Proposed Methodology
    For clothed person I and target garment c the goal is to synthesize person
    wearing garment c as I’’. The person is represented using an 18 channel
    pose heatmap, human segmentation(except face and hair), RGB channels of
    face and hair. The person representation p and garment c are concatenated
    and fed to encoder-decoder architecture(𝐺𝑐
    ) which outputs the synthesized
    image(I’) and segmentation mask(M) of garment in I’. The module is
    optimized as 𝐿𝐺𝑐
    = σ
    𝑖=0
    5 𝜆𝑖
    ||𝜙𝑖
    (𝐼′) − 𝜙𝑖
    (𝐼)||1
    + ||𝑀 − 𝑀0
    || for 𝜙 representing the
    𝑖𝑡ℎ feature map of VGG19 and 𝑀0
    representing output of human parser on I.
    VITON: An Image-based Virtual Try-on Network
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    The refinement module uses a Thin Plate Spline transformation to warp
    foreground of garment c to calculated segmentation mask M to output
    warped garment c’. The warped garment(c’) and synthesized image(I’) are
    concatenated and fed to a refinement module(𝐺𝑟
    ) to output 1 channel
    composition mask 𝛼 ∈ (0,1)𝑚×𝑛. Thus, the output is calculated as 𝐼′′ = 𝛼 ⊙ 𝑐′ +
    1 − 𝛼 𝐼′ and optimized using 𝐿𝐺𝑟
    = σ
    𝑖=3
    5 𝜆𝑖
    ||𝜙𝑖
    (𝐼′′) − 𝜙𝑖
    (𝐼)||1
    + 𝜆𝑤𝑎𝑟𝑝
    ||𝛼||1
    +
    𝜆𝑇𝑉
    ||Δ𝛼||1
    . Here 𝜆𝑤𝑎𝑟𝑝
    and 𝜆𝑇𝑉
    denote weights of 𝐿1
    norm and TV norm and
    ||Δ𝛼||1
    is used to penalize gradient of composition mask.
    ◼ Results
    Zalandao dataset is used to evaluate the effectiveness of proposed algorithm
    and compare to other state of the art algorithms.
    ◼ Next must-read paper: “Photographic image synthesis with cascaded
    refinement networks ”

    View Slide

  13. ◼ Summary
    The proposed algorithm can swap garments without training or using
    segmentation results.
    ◼ Related Works
    Goodfellow(2014) trained generator(G) on data distribution and a
    discriminator(D) to distinguish real from generated data, thus after
    optimization G can produce images indistinguishable from training
    examples. Mirza(2014) proposed conditional GAN generates images
    conditioned on information.
    ◼ Proposed Methodology
    The proposed algorithm uses images of person wearing a garment(x) and
    images of garment(y) for supervised training. The model(generator and
    discriminator) is trained adverbially as 𝑚𝑖𝑛𝐺
    𝑚𝑎𝑥𝐷
    𝐿𝑐𝐺𝐴𝑁
    𝐺,𝐷 + 𝑦𝑖
    𝐿𝑖𝑑
    (𝐺) +
    𝑦𝑐
    𝐿𝑐𝑦𝑐
    (𝐺) where 𝐿𝑐𝐺𝐴𝑁
    𝐺, 𝐷 = 𝔼𝑥𝑖,𝑦𝑖∼𝑝𝑑𝑎𝑡𝑎
    σ
    𝜆,𝜇
    [log𝐷𝜆 ,𝜇
    (𝑥𝑖
    ,𝑦𝑖
    )] +
    𝔼𝑥𝑖,𝑦𝑖,𝑦𝑗∼𝑝𝑑𝑎𝑡𝑎
    σ
    𝜆,𝜇
    [(1 − log𝐷𝜆 ,𝜇
    𝐺 𝑥𝑖
    , 𝑦𝑖
    , 𝑦𝑗
    , 𝑦𝑗
    )] +
    𝔼𝑥𝑖,𝑦𝑗≠𝑖∼𝑝𝑑𝑎𝑡𝑎
    σ
    𝜆 ,𝜇
    [(1 − log 𝐷𝜆 ,𝜇
    𝑥𝑖
    , 𝑦𝑖
    )] for current garment 𝑦𝑖
    and target
    garment 𝑦𝑗
    . A regularization loss 𝐿𝑖𝑑
    (𝐺) is used to avoid painting irrelevant
    regions as 𝐿𝑖𝑑
    𝐺 = 𝔼𝑥𝑖 ,𝑦𝑖,𝑦𝑗∼𝑝𝑑𝑎𝑡𝑎
    | 𝛼
    𝑗
    𝑖 | where . represents L1 normalization.
    To enforce consistency cycle loss 𝐿𝑐𝑦𝑐
    (𝐺) is used as 𝐿𝑐𝑦𝑐
    𝐺 =
    𝔼𝑥 ,𝑦 ,𝑦 ∼𝑝
    | 𝑥𝑖
    − 𝐺 𝐺 𝑥𝑖
    , 𝑦𝑖
    , 𝑦𝑗
    , 𝑦𝑗
    ,𝑦𝑖
    |. Thus, if 𝑥𝑗 = 𝐺(𝑥𝑖
    , 𝑦𝑖
    ,𝑦𝑗
    ) modifies
    The Conditional Analogy GAN: Swapping Fashion Articles on People Images
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    irrelevant regions reverse swapping as 𝐺(𝑥
    𝑖
    𝑗, 𝑦𝑗
    , 𝑦𝑖
    ) will generate image which
    when compared to 𝑥𝑖
    will penalize the model.
    ◼ Results
    Zalandao dataset is used to evaluate the effectiveness of proposed algorithm.
    ◼ Next must-read paper: “A generative model of people in clothing ”
    ◼ Conclusion
    The performance can be further increased if foreground background
    segmentation is available, texture descriptors can further increase the
    performance of condition GAN.

    View Slide

  14. 14
    Copyright © 2022 VIVEN Inc. All Rights Reserved
    会社名 株式会社 微分
    代表者 吉田 慎太郎
    所在地
    東京都新宿区新宿4丁目1-16 JR新宿ミラ
    イナタワー18F JustCo新宿
    設立 2020年10月
    資本金 7,000,000(2022年10月時点)
    従業員 20名(全雇用形態含む)
    事業内容
    教育機関向けソフトウェア「School DX」の開発
    ウェブアプリケーションの開発
    画像認識/自然言語の研究開発
    会社概要

    View Slide

  15. 15
    Copyright © 2022 VIVEN Inc. All Rights Reserved

    View Slide