Virtual-try-on Survey by VIVEN Inc

1 Virtual-try-on Survey Tapas Dutta, Deep Learning Engineer

◼ Summary The work attempts to preserve garment details and
transfer garment under occlusion and difficult poses. This is achieved using a geometric matching procedure complemented with a powerful image generator. ◼ Related Works Xintong(2018) proposed VTON employing a two-stage approach i.e., coarse-to-fine image generation and Thin-Plate Spline(TPS) transformation to align image to clothing. Wang(2018) used Geometric Matching Module(GMM) to learn TPS in an end-to-end manner. Rahman(2020) improved the human mask send to GMM. Yu(2019) improved the human representation fed to GMM. Choi(2021), Dong(2019), Yang(2020) used a secondary network for generating clothing segmentation which is used as additional input. Ge(2021), Issenhuth(2020) employed a teacher-student distillation module to overcome the need for error prone intermediate steps. ◼ Proposed Methodology The algorithm consists of 2 modules Body Part Geometric Matcher(BPGM) and Context-Aware Generator(CAG). The BPGM is responsible for estimating the parameters of TPS. It has 2 encoders to get feature maps for clothing𝜑1 and body segmentation 𝜑2 . To calculate loss the desired target appearance is needed, thus target clothing C that matches the one in the input image is used to optimize the model. These obtained feature 2 C-VTON: Context-Driven Image-Based Virtual Try-On Network Copyright © 2022 VIVEN Inc. All Rights Reserved maps are normalized along channel, spatially flattened(w*h*c-> wh*c) and is used concatenated. To calculate correlation matrix(C) as 𝐶 = 𝜑1 𝑇 𝜑2 . This, correlation matrix is fed to a regressor which predicts TPS parameters 𝜃. This module is optimized using 3 losses namely target shape loss to have clothing match shape of pose as 𝐿𝑠ℎ𝑝 = || | 𝑇𝜃 𝑀𝑡 − 𝑀𝑐 |1 where 𝑇𝜃 represents TPS parameterized by 𝜃 , 𝑀𝑡 as binary mask of clothing and 𝑀𝑐 binary mask of clothing in input image. To preserve details in garment 𝐿𝑎𝑝𝑝 = | 𝐶𝑤 ۨ 𝑀𝑏 − 𝐼 ⊙ 𝑀𝑏 |1 for input image I, warped garment 𝐶𝑤 and 𝑀𝑏 as binary mask of target area. Perceptual loss as 𝐿𝑣𝑔𝑔 = σ 𝑖 𝑛 𝜆𝑖 ||𝜃𝑖 𝐶𝑤 ⊙ 𝑀𝑏 − 𝜃𝑖 𝐼 ⊙ 𝑀𝑏 ||1 is used to ensure that sematic information is preserved in target area where 𝜃𝑖 represents feature map generated before 𝑖𝑡ℎ max pooling layer of a VGG19 model. The entire module is trained on weighted average of these 3 losses. Body segmentation, input with masked clothing area, target and warped clothing images are concatenated channel wise and fed to CAG as Image Context(IC). The CAG architecture consists of ResNet blocks, upsampling layers and context aware normalization(CAN) calculated as 𝑋𝐶𝐴𝑁 = 𝑋𝐵𝑁 ⊙ 𝜆 + 𝛽 for learnable parameter 𝜆 𝑎𝑛𝑑 𝛽. Each residual blocks receives output from previous block and IC. CAG’s output with target garment and garment in input image represented as 𝐼𝐶 and 𝐼′𝐶 . Perceptual loss is calculated between CAG’s output 𝐼′𝐶 and input image.

3 Copyright © 2022 VIVEN Inc. All Rights Reserved For
input I and CAG output 𝐼𝐶 𝐷𝑠𝑒𝑔 outputs d+1 channels for d body parts and 1 channel encodes whether a pixel is from real or generated data. Thus, 𝐷𝑠𝑒𝑔 is optimized by minimizes the loss for segmentation maps for real image for first d feature maps and the last feature map is optimized as a classification task. This is trained in a generative adversarial network(GAN) approach. Matching discriminator loss is used to make the feature representation of output closer to garment compared to input. Based on segmentation map patches are extracted from neck, both upper and lower arms and forearms using convolution and discriminator is used to differentiate between real and fake samples. ◼ Result Comparison with state-of-the-art architectures are conducted on VITON and MPV dataset. ◼ Next must-read paper: “Parser-Free Virtual Try-on via Distilling Appearance Flows ”

◼ Summary The work proposes a novel recurrent generation pipeline
to sequentially put on garments. This is done by encoding the shape and texture of garment, enabling them to be edited separately. By jointly training on pose parameters and inpainting details are preserved. ◼ Related Works Raj(2018) proposed SwapNet which is able to transfer garment from one person to another by generating a segmentation mask of the desired clothing on the desired pose. Neubeger(2020) first generates a segmentation mask for all garments followed by injecting the garment encodings in the associated regions. Men(2020) encodes garment in a 1D vector and is fed to StyleGAN. Conditioning on 2D pose allows for pose transfer as well as try-on. Esser(2018), Ma(2017), Men(2020), Tang(2020), Ulyanov(2017), Zhu(2019) use pose transfer using 2D key points, however due to their limited ability to capture garment details they produce blurry results. ◼ Proposed Methodology A person is represented as a tuple of pose, body, garments. Pose is represented as 18 key point heatmaps using OpenPose. Given source garment 𝑔𝑘 the masked garment segment 𝑠𝑔𝑘 and the pose is estimated and since it would be different from desired pose P, flow field 𝑓𝑔𝑘 using Global Flow Field Estimator(GFLA) to align the garment segment 𝑠𝑔𝑘 with P 4 Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing Copyright © 2022 VIVEN Inc. All Rights Reserved The garment segment 𝑠𝑔𝑘 is fed to texture encoder 𝐸𝑡𝑒𝑥 (3 VGG encoder in ADGAN), this output is warped by flow field 𝑓𝑔𝑘 to output texture feature map 𝑇𝑔𝑘 . Soft-shape mask 𝑀𝑔𝑘 is calculated using 3 convolution layers 𝐸𝑠𝑒𝑔 on 𝑇𝑔𝑘 . 𝑀𝑔𝑘 and 𝑇𝑔𝑘 are stacked and passed through 2 convolution layers 𝐸𝑚𝑎𝑝 to get 𝑇′𝑔𝑘 . Human segmenter is used to obtain background 𝑆𝑏𝑔 and skin mask 𝑆𝑠𝑘𝑖𝑛 which are encoded using 𝐸𝑡𝑒𝑥 and 𝐸𝑠𝑒𝑔 to obtain (𝑇𝑏𝑔 , 𝑀𝑏𝑔 ) and (𝑇𝑠𝑘𝑖𝑛 , 𝑀𝑠𝑘𝑖𝑛 ). For 𝑀𝑓𝑔 representing mask of all foreground body representation 𝑇′𝑏𝑜𝑑𝑦 is calculated as 𝑇′𝑏𝑜𝑑𝑦 = 𝑀𝑓𝑔 ⊙ 𝐸𝑚𝑎𝑝 𝑀𝑓𝑔 ۪ 𝑏, 𝑀𝑓𝑔 + 1 − 𝑀𝑓𝑔 ⊙ 𝑇′𝑏𝑔 . To generate output the pose P is encoded using 𝐸𝑝𝑜𝑠𝑒 , the result is 𝑍𝑝𝑜𝑠𝑒 which along with 𝑇′𝑏𝑜𝑑𝑦 is used to obtain 𝑍𝑏𝑜𝑑𝑦 using 𝐺𝑏𝑜𝑑𝑦 (2 style blocks in ADGAN). To generate garments 𝑍𝑏𝑜𝑑𝑦 is used as 𝑍0 and 𝐺𝑔𝑎𝑟 is used to produce the next state 𝑍𝐾 from 𝑍𝐾−1 , 𝑇′𝑔𝑘 and 𝑀𝐾 as 𝑍𝐾 = ∮ 𝑍𝐾−1 , 𝑇′ 𝑔𝑘 ⊙ 𝑀𝐾 + 𝑍𝐾−1 ⊙ (1 − 𝑀𝑔𝑘 ) for ∮ having architecture like 𝐺𝑏𝑜𝑑𝑦 . The output image is prepared as 𝐺𝑑𝑒𝑐 (𝑍𝐾 ), for 𝐺𝑑𝑒𝑐 as the decoder of ADGAN. ◼ Results DeepFashion dataset is used for evaluation of the model’s performance. ◼ Next must-read paper: “Controllable person image synthesis with attribute-decomposed gan.”

◼ Summary The work proposes a novel algorithm to overcome
occlusion and human pose constraints and generate photo realistic images. This is done by first predicting the layout followed by determining whether content needs to be generated or preserved. ◼ Related Works Dong(2019) proposed a multi-pose guided image based virtual try-on network. Han(2018) used Thin-Plate-Spline(TPS) method for wrapping target garment to input pose. Wang(2018) used neural network to learn TPS wrapping achieving more accurate results. To preserve posture and previous garment details Yu(2019) makes use of features extracted from high level input features. ◼ Proposed Methodology The proposed Adaptive Content Generating and Preserving Network has 3 modules namely the Semantic Generation Module(SGM) to generate masks of body parts and warped clothes, Clothes Warping Module(CWM) to wrap target clothing according to clothing mask using THS, the Content Fusion Module(CFM) to combine the information from previous 2 modules to determine if a part needs to be generated or preserved. The arm and torso segmentation masks are fused which along with target clothing image 𝑇𝑐 and pose map 𝑀𝑝 is used as input to SGM. SGM outputs mask of body 𝑀𝑤 𝑆 5 Towards Photo-Realistic Virtual Try-On by Adaptively Generating↔Preserving Image Content Copyright © 2022 VIVEN Inc. All Rights Reserved parts(head, arm, bottom clothes). Using 𝑀𝑤 𝑆 , 𝑀𝑝 and 𝑇𝑐 the estimated clothing 𝑀𝐶 𝑆 region is predicted. The module is trained in an adversarial manner along with a pixel-wise cross entropy loss. A Spatial Transformation Network(STN) is used to learn the mapping between 𝑇𝑐 and 𝑀𝐶 𝑆 to get wrapped clothing image 𝑇𝐶 𝑊 which is constrained on 𝐿3 as 𝐿3 = σ 𝑝 ∈𝑃 𝜆𝑟 || 𝑝𝑝0 |2 − 𝑝𝑝1 |2 + || 𝑝𝑝2 |2 − 𝑝𝑝3 |2 + 𝜆𝑠 (|𝑆 𝑝, 𝑝0 − 𝑆 𝑝, 𝑝1 + 𝑆 𝑝, 𝑝2 − 𝑆(𝑝,𝑝3 )|) where 𝜆𝑟 and 𝜆𝑠 are trade-off parameters, 𝑝0−4 are top, bottom, left, and right control points and 𝑆 𝑝, 𝑝𝑖 is the slope between p and 𝑝𝑖 points. This module is optimized using 𝐿𝑤 for 𝐿𝑤 = 𝐿3 + 𝐿4 where 𝐿4 = | 𝑇𝐶 𝑊 − 𝐼𝐶 |1 . A refinement network is used to add further details as 𝑇𝐶 𝑅 = 1 − 𝛼 ⊙ 𝑇𝐶 𝑊 + 𝛼 𝑇𝐶 𝑅 for learnable parameter 𝛼 and refinement module output 𝑇𝐶 𝑅. The composited body mask is calculated as 𝑀𝑊 𝐶 = (𝑀𝑎 𝐺 + 𝑀𝑤 ) ⊙ (1 − 𝑀𝐶 𝑆) for generated body mask 𝑀𝑎 𝐺, as 𝑀𝑎 𝐺 = 𝑀𝑊 𝑆 ⊙ 𝑀𝑐 synthesized clothing mask 𝑀𝐶 𝑆 and original body part mask 𝑀𝑤 , also non-targeted details are preserved as 𝐼𝑤 = 𝐼𝑤′ ⊙ (1 − 𝑀𝐶 𝑆) where 𝐼𝑤′ = 𝐼 − 𝑀𝑊 𝐶 . Thus, 𝐼𝑤 , 𝑇𝐶 𝑅, 𝑀𝑊 𝐶 and 𝑀𝐶 𝑆 are concatenated and used as input for inpainting based fusion GAN 𝐺3 . ◼ Results Comparison with state-of-the-art architectures are conducted on VITON. ◼ Next must-read paper: “An image-based virtual try-on network with body and clothing feature preservation”

◼ Summary The work divides the task of virtual-try-on into
two simpler tasks i.e. separate person’s body from their clothing followed by generating new images of wearer dressed in arbitrary garments. ◼ Related Works Yan(2017) synthesized human motion using single human image and human skeleton sequence. Ma(2018) generated image of human at desired pose using human and pose information. Han(2018) transferred garment to human using context matching module. Wang(2018) used a composition network to integrate garment onto human pose. Jetchev(2018) used Cycle- GAN for virtual try-on without the need of pose or shape information. ◼ Proposed Methodology The proposed algorithm contains two subnetworks, shape transfer network to generate semantic map of person and garment and an appearance transfer network to preserve fine details. The shape transfer network requires semantic information, which is obtained using a semantic parser(Gong, 2018) trained on LIP dataset. The segmentation is transformed into 10 channel binary maps, followed by masking the garment regions in binary maps. The pose estimation(17 keypoints) is obtained using a off-the-self pose estimator. A binary mask representing the arms, torso and top clothes is used for body representation. The keypoints and GarmentGAN: Photo-realistic Adversarial Fashion Transfer Copyright © 2022 VIVEN Inc. All Rights Reserved binary mask are concatenated, fed to an encoder-decoder architecture and trained in an adversarial manner with Patch-GAN(Isola, 2017) as discriminator. CNN layers with 2 strides, 3X3 kernels, leakyReLU and layernormalization is used 5 times for downsampling. 4 residual blocks are used as bottleneck and residual CNN layers for upsampling are used as decoder. To have the architecture focus on regions that needs to be replaced, regions outside the masked region are replaced with the input. The generator loss 𝐿𝐺 = 𝜆1 𝐿𝑝𝑎𝑟𝑠𝑖𝑛𝑔 + 𝜆2 𝐿𝑝𝑒𝑟−𝑝𝑖𝑥𝑒𝑙 − 𝐸 𝐷 𝐼′𝑠𝑒𝑔 for 𝐿𝑝𝑎𝑟𝑠𝑖𝑛𝑔 representing parsing loss, 𝐿𝑝𝑒𝑟−𝑝𝑖𝑥𝑒𝑙 = ||𝐼𝑠𝑒𝑔−𝐼′ 𝑠𝑒𝑔||1 𝑁 where N is the number of pixels. The discriminator is trained as 𝐿𝐷 = 𝐸 max 0,1 − 𝐷 𝐼𝑠𝑒𝑔 + 𝐸 max 0,1 + 𝐷 𝐼′ 𝑠𝑒𝑔 + 𝜆3 𝐿𝐺𝑃 where 𝐿𝐺𝑃 = 𝐸[(| 𝛿 𝑥 𝐷 𝑥 |2 − 1)2]. The appearance network receives generated segmentation maps, target clothing, body shape information as inputs which are fed to an encoder-decoder network trained adversely using multi-scale SN-PatchGan as discriminator. Features extracted from person representation and target clothing are used to estimate parameters of Thin-Plate Spline transformation to warp the garment in desired pose. The features of person after masking and warped clothing are concatenated and fed to SPADE normalization layers. Thus, generator loss 𝐿𝐺 = 𝛼1 𝐿𝑇𝑃𝑆 + 𝛼2 𝐿𝑝𝑒𝑟−𝑝𝑖𝑥𝑒𝑙 +𝛼3 𝐿𝑝𝑒𝑟𝑐𝑒𝑝𝑡 +𝛼4 𝐿𝑓𝑒𝑎𝑡 − 𝐸(𝐷(𝐼′𝑝𝑒𝑟𝑠𝑜𝑛 ))

For 𝐼′𝑝𝑒𝑟𝑠𝑜𝑛 representing generator’s output and 𝐿𝑝𝑒𝑟𝑐𝑒𝑝𝑡 , 𝐿𝑓𝑒𝑎𝑡 representing
percept(Jonson, 2016) and feat(Wang, 2018) loss, respectively. 𝐿𝑇𝑃𝑆 = 𝐸(||𝐼𝑤𝑎𝑟𝑝𝑒𝑑 − 𝐼𝑤𝑜𝑟𝑛 ) where 𝐼𝑤𝑎𝑟𝑝𝑒𝑑 𝑎𝑛𝑑 𝐼𝑤𝑜𝑟𝑛 represents prediction using TPS and reference image of person wearing garment, respectively. The discriminator loss 𝐿𝐷 = 𝐸 max 0,1 − 𝐷 𝐼𝑝𝑒𝑟𝑠𝑜𝑛 + 𝐸 max 0,1 + 𝐷 𝐼𝑝𝑒𝑟𝑠𝑜𝑛 + 𝛽𝐿𝐺𝑃 . ◼ Results Data collected by Han(2018) is used for evaluation using Iception Score(Salimas, 2016) and Frechet Inception Distance(Heusel, 2017). ◼ Next must-read paper: “Toward Characteristic-Preserving Image-based Virtual Try-On Network” GarmentGAN: Photo-realistic Adversarial Fashion Transfer Copyright © 2022 VIVEN Inc. All Rights Reserved

◼ Summary The work proposes Outfit-VITON with an inexpensive training
pipeline and with the ability to synthesize outputs from multiple garments. ◼ Related Works Han(2018) used shape information to warp garment to fit pose using compositional stage and geometric warping. Wang(2018) used convolutional geometric matcher for geometric warping. Issenhuth(2019) trained network in an adversarial manner to preserve details. Wu(2018) used GAN to warp garment onto target person. Sangwoo(2018) use segmentation maps to overcome over generated garments. Yildirim(2019) can generate images of person wearing multiple garments. Raj(2018) can swap entire outfit between two query images using GAN. ◼ Proposed Methodology The person’s image 𝑥0 along with multiple garments (𝑥1 , 𝑥2 ,… 𝑥𝑀 ) are used as input. Segmentation and DensePose network is used to obtain body parsing and pose information(b) for each M images, of size 𝐻 × 𝑊 × 𝐷𝑐 and 𝐻 × 𝑊 × 𝐷𝑏 respectively for 𝐷𝑐 and 𝐷𝑏 number of classes. A selected garment 𝑥𝑚 is passed through a shape autoencoder 𝐸𝑠ℎ𝑎𝑝𝑒 and pooling layer is used to obtain features maps(𝑒𝑚,𝑐 𝑠 ) of dimension 8 × 4 × 𝐷𝑠 . This is done for each mask of person’s image to obtain a feature map of size 8 × 4 × 𝐷𝑠 𝐷𝑐 (𝑒′𝑠 ). When the user wants to use garment c from reference image m, Image Based Virtual Try-on Network from Unpaired Data Copyright © 2022 VIVEN Inc. All Rights Reserved feature map in 𝑒′𝑠 is replaced by 𝑒𝑚 ,𝑐 𝑠 followed by upsampling to 𝐻 × 𝑊 × 𝐷𝑠 × 𝐷𝑐 denoted by 𝑒𝑠 . This along with pose information is fed to shape generator network to obtain transformed segmentation map as 𝑠𝑦 = 𝐺𝑠ℎ𝑎𝑝𝑒 (𝑏, 𝑒𝑠 ). The appearance generation module takes a reference image along with segmentation map of desired garment within the image and is fed to an appearance autoencoder after passing through regionwise pooling according to mask we obtain 𝑒𝑚,𝑐 𝑡 ∈ 𝑅1×𝐷𝑡 . Similarly for query image we obtain a feature map of dimension 𝐷𝑐 × 𝐷𝑡 followed by replacing the 𝑐𝑡ℎ dimension by 𝑒𝑚 ,𝑐 𝑡 ,after replacing for each garment we obtain the appearance vector 𝑒𝑚 𝑡 which after region-wise-broadcasting we obtain feature map 𝑒𝑡 . The appearance generator receives 𝑠𝑦 and 𝑒𝑡 to generate virtual try-on output. The online optimization fine tunes the appearance generator using reference loss and GAN loss. ◼ Results Data scrapped from Amazon of both male and female and various garments are used to validate the performance of the model. ◼ Next must-read paper: “ Toward characteristic- preserving image-based virtual try-on network”

◼ Summary The proposed algorithm can preserve fine details in
the garment. This is achieved using a geometric matching module to align garment to pose followed by a Try-on-Module to seamlessly integrate the warped garment onto the person. ◼ Related Works Jetchev(2017) transferred person’s clothes using target clothes as condition. Han(2018) generated image from garment image and person representation. Wang(2018) refined the garment details using by adding warped product image using Geometric Matching Module(GMM). Guler(2018) used a pose transfer network based on using dense pose condition. ◼ Proposed Methodology The person is represented using an 18 channel heatmap with each keypoint as a 11 × 11 white rectangle, 1 channel binary mask of different parts of human body, RGB image of the reserved parts of person. A geometric matching module is used to align target clothing to person representation. During training garment is matched from ground truth (c) to garment worn by person(𝑐𝑡 ) using 4 parts: two networks for feature extraction from person(p) and ground truth garment(c), a correlation layer to combine them into a single tensor, a regression network to predict the parameters of Thin-Plate Spline(TPS) transformation, a TPS transformation to warp Toward Characteristic-Preserving Image-based Virtual Try-On Network Copyright © 2022 VIVEN Inc. All Rights Reserved garment as warped garment(c’). The module is optimized as 𝐿𝐺𝑀𝑀 = | 𝑐′ − 𝑐𝑡 |1 = | 𝑇𝑃𝑆(𝑐) − 𝑐𝑡 |1 . The try-on module comprising of a U-Net is fed a concatenation of person representation (p) and warped garment c’ to predict compositional mask M and render person 𝐼𝑟 . Thus, the result becomes 𝐼0 = 𝑀 ⊙ 𝑐′ + 1 − 𝑀 ⊙ 𝐼𝑟 . The module is trained using 𝐿𝑇𝑂𝑀 = 𝜆𝑙1 ቚ ቚ 𝐼0 − 𝐼𝑡 |`1 + 𝜆𝑣𝑔𝑔 σ 𝑖=1 5 ||∅𝑖 𝐼0 − ∅𝑖 (𝐼𝑡 ) |1 + 𝜆𝑚𝑎𝑠𝑘 | 1 − 𝑀 |`1 . For person wearing garment 𝐼𝑡 and ∅𝑖 representing the 𝑖𝑡ℎ convolution layer’s output in a pretrained VGG19 network. ◼ Results Experiments conducted are on data collected by Han(2017). It has more than 16K cleaned pairs with 14K in training and 2K in validation. ◼ Next must-read paper: “Viton: An image-based virtual try-on network”

◼ Summary The proposed algorithm can transfer garments from image
of person wearing it to new person using three networks position alignment, texture alignment and fitting network. ◼ Related Works Ma(2017) used GAN used pose representation for person generation. Zhu(2017) generated fashion images from textual inputs. ◼ Proposed Methodology To transfer model M’s garments to person P, the dense pose is extracted as 𝑀𝐷 and 𝑃𝐷 respectively. Barcentric coordinate interpolation is used to warp M to 𝑀′𝑊 using UV coordinates of 𝑀𝐷 and 𝑃𝐷 . M, 𝑀𝐷 , 𝑃𝐷 and 𝑀′𝑊 are concatenated and fed to an encoder(3 CNN and residual)-decoder(2 deconvolution and 1 CNN) network to align model’s pose as 𝑀′𝐴 trained in a self-supervised manner. A binary mask(R) of same size as 𝑀′𝑊 such that if a pixel in 𝑀′𝑊 belongs to background its mask value is 0 else 1 is created. Merged image as 𝑀′ = 𝑀′𝑊 ⊙ 𝑅 + 𝑀′ 𝐴 ⊙ (1 − 𝑅) will have sharp edges thus an encoder-decoder architecture is used to smooth them. ROI(R) mask is created by training a network with LIP_SSL pretrained model to predict garment and DensePose to predict upper torso mask and taking the union as the ground truth. 𝑀′′𝑅 = 𝑀′𝑅 ⊙ 𝑅 and 𝑃′ = 𝑃 ⊙ (1 − 𝑅) 𝑀′′𝑅 and 𝑃′ are concatenated and fed to an encoder-decoder network for smoothing. M2E-Try On Net: Fashion from Model to Everyone Copyright © 2022 VIVEN Inc. All Rights Reserved ◼ Next must-read paper: “Viton: An image-based virtual try-on network ”

◼ Summary The work proposes a novel weakly supervised algorithm
to transfer garments across people with arbitrary garments and poses. ◼ Related Works Lassneret(2017) generated people with arbitrary garment conditioned on pose. Zhu(2017) ◼ Proposed Methodology The algorithm consists of a warping module and texturing module. For A representing garment and B representing desired body pose and shape, their segmentation maps 𝐴𝑐𝑠 (body segmentation) and 𝐵𝑏𝑠 (18 channel garment segmentation) are fed to a dual channel conditional GAN. To strongly condition the output on body segmentation and weakly on garment segmentation, 𝐵𝑏𝑠 is encoded into a narrow shape of 2 × 2 × 1024 before upsampling to 512 × 512 × 1024 and concatenating with encoded body segmentation feature maps. The module is trained using 𝐵𝑏𝑠 and 𝐵𝑐𝑠 , to prevent it from learning positional information from 𝐵𝑏𝑠 random affine, crop, horizontal flips are used. Thus, the module can be represented as 𝑍𝑐𝑠 = 𝑓1(𝐴𝑐𝑠 , 𝐵𝑏𝑠 ) which is optimized as 𝐿𝑤𝑎𝑟𝑝 = 𝐿𝐶𝐸 + 𝜆𝑎𝑑𝑣 𝐿𝑎𝑑𝑣 where 𝐿𝐶𝐸 = − σ 𝑐 =1 18 (𝐴𝑐𝑠 (𝑖, 𝑗) = 𝑐)(log 𝑍𝑐𝑠 (𝑖,𝑗)) and 𝐿𝑎𝑑𝑣 = 𝐸𝑥 ~𝑝(𝐴𝑐𝑠) 𝐷 𝑥 + 𝐸𝑥 ~𝑝(𝑓1𝑒𝑛𝑐(𝐴𝑐𝑠,𝐵𝑏𝑠)) 1 − 𝐷 𝑓1𝑑𝑒𝑐 (𝑧) . The texturing module receives 𝑍𝑐𝑠 and ROI SwapNet: Image Based Garment Transfer Copyright © 2022 VIVEN Inc. All Rights Reserved pooling of the desired garment in A after upsampling it is concatenated with 𝑍𝑐𝑠 and fed to a U-Net architecture. The module(f2) is trained using 𝐵𝑐𝑠 and embedding of garment in augmented B and optimized using the following losses 𝐿𝐿1 = ||𝑓2 𝑍′ 𝑐𝑠 , 𝐴 − 𝐴||1 , 𝐿𝑓𝑒𝑎𝑡 = σ 𝑙 𝜆𝑙 ||∅𝑙 𝑓2 𝑍′ 𝑐𝑠 ,𝐴 − ∅𝑙 𝐴 ||2 , 𝐿𝑎𝑑𝑣 = 𝐸𝑥 ~𝑝(𝐴) 𝐷 𝑥 + 𝐸𝑧~𝑝(𝑓2𝑒𝑛𝑐) 1 − 𝐷 𝑓2𝑑𝑒𝑐 (𝑧) . Where ∅𝑙 represents the layer l from pretrained VGG19 network. The discriminator is trained as 𝐿𝑎𝑑𝑣𝑑 = 𝐸𝑥 ~𝑝(𝐴) 𝐷 𝑥 + 𝐸𝑧~𝑝(𝑓2𝑒𝑛𝑐) 1 − 𝐷 𝑓2𝑑𝑒𝑐 (𝑧) + 𝜆𝑔𝑝 𝐸𝑧~𝑝(𝑧) ||Δ𝑧 𝐷 𝑧 ||2 ◼ Results The test split of VITON is used for comparison with CGAN and PG2. ◼ Next must-read paper: “Viton: An image-based virtual try-on network ”

◼ Summary The work proposes virtual try-on without the need
for 3D information using a coarse-to-fine strategy. The algorithm first produces an image of target garment overlaid on the person as per pose followed by refinement. ◼ Related Works Guan(2012) proposed DRAPE algorithm to simulate 2D garments on 3D images. Eisert(2009) retextured the garment dynamically for real-time virtual try-on. Sekine(2014) adjusted 2D garments to users using body shape and depth images. Moll(2017) used 3D information to warp and extract features from garments. Yang(2017) extracted 3D information from 2D images which are then re-targeted to other humans. ◼ Proposed Methodology For clothed person I and target garment c the goal is to synthesize person wearing garment c as I’’. The person is represented using an 18 channel pose heatmap, human segmentation(except face and hair), RGB channels of face and hair. The person representation p and garment c are concatenated and fed to encoder-decoder architecture(𝐺𝑐 ) which outputs the synthesized image(I’) and segmentation mask(M) of garment in I’. The module is optimized as 𝐿𝐺𝑐 = σ 𝑖=0 5 𝜆𝑖 ||𝜙𝑖 (𝐼′) − 𝜙𝑖 (𝐼)||1 + ||𝑀 − 𝑀0 || for 𝜙 representing the 𝑖𝑡ℎ feature map of VGG19 and 𝑀0 representing output of human parser on I. VITON: An Image-based Virtual Try-on Network Copyright © 2022 VIVEN Inc. All Rights Reserved The refinement module uses a Thin Plate Spline transformation to warp foreground of garment c to calculated segmentation mask M to output warped garment c’. The warped garment(c’) and synthesized image(I’) are concatenated and fed to a refinement module(𝐺𝑟 ) to output 1 channel composition mask 𝛼 ∈ (0,1)𝑚×𝑛. Thus, the output is calculated as 𝐼′′ = 𝛼 ⊙ 𝑐′ + 1 − 𝛼 𝐼′ and optimized using 𝐿𝐺𝑟 = σ 𝑖=3 5 𝜆𝑖 ||𝜙𝑖 (𝐼′′) − 𝜙𝑖 (𝐼)||1 + 𝜆𝑤𝑎𝑟𝑝 ||𝛼||1 + 𝜆𝑇𝑉 ||Δ𝛼||1 . Here 𝜆𝑤𝑎𝑟𝑝 and 𝜆𝑇𝑉 denote weights of 𝐿1 norm and TV norm and ||Δ𝛼||1 is used to penalize gradient of composition mask. ◼ Results Zalandao dataset is used to evaluate the effectiveness of proposed algorithm and compare to other state of the art algorithms. ◼ Next must-read paper: “Photographic image synthesis with cascaded refinement networks ”

◼ Summary The proposed algorithm can swap garments without training
or using segmentation results. ◼ Related Works Goodfellow(2014) trained generator(G) on data distribution and a discriminator(D) to distinguish real from generated data, thus after optimization G can produce images indistinguishable from training examples. Mirza(2014) proposed conditional GAN generates images conditioned on information. ◼ Proposed Methodology The proposed algorithm uses images of person wearing a garment(x) and images of garment(y) for supervised training. The model(generator and discriminator) is trained adverbially as 𝑚𝑖𝑛𝐺 𝑚𝑎𝑥𝐷 𝐿𝑐𝐺𝐴𝑁 𝐺,𝐷 + 𝑦𝑖 𝐿𝑖𝑑 (𝐺) + 𝑦𝑐 𝐿𝑐𝑦𝑐 (𝐺) where 𝐿𝑐𝐺𝐴𝑁 𝐺, 𝐷 = 𝔼𝑥𝑖,𝑦𝑖∼𝑝𝑑𝑎𝑡𝑎 σ 𝜆,𝜇 [log𝐷𝜆 ,𝜇 (𝑥𝑖 ,𝑦𝑖 )] + 𝔼𝑥𝑖,𝑦𝑖,𝑦𝑗∼𝑝𝑑𝑎𝑡𝑎 σ 𝜆,𝜇 [(1 − log𝐷𝜆 ,𝜇 𝐺 𝑥𝑖 , 𝑦𝑖 , 𝑦𝑗 , 𝑦𝑗 )] + 𝔼𝑥𝑖,𝑦𝑗≠𝑖∼𝑝𝑑𝑎𝑡𝑎 σ 𝜆 ,𝜇 [(1 − log 𝐷𝜆 ,𝜇 𝑥𝑖 , 𝑦𝑖 )] for current garment 𝑦𝑖 and target garment 𝑦𝑗 . A regularization loss 𝐿𝑖𝑑 (𝐺) is used to avoid painting irrelevant regions as 𝐿𝑖𝑑 𝐺 = 𝔼𝑥𝑖 ,𝑦𝑖,𝑦𝑗∼𝑝𝑑𝑎𝑡𝑎 | 𝛼 𝑗 𝑖 | where . represents L1 normalization. To enforce consistency cycle loss 𝐿𝑐𝑦𝑐 (𝐺) is used as 𝐿𝑐𝑦𝑐 𝐺 = 𝔼𝑥 ,𝑦 ,𝑦 ∼𝑝 | 𝑥𝑖 − 𝐺 𝐺 𝑥𝑖 , 𝑦𝑖 , 𝑦𝑗 , 𝑦𝑗 ,𝑦𝑖 |. Thus, if 𝑥𝑗 = 𝐺(𝑥𝑖 , 𝑦𝑖 ,𝑦𝑗 ) modifies The Conditional Analogy GAN: Swapping Fashion Articles on People Images Copyright © 2022 VIVEN Inc. All Rights Reserved irrelevant regions reverse swapping as 𝐺(𝑥 𝑖 𝑗, 𝑦𝑗 , 𝑦𝑖 ) will generate image which when compared to 𝑥𝑖 will penalize the model. ◼ Results Zalandao dataset is used to evaluate the effectiveness of proposed algorithm. ◼ Next must-read paper: “A generative model of people in clothing ” ◼ Conclusion The performance can be further increased if foreground background segmentation is available, texture descriptors can further increase the performance of condition GAN.

14 Copyright © 2022 VIVEN Inc. All Rights Reserved 会社名
株式会社微分代表者吉田慎太郎所在地東京都新宿区新宿4丁目1-16 JR新宿ミライナタワー18F JustCo新宿設立 2020年10月資本金 7,000,000（2022年10月時点）従業員 20名（全雇用形態含む）事業内容教育機関向けソフトウェア「School DX」の開発ウェブアプリケーションの開発画像認識/自然言語の研究開発会社概要

Virtual-try-on Survey by VIVEN Inc

Virtual-try-on Survey by VIVEN Inc

VIVEN, Inc.

More Decks by VIVEN, Inc.

Other Decks in Science

Featured

Transcript

1 Virtual-try-on Survey Tapas Dutta, Deep Learning Engineer

◼ Summary The work attempts to preserve garment details and

3 Copyright © 2022 VIVEN Inc. All Rights Reserved For

◼ Summary The work proposes a novel recurrent generation pipeline

◼ Summary The work proposes a novel algorithm to overcome

◼ Summary The work divides the task of virtual-try-on into

For 𝐼′𝑝𝑒𝑟𝑠𝑜𝑛 representing generator’s output and 𝐿𝑝𝑒𝑟𝑐𝑒𝑝𝑡 , 𝐿𝑓𝑒𝑎𝑡 representing

◼ Summary The work proposes Outfit-VITON with an inexpensive training

◼ Summary The proposed algorithm can preserve fine details in

◼ Summary The proposed algorithm can transfer garments from image

◼ Summary The work proposes a novel weakly supervised algorithm

◼ Summary The work proposes virtual try-on without the need

◼ Summary The proposed algorithm can swap garments without training

14 Copyright © 2022 VIVEN Inc. All Rights Reserved 会社名

15 Copyright © 2022 VIVEN Inc. All Rights Reserved