Upgrade to Pro — share decks privately, control downloads, hide ads and more …

名古屋CV・PRML勉強会"Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks~GANからCycleGANまで~"

名古屋CV・PRML勉強会"Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks~GANからCycleGANまで~"

Hiroshi Fukui

June 17, 2017
Tweet

More Decks by Hiroshi Fukui

Other Decks in Research

Transcript

  1. ࠓ೔঺հ͢Δ࿦จ  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks Jun-Yan

    Zhu⇤ Taesung Park⇤ Phillip Isola Alexei A. Efros Berkeley AI Research (BAIR) laboratory, UC Berkeley Zebras Horses horse zebra zebra horse Summer Winter summer winter winter summer Photograph Van Gogh Cezanne Monet Ukiyo-e Monet Photos Monet photo photo Monet Figure 1: Given any two unordered image collections X and Y , our algorithm learns to automatically “translate” an image from one into the other and vice versa: (left) 1074 Monet paintings and 6753 landscape photos from Flickr; (center) 1177 ze- bras and 939 horses from ImageNet; (right) 1273 summer and 854 winter Yosemite photos from Flickr. Example application (bottom): using a collection of paintings of a famous artist, learn to render a user’s photograph into their style. 0593v1 [cs.CV] 30 Mar 2017
  2. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPS (FOFSBUPS %JTDSJNJOBUPSɿؑఆࢣ ڇٱ঵޹ l%FFQ-FBSOJOHʹΑΔࢹ֮ºݴޠ༥߹ͷ࠷લઢz IUUQTXXXTMJEFTIBSFOFU:PTIJUBLB6TIJLVEFFQMFBSOJOH
  3. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS ຊ෺ ِ෺ ڇٱ঵޹ l%FFQ-FBSOJOHʹΑΔࢹ֮ºݴޠ༥߹ͷ࠷લઢz IUUQTXXXTMJEFTIBSFOFU:PTIJUBLB6TIJLVEFFQMFBSOJOH
  4. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS ຊ෺ ِ෺ ِ෺ ڇٱ঵޹ l%FFQ-FBSOJOHʹΑΔࢹ֮ºݴޠ༥߹ͷ࠷લઢz IUUQTXXXTMJEFTIBSFOFU:PTIJUBLB6TIJLVEFFQMFBSOJOH
  5. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS ຊ෺ ِ෺ ِ෺ ڇٱ঵޹ l%FFQ-FBSOJOHʹΑΔࢹ֮ºݴޠ༥߹ͷ࠷લઢz IUUQTXXXTMJEFTIBSFOFU:PTIJUBLB6TIJLVEFFQMFBSOJOH
  6. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS ຊ෺ ِ෺ ڇٱ঵޹ l%FFQ-FBSOJOHʹΑΔࢹ֮ºݴޠ༥߹ͷ࠷લઢz IUUQTXXXTMJEFTIBSFOFU:PTIJUBLB6TIJLVEFFQMFBSOJOH
  7. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS ຊ෺ ِ෺ ِ෺ ڇٱ঵޹ l%FFQ-FBSOJOHʹΑΔࢹ֮ºݴޠ༥߹ͷ࠷લઢz IUUQTXXXTMJEFTIBSFOFU:PTIJUBLB6TIJLVEFFQMFBSOJOH
  8. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS ຊ෺ ِ෺ ڇٱ঵޹ l%FFQ-FBSOJOHʹΑΔࢹ֮ºݴޠ༥߹ͷ࠷લઢz IUUQTXXXTMJEFTIBSFOFU:PTIJUBLB6TIJLVEFFQMFBSOJOH
  9. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS ຊ෺ ِ෺ ຊ෺ ڇٱ঵޹ l%FFQ-FBSOJOHʹΑΔࢹ֮ºݴޠ༥߹ͷ࠷લઢz IUUQTXXXTMJEFTIBSFOFU:PTIJUBLB6TIJLVEFFQMFBSOJOH
  10. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS ِ෺ ຊ෺ ڇٱ঵޹ l%FFQ-FBSOJOHʹΑΔࢹ֮ºݴޠ༥߹ͷ࠷લઢz IUUQTXXXTMJEFTIBSFOFU:PTIJUBLB6TIJLVEFFQMFBSOJOH
  11. ("/ͷֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS
  12. ("/ͷֶशํ๏  min G max D V (D, G) =

    E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS ˡຊ෺ͱݟ෼͚ΒΕͳ͍ը૾ ɹΛੜ੒Ͱ͖ΔΑ͏ʹֶश In other words, D and G play the following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the
  13. ("/ͷֶशํ๏  min G max D V (D, G) =

    E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (FOFSBUPSɿآ࡞ࢣ %JTDSJNJOBUPSɿؑఆࢣ %JTDSJNJOBUPS (FOFSBUPS In other words, D and G play the following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)) . This objective function results in the ˡຊ෺ͱِ෺Λݟ෼͚ ΒΕΔΑ͏ʹֶश
  14. ֶशํ๏  In other words, D and G play the

    following two-player minimax game with value function V (G, D) : min G max D V (D, G) = E x ⇠ pdata( x )[log D(x)] + E z ⇠ pz( z )[log(1 D(G(z)))]. (1) In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G . This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, %JTDSJNJOBUPSِ͕෺Λ ೝࣝͰ͖Δ͔ͷޡࠩ (FOFSBUPS͕ੜ੒ͨ͠ը૾͕ ֶशσʔλʹؚ·Ε͍ͯΔ͔ͷޡࠩ .BY .JO
  15. ը૾ͷੜ੒݁Ռ  GAN 6.28% 5.65% DCGAN (ours) 2.98% 1.48% GAN

    6.28% 5.65% DCGAN (ours) 2.98% 1.48% ("/ %$("/
  16. ৚݅෇͖ը૾ੜ੒ɿDPOEJUJPOBM("/ w (FOFSBUPSͱ%JTDSJNJOBUPSʹରͯ͠৚݅Λ༩ֶ͑ͯश͢ΔϞσϧ  ༩͑Δ৚݅ɿΫϥε৘ใɼจষɼʜFUD  enerator the prior input

    noise pz(z) , and y are combined in joint hidden representation, and ersarial training framework allows for considerable flexibility in how this hidden representa- omposed. 1 discriminator x and y are presented as inputs and to a discriminative function (embodied y a MLP in this case). ective function of a two-player minimax game would be as Eq 2 min G max D V (D, G) = E x ⇠pdata( x )[log D(x | y)] + E z ⇠pz( z )[log(1 D(G(z | y)))]. (2) ustrates the structure of a simple conditional adversarial net. Figure 1: Conditional adversarial net with Stacked Generative Adversarial Networks Han Zhang⇤1, Tao Xu2, Hongsheng Li3, Shaoting Zhang4, Xiaolei Huang2, Xiaogang Wang3, Dimitris Metaxas1 1Department of Computer Science, Rutgers University 2Department of Computer Science and Engineering, Lehigh University 3Department of Electronic Engineering, The Chinese University of Hong Kong 4Department of Computer Science, University of North Carolina at Charlotte Abstract Synthesizing photo-realistic images from text descrip- tions is a challenging problem in computer vision and has many practical applications. Samples generated by exist- ing text-to-image approaches can roughly reflect the mean- ing of the given descriptions, but they fail to contain nec- essary details and vivid object parts. In this paper, we propose stacked Generative Adversarial Networks (Stack- GAN) to generate photo-realistic images conditioned on text descriptions. The Stage-I GAN sketches the primitive shape and basic colors of the object based on the given text description, yielding Stage-I low resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high resolution images with photo- realistic details. The Stage-II GAN is able to rectify de- fects and add compelling details with the refinement pro- cess. Samples generated by StackGAN are more plausi- ble than those generated by existing approaches. Impor- This bird has a yellow belly and tarsus, grey back, wings, and brown throat, nape with a black face (a) Stage-I images (b) Stage-II images This bird is white with some black on its head and wings, and has a long orange beak This flower has overlapping pink pointed petals surrounding a ring of short yellow filaments Figure 1. Photo-realistic images generated by our StackGAN from unseen text descriptions. Descriptions for birds and flowers are from CUB [32] and Oxford-102 [18] datasets, respectively. (a) Given text descriptions, Stage-I of StackGAN sketches rough shapes and basic colors of objects, yielding low resolution images. iv:1612.03242v1 [cs.CV] 10 Dec 2016 CC-LAPGAN: Automobile CC-LAPGAN: Bird CC-LAPGAN: Cat ৚݅ɿΫϥε৘ใ ৚݅ɿจষ
  17. ը૾ม׵΁ͷԠ༻ w QJYQJY  ೖྗ͞Εͨը૾ʹରͯ͠Կ͔͠Βͷม׵Λ༩͑ΔD("/ w ηάϝϯςʔγϣϯˠंࡌը૾ɼߤۭը૾ˠ஍ਤɼʜFUD  ҰൠతͳD("/ͱͷҧ͍ w

    ϊΠζϕΫτϧ͔Βը૾Λੜ੒͠ͳ͍ w ৚݅ͱͯ͠ڭࢣը૾Λ࢖༻  Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from
  18. (FOFSBUPS6/FU w ηϚϯςΟοΫηάϝϯςʔγϣϯͰ༻͍ΒΕΔωοτϫʔΫ  VQDPOWPMVUJPOͱTLJQDPOOFDUJPOΛಋೖͨ͠ωοτϫʔΫߏ଄ w ৞ΈࠐΈͱVQDPOWPMVUJPOͰಘΒΕͨϚοϓΛ݁߹࠶౓৞ΈࠐΉ͜ͱͰϩʔΧϧάϩʔ όϧͳಛ௃Λ֫ಘ  ࡉ๔ͷηάϝϯςʔγϣϯ౳Ͱ༻͍ΒΕͨख๏

    w ֶशαϯϓϧ਺͕গͳ͍৔߹ʹ༗ޮ  2 copy and crop input image tile output segmentation map 64 1 128 256 512 1024 max pool 2x2 up-conv 2x2 conv 3x3, ReLU 572 x 572 284² 64 128 256 512 570 x 570 568 x 568 282² 280² 140² 138² 136² 68² 66² 64² 32² 28² 56² 54² 52² 512 104² 102² 100² 200² 30² 198² 196² 392 x 392 390 x 390 388 x 388 388 x 388 1024 512 256 256 128 64 128 64 2 conv 1x1 Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue
  19. ֶशํ๏ͷҧ͍  ("/ QJYQJY OPJTFWFDUPS 'BLFPS3FBM JNBHF Real or fake

    pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Encoder-decoder U-Net Figure 3: Two choices for the architecture of the generator. The “U-Net” [34] is an encoder-decoder with skip connections be- tween mirrored layers in the encoder and decoder stacks. 'BLFPS3FBM QBJS Real or fake pair? Positive examples Negative examples Real or fake pair? D D Encode Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Encoder-decoder U-Net Figure 3: Two choices for the architecture of the “U-Net” [34] is an encoder-decoder with skip c tween mirrored layers in the encoder and decoder s PS PS Real or fake pair? Positive examples Negative examples Real or fake pair? D D G es to synthesize fake ges that fool D es to identify the fakes Encoder-decoder U-Net Figure 3: Two choices for the architecture of the generator. The “U-Net” [34] is an encoder-decoder with skip connections be- tween mirrored layers in the encoder and decoder stacks.
  20. QJYQJYͷม׵ྫ ηάϝϯςʔγϣϯˠ࣮ը૾  Input Ground truth Output Input Ground truth

    Output tionally on the larger images at test time). Contrast adjusted for clarity. Input Ground truth Output Figure 11: Example results of our method on Cityscap
  21. QJYQJYͷม׵ྫ Τοδˠ࣮ը૾ɼனˠ໷  Input Ground truth Output Figure 13: Example

    results of our method on day!night, compared to ground truth. Input Ground truth Output Input Ground truth Output
  22. QJYQJYͷσϝϦοτ w ͭͷը૾ͷϖΞʹରͯ͠ըૉಉ͕࢜ରԠ͍ͯ͠ͳ͍ͱ͍͚ͳ͍  ը෩ม׵౳ͷϖΞΛ࡞Δͷ͕ࠔ೉ͳ৔߹͸΄΅ֶश͕ෆՄೳ  Input Ground truth Output

    Input Ground truth Output Figure 13: Example results of our method on day!night, compared to Input Ground truth Output Input Input Ground truth
  23. $ZDMF("/ w ը૾ͷϖΞΛඞཁͱ͠ͳ͍QJYQJY  ͦΕͧΕͷม׵ઌΛυϝΠϯ9 υϝΠϯ:ͱఆٛ  ͭͷ(FOFSBUPSͱ%JTDSJNJOBUPSΛ༻ҙ֤ͯ͠υϝΠϯΛ૬ޓʹม׵ɾֶश  n

    , n Paired Unpaired , , , n X Y xi yi Figure 2: Paired training data (left) consists of training ex- amples { xi, yi }N i =1 , where the yi that corresponds to each xi is given [18]. We instead consider unpaired training data (right), consisting of a source set { xi }N i =1 2 X and a target domain X and a different set in domain Y . We may train a mapping G : X ! Y such that the output ˆ y = G(x) , x 2 X , is indistinguishable from images y 2 Y by an ad- versary trained to classify ˆ y apart from y . In theory, this ob- jective can induce an output distribution over ˆ y that matches the empirical distribution pY (y) (in general, this requires that G be stochastic) [13]. The optimal G thereby trans- lates the domain X to a domain ˆ Y distributed identically to Y . However, such a translation does not guarantee that the individual inputs and outputs x and y are paired up in a meaningful way – there are infinitely many mappings G that will induce the same distribution over ˆ y . Moreover, in practice, we have found it difficult to optimize the adversar- ial objective in isolation: standard procedures often lead to the well-known problem of mode collapse, where all input images map to the same output image, and the optimization 035 036 037 038 039 040 041 042 043 044 045 046 047 048 t=1 X (5) ˆ X (6) Y (7) ˆ Y (8) F (9) G (10) 041 042 043 044 045 046 047 048 049 050 051 052 053 Y (7) ˆ Y (8) F (9) G (10) DX (11) DY (12) 043 044 045 046 047 048 049 050 051 052 053 ˆ Y F G DX DY 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Y ˆ Y F G DX DY (FOFSBUPST %JTDSJNJOBUPST 9Λ:ʹม׵͢Δ (FOFSBUPS :Λ9ʹม׵͢Δ (FOFSBUPS (͕ม׵ͨ͠:Λݟ෼͚Δ %JTDSJNJOBUPS '͕ม׵ͨ͠9Λݟ෼͚Δ %JTDSJNJOBUPS
  24. $ZDMF("/ͷֶशํ๏  Real or fake pair? Positive examples Negative examples

    Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- Encoder-decoder U-Net Figure 3: Two choices for the architecture of the generator. The “U-Net” [34] is an encoder-decoder with skip connections be- tween mirrored layers in the encoder and decoder stacks. this strategy effective – the generator simply learned to ig- nore the noise – which is consistent with Mathieu et al. [27]. Instead, for our final models, we provide noise only in the form of dropout, applied on several layers of our generator at both training and test time. Despite the dropout noise, we υϝΠϯ9ˠυϝΠϯ:ˠυϝΠϯ9 t=1 + i , p− i ) + T −1 t=1 EClass or Reg t (pi) (4) X (5) ˆ X (6) Y (7) ˆ Y (8) Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D Encoder-decoder U-Net Figure 3: Two choices for the architecture of the generator. The “U-Net” [34] is an encoder-decoder with skip connections be- tween mirrored layers in the encoder and decoder stacks. i , pi ) + t=1 Et (pi) (3) + i , p− i ) + T −1 t=1 EClass or Reg t (pi) (4) X (5) ˆ X (6) Y (7) Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- Encoder-decoder U-Net Figure 3: Two choices for the architecture of the gene “U-Net” [34] is an encoder-decoder with skip conne tween mirrored layers in the encoder and decoder stack this strategy effective – the generator simply learn nore the noise – which is consistent with Mathieu e Instead, for our final models, we provide noise on form of dropout, applied on several layers of our at both training and test time. Despite the dropout ˆ X (6) Y (7) ˆ Y (8) 2. Concolusion References X (5) ˆ X (6) Y (7) ˆ Y (8) F (9) G (10) Y (7) ˆ Y (8) F (9) G (10) DX (11) DY (12) 'BLFPS3FBM QBJS Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- Encoder-deco Figure 3: Two choices “U-Net” [34] is an enc tween mirrored layers in this strategy effective nore the noise – which Instead, for our final form of dropout, appl at both training and te Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- tor observe an input image. where G tries to minimize this objective against an ad- versarial D that tries to maximize it, i.e. G ⇤ = Figure 3: “U-Net” [ tween mir this strate nore the n Instead, f form of d at both tr observe v Designin put, and distributi by the pr 044 045 046 047 048 049 050 051 052 053 F G (1 DX (1 DY (1 Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D Encoder-deco Figure 3: Two choices f “U-Net” [34] is an enc tween mirrored layers in Real or fake pair? Positive examples Negative examples Real or fake pair? D D 'BLFPS3FBM QBJS 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Y ˆ Y F G ( DX ( DY (
  25. $ZDMF("/ͷֶशํ๏  Real or fake pair? Positive examples Negative examples

    Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- Encoder-decoder U-Net Figure 3: Two choices for the architecture of the generator. The “U-Net” [34] is an encoder-decoder with skip connections be- tween mirrored layers in the encoder and decoder stacks. this strategy effective – the generator simply learned to ig- nore the noise – which is consistent with Mathieu et al. [27]. Instead, for our final models, we provide noise only in the form of dropout, applied on several layers of our generator at both training and test time. Despite the dropout noise, we υϝΠϯ:ˠυϝΠϯ9ˠυϝΠϯ: Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- Encoder-decoder U-Net Figure 3: Two choices for the architecture of the generator. The “U-Net” [34] is an encoder-decoder with skip connections be- tween mirrored layers in the encoder and decoder stacks. this strategy effective – the generator simply learned to ig- nore the noise – which is consistent with Mathieu et al. [27]. Instead, for our final models, we provide noise only in the form of dropout, applied on several layers of our generator at both training and test time. Despite the dropout noise, we X (5) ˆ X (6) Y (7) ˆ Y (8) Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- Encoder-decoder U-Net Figure 3: Two choices for the architecture of the gene “U-Net” [34] is an encoder-decoder with skip conne tween mirrored layers in the encoder and decoder stack this strategy effective – the generator simply learn nore the noise – which is consistent with Mathieu e Instead, for our final models, we provide noise on form of dropout, applied on several layers of our at both training and test time. Despite the dropout Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- Encoder-decoder Figure 3: Two choices for the architectur “U-Net” [34] is an encoder-decoder wit tween mirrored layers in the encoder and this strategy effective – the generator nore the noise – which is consistent wi Instead, for our final models, we prov form of dropout, applied on several la at both training and test time. Despite EAll = Edet(pi, p+ i , p− i ) + T −1 t=1 EClass or Reg t (pi) (4) X (5) ˆ X (6) Y (7) ˆ Y (8) 2. Concolusion Y (7) ˆ Y (8) F (9) G (10) DX (11) ˆ X (6) Y (7) ˆ Y (8) F (9) G (10) DX (11) Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D Encoder-deco Figure 3: Two choices f “U-Net” [34] is an enc tween mirrored layers in Real or fake pair? Positive examples Negative examples Real or fake pair? D D 'BLFPS3FBM QBJS Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D Figure 3: “U-Net” [ tween mirr Real or fake pair? Positive examples Negative examples Real or fake pair? D D Encoder-deco 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 ˆ X Y ˆ Y F G ( DX ( DY ( 'BLFPS3FBM QBJS Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- Encoder-deco Figure 3: Two choices “U-Net” [34] is an enc tween mirrored layers in this strategy effective nore the noise – which Instead, for our final form of dropout, appl at both training and te Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- tor observe an input image. where G tries to minimize this objective against an ad- versarial D that tries to maximize it, i.e. G ⇤ = Figure 3: “U-Net” [ tween mir this strate nore the n Instead, f form of d at both tr observe v Designin put, and distributi by the pr Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- tor observe an input image. where G tries to minimize this objective against an ad- versarial D that tries to maximize it, i.e. G ⇤ = Encoder-deco Figure 3: Two choices “U-Net” [34] is an enc tween mirrored layers in this strategy effective nore the noise – which Instead, for our final form of dropout, appl at both training and te observe very minor s Designing conditiona put, and thereby capt distributions they mo by the present work. Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- Figure 3: “U-Net” [ tween mir this strate nore the n Instead, f form of d at both tr 047 048 049 050 051 052 053 G (1 DX (1 DY (1 Real or fake pair? Positive examples Negative examples Real or fake pair? D D G G tries to synthesize fake images that fool D Encoder-decoder U-Net Figure 3: Two choices for the architecture of the generator. The “U-Net” [34] is an encoder-decoder with skip connections be- tween mirrored layers in the encoder and decoder stacks. X (5) ˆ X (6) Y (7) ˆ Y (8)
  26. ޡࠩؔ਺ͷઃܭ w (FOFSBUPSͱ%JTDSJNJOBUPSͷޡࠩʹՃ͑ͯ$ZDMFDPOTJTUFODZ-PTTΛՃࢉ  "EWFSTBSJBM-PTTɿ(FOFSBUPSͱ%JTDSJNJOBUPSͷ-PTT छྨ   $ZDMFDPOTJTUFODZ-PTTɿೖྗը૾ͱม׵ը૾ͷ࠷খ৐ޡࠩ 

    ing the distribution of generated images to the data distri- bution in the target domain; and a cycle consistency loss to prevent the learned mappings G and F from contradicting each other. 3.1. Adversarial Loss We apply adversarial losses [13] to both mapping func- tions. For the mapping function G : X ! Y and its dis- criminator DY , we express the objective as: LGAN(G, DY , X, Y ) = E y ⇠ pdata( y )[log DY (y)] + E x ⇠ pdata( x )[log(1 DY (G(x))], (1) where G tries to generate images G(x) that look simi- lar to images from domain Y , while DY aims to dis- tinguish between translated samples G(x) and real sam- ples y . G tries to minimize this objective against an adversarial D that tries to maximize it, i.e. G ⇤ = arg min max L (G, D , X, Y ) . We introduce a The behavior induced by th be observed in Figure 4: the re end up matching closely to the 3.3. Full Objective Our full objective is: L (G, F, DX, DY ) = L + + where controls the relative i tives. We aim to solve: G ⇤ , F ⇤ = arg min F,G ma Dx,D Notice that our model can b toencoders” [16]: we learn on X jointly with another G F : induce an output distribution that matches the target distri- bution. Thus, an adversarial loss alone cannot guarantee that the learned function can map an individual input xi to a desired output yi . To further reduce the space of possi- ble mapping functions, we argue that the learned mapping functions should be cycle-consistent: as shown in Figure 3 (b), for each image x from domain X , the image translation cycle should be able to bring x back to the original image, i.e. x ! G(x) ! F(G(x)) ⇡ x . We call this forward cy- cle consistency. Similarly, as illustrated in Figure 3 (c), for each image y from domain Y , G and F should also satisfy backward cycle consistency: y ! F(y) ! G(F(y)) ⇡ y . We can incentivize this behavior using a cycle consistency loss: Lcyc(G, F) = E x ⇠ pdata( x )[ k F(G(x)) x k1] + E y ⇠ pdata( y )[ k G(F(y)) y k1]. (2) In preliminary experiments, we also tried replacing the L1 "EWFSTBSJBM-PTT 9ˠ: $ZDMFDPOTJTUFODZ-PTT 9:ˠ9 :9ˠ:
  27. ޡࠩؔ਺ͷઃܭ w (FOFSBUPSͱ%JTDSJNJOBUPSͷޡࠩʹՃ͑ͯ$ZDMFDPOTJTUFODZ-PTTΛՃࢉ  "EWFSTBSJBM-PTTɿ(FOFSBUPSͱ%JTDSJNJOBUPSͷ-PTT छྨ   $ZDMFDPOTJTUFODZ-PTTɿೖྗը૾ͱม׵ը૾ͷ࠷খ৐ޡࠩ 

    norm in this loss with an adversarial loss between F(G(x)) and x , and between G(F(y)) and y , but did not observe improved performance. The behavior induced by the cycle consistency loss can be observed in Figure 4: the reconstructed images F(G(x)) end up matching closely to the input images x . 3.3. Full Objective Our full objective is: L (G, F, DX , DY ) = LGAN(G, DY , X, Y ) + LGAN(F, DX , Y, X) + Lcyc(G, F), (3) where controls the relative importance of the two objec- tives. We aim to solve: G ⇤ , F ⇤ = arg min F,G max Dx,DY L (G, F, DX , DY ). (4) h- i- o g c- s- 1) i- s- m- n = improved performance. The behavior induced by the cycle consistency loss can be observed in Figure 4: the reconstructed images F(G(x)) end up matching closely to the input images x . 3.3. Full Objective Our full objective is: L (G, F, DX, DY ) = LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + Lcyc(G, F), (3) where controls the relative importance of the two objec- tives. We aim to solve: G ⇤ , F ⇤ = arg min F,G max Dx,DY L (G, F, DX, DY ). (4) Notice that our model can be viewed as training two “au- toencoders” [16]: we learn one autoencoder F G : X ! શମͷޡࠩ 9ˠ:ͷ(FOFSBUPSͱ%JTDSJNJOBUPS :ˠ9ͷ(FOFSBUPSͱ%JTDSJNJOBUPS $ZDMFDPOTJTUFODZ-PTT
  28. ධՁํ๏ w "NB[PO.FDIBOJDBM5VSL ".5 ʹΑΔධՁ  ม׵ը૾ͱ࣮ը૾Λݟͤͯຊ෺Λબ୒  ճࢼߦͯ͠ೝࣝ཰Λࢉग़ ճɿ࿅शɼճɿຊ൪

     w 'VMMZ$POWPMVUJPOBM/FUXPSL '$/ ʹΑΔධՁ  ม׵ը૾Λ'$/ʹೖྗͯ͠ೝࣝͰ͖Δ͔ΛධՁ w ηϚϯςΟοΫηάϝϯςʔγϣϯʹΑΔධՁ  $JUZTDBQF%BUBTFUͷධՁࢦඪʹैͬͯೝࣝ཰ΛධՁ 
  29. ".5ʹΑΔධՁ Ϛοϓ⁶ߤۭը૾ w ଞͷ("/ϕʔεͷख๏ͱൺ΂ͯඇৗʹਫ਼౓͕ྑ͍  QJYQJYͷํ͕(5ʹ͍ۙը૾ม׵͕Մೳ  Map ! Photo

    Photo ! Map Loss % Turkers labeled real % Turkers labeled real CoGAN [27] 0.6% ± 0.5% 0.9% ± 0.5% BiGAN [6, 5] 2.1% ± 1.0% 1.9% ± 0.9% Pixel loss + GAN [41] 0.7% ± 0.5% 2.6% ± 1.1% Feature loss + GAN 1.2% ± 0.6% 0.3% ± 0.2% CycleGAN (ours) 26.8% ± 2.8% 23.2% ± 3.4% Table 1: AMT “real vs fake” test on maps$aerial photos. Loss Per-pixel acc. Per-class acc. Class IOU CoGAN [27] 0.40 0.10 0.06 BiGAN [6, 5] 0.19 0.06 0.02 Loss Per-pixel acc. Per-class acc. CoGAN [27] 0.45 0.11 BiGAN [6, 5] 0.41 0.13 Pixel loss + GAN [41] 0.47 0.11 Feature loss + GAN 0.50 0.10 CycleGAN (ours) 0.58 0.22 pix2pix [18] 0.85 0.40 Table 3: Classification performance of photo different methods on cityscapes. Loss Per-pixel acc. Per-class acc. Figure 5: Different methods for mapping labels$photos traine CoupledGAN [27], CycleGAN (ours), pix2pix [18] trained on pa CoGAN BiGAN CycleGAN Ground truth Input pix2pix Figure 6: Different methods for mapping aerial photos$maps on Google Maps. From left to right: input, BiGAN [5, 6], CoupledGAN [27], CycleGAN ".5ʹΑΔධՁ ը૾ม׵ͷ݁Ռ
  30. ม׵݁Ռ $JUZ1FSTPO%BUBTFU w ଞͷ("/ϕʔεͷख๏ΑΓߴ඼࣭ʹը૾Λੜ੒͢Δ͜ͱ͕Մೳ  ࣮ࣸˠϥϕϧɿೝࣝਫ਼౓తʹ͸ΠϚΠν  ϥϕϧˠ࣮ࣸɿ΅Μ΍Γͱಓ࿏ͱ෺ମ͕ࢹೝͰ͖Δ͘Β͍ͷਫ਼౓  Ground

    truth Input GAN alone Cycle alone GAN+forward GAN+backward CycleGAN (ours) Ours+Identity loss Figure 7: Different variants of our method for mapping labels$photos trained on cityscapes. From left to right: input, cycle- consistency loss alone, adversarial loss alone, GAN + forward cycle-consistency loss ( F(G(x)) ⇡ x ), GAN + backward
  31. ( 9 ' : Λֶश ධՁʹ༻͍ͨͱ͖ͷ݁Ռ w ( 9 Λೝࣝ݁Ռɼ'

    : Λֶशαϯϓϧͱͯ͠ධՁ  ೝࣝਫ਼౓ͱͯ͠͸( 9 Λೝࣝ݁Ռͱͯ͠࢖༻ͨ͠ํ͕ਫ਼౓͕ྑ͍  ௨ৗͷ$JUZQFSTPO%BUBTFUͰ'$/ T Λֶशͨ͠৔߹ɿ$MBTT*06 w ֶशαϯϓϧͱͯ͠ѻ͏ʹ͸·ͩੑೳ͕଍Γͳ͍ w QJYQJYͷํ͕ੑೳ͕ྑ͍  Map ! Photo Photo ! Map Loss % Turkers labeled real % Turkers labeled real CoGAN [27] 0.6% ± 0.5% 0.9% ± 0.5% BiGAN [6, 5] 2.1% ± 1.0% 1.9% ± 0.9% Pixel loss + GAN [41] 0.7% ± 0.5% 2.6% ± 1.1% Feature loss + GAN 1.2% ± 0.6% 0.3% ± 0.2% CycleGAN (ours) 26.8% ± 2.8% 23.2% ± 3.4% Table 1: AMT “real vs fake” test on maps$aerial photos. Loss Per-pixel acc. Per-class acc. Class IOU CoGAN [27] 0.40 0.10 0.06 BiGAN [6, 5] 0.19 0.06 0.02 Pixel loss + GAN [41] 0.20 0.10 0.0 Feature loss + GAN 0.07 0.04 0.01 CycleGAN (ours) 0.52 0.17 0.11 pix2pix [18] 0.71 0.25 0.18 Table 2: FCN-scores for different methods, evaluated on Cityscapes labels!photos. Loss Per-pixel acc. Per-class acc. C CoGAN [27] 0.45 0.11 BiGAN [6, 5] 0.41 0.13 Pixel loss + GAN [41] 0.47 0.11 Feature loss + GAN 0.50 0.10 CycleGAN (ours) 0.58 0.22 pix2pix [18] 0.85 0.40 Table 3: Classification performance of photo! different methods on cityscapes. Loss Per-pixel acc. Per-class acc. C Cycle alone 0.22 0.07 GAN alone 0.52 0.11 GAN + forward cycle 0.55 0.18 GAN + backward cycle 0.41 0.14 CycleGAN (ours) 0.52 0.17 Table 4: Ablation study: FCN-scores for differe of our method, evaluated on Cityscapes labels!p Map ! Photo Photo ! Map s % Turkers labeled real % Turkers labeled real AN [27] 0.6% ± 0.5% 0.9% ± 0.5% AN [6, 5] 2.1% ± 1.0% 1.9% ± 0.9% l loss + GAN [41] 0.7% ± 0.5% 2.6% ± 1.1% ure loss + GAN 1.2% ± 0.6% 0.3% ± 0.2% eGAN (ours) 26.8% ± 2.8% 23.2% ± 3.4% e 1: AMT “real vs fake” test on maps$aerial photos. ss Per-pixel acc. Per-class acc. Class IOU GAN [27] 0.40 0.10 0.06 Loss Per-pixel acc. Per-class acc. Class IOU CoGAN [27] 0.45 0.11 0.08 BiGAN [6, 5] 0.41 0.13 0.07 Pixel loss + GAN [41] 0.47 0.11 0.07 Feature loss + GAN 0.50 0.10 0.06 CycleGAN (ours) 0.58 0.22 0.16 pix2pix [18] 0.85 0.40 0.32 Table 3: Classification performance of photo!labels fo different methods on cityscapes. ' : Ͱม׵ͨ͠αϯϓϧΛ'$/ʹೖྗͨ݁͠Ռ ( 9 Λೝࣝ݁Ռͱͯ͠࢖༻ͨ࣌͠ͷ݁Ռ
  32. ( ' 9 ' ( : ΛֶशɾධՁʹ༻͍ͨͱ͖ͷ݁Ռ w $ZDMFDPOTJTUFODZ-PTTͷޮՌΛݕূ 

    GPSXBSEɿυϝΠϯ9ˠυϝΠϯ:ͷ-PTTͷΈ  CBDLXBSEɿυϝΠϯ:ˠυϝΠϯ9ͷ-PTTͷΈ  PVSɿυϝΠϯ9ˠυϝΠϯ:ɼυϝΠϯ:ˠυϝΠϯ9ͷ-PTT  Loss Per-pixel acc. Per-class acc. Class IOU CoGAN [27] 0.45 0.11 0.08 BiGAN [6, 5] 0.41 0.13 0.07 Pixel loss + GAN [41] 0.47 0.11 0.07 Feature loss + GAN 0.50 0.10 0.06 CycleGAN (ours) 0.58 0.22 0.16 pix2pix [18] 0.85 0.40 0.32 Table 3: Classification performance of photo!labels for different methods on cityscapes. Loss Per-pixel acc. Per-class acc. Class IOU Cycle alone 0.22 0.07 0.02 GAN alone 0.52 0.11 0.08 GAN + forward cycle 0.55 0.18 0.13 GAN + backward cycle 0.41 0.14 0.06 CycleGAN (ours) 0.52 0.17 0.11 Table 4: Ablation study: FCN-scores for different variants of our method, evaluated on Cityscapes labels!photos. Loss Per-pixel acc. Per-class acc. Class IOU Cycle alone 0.10 0.05 0.02 GAN (ours) 26.8% ± 2.8% 23.2% ± 3.4% 1: AMT “real vs fake” test on maps$aerial photos. Per-pixel acc. Per-class acc. Class IOU AN [27] 0.40 0.10 0.06 AN [6, 5] 0.19 0.06 0.02 loss + GAN [41] 0.20 0.10 0.0 ure loss + GAN 0.07 0.04 0.01 eGAN (ours) 0.52 0.17 0.11 pix [18] 0.71 0.25 0.18 2: FCN-scores for different methods, evaluated on apes labels!photos. this “upper bound” without using any paired training fair comparison, we implement all the baselines, ex- he CoupledGAN [27], using the same architecture mplementation details as our method. CoupledGAN on generators that produce images from a shared la- presentation, which is incompatible with our image- ge architecture. We use the public implementation of ethod instead 1. CycleGAN (ours) 0.58 0.22 0.16 pix2pix [18] 0.85 0.40 0.32 Table 3: Classification performance of photo!labels fo different methods on cityscapes. Loss Per-pixel acc. Per-class acc. Class IOU Cycle alone 0.22 0.07 0.02 GAN alone 0.52 0.11 0.08 GAN + forward cycle 0.55 0.18 0.13 GAN + backward cycle 0.41 0.14 0.06 CycleGAN (ours) 0.52 0.17 0.11 Table 4: Ablation study: FCN-scores for different variant of our method, evaluated on Cityscapes labels!photos. Loss Per-pixel acc. Per-class acc. Class IOU Cycle alone 0.10 0.05 0.02 GAN alone 0.53 0.11 0.07 GAN + forward cycle 0.49 0.11 0.07 GAN + backward cycle 0.01 0.06 0.01 CycleGAN (ours) 0.58 0.22 0.16 Table 5: Ablation study: classification performanc of photos!labels for different losses, evaluated o Cityscapes. ' ( 9 Ͱม׵ͨ͠αϯϓϧΛ'$/ʹೖྗͨ݁͠Ռ ( ' : Λೝࣝ݁Ռͱͯ͠࢖༻ͨ࣌͠ͷ݁Ռ
  33. Նˠౙ ౙˠՆ  zebra → horse summer Yosemite → winter

    Yosemite winter Yosemite → summer Yosemite
  34. ྛޝˠΦϨϯδ ΦϨϯδˠྛޝ  summer Yosemite → winter Yosemite apple →

    orange orange → apple Figure 13: Our method applied to several translation problems. These images are selected as relatively successful results
  35. ࣦഊྫ  Figure 16: We compare our method with neural

    style transfer [10] on various applications. From top to bottom: apple!orange, horse!zebra, and Monet!photo. Left to right: input image, results from [10] using two different images as style images, results from [10] using all the images from the target domain, and CycleGAN (ours). Input Output Input Output apple → orange zebra → horse dog → cat cat → dog winter → summer Monet → photo photo → Ukiyo-e photo → Van Gogh Input Output iPhone photo → DSLR photo
  36. $ZDMF("/ͷσϝϦοτ w QJYQJYͷΑ͏ͳڭࢣ͋Γֶशͷख๏ΑΓ͸ੑೳ͕ྼΔ  l6OTVQFSWJTFEzͰֶशͰ͖Δͷ͕$ZDMF("/ͷར఺ w ܗঢ়Λม͑ΔΑ͏ͳม׵͸೉͍͠  ςΫενϟΛม͑Δఔ౓ͷม׵͕ݶք w

    ը૾தʹଘࡏ͢Δ෺ମʹରͯ͠ແࠩผʹςΫενϟΛ൓ө͍ͤͯ͞Δʁ  ը૾தʹԿͷ෺ମ͕ଘࡏͯ͠Δͷ͔͸ཧղͰ͖ͳ͍  അˠγϚ΢Ϛͷ৔߹ʹഎܠ͕ࠇͬΆ͘ͳΔͷ͸γϚ΢Ϛͷࣶ໛༷ͷ͍ͤʁ