$30 off During Our Annual Pro Sale. View Details »

Optimal Transport and Deep Generative Models

Optimal Transport and Deep Generative Models

Talk at "Rencontre Maths-Industrie"

Gabriel Peyré

June 20, 2017
Tweet

More Decks by Gabriel Peyré

Other Decks in Research

Transcript

  1. Optimal Transport
    and
    Deep Generative Models
    Gabriel Peyré
    www.numerical-tours.com
    Joint works with:
    Aude Genevay and Marco Cuturi
    É C O L E N O R M A L E
    S U P É R I E U R E

    View Slide

  2. Optimal Transport: Theory to Applications
    Monge Kantorovich Dantzig Brenier Otto McCann Villani
    ansport framework Sliced Wasserstein projection Applications
    lication to Color Transfer
    Source image (X)
    Style image (Y)
    Sliced Wasserstein projection of X to style
    image color statistics Y
    Source image after color transfer
    J. Rabin Wasserstein Regularization

    View Slide

  3. Overview
    Density Fitting vs. Auto Encoders
    Discriminative vs. Generative Models
    g✓
    Z X
    g✓ µ✓


    d⇠

    View Slide

  4. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension
    High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠

    View Slide

  5. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension
    High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠
    classification,
    z
    =class probability
    Supervised:
    !
    Learn d⇠ from labeled data (xi, zi)i.

    View Slide

  6. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension
    High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠
    classification,
    z
    =class probability
    Supervised:
    !
    Learn d⇠ from labeled data (xi, zi)i.
    Compression: z = d⇠(x) is a representation.
    Generation: x = g✓(z) is a synthesis.
    Un-supervised:
    !
    Learn (g✓, d⇠) from data (xi)i.

    View Slide

  7. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension
    High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠
    classification,
    z
    =class probability
    Supervised:
    !
    Learn d⇠ from labeled data (xi, zi)i.
    Compression: z = d⇠(x) is a representation.
    Generation: x = g✓(z) is a synthesis.
    Un-supervised:
    !
    Learn (g✓, d⇠) from data (xi)i.
    Density fitting
    g✓({
    zi
    }i) ⇡ {
    xi
    }i
    Auto-encoders
    g✓(
    d⇠(
    xi)) ⇡
    xi

    View Slide

  8. Discriminative vs Generative Models
    Z
    x
    z
    X
    Z X
    Low dimension
    High dimension
    Generative
    Discriminative
    g✓
    d⇠
    g✓
    d⇠
    classification,
    z
    =class probability
    Supervised:
    !
    Learn d⇠ from labeled data (xi, zi)i.
    Compression: z = d⇠(x) is a representation.
    Generation: x = g✓(z) is a synthesis.
    Un-supervised:
    !
    Learn (g✓, d⇠) from data (xi)i.
    Density fitting
    g✓({
    zi
    }i) ⇡ {
    xi
    }i
    Auto-encoders
    g✓(
    d⇠(
    xi)) ⇡
    xi
    Optimal transport
    map d⇠

    View Slide

  9. Deep Discriminative vs Generative Models
    ✓1
    ✓2


    z
    x
    g✓
    ⇢ ⇢
    z
    x
    Discriminative
    Generative
    d⇠
    ⇠1 ⇠2
    g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .)
    Deep networks:
    d⇠(
    x
    ) =

    (
    ⇠K(
    . . . ⇢
    (
    ⇠2(

    (
    ⇠1(
    x
    )
    . . .
    )

    View Slide

  10. Deep Discriminative vs Generative Models
    ✓1
    ✓2


    z
    x
    g✓
    ⇢ ⇢
    z
    x
    Discriminative
    Generative
    z1
    z2
    g✓
    Z
    x
    z
    X
    d⇠
    d⇠
    ⇠1 ⇠2
    g✓(z) = ⇢(✓K(. . . ⇢(✓2(⇢(✓1(z) . . .)
    Deep networks:
    d⇠(
    x
    ) =

    (
    ⇠K(
    . . . ⇢
    (
    ⇠2(

    (
    ⇠1(
    x
    )
    . . .
    )

    View Slide

  11. Examples of Image Generation
    [Credit ArXiv:1511.06434]
    g✓
    Z X

    View Slide

  12. Overview
    Density Fitting vs. Auto Encoders
    Discriminative vs. Generative Models
    g✓
    Z X
    g✓ µ✓


    d⇠

    View Slide

  13. Density Fitting vs. Generative Models
    Parametric model:
    ✓ 7! µ✓
    µ✓

    Observations: ⌫ = 1
    n
    P
    n
    i
    =1 xi

    View Slide

  14. Density Fitting vs. Generative Models
    Parametric model:
    ✓ 7! µ✓
    µ✓

    min

    c
    KL(
    µ✓
    |⌫
    )
    def.
    =
    X
    j
    log(
    f✓(
    yj))
    dµ✓(y) = f✓(y)dy
    Density fitting:
    Maximum
    likelihood (MLE)
    Observations: ⌫ = 1
    n
    P
    n
    i
    =1 xi

    View Slide

  15. Density Fitting vs. Generative Models
    Parametric model:
    ✓ 7! µ✓
    µ✓

    min

    c
    KL(
    µ✓
    |⌫
    )
    def.
    =
    X
    j
    log(
    f✓(
    yj))
    dµ✓(y) = f✓(y)dy
    Density fitting:
    Maximum
    likelihood (MLE)
    g✓ µ✓
    X
    Z

    Generative model fit:
    µ✓ = g✓,]⇣
    c
    KL(µ✓
    |⌫) = +1
    ! MLE undefined.
    ! Need a weaker metric.
    Observations: ⌫ = 1
    n
    P
    n
    i
    =1 xi

    View Slide

  16. Comparing Measures and Spaces
    Source image (X)
    Style image (Y)
    Sliced Wasserstein projection of X to style
    image color statistics Y
    Source image after color transfer
    J. Rabin Wasserstein Regularization
    !
    images, vision, graphics and machine learning, . . .

    Probability distributions and histograms

    View Slide

  17. Comparing Measures and Spaces
    Source image (X)
    Style image (Y)
    Sliced Wasserstein projection of X to style
    image color statistics Y
    Source image after color transfer
    J. Rabin Wasserstein Regularization
    !
    images, vision, graphics and machine learning, . . .

    Probability distributions and histograms
    Optimal transport mean
    L2 mean
    • Optimal transport

    View Slide

  18. Comparing Measures and Spaces
    Source image (X)
    Style image (Y)
    Sliced Wasserstein projection of X to style
    image color statistics Y
    Source image after color transfer
    J. Rabin Wasserstein Regularization
    !
    images, vision, graphics and machine learning, . . .

    Probability distributions and histograms
    Optimal transport mean
    L2 mean
    • Optimal transport
    !
    well defined for discrete
    or singular distributions
    (“weak” metric).
    µ

    View Slide

  19. Probability Measures and Couplings
    Couplings:
    P1]⇡(S) def.
    = ⇡(S, X) P2]⇡(S) def.
    = ⇡(X, S)
    Marginals:
    ⇧(µ, ⌫) def.
    = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫}

    Discrete

    View Slide

  20. Probability Measures and Couplings
    Couplings:
    P1]⇡(S) def.
    = ⇡(S, X) P2]⇡(S) def.
    = ⇡(X, S)
    Marginals:
    ⇧(µ, ⌫) def.
    = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫}

    Discrete

    Semi-discrete

    View Slide

  21. Probability Measures and Couplings
    Couplings:
    P1]⇡(S) def.
    = ⇡(S, X) P2]⇡(S) def.
    = ⇡(X, S)
    Marginals:
    ⇧(µ, ⌫) def.
    = {⇡ 2 M+(X ⇥ X) ; P1]⇡ = µ, P2]⇡ = ⌫}

    Discrete

    Continuous

    Semi-discrete

    View Slide

  22. Optimal Transport
    Optimal transport: [Kantorovitch 1942]
    W
    p
    p
    (
    µ, ⌫
    ) def.
    = min


    h
    d
    p
    , ⇡
    i =
    Z
    X⇥X
    d
    (
    x, y
    )pd

    (
    x, y
    ) ;

    2 ⇧(
    µ, ⌫
    )
    d
    (
    x, y
    )
    x y

    View Slide

  23. Optimal Transport
    Optimal transport: [Kantorovitch 1942]
    W
    p
    p
    (
    µ, ⌫
    ) def.
    = min


    h
    d
    p
    , ⇡
    i =
    Z
    X⇥X
    d
    (
    x, y
    )pd

    (
    x, y
    ) ;

    2 ⇧(
    µ, ⌫
    )
    ! Wp is a distance on
    M+(
    X
    ).
    Wp
    (
    x, y
    ) =
    d
    (
    x, y
    ) x
    !
    y
    ! 0
    ! Wp works for singular distributions. d
    (
    x, y
    )
    x y

    View Slide

  24. Optimal Transport
    Optimal transport: [Kantorovitch 1942]
    W
    p
    p
    (
    µ, ⌫
    ) def.
    = min


    h
    d
    p
    , ⇡
    i =
    Z
    X⇥X
    d
    (
    x, y
    )pd

    (
    x, y
    ) ;

    2 ⇧(
    µ, ⌫
    )
    ! Wp is a distance on
    M+(
    X
    ).
    Wp
    (
    x, y
    ) =
    d
    (
    x, y
    ) x
    !
    y
    ! 0
    ! Wp works for singular distributions.
    g✓ µ✓
    X
    Z

    Minimum Kantorovitch Estimator:
    min

    Wp(µ✓, ⌫)
    [Bassetti et al, 06]
    Algorithms: [Montavon et al 16], [Bernton et al 17], [Genevay et al 17]
    generative
    model
    min

    Wp(g✓,]⇣, ⌫)
    d
    (
    x, y
    )
    x y

    View Slide

  25. From OT to VAE
    min

    Wp
    p
    (g✓,]⇣, ⌫)
    µ✓

    = min
    ✓,⇡
    ⇢Z
    X⇥X
    d
    (
    x, y
    )pd

    (
    x, y
    ) ;

    2 ⇧(
    µ✓, ⌫
    )

    View Slide

  26. From OT to VAE
    min

    Wp
    p
    (g✓,]⇣, ⌫)
    µ✓
    g✓


    = min
    ✓,⇡
    ⇢Z
    X⇥X
    d
    (
    x, y
    )pd

    (
    x, y
    ) ;

    2 ⇧(
    µ✓, ⌫
    )
    = min
    ✓,
    ⇢Z
    Z⇥X
    d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)

    View Slide

  27. From OT to VAE
    min

    Wp
    p
    (g✓,]⇣, ⌫)
    µ✓
    g✓


    d⇠
    Approximation of :
    [Bousquet, et al, 17]
    Arxiv:1705.07642


    P
    i f⇠(
    xi)
    ,xi
    = min
    ✓,⇡
    ⇢Z
    X⇥X
    d
    (
    x, y
    )pd

    (
    x, y
    ) ;

    2 ⇧(
    µ✓, ⌫
    )
    = min
    ✓,
    ⇢Z
    Z⇥X
    d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)
    ⇡ min
    ✓,⇠
    (
    X
    i
    d
    (
    g✓(
    d⇠(
    xi)
    , xi)p ;
    g⇠]⌫


    )

    View Slide

  28. From OT to VAE
    min

    Wp
    p
    (g✓,]⇣, ⌫)
    µ✓
    g✓


    d⇠
    g✓ d⇠
    ⇡ Id
    Variational Auto-Encoders:
    [Kingma, Welling, 13]
    Approximation of :
    [Bousquet, et al, 17]
    Arxiv:1705.07642


    P
    i f⇠(
    xi)
    ,xi
    = min
    ✓,⇡
    ⇢Z
    X⇥X
    d
    (
    x, y
    )pd

    (
    x, y
    ) ;

    2 ⇧(
    µ✓, ⌫
    )
    = min
    ✓,
    ⇢Z
    Z⇥X
    d(g✓(z), y)pd (z, y) ; 2 ⇧(⇣, ⌫)
    ⇡ min
    ✓,⇠
    (
    X
    i
    d
    (
    g✓(
    d⇠(
    xi)
    , xi)p ;
    g⇠]⌫


    )

    d⇠

    View Slide

  29. Conclusion
    Unsupervised learning:
    !
    learning a (generator, discriminator) pair.

    View Slide

  30. Conclusion
    Unsupervised learning:
    !
    learning a (generator, discriminator) pair.
    Generative Adversarial Networks:
    [Goodfellow et al, 14]
    ! also an OT-like problem!
    [Arjovsky et al, 17], Arxiv:1701.07875

    View Slide

  31. Conclusion
    Unsupervised learning:
    !
    learning a (generator, discriminator) pair.
    Generative Adversarial Networks:
    [Goodfellow et al, 14]
    ! also an OT-like problem!
    [Arjovsky et al, 17], Arxiv:1701.07875
    Open problems:
    ! how far are VAE/GAN from “true” OT?
    ! using OT to improve VAE/GAN training.

    View Slide