Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

Learning Optimal Priors for Task-Invariant Repr...

Avatar for Hiroshi Takahashi Hiroshi Takahashi
June 08, 2024
140

Learning Optimal Priors for Task-Invariant Representations in Variationalย Autoencoders

- KDD2022
- Paper is available at https://dl.acm.org/doi/10.1145/3534678.3539291

Avatar for Hiroshi Takahashi

Hiroshi Takahashi

June 08, 2024
Tweet

Transcript

  1. KDD 2022 Research Track Learning Optimal Priors for Task-Invariant Representations

    in Variational Autoencoders Hiroshi Takahashi1, Tomoharu Iwata1, Atsutoshi Kumagai1, Sekitoshi Kanai1, Masanori Yamada1, Yuuki Yamanaka1, Hisashi Kashima2 1NTT, 2Kyoto University
  2. 1 Copyright 2022 NTT CORPORATION [Introduction] Variational Autoencoder โ€ข The

    variational autoencoder (VAE) is a powerful latent variable model for unsupervised representation learning. downstream applications (such as classification, data generation, out-of-distribution detection, etc.) ๐œ™ ๐œƒ ๐ฑ ๐ฑ ๐ณ data encoder decoder ๐‘(๐ณ) standard Gaussian prior data latent variable
  3. 2 Copyright 2022 NTT CORPORATION [Introduction] Multi-Task Learning โ€ข However,

    the VAE cannot perform well with insufficient data points since it depends on neural networks. โ€ข To solve this, we focus on obtaining task-invariant latent variable from multiple tasks. ๐œ™ encoder ๐ณ task-invariant latent variable multiple tasks useful for tasks of insufficient data points insufficient data points a lot of datapoints
  4. 3 Copyright 2022 NTT CORPORATION [Introduction] Conditional VAE โ€ข For

    multiple tasks, the conditional VAE (CVAE) is widely used, which tries to obtain task-invariant latent variable. ๐œ™ ๐œƒ ๐ฑ ๐ฑ ๐ณ data encoder decoder ๐‘(๐ณ) data task-invariant latent variable task index task index ๐‘  ๐‘  standard Gaussian prior
  5. 4 Copyright 2022 NTT CORPORATION [Introduction] Problem and Contribution โ€ข

    Although the CVAE can reduce the dependency of ๐ณ on ๐‘  to some extent, this dependency remains in many cases. โ€ข The contribution of this study is as follows: 1. We investigate the cause of the task-dependency in the CVAE and reveal that the simple prior is one of the causes. 2. We introduce the optimal prior to reduce the task-dependency. 3. We theoretically and experimentally show that our learned representation works well on multiple tasks.
  6. 5 Copyright 2022 NTT CORPORATION <latexit sha1_base64="oe48kSpU4v4VxmcJwBGi8bPtZ7c=">AAADs3icjVFLb9NAEB7XPIp5NIULEheLqFEilWiDykNISIW2CAmQ+iBppWxk2c4mturHxl5HtK7/AH+AAyeQEEJc+Qdc+AMcyhUuiGORuHBg/ChJqaCM5d2Zb+ab+XbX4I4dCkJ2pQn52PETJydPKafPnD03VZo+3wr9KDBZ0/QdP9gw9JA5tseawhYO2+AB013DYevG5kKaXx+yILR977HY4qzj6n3P7tmmLhDSpqVH1NWFZepOfC/R4iwI3HihdWcpSapZaPRiKiwm9GT2d8wtO6kplds5YMRLyOVavDiiPElmw9qg6LhPGaW3k50DlQl1WE+0qeOpfIyUzx3vujPqgDQa2H1LdKjne5FrsIBSpUIHg0jvqleO0FZMXNTiBw+T6v8rpS0WCJWP5Wv7MhRFK5VJnWSmHnYahVOGwpb90mug0AUfTIjABQYeCPQd0CHErw0NIMAR60CMWICeneUZJKAgN8IqhhU6opu49jFqF6iHcdozzNgmTnHwD5Cpwgz5SN6QPfKBvCVfyc+/9oqzHqmWLdyNnMu4NvX04tqPI1ku7gKsEeufmgX04Gam1UbtPEPSU5g5f7j9bG/t1upMXCEvyTfU/4Lskvd4Am/43Xy1wlafQ/oAjT+v+7DTulpvXK/PrcyV5+8WTzEJl+AyVPG+b8A83IdlaIIpvZM+SZ+lL/I1uS0bcjcvnZAKzgU4YLL7C/j4DU8=</latexit> FCVAE(โœ“, ) =

    E pD(x,s)q (z|x,s) [ln pโœ“(x|z, s)] E pD(x,s) [DKL(q (z|x, s)kp(z))] <latexit sha1_base64="R5iKmH4CEPmUQAzNPJKK3ZYfmTM=">AAADjXicnVHLbtNAFL2ueZTwSAobJDYRUSGRUDRBpSBEUQVCsOyDtJXiyBq743hUP6b2JGozzA+wYMuCFUgIIbZsYcOGH2DRT0Asi8SGBdeOqxQqqMq1PHPumXvO3JlxRMBTSciOMWEeO37i5OSp0ukzZ8+VK1PnV9K4n7is7cZBnKw5NGUBj1hbchmwNZEwGjoBW3U27mfrqwOWpDyOHsttwboh7UXc4y6VSNlTxlVhKyuk0nc8ZUmfSap1fY/Y0k/SxpzFI1k9pGwPDvW1tCHq47SRwyRU63pMzo2gox5oW23ucxY+3+c7HPtuZb7aCpgnO5aXUFf9d0P6SDsmvOfLbqlkV2qkSfKoHgStAtSgiIW48gYsWIcYXOhDCAwikIgDoJDi14EWEBDIdUEhlyDi+ToDDSXU9rGKYQVFdgPHHmadgo0wzzzTXO3iLgH+CSqrME2+kLdkl3wm78hX8vOvXir3yHrZxtkZaZmwy08vLv84VBXiLMEfq/7ZswQPbuW9cuxd5Ex2CnekHwyf7y7fXppWV8gr8g37f0l2yCc8QTT47r5eZEsvIHuA1p/XfRCsXG+2ZpszizO1+XvFU0zCJbgMdbzvmzAPj2AB2uAaz4z3xgfjo1k2b5h3zLuj0gmj0FyA38J8+AvILQEq</latexit> pโœ“(x|s) = Z pโœ“(x|z, s)p(z)dz = E q (z|x,s) ๏ฃฟ pโœ“(x|z, s)p(z) q (z|x, s) [Preliminaries] Reviewing CVAE โ€ข The CVAE models a conditional probability of ๐ฑ given ๐‘  as: โ€ข The CVAE is trained by maximizing the ELBO that is the lower bound of the log-likelihoods as follows: decoder prior encoder = โ„›(๐œ™) data distribution
  7. 6 Copyright 2022 NTT CORPORATION [Preliminaries] Mutual Information โ€ข To

    investigate the cause of dependency of ๐ณ on ๐‘ , we introduce the mutual information ๐ผ(๐‘†; ๐‘), which measures the mutual dependence between two random variables. ๐ผ ๐‘†; ๐‘ becomes large if ๐ณ depends on ๐‘  ๐ผ ๐‘†; ๐‘ becomes small if ๐ณ does NOT depend on ๐‘  ๐ป(๐‘†) ๐ป(๐‘) ๐ป(๐‘†) ๐ป(๐‘)
  8. 7 Copyright 2022 NTT CORPORATION [Proposed] Theorem 1 โ€ข The

    CVAE tries to minimize the mutual information ๐ผ(๐‘†; ๐‘) by minimizing its upper bound โ„›(๐œ™): โ€ข However, โ„›(๐œ™) is NOT a tight upper bound of ๐ผ(๐‘†; ๐‘) since ๐ท!" (๐‘ž# (๐ณ)||๐‘(๐ณ)) usually gives a large value. <latexit sha1_base64="2WF4WdrdpOv468GFxtVHg/eznZs=">AAADZ3icjVFdaxNBFL2b+FHjR1IFKfgSDQ27VMJEioqlULSCpT60iUlLM+myu06SIfvV3UmwHecP+OBrBZ8URMSf4Yt/wIf+hOJjBV988GYTiLFWvcvuPXPuPWfvzNihy2NByIGWSp86febs1LnM+QsXL2Vz05frcdCLHFZzAjeINm0rZi73WU1w4bLNMGKWZ7tsw+4+GNQ3+iyKeeA/Ebsha3pW2+ct7lgCKXNa06hniY5jubKi9ATbLUnDDldGpkjZTo/3h6wtHypThqZcHvc9UzdjQ1GXtURj2ZSrj5W+Y8oJl3Hznno+oaN1Fol8+EvdMGjE2x3RpDRTXFzRqwtbxtx/+P7Rao7GPc+U3cWy2parioYcF2pF39yWetdQC1vDbGTMXIGUSBL546A8AgUYxVqQew8UnkIADvTAAwY+CMQuWBDj04AyEAiRa4JELkLEkzoDBRnU9rCLYYeFbBe/bVw1RqyP64FnnKgd/IuLb4TKPMySL+QDOSKfyUdySH6c6CUTj8Esu5jtoZaFZvbFTPX7P1UeZgGdseqvMwtowd1kVo6zhwkz2IUz1Pf39o+q9yqzskjekq84/xtyQD7hDvz+N+fdOqu8hsEFlH8/7uOgfqtUvl2aX58vLN0fXcUUXIMboON534EleARrUANH62gvtX3tVeownU1fTc8MW1PaSHMFJiJ9/Sf28+nG</latexit> R( ) โŒ˜ E pD(x,s) [DKL(q (z|x, s)kp(z))] = I(S; Z) + DKL(q (z)kp(z)) + K X k=1 โ‡กkI(X(k); Z(k)) mutual information between ๐ฑ and ๐ณ when ๐‘  = ๐‘˜ ๐œ‹! = ๐‘(๐‘  = ๐‘˜) ๐‘ž" ๐ณ = โˆซ ๐‘ž" ๐ณ ๐ฑ, ๐‘  ๐‘# ๐ฑ, s d๐ฑ
  9. 8 Copyright 2022 NTT CORPORATION [Proposed] Effects of Priors !

    !"# $ "! # $ ! ; & ! '$% (& ) โˆฅ + ) โ„› - # .; & Proposed Method โ„›(๐œ™) is NOT a tight upper bound of ๐ผ(๐‘†; ๐‘) since ๐ท$% (๐‘ž" (๐ณ)||๐‘(๐ณ)) usually gives a large value. When ๐‘ ๐ณ = ๐‘ž" ๐ณ , โ„›(๐œ™) becomes the tightest upper bound of ๐ผ(๐‘†; ๐‘). โ€ข That is, the simple prior ๐‘(๐ณ) is one causes of the task- dependency, and ๐‘ž# ๐ณ is the optimal prior to reduce it.
  10. 9 Copyright 2022 NTT CORPORATION [Proposed] Theorem 2 โ€ข The

    ELBO with this optimal prior โ„ฑ$%&'&() (๐œƒ, ๐œ™) is always larger than or equal to original ELBO โ„ฑ*+,- (๐œƒ, ๐œ™): โ€ข That is, โ„ฑ$%&'&() (๐œƒ, ๐œ™) is also a better lower bound of the log-likelihood than โ„ฑ*+,- ๐œƒ, ๐œ™ . โ€ข This contributes to obtaining better representation for the improved performance on the target tasks. <latexit sha1_base64="cReRpIFFkHRHyAEW/aHr3JatyTY=">AAADWXicnZHPaxNBFMffZv0R1x+N9iJ4CYaWBEuYlKIiCNWqCHpIW5MWu2WZnU6SoftjOjsJtMuCV/sPePDUgoj4Z3jxH/BQ/AvEYwu9ePDtZrVqbQudZWfe+773efNmxpWeiDQhO0bBPHP23PniBevipctXRkpXr7WjsK8Yb7HQC9WiSyPuiYC3tNAeX5SKU9/1+IK7OpPGFwZcRSIMXuh1yZd92g1ERzCqUXJKe7ZPdY9RL36SOHHmKD9uqlCGEV9JkmomuZ3Y1j2uaTLx25c9kdSs+//lZ9oPHp/I3nrkxM+eJ9W1nPsVOOA2kprd5kqX5Z9SzbK7fK18+o0tp1QhdZKN8mGjkRsVyEczLL0HG1YgBAZ98IFDABptDyhE+C1BAwhI1JYhRk2hJbI4hwQsZPuYxTGDorqKcxe9pVwN0E9rRhnNcBcPf4VkGcbIF/KB7JLP5CP5Rn4cWSvOaqS9rOPqDlkunZHN6/P7J1I+rhp6B9SxPWvowN2sV4G9y0xJT8GG/GDjze78vbmxeJxsk+/Y/xbZIZ/wBMFgj72b5XNvIX2Axr/XfdhoT9Ybt+tTs1OV6Yf5UxThBtyEKt73HZiGp9CEFjDjpfHKeG1sFr6ahlk0rWFqwciZUfhrmKM/AQbF6Kc=</latexit> FProposed(โœ“, ) = FCVAE(โœ“, ) + DKL(q (z)kp(z)) FCVAE(โœ“, )
  11. 10 Copyright 2022 NTT CORPORATION [Proposed] Optimizing โ„ฑ!"#$#%& (๐œƒ, ๐œ™)

    โ€ข We optimize โ„ฑ$%&'&() ๐œƒ, ๐œ™ = โ„ฑ*+,- ๐œƒ, ๐œ™ + ๐ท!" (๐‘ž# (๐ณ)||๐‘(๐ณ)) by approximating the KL divergence ๐ท!" (๐‘ž# (๐ณ)||๐‘(๐ณ)): โ€ข We approximate ๐‘ž# ๐ณ /๐‘(๐ณ) by density ratio trick, which can estimate the density ratio between two distributions using samples from both distribution (See Section 3.3). <latexit sha1_base64="PVz8Nq1rbUNMiC1/ST13WyzPjus=">AAADDHichVHLShxBFL3dUWPG15hsAm4GB2XcDDVGkhAiiHEh6MJHZhRsabrLGqewX1bXDGjRP+Ai2yyyUhARt+7ElRDyAy78hJClATcuvN3T4gvH21TXuafuuXWqyg4cHkpCLjT9VVt7x+vON5mu7p7evmz/20ro1wVlZeo7vli2rZA53GNlyaXDlgPBLNd22JK98S1eX2owEXLf+y63ArbqWuser3JqSaTM7MGUqWZmo8KmqQzXkjW7qoygxqOocJtuRyNGhQmZC+5TI8PjBvdkrrXO8YyqsKhqWRWpB52jBAtXrUV3bMbM5kmRJJF7CkopyEMac372AAxYAx8o1MEFBh5IxA5YEOK3AiUgECC3Cgo5gYgn6wwiyKC2jlUMKyxkN/C/jtlKynqYxz3DRE1xFweHQGUOhsg5OSSX5A85In/J9bO9VNIj9rKFs93UssDs23m/ePWiysVZQu1O1dKzhCp8Trxy9B4kTHwK2tQ3tn9eLn5ZGFLDZI/8Q/+75IKc4Qm8xn+6P88WfkH8AKXH1/0UVEaLpY/Fsfmx/MRk+hSdMACDUMD7/gQTMA1zUAaq9WgftK/auP5DP9ZP9NNmqa6lmnfwIPTfN8qWzOQ=</latexit> DKL(q (z)kp(z)) = Z q (z) ln q (z) p(z) dz
  12. 11 Copyright 2022 NTT CORPORATION [Proposed] Theoretical Contributions โ€ข Our

    theoretical contributions are summarized as follows: โ€ข We next evaluate our representation on various datasets. โ€ข The simple prior is one of the causes of the task-dependency. โ€ข ๐‘ž! ๐ณ is the optimal prior to reduce the task-dependency. โ€ข โ„ฑ"#$%$&'(๐œƒ, ๐œ™) gives a better lower bound of the log-likelihood, which enables us to obtain better representation than the CVAE. Theorem 1 shows: Theorem 2 shows:
  13. 12 Copyright 2022 NTT CORPORATION [Experiments] Datasets โ€ข We used

    two handwritten digits (USPS and MNIST), two house number digits (SynthDigits and SVHN), and three face datasets (Frey, Olivetti, and UMist).
  14. 13 Copyright 2022 NTT CORPORATION [Experiments] Settings โ€ข On digits

    datasets, we conducted two-task experiments, which estimate the performance on the target tasks: โ€ข The source task has a lot of training data points. โ€ข The target task has only 100 training data points. โ€ข Pairs are (USPSโ†’MNIST), (MNISTโ†’USPS), (SynthDigitsโ†’SVHN), and (SVHNโ†’SynthDigits). โ€ข On face datasets, we conducted three-task experiment, which simultaneously evaluates the performance on each task using a single estimator.
  15. 15 Copyright 2022 NTT CORPORATION VAE CVAE Proposed USPSโ†’MNIST โˆ’163.25

    ยฑ 2.15 โˆ’152.32 ยฑ 1.64 โˆ’+,-. ./ ยฑ .. /0 MNISTโ†’USPS โˆ’235.23 ยฑ 1.54 โˆ’1++. +/ ยฑ .. 22 โˆ’1+1. ++ ยฑ +. ,/ Synthโ†’SVHN 1146.04 ยฑ 35.65 1397.36 ยฑ 10.89 +,7.. 18 ยฑ ++. ,, SVHNโ†’Synth 760.66 ยฑ 8.85 814.63 ยฑ 10.09 /22. 2+ ยฑ ++. ,+ Face Datasets 895.41 ยฑ 2.98 902.99 ยฑ 3.69 -+7. ./ ยฑ 2. .2 [Results] Density Estimation Performance Almost equal to or better performance than other approaches
  16. 16 Copyright 2022 NTT CORPORATION VAE CVAE Proposed USPSโ†’MNIST 0.52

    ยฑ 2.15 0.53 ยฑ 0.02 ). *+ ยฑ ). ), MNISTโ†’USPS 0.64 ยฑ 0.01 0.67 ยฑ 0.01 ). 01 ยฑ ). )2 Synthโ†’SVHN 0.20 ยฑ 0.00 ). 2, ยฑ ). )) 0.19 ยฑ 0.00 SVHNโ†’Synth 0.25 ยฑ 0.01 0.25 ยฑ 0.00 ). 2* ยฑ ). )) [Results] Downstream Classification Almost equal to or better performance than other approaches
  17. 17 Copyright 2022 NTT CORPORATION Conclusion โ€ข Our contribution for

    the CVAE are summarized as follows: โ€ข The simple prior is one of the causes of the task-dependency. โ€ข We propose the optimal prior to reduce the task-dependency. โ€ข Our approach gives a better lower bound of the log-likelihood, which enable us to obtain better representation than the CVAE. Theorem 1 shows: Theorem 2 shows: โ€ข Our approach achieves better performance on various datasets. Experiments shows: