Emtiyaz Khan (RIKEN, Tokyo, Japan) The Bayesian Learning Rule
WORKSHOP ON OPTIMAL TRANSPORT
FROM THEORY TO APPLICATIONS
INTERFACING DYNAMICAL SYSTEMS, OPTIMIZATION, AND MACHINE LEARNING
Venue: Humboldt University of Berlin, Dorotheenstraße 24
AI Project, Tokyo http://emtiyaz.github.io 1 Summary of recent research at https://emtiyaz.github.io/papers/symposium_2023.pdf Slides available at https://emtiyaz.github.io/ Presentation at the Optimal Transport Workshop (Berlin) Mar 14 2024
need retraining • Huge amount of resources are required only few can afford (costly & unsustainable) [1,2, 3] • Difficult to apply in “dynamic” settings (robotics, medicine, epidemiology, climate science, etc.) • Our goal is to solve such challenges – Help in building safe and trustworthy AI – To reduce “magic” in deep learning (DL) 6 1. Diethe et al. Continual learning in practice, arXiv, 2019. 2. Paleyes et al. Challenges in deploying machine learning: a survey of case studies, arXiv, 2021. 3. https://www.youtube.com/watch?v=hx7BXih7zx8&t=897s
[2-5] – SOTA on GPT-2 and ImageNet [5] • Improve other aspects of DL [5-7] – Calibration, uncertainty, memory etc. – Understand and fix model behavior • Towards human-like quick adaptation 7 1. Khan and Rue, The Bayesian Learning Rule, JMLR (2023). 2. Khan, et al. Fast and scalable Bayesian deep learning by weight-perturbation in Adam, ICML (2018). 3. Osawa et al. Practical Deep Learning with Bayesian Principles, NeurIPS (2019). 4. Lin et al. Handling the positive-definite constraints in the BLR, ICML (2020). 5. Shen et al. Variational Learning is Effective for Large Deep Networks, Under review. 6. Daheim et al. Model merging by uncertainty-based gradient matching, ICLR (2024). 7. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS (2023)
at the same cost [5] Trained on OpenWebText data (49.2B tokens). On 773M, we get a gain of 0.5 in perplexity. On 355M, we get a gain of 0.4 in perplexity. 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Shen et al. “Variational Learning is Effective for Large Deep Networks.” Under review (2024) BLR (IVON)[3]
Variational Online Newton (IVON) <latexit sha1_base64="6cqgdMpwEvsCUhlzFD20sqy/oaU=">AAACt3icbVFNj9MwEHXC11K+Chy5jKhAqdCWZFW+bitx4bhIdHdFXaqJ49TWOnHWdkBVlL/IgRv/BidtJbq7I1l6eu/N83icVkpaF8d/g/DW7Tt37x3cHzx4+Ojxk+HTZ6dW14bxGdNKm/MULVey5DMnneLnleFYpIqfpRefO/3sJzdW6vKbW1d8UeCqlLlk6Dy1HP6mAl2zauE1VTx3aIz+BT1HS0wVtkC5UhF1gjscUzroNXGDf9X+OAJvEHtSlBxSI/QYBLyBDsEuoMvqU/ejNtQhUFSVQIBol971Z1x5sYDxW4iovTR90I7vplsOR/Ek7guug2QLRmRbJ8vhH5ppVhe8dEyhtfMkrtyiQeMkU7wd0NryCtkFrvjcwxILbhdNv/cWXnkmg1wbf0oHPft/R4OFtesi9c4CnbBXtY68SZvXLv+4aGRZ1Y6XbHNRXitwGrpPhEwazpxae4DMSD8rMIEGmfNf3S0hufrk6+D0aJK8n7z7Oh0dT7frOCAvyEsSkYR8IMfkCzkhM8KCafA9YEEWfgqXYR6KjTUMtj3PyV6Fl/8ARwHSgg==</latexit> ˆ g ˆ r`(✓) ˆ h ˆ g2 h (1 ⇢)h + ⇢ˆ h ✓ ✓ ↵(ˆ g + m)/( p h + ) 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Lin et al. “Handling the positive-definite constraints in the BLR.” ICML (2020). 4. Shen et al. “Variational Learning is Effective for Large Deep Networks.” Under review (2024) Only tune initial value of h (a scalar) <latexit sha1_base64="lMR1EEq84mMqe3ZqmU42lhTl4A0=">AAADlnicbVLbbtNAEN3YXEq4NKUvSLyMiKhspbk4KrcXVFEheKqCRNpK2SRabzZZq7u2tV5TIst/xNfwxt+wdl2C065l6ficmTkz4/VjESR6MPjTsOx79x883HnUfPzk6bPd1t7zsyRKFWVjGolIXfgkYSII2VgHWrCLWDEifcHO/cuTQj//wVQSROF3vY7ZVJJVGCwDSrSh5nuNX5gTna1yOMCCLTVRKrqCksMh8QXJATMhHKw508SFDJeemWILo2j2UyuZwRVnikFBFFGAk0AClkRzSkR2mjvy0FArSWZDF3KMm2V9fofnKsd0EemaS+XdlW7/popxwk1eS3e8LlY8coFDBwoElQk+LJ56450iYDZ0eLcKcmfDvmO+O3jBhJnTLbqUNQMJXcBExJwAODfdFl5lhpHdPjg1F76RXVOuvrp/kxxs7XTj6PWdU4dvarj5vNUe9AblgdvAq0AbVWc0b/3Gi4imkoWaCpIkE28Q62lGlA6oYHkTpwmLCb0kKzYxMCSSJdOsbCeH14ZZwDJS5g01lOz/GRmRSbKWvoks/nWyrRXkXdok1cv30ywI41SzkF4bLVMBOoLijsIiUIxqsTaAUBWYXoFyogjV5iY3zRK87ZFvg7Nhz3vbe/PtqH18VK1jB71Er5CDPPQOHaOvaITGiFr71gfrk3Viv7A/2p/tL9ehVqPK2Ue1Y4/+AgL8INo=</latexit> ˆ g ˆ r`(✓) where ✓ ⇠ N(m, 2) ˆ h ˆ g · (✓ m)/ 2 h (1 ⇢)h + ⇢ˆ h +⇢2(h ˆ h)2/(2(h + )) m m ↵(ˆ g + m)/(h + ) 2 1/(N(h + ))
Conjugate computation variational inference, AISTATS, 2017. <latexit sha1_base64="nnfJIuxzsUOC+xV/awsH5wDVAB8=">AAACIHicbVDLSgMxFM34rPVVdekmWARXZUZEXRbc6K6CrUJnKJn0VkMzk5DcEcvQjf/hxi/wH0RwoYju9C/8A9PHwteBwOGce7k5J9ZSWPT9d29icmp6ZrYwV5xfWFxaLq2sNqzKDIc6V1KZs5hZkCKFOgqUcKYNsCSWcBp3Dwb+6SUYK1R6gj0NUcLOU9ERnKGTWqW9EOEKc60sghHK9GmojdKo6MiQouskFAnYsaKHY61S2a/4Q9C/JBiTcrX8+XBdv9uttUpvYVvxLIEUuWTWNgNfY5Qzg4JL6BfDzIJmvMvOoeloytzFKB8G7NNNp7RpRxn3UqRD9ftGzhJre0nsJhOGF/a3NxD/85oZdvajXKQ6Q0j56FAnk9TlH7RF28IAR9lzhHEj3F8pv2CGcdeWLboSgt+R/5LGdiXYrewcuzaOyAgFsk42yBYJyB6pkkNSI3XCyQ25J0/k2bv1Hr0X73U0OuGNd9bID3gfX1h/qU8=</latexit> posterior / lik ⇥ prior Bayes rule: Multiplication of distribution = addition of (natural) params <latexit sha1_base64="1iCgJvV81YG7AOljPaALLc+WRVM=">AAACLnicbZDLSgMxFIYz9VbrbVRwo4tgEQSxzIioG6Eggu4q2Au0pWbStA3NTIbkjFiGeRYfwI0+ii4EFXHrY5heFtr6Q+DnO+dwcn4vFFyD47xZqanpmdm59HxmYXFpecVeXStpGSnKilQKqSoe0UzwgBWBg2CVUDHie4KVve5Zv16+ZUpzGVxDL2R1n7QD3uKUgEEN+7wmTHOTNOIasDuIQ6khSfApHuOCdw3eG8eh4lIlScPOOjlnIDxp3JHJ5ve3blL3G0+Fhv1Sa0oa+SwAKojWVdcJoR4TBZwKlmRqkWYhoV3SZlVjA+IzXY8H5yZ4x5AmbkllXgB4QH9PxMTXuud7ptMn0NHjtT78r1aNoHVSj3kQRsACOlzUigQGifvZ4SZXjILoGUOo4uavmHaIIhRMwhkTgjt+8qQpHeTco9zhlUnjEg2VRptoG+0iFx2jPLpABVREFD2gZ/SOPqxH69X6tL6GrSlrNLOO/sj6/gExnK1b</latexit> post = lik + prior <latexit sha1_base64="aYxqies8sZhJDuXFT1LZihCQvHA=">AAACcnicbVHdShtBFJ5dbbXpj2uLIC2004aCgoRdKdZLoTf2TiExQnZdZidnzZDZnWHmrBiWfYA+R+8EX8a7PkVvfAAniRetyYGBj++HOfNNpqWwGIZ/PH9l9dnztfUXrZevXr/ZCDbfnllVGQ49rqQy5xmzIEUJPRQo4VwbYEUmoZ+Nf0z1/hUYK1TZxYmGpGCXpcgFZ+ioNPgFF3UsnX/I0jpGuMZaK4tNcxGj0rS7E+MIkO02NNZGaVR0MSDFeJkfRQF2iV0bocxiIA3aYSecDV0E0SNoH23e/L7dvktP0uAuHipeFVAil8zaQRRqTGpmUHAJTSuuLGjGx+wSBg6WzK2T1LPKGvrVMUOaK+NOiXTG/puoWWHtpMics2A4sk+1KblMG1SYHya1KHWFUPL5RXklqWtu2j8dCgMc5cQBxo1wu1I+YoZxdL/UciVET5+8CM72O9FB59upa+Mnmc86+UC+kB0Ske/kiByTE9IjnPz1tryP3ifv3n/vf/bbc6vvPWbekf/G33sASm/EYA==</latexit> e > post T (✓) / e > lik T (✓) ⇥ e > prior T (✓) This idea can be generalized through natural-gradients. <latexit sha1_base64="tWmmkq0DMSALFQRNtizGBSdmrIY=">AAACIXicbVC7SgNBFJ31bXwkamFhMyiCIIZdwUehErCxVDAPSEKYndzEIbM7y8xdMSz5FRs/wV+wsVAknVjb+g1OHmA0Hhg4nHMud+7xIykMuu67MzE5NT0zOzefWlhcWk5nVlYLRsWaQ54rqXTJZwakCCGPAiWUIg0s8CUU/dZ5zy/egjZChdfYjqAasGYoGoIztFItc1xBuMNEquZepAyCFkp36Cn9kaVodejuiBD1M7XMlpt1+6DjxBuSrdz611f65PHzspbpVuqKxwGEyCUzpuy5EVYTplFwCZ1UJTYQMd5iTShbGrIATDXpX9ih21ap04bS9oVI++roRMICY9qBb5MBwxvz1+uJ/3nlGBvH1USEUYwQ8sGiRiwpKtqri9aFBo6ybQnjWti/Un7DNOO2KpOyJXh/Tx4nhf2sd5g9uLJtnJEB5sgG2SQ7xCNHJEcuyCXJE07uyRN5Ia/Og/PsvDndQXTCGc6skV9wPr4BlnKoxg==</latexit> log-posterior = log-lik + log-prior <latexit sha1_base64="loPj4pEeiiE1Tk1sbZkGwlrqj8s=">AAACTXicbVHPaxQxFM6stT/W/ljtsZfYUihIl5lC1YtSKILHCm5b2BmGTObtNmwmGZM36hJy8Oy5/5QXwZv/hRcPioiZ3R5q2weBL9/7vpf3XopaCotx/D3q3Fu4v7i0vNJ9sLq2vtF7+OjU6sZwGHAttTkvmAUpFAxQoITz2gCrCglnxeS4zZ+9B2OFVm9xWkNWsbESI8EZBirvlakM4pLlLkX4iK7WFr2nL6hLZ8WdgdLTVLFCtpqqCZeK4UVRuFc+f+fpkM6NUo/3pZh4+uQaURuhjc/y3k7cj2dBb4PkCuwcPf4gPx9ffjrJe9/SUvOmAoVcMmuHSVxj5phBwSX4btpYqBmfsDEMA1SsApu5WcOe7gampCNtwlFIZ+x1h2OVtdOqCMp2Ensz15J35YYNjp5nTqi6QVB8/tCokRQ1bVdLS2GAo5wGwLgRoVfKL5hhHMMHdMMSkpsj3wanB/3kaf/wTdjGSzKPZbJFtskeScgzckRekxMyIJx8IT/IL/I7+hr9jP5Ef+fSTnTl2ST/RWfpHw7Gubk=</latexit> post = rµ E q[log-lik + log-prior] Natural gradient Posterior “approximation”
q[log-lik] = > lik E q[T(✓)] = > lik µ Expected log-lik and log-prior are linear in [1] μ Gradient wrt is simply the natural parameter μ <latexit sha1_base64="WAonMs+rXRscP32yi7QdZHNn6xU=">AAACK3icbVDLSgMxFM3Ud31VBTe6CBbBTcuMiLoRRBF0p2BV6AxjJk1raCYzJnfEMsy3uHXjwh9xoQsfuPU/zLQutPVA4HDOvUnOCWLBNdj2u1UYGh4ZHRufKE5OTc/Mlubmz3SUKMpqNBKRugiIZoJLVgMOgl3EipEwEOw8aO/n/vkNU5pH8hQ6MfNC0pK8ySkBI/mlPVeSQBA/dcMkw25I4CoI0oPMv8Z17AK7hVRErYrg7czDO9gV5upGPt6zjJz5pbJdtbvAg8T5IeXdyvJl4W7x8dgvPbuNiCYhk0AF0bru2DF4KVHAqWBZ0U00iwltkxarGypJyLSXdrNmeNUoDdyMlDkScFf9vZGSUOtOGJjJPIzu93LxP6+eQHPbS7mME2CS9h5qJgJDhPPicIMrRkF0DCFUcfNXTK+IIhRMvUVTgtMfeZCcrVedzerGiWnjCPUwjpbQClpDDtpCu+gQHaMaougePaFX9GY9WC/Wh/XZGy1YPzsL6A+sr28xhass</latexit> rµ E q[log-lik] = lik So Bayes’ rule can be written as (for an arbitrary q) <latexit sha1_base64="6U/f8yZ2xPtTjCeEcHXJMx/cJR8=">AAACSHicbVDLahRBFK1uH4nja1RwkywKgyBIhm6R6DIQBN1FyCSB6aa9XXN7Ukx1VVt1O3Fo+lvyDX6EG5fZ5RvcuFBCdqmeySImHig4nHMut+7JKyUdRdFpEN66fefu0vK93v0HDx897j95uutMbQUOhVHG7ufgUEmNQ5KkcL+yCGWucC+fbnX+3iFaJ43eoVmFaQkTLQspgLyU9bNE+fAYsiYh/EZNZRy1LU8UFgTWmiOeaMhV55e110uggzxvPrTZVz7iixllJutKTlv++opQWWlsm2b9tWgQzcFvkviSrG2ur34Jj59/3876J8nYiLpETUKBc6M4qihtwJIUCtteUjusQExhgiNPNZTo0mZeRMtfemXMC2P908Tn6tWJBkrnZmXuk90h7rrXif/zRjUV79NG6qom1GKxqKgVJ8O7VvlYWhSkZp6AsNL/lYsDsCDId9/zJcTXT75Jdt8M4o3B28++jU9sgWW2wl6wVyxm79gm+8i22ZAJ9oP9Yn/Y3+Bn8Ds4C84X0TC4nHnG/kEYXgDBSbbp</latexit> post rµ E q[log-lik + log-prior] 1. Khan, Variational-Bayes Made Easy, AABI 2023. As an analogy, think of least-square = 1-step of Newton <latexit sha1_base64="y7D+gn1jhrF35SzpSRvMhBcgsEc=">AAACN3icbVDLSgMxFM3UV62vquBGF0ERBLHMiKhLwY1uRMGq0Ck1k97R0MxkSO6oZZhv8RsEN/6GO924UMStf2DautDqgcDhnHO5uSdIpDDouk9OYWBwaHikOFoaG5+YnCpPz5wYlWoOVa6k0mcBMyBFDFUUKOEs0cCiQMJp0Nrt+KdXoI1Q8TG2E6hH7CIWoeAMrdQoH/jShpuskfkIN5glymCeU19CiExrdU37AlK0rL/aLydaKJ3njfKSW3G7oH+J902WdtYWzgu3c3eHjfKj31Q8jSBGLpkxNc9NsJ4xjYJLyEt+aiBhvMUuoGZpzCIw9ax7d06XrdKkodL2xUi76s+JjEXGtKPAJiOGl6bf64j/ebUUw+16JuIkRYh5b1GYSoqKdkqkTaGBo2xbwrgW9q+UXzLNONqqS7YEr//kv+RkveJtVjaObBv7pIcimSeLZIV4ZIvskD1ySKqEk3vyTF7Jm/PgvDjvzkcvWnC+Z2bJLzifX514sao=</latexit> post lik + prior
1. Khan and Rue, The Bayesian Learning Rule, JMLR, 2023 2. Khan and Lin. "Conjugate-computation variational inference….” AIstats, 2017 <latexit sha1_base64="/+t3q4v2VzqDXbxQIVNEyo2KDJY=">AAACBnicbVDJSgNBEO2Je9xGPYrQGATFEGZE1JMIXjxGMAtkQujp1CRNenqG7hohBE968Fe8eFDEgxe/wZt/Y2c5uD0oeLxXRVW9MJXCoOd9Ormp6ZnZufmF/OLS8sqqu7ZeNUmmOVR4IhNdD5kBKRRUUKCEeqqBxaGEWtg7H/q1a9BGJOoK+yk0Y9ZRIhKcoZVa7lYQC9UKsAvIaFAMijQAKXfHwl7LLXglbwT6l/gTUjg7enu9i/Yfyy33I2gnPItBIZfMmIbvpdgcMI2CS7jJB5mBlPEe60DDUsViMM3B6I0bumOVNo0SbUshHanfJwYsNqYfh7YzZtg1v72h+J/XyDA6aQ6ESjMExceLokxSTOgwE9oWGjjKviWMa2FvpbzLNONok8vbEPzfL/8l1YOSf1Q6vLRpnJIx5skm2Sa7xCfH5IxckDKpEE5uyQN5Is/OvfPovDiv49acM5nZID/gvH8BjDWblg==</latexit> min ✓ `(✓) <latexit sha1_base64="jX7nnuoTA97Rm3paM6fhC3APrMU=">AAACWXicbVFdSxwxFM2MVddptWt99CVUhBXsMlNKlT4USxF8VHBV2AxDJnvXDWYys8mdwhLmf/k39KFQ+lf6YHZX8asXAueec+7NzU1eKWkxjv8E4cKbxaXl1kr09t3q2vv2+oczW9ZGQE+UqjQXObegpIYeSlRwURngRa7gPL/6OdXPf4GxstSnOKkgLfillkMpOHoqa1eskDpzY8qkpo7NGjoDg4ayguNIcOVOmsZnu2x3TuW5O2wy98w77jAcAfId7+wzUOohT+mnx0ZHTWe8k7W34m48C/oaJPdg6+DbTRRe/9g+ztq3bFCKugCNQnFr+0lcYeq4QSkUNBGrLVRcXPFL6HuoeQE2dbPhGrrtmQEdlsYfjXTGPq1wvLB2UuTeOZ3SvtSm5P+0fo3D/dRJXdUIWswvGtaKYkmna6YDaUCgmnjAhZF+VipG3HCB/jMiv4Tk5ZNfg7PP3eRr98uJ38Z3Mo8W2SQfSYckZI8ckCNyTHpEkN/kX7AYLAV/wyBshdHcGgb3NRvkWYQbd87xtyg=</latexit> min q2Q E q(✓) [`(✓)] H(q) Bayesian Learning Rule [1,2] (natural-gradient descent) Natural and Expectation parameters of q <latexit sha1_base64="3p+Algv/G/xbGOjIYDbmLOcc3zc=">AAACX3icbVHBbtQwEHVSoGVb2rRcQFwsKqTtgVUCiHIqFQipxyKxbaV1FE2cya5Vx0ltp3Rl5SP4CT6IGxIXLnwHzm6RoGUkW8/vvfGMx3kjhbFx/D0IV+7cvbe6dn+wvvFgcyva3jkxdas5jnkta32Wg0EpFI6tsBLPGo1Q5RJP8/P3vX56idqIWn2y8wbTCqZKlIKD9VQWXTLpzQVQJrG0oHX9mf6hnlOmZzVTkEvInGOLak5j0VFWtZ3f34kpc/4Adpbn7kOXXdAJQymHzM7Qwl7a39GrHKQ76oYXe8ucLot241G8CHobJNdg9/Dl16tHb7/8Os6ib6yoeVuhslyCMZMkbmzqQFvBJXYD1hpsgJ/DFCceKqjQpG7RcUefeaagZa39UpYu2L8zHFTGzKvcO/tmzU2tJ/+nTVpbvkmdUE1rUfFlobKV1Na0HzYthEZu5dwD4Fr4XimfgQZu/ZcM/BCSm0++DU5ejJLXo1cf/TQOyDLWyBPylAxJQvbJITkix2RMOPkRhMF6sBH8DFfDzTBaWsPgOuch+SfCx78BIYW4/Q==</latexit> ⇢rµ n E q[`(✓)] H(q) o Old belief New information = natural gradients Exploiting posterior’s information geometry to derive existing algorithms as special instances by approximating q and natural gradients.
what we (often) encounter in machine learning for Maximum-Likelihood – In MLE, the loss is the negative log probability distribution – Here, loss and distribution are two different entities, even possible unrelated 16 min θ − log q(θ) ⇒ F(θ)−1 ∇log q(θ) min q 𝔼 q [ℓ(θ)] − ℋ(q) ⇒ F(λ)−1 ∇λ 𝔼 q [ℓ(θ)]
fixed covariance ⇢rµ (E q[`(✓)] H(q)) <latexit sha1_base64="ZgoUih72jNp2X1gy1YFrJC9GVEM=">AAACYHicbVFNa9RAGJ6kfqxra7d608toEXYPLklF2pMUitBjBbct7ITwZvJmd+jkozNvXJaQg2fP/jFvHrz4S5xkK2jrCwMPz/N+PpNUWlkKgh+ev3Xv/oOHg0fDx9s7T3ZHe0/PbVkbiTNZ6tJcJmBRqwJnpEjjZWUQ8kTjRXJ10ukXn9FYVRafaF1hlMOiUJmSQI6KRyuhXXIKXGjMCIwpV/wP9YYLsyy5KCDREDeN6Mc1BtOWi7xu275m7DDQMkmaD218zecCtR4LWiLBJOp6dKoE3Zy24+uJa6kWS5rEo/1gGvTB74LwBuwfv1zpryffvpzFo+8iLWWdY0FSg7XzMKgoasCQkhrboagtViCvYIFzBwvI0UZNv3HLXzsm5Vlp3CuI9+zfFQ3k1q7zxGV229rbWkf+T5vXlB1FjSqqmrCQm0FZrTmVvHObp8qgJL12AKRRblcul2BAkvuToTMhvH3yXXB+MA3fTt99dG68Z5sYsBfsFRuzkB2yY3bKztiMSfbT2/K2vR3vlz/wd/29Tarv3dQ8Y/+E//w3iHi5FA==</latexit> m m ⇢rm E q[`(✓)] <latexit sha1_base64="mKRRHs0Ncb8WZ6GVhpi6k093WMo=">AAACMnicbVBNaxRBFOyJUeP6kVWPXtoEIR5cZhJETxIIgrlFcJPA9jC86X2TbdIfY/cbwzLMwbPn/I5c/CWCh3hQxKs/wt7dHDSxoKGoeo9XXWWtVaA0PU+Wri1fv3Fz5Vbv9p2791b79x/sB9d4iUPptPOHJQTUyuKQFGk8rD2CKTUelMc7M//gA/qgnH1H0xpzA0dWVUoCRano7xouNFYE3rsTbvgzLvzEcS4slBqKVsxPtB7HHTcdFwZoUpbt6654z0cCtd4QNEGCp3nRX08H6Rz8KskuyPr24xP9aef0417R/yLGTjYGLUkNIYyytKa8BU9Kaux6oglYgzyGIxxFasFgyNt5no4/icqYV87HZ4nP1b83WjAhTE0ZJ2eRw2VvJv7PGzVUvcxbZeuG0MrFoarRnByf9cfHyqMkPY0EpFcxK5cT8CApttyLJWSXv3yV7G8Osq3B87exjVdsgRX2iK2xDZaxF2ybvWF7bMgkO2Nf2Xf2I/mcfEt+Jr8Wo0vJxc5D9g+S338Ali+tYQ==</latexit> m m ⇢rm`(m) <latexit sha1_base64="98CcIRFFgD6zRorNq9GoJPRyZI4=">AAACEHicbVA9SwNBEN3z2/gVtRRkUUQtDHeKaCWCjWUEE4VcCHObObO4H8funhJCfoKNnf4NGwtFbC3t/DduEgu/Hgw83pthZl6SCW5dGH4EQ8Mjo2PjE5OFqemZ2bni/ELV6twwrDAttDlPwKLgCiuOO4HnmUGQicCz5PKo559dobFcq1PXzrAu4ULxlDNwXmoU1yWNBaYOjNHXVNItGpuWpjRWkAhoeBeF2JCbjeJqWAr7oH9J9EVWD5fvergvN4rvcVOzXKJyTIC1tSjMXL0DxnEmsFuIc4sZsEu4wJqnCiTaeqf/UJeueaVJU218KUf76veJDkhr2zLxnRJcy/72euJ/Xi136X69w1WWO1RssCjNBXWa9tKhTW6QOdH2BJjh/lbKWmCAOZ9hwYcQ/X75L6lul6Kd0u6JT+OADDBBlsgK2SAR2SOH5JiUSYUwckMeyBN5Dm6Dx+AleB20DgVfM4vkB4K3T4sKn2A=</latexit> GD: E q[`(✓)] ⇡ `(m) <latexit sha1_base64="qsg07BnB/paLQCxpgnk/lIhGH4M=">AAACE3icbVC7SgNBFJ31bXxFLW1GRVCLsKuIViIEwTKCeUB2CbOTm2TI7MOZu2pYUljb2ORXbCwUsbWx82+cPAqNHhg4nHMvd87xYyk02vaXNTE5NT0zOzefWVhcWl7Jrq6VdJQoDkUeyUhVfKZBihCKKFBCJVbAAl9C2W/n+375BpQWUXiFnRi8gDVD0RCcoZFq2X03YNjy/fS8W7uuuiDlrostQLbnUZfFsYru6EAN9mrZbTtnD0D/EmdEts82b+VDvndfqGU/3XrEkwBC5JJpXXXsGL2UKRRcQjfjJhpixtusCVVDQxaA9tJBpi7dMUqdNiJlXoh0oP7cSFmgdSfwzWQ/gR73+uJ/XjXBxomXijBOEEI+PNRIJMWI9guidaGAo+wYwrgS5q+Ut5hiHE2NGVOCMx75Lykd5JzD3NGlaeOUDDFHNsgW2SUOOSZn5IIUSJFw8kieyAt5tXrWs/VmvQ9HJ6zRzjr5BevjG5S3oO4=</latexit> “Global” to “local” (the delta method) := m <latexit sha1_base64="a96HRJceYu7AvbLB2G0npJD7GWA=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqswooghqwY3LCvYBnaFkMpk2NMkMSUYopb/hxoUibv0K/8Cdf2M67UJbDwQO55zLvTlhypk2rvvtFJaWV1bXiuuljc2t7Z3y7l5TJ5kitEESnqh2iDXlTNKGYYbTdqooFiGnrXBwO/Fbj1RplsgHM0xpIHBPspgRbKzk+9xGI4wur5Dolitu1c2BFok3I5Wbz9Mc9W75y48SkgkqDeFY647npiYYYWUY4XRc8jNNU0wGuEc7lkosqA5G+c1jdGSVCMWJsk8alKu/J0ZYaD0UoU0KbPp63puI/3mdzMQXwYjJNDNUkumiOOPIJGhSAIqYosTwoSWYKGZvRaSPFSbG1lSyJXjzX14kzZOqd1o9u3crtWuYoggHcAjH4ME51OAO6tAAAik8wQu8Opnz7Lw579NowZnN7MMfOB8/uDmTBg==</latexit> Expectation parameters Natural parameters Gaussian distribution µ := E q[✓] = m <latexit sha1_base64="cNeSOwgBk16lIirR3C0fcoEhwJY=">AAACCHicbVA9SwNBEN3zM8avqKWFq0GwCneKKEIkEATLCCYRckfY22zMkt27c3dOCUcKCxut/RU2ForY+hPs/DfuJSk0+mDg8d4MM/P8SHANtv1lTUxOTc/MZuay8wuLS8u5ldWaDmNFWZWGIlQXPtFM8IBVgYNgF5FiRPqC1f1uOfXr10xpHgbn0IuYJ8llwNucEjBSM7fhyhgfFbErCXR8PznpN68aLnQYEA8XsWzm8nbBHgD/Jc6I5EubN+Kh/HhbaeY+3VZIY8kCoIJo3XDsCLyEKOBUsH7WjTWLCO2SS9YwNCCSaS8ZPNLH20Zp4XaoTAWAB+rPiYRIrXvSN53pvXrcS8X/vEYM7UMv4UEUAwvocFE7FhhCnKaCW1wxCqJnCKGKm1sx7RBFKJjssiYEZ/zlv6S2W3D2CvtnJo1jNEQGraMttIMcdIBK6BRVUBVRdIee0At6te6tZ+vNeh+2TlijmTX0C9bHN129m+g=</latexit> Entropy q(✓) := N(m, 1) <latexit sha1_base64="VLIJksnLhlFGb0I6mjjY2mTg0ss=">AAACCHicbVC7SgNBFJ2NrxhfUQsLCweDsAEJu4pEBCFgYyURzAOyS5idTMyQ2Yczd4WwpLSx8UNsLBSxzSfY+SPWTh6FJh64cDjnXu69x4sEV2BZX0Zqbn5hcSm9nFlZXVvfyG5uVVUYS8oqNBShrHtEMcEDVgEOgtUjyYjvCVbzuhdDv3bPpOJhcAO9iLk+uQ14m1MCWmpm9+5MBzoMSB6fnWPHJ9ChRCRXfdM/xHYeN7M5q2CNgGeJPSG50k48eHLM73Iz++m0Qhr7LAAqiFIN24rATYgETgXrZ5xYsYjQLrllDU0D4jPlJqNH+vhAKy3cDqWuAPBI/T2REF+pnu/pzuGlatobiv95jRjap27CgygGFtDxonYsMIR4mApucckoiJ4mhEqub8W0QyShoLPL6BDs6ZdnSfWoYB8XTq51GkU0Rhrton1kIhsVUQldojKqIIoe0DN6RW/Go/FivBsf49aUMZnZRn9gDH4AN0KbMg==</latexit> H(q) := log(2⇡)/2 <latexit sha1_base64="TeFJTJQysL/svzgL44d23d+iA7k=">AAACB3icbVC7SgNBFJ2NrxhfqxYWggwGYdPE3YhEBCFgkzKCeUB2CbOT2WTI7MOZWSEs6Wys/A8bC0Vs9RPs/BFrZ5MUmnjgwuGce7n3HjdiVEjT/NIyC4tLyyvZ1dza+sbmlr690xBhzDGp45CFvOUiQRgNSF1SyUgr4gT5LiNNd3CZ+s1bwgUNg2s5jIjjo15APYqRVFJHP7B9JPsYsaQ6Mm4K8PwC2izsGSU7ooXjUkfPm0VzDDhPrCnJV/bijwfb+K519E+7G+LYJ4HEDAnRtsxIOgnikmJGRjk7FiRCeIB6pK1ogHwinGT8xwgeKaULvZCrCiQcq78nEuQLMfRd1ZleLWa9VPzPa8fSO3MSGkSxJAGeLPJiBmUI01Bgl3KCJRsqgjCn6laI+4gjLFV0ORWCNfvyPGmUitZJ8fRKpVEGE2TBPjgEBrBAGVRAFdRAHWBwBx7BM3jR7rUn7VV7m7RmtOnMLvgD7f0H5ASbDA==</latexit> ✓ ✓ ⇢r✓`(✓) <latexit sha1_base64="a6+5Yxu90ENkXxqRBNrJ/+cYKDo=">AAACI3icbVBNS1tBFJ2X1hqjrbEuhTI0CHHR8J4iLS6K4MalBaNCXgj3Te5Lhsybeczc1xJCfoiuuvGvuHGhiBsX/S9OPgpt9MDA4ZxzuXNPkivpKAyfgtKbt0vvlssrldW19x/Wqxsfz5wprMCmMMrYiwQcKqmxSZIUXuQWIUsUnieDo4l//hOtk0af0jDHdgY9LVMpgLzUqR7E1EcCHitMCaw1v/hc+cJj2zc81pAo6PyNoVL1Gd/pVGthI5yCvyTRnNQOP11OcHXSqT7EXSOKDDUJBc61ojCn9ggsSaFwXIkLhzmIAfSw5amGDF17NL1xzLe90uWpsf5p4lP134kRZM4Ns8QnM6C+W/Qm4mteq6D0W3skdV4QajFblBaKk+GTwnhXWhSkhp6AsNL/lYs+WBDka634EqLFk1+Ss91GtNfY/+Hb+M5mKLMt9pnVWcS+skN2zE5Ykwn2m92wO3YfXAe3wUPwOIuWgvnMJvsPwZ9nwsKoEg==</latexit> BLR: See Section 1.3.1 in Khan and Rue, 2021
E q[`(✓)] <latexit sha1_base64="L0YxREwgd4xVW52Hx4kXKtg8eag=">AAACUnicbVJdSxwxFM1uP7Srbdf62JdQsawPXWYsYp+KUAQfLe2qsBmGO9k7bjCTjMmdynaYP9U/Iogv/hAR+tA2u2vBai8EDufcyz05SVZq5SmKrlrtR4+fPF1YfNZZWn7+4mV35dWBt5WTOJBWW3eUgUetDA5Ikcaj0iEUmcbD7OTTVD/8hs4ra77SpMSkgGOjciWBApV21ZeCC405gXP2jPfid1y4sd3ggZ9DLgxkGtK6FrN1tcNRw7kogMZZVu826WlP0BgJNprmLsuHArX+qyVpdy3qR7PiD0F8C9Z23t58v/xBYj/tXoiRlVWBhqQG74dxVFJSgyMlNTYdUXksQZ7AMQ4DNFCgT+qZx4avB2bEc+vCMcRn7N2JGgrvJ0UWOqeW/X1tSv5PG1aUf0hqZcqK0Mj5orzSnCyf5stHyqEkPQkApFPBK5djcCApvEInhBDfv/JDcLDZj9/3tz6HND6yeS2y1+wN67GYbbMdtsf22YBJds6u2S/2u3XZ+tkOv2Te2m7dzqyyf6q9/Ac9Kbfo</latexit> S (1 ⇢)S ⇢2rE q(✓✓>) E q[`(✓)] <latexit sha1_base64="m7F8o13Jqv7EzTrgw8ytR43m1nI=">AAACXXicbVHBbtQwEHUCLe1SyrZF4sBlRIW0e2CVFKH2RCshJA4cisq2ldYhcryTrlUnTu0JaBXl9/gAbnDhV/Bu9lBanmTr6b03mvE4q7RyFEW/gvDBw7X1RxubvcdbT7af9nd2z52prcSxNNrYy0w41KrEMSnSeFlZFEWm8SK7fr/wL76hdcqUX2heYVKIq1LlSgryUtqnM+AacxLWmu8wiF8DtzMzhDPoGBwAL0WmRdo0fNmusThtAXghaJZlzYc2vRlwmiGJ7v7KyVTDtr0dgAlHrVexYZL296NRtATcJ/GK7J8MPv04fra7fpr2f/KpkXWBJUktnJvEUUVJIywpqbHt8dphJeS1uMKJp6Uo0CXNctwWXnllCrmx/pQES/V2RSMK5+ZF5pOLkd1dbyH+z5vUlB8ljSqrmrCUXaO81kAGFquGqbIoSc89EdIqPyvImbBCkv+Qnl9CfPfJ98n5wSh+M3r72W/jHeuwwV6wl2zAYnbITthHdsrGTLLfAQs2g17wJ1wLt8LtLhoGq5o99g/C538BvoO2CQ==</latexit> m m ⇢S 1rm`(m) S (1 ⇢)S + ⇢Hm <latexit sha1_base64="v9K1a5jlFvx60cg+XbCII4W4H7U=">AAACW3icbVFNaxsxENVumsbdpK3T0lMvQ0yCQ7HZ7QftKQR6yTElcRKwXKOVZ2MRfSySNsUs+yd7Sg/9K6Xyxoc66QPB4715M9IoL6VwPk1/RfHGk82nW51nyfbO8xcvu7uvLpypLMcRN9LYq5w5lELjyAsv8aq0yFQu8TK/+br0L2/ROmH0uV+UOFHsWotCcOaDNO1aBQdUYuGZteYHKBgAtXMDNW171xZnDZxB870eZA1QzXLJpgooStlXh5QmZ2v5fjZYxg9D5N16j7brSYg2024vHaYt4DHJVqRHVjiddn/SmeGVQu25ZM6Ns7T0k5pZL7jEJqGVw5LxG3aN40A1U+gmdTu7gf2gzKAwNhztoVX/TdRMObdQeahUzM/dQ28p/s8bV774MqmFLiuPmt8PKioJ3sBy0TATFrmXi0AYtyLcFficWcZ9+I4kLCF7+OTH5OL9MPsw/PTtY+/4aLWODnlL9kifZOQzOSYn5JSMCCd35E+0FXWi3/FGnMQ796VxtMq8JmuI3/wFGkeyFA==</latexit> rE q(✓) E q[`(✓)] = E q[r✓`(✓)] 2E q[H✓]m rE q(✓✓>) E q[`(✓)] = E q[H✓] <latexit sha1_base64="1tKNoi6a6uPuQLwufe4gfqBe06A=">AAADHniclVJNaxRBEO2Z+BHXr40evRQuSjy4zMSEeDEERMgxgpsEtsehp7c226SnZ+yuEZZhfkku+StePCgieNJ/48zuBHY3OWhBN4/36nW9bjrJtXIUBH88f+3GzVu31+907t67/+Bhd+PRkcsKK3EgM53Zk0Q41MrggBRpPMktijTReJycvW30489oncrMB5rmGKXi1KixkoJqKt7wtrkRiRZxWfLZaaXFUQXAU0GTJCnfVfGnTU4TJPGiqhZZGHLU+lKL4PkbWJaXDmynzLth0VlBBC9ha8UMS+6DS2PTnALnnX9OPd8/csry/73A9RmaCHG3F/SDWcFVELagx9o6jLu/+CiTRYqGpBbODcMgp6gUlpTUWHV44TAX8kyc4rCGRqToonI2vYJnNTOCcWbrZQhm7KKjFKlz0zSpO5v8blVryOu0YUHj11GpTF4QGjkfNC40UAbNX4GRsihJT2sgpFV1VpATYYWk+kd16kcIV698FRxt9cNX/Z332739vfY51tkT9pRtspDtsn12wA7ZgEnv3PviffO++xf+V/+H/3Pe6nut5zFbKv/3X0+sAFw=</latexit> ✓ ✓ H 1 ✓ [r✓`(✓)] <latexit sha1_base64="cC3LQqRSdlYDPtPx6+bJYnkjhX8=">AAACO3icbZDNSxxBEMV71ETdfK16zKVRAnpwmTGEeBJBCB5NcFXYmQw1vTW7jT09Q3eNsgz7f3nx7i3izUsuORjEq3d7d1YwmgcNP15VUV0vKZS05PvX3tT0zKvXs3PzjTdv373/0FxYPLB5aQS2Ra5yc5SARSU1tkmSwqPCIGSJwsPkeGdUPzxBY2Wu92lQYJRBT8tUCiBnxc0fIfWRgIcKUwJj8lM+cdb5blzjz2o9GNYdHR5qSBTEj2Oo1GrNazw0stenKG6u+C1/LP4SggmsbC+n335drF3uxc2rsJuLMkNNQoG1ncAvKKrAkBQKh42wtFiAOIYedhxqyNBG1fj2If/knC5Pc+OeJj52n05UkFk7yBLXmQH17fPayPxfrVNSuhlVUhcloRb1orRUnHI+CpJ3pUFBauAAhJHur1z0wYAgF3fDhRA8P/klHGy0gs+tL99dGlus1hz7yJbZKgvYV7bNdtkeazPBzthvdsP+eufeH+/Wu6tbp7zJzBL7R979Az/dsUw=</latexit> Newton’s method: Express in terms of gradient and Hessian of loss: E q[`(✓)] ⇡ `(m) <latexit sha1_base64="qsg07BnB/paLQCxpgnk/lIhGH4M=">AAACE3icbVC7SgNBFJ31bXxFLW1GRVCLsKuIViIEwTKCeUB2CbOTm2TI7MOZu2pYUljb2ORXbCwUsbWx82+cPAqNHhg4nHMvd87xYyk02vaXNTE5NT0zOzefWVhcWl7Jrq6VdJQoDkUeyUhVfKZBihCKKFBCJVbAAl9C2W/n+375BpQWUXiFnRi8gDVD0RCcoZFq2X03YNjy/fS8W7uuuiDlrostQLbnUZfFsYru6EAN9mrZbTtnD0D/EmdEts82b+VDvndfqGU/3XrEkwBC5JJpXXXsGL2UKRRcQjfjJhpixtusCVVDQxaA9tJBpi7dMUqdNiJlXoh0oP7cSFmgdSfwzWQ/gR73+uJ/XjXBxomXijBOEEI+PNRIJMWI9guidaGAo+wYwrgS5q+Ut5hiHE2NGVOCMx75Lykd5JzD3NGlaeOUDDFHNsgW2SUOOSZn5IIUSJFw8kieyAt5tXrWs/VmvQ9HJ6zRzjr5BevjG5S3oO4=</latexit> Delta Method Set =1 to get m m H 1 m [rm`(m)] <latexit sha1_base64="30gUsIDe7mBYYbgyknga5qh7a6I=">AAACFnicbVDLSgNBEJz1bXxFPXoZDIoeEnZ9oCcRvHhUMCpk16V30quDM7PLzKwSlnyFF3/FiwdFvIo3/8ZJzMFXQUNR1U13V5ILbqzvf3hDwyOjY+MTk5Wp6ZnZuer8wqnJCs2wyTKR6fMEDAqusGm5FXieawSZCDxLrg96/tkNasMzdWI7OUYSLhVPOQPrpLhal3Q1FJha0Dq7pZLW6WEsL8p60KWtUEEiIJY0RCHW5HoUV2t+w++D/iXBgNTIAEdx9T1sZ6yQqCwTYEwr8HMblaAtZwK7lbAwmAO7hktsOapAoonK/ltduuKUNk0z7UpZ2le/T5QgjenIxHVKsFfmt9cT//NahU13o5KrvLCo2NeitBDUZrSXEW1zjcyKjiPANHe3UnYFGph1SVZcCMHvl/+S041GsNnYPt6q7e8N4pggS2SZrJGA7JB9ckiOSJMwckceyBN59u69R+/Fe/1qHfIGM4vkB7y3T0BGnYw=</latexit> ⇢ <latexit sha1_base64="LY4aJqz59GMbQ6sd5USq7FFmmvw=">AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKez6QE8S8OIxgjGBZAmzk9nskHksM7NCCPkFLx4U8eoPefNvnE32oIkFDUVVN91dUcqZsb7/7ZVWVtfWN8qbla3tnd296v7Bo1GZJrRFFFe6E2FDOZO0ZZnltJNqikXEaTsa3eZ++4lqw5R8sOOUhgIPJYsZwTaXejpR/WrNr/szoGUSFKQGBZr96ldvoEgmqLSEY2O6gZ/acIK1ZYTTaaWXGZpiMsJD2nVUYkFNOJndOkUnThmgWGlX0qKZ+ntigoUxYxG5ToFtYha9XPzP62Y2vg4nTKaZpZLMF8UZR1ah/HE0YJoSy8eOYKKZuxWRBGtMrIun4kIIFl9eJo9n9eC8fnl/UWvcFHGU4QiO4RQCuIIG3EETWkAggWd4hTdPeC/eu/cxby15xcwh/IH3+QMgvI5K</latexit> 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). See Section 1.3.2 in Khan and Rue, 2021
scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). RMSprop BLR for Gaussian approx To get RMSprop, make the following choices • Restrict covariance to be diagonal • Replace Hessian by square of gradients • Add square root for scaling vector S (1 ⇢)S + ⇢(H✓) m m ↵S 1r✓`(✓) <latexit sha1_base64="fTMsyehhC7I4gYoyh5dms3QHSNE=">AAACenicbZFNaxsxEIa126/U/XLbYykMMU1tgs1um5AeA73kmOI6CViumZVnsyLSapG0DWbZH5G/llt/SS89dG3voU46IHh4Z17NSJMUSjofRb+C8MHDR4+f7DztPHv+4uWr7us3Z86UVtBEGGXsRYKOlMxp4qVXdFFYQp0oOk+uvq7y5z/JOmny735Z0EzjZS5TKdA30rx7M4Y9rij1aK25hn485DYzAxjDPqwI+hVfd6ksLWo4mXOfkcd6wHlHb1k1DIGjKjKELcu4/lEN43pb5DkmCuv2NuCkVH/Dg3m3F42idcB9iFvosTZO591bvjCi1JR7odC5aRwVflah9VIoqju8dFSguMJLmjaYoyY3q9bD1PChURaQGtuc3MNa/ddRoXZuqZOmUqPP3N3cSvxfblr69MusknlResrFplFaKvAGVnuAhbQkvFo2gMLKZlYQGVoUvtlWp/mE+O6T78PZp1H8eXT47aB3fNR+xw57x3ZZn8XsiB2zE3bKJkyw38H7YC/4GPwJd8NBuL8pDYPW85ZtRXjwFxMjvuc=</latexit> s (1 ⇢)s + ⇢[ ˆ r`(✓)]2 ✓ ✓ ↵( p s + ) 1 ˆ r`(✓) <latexit sha1_base64="RyiZAK7rqVMEqEFbAuWUs+AixI0=">AAACfnicdVFNb9QwEHVSPsrytYUjF8OKaku1S9JStcdKvfRYJLattE5XE++kserEwZ6AVtH+DP4YN34LF5zdHGgLI1l682besz2TVlo5iqJfQbjx4OGjx5tPek+fPX/xsr/16tyZ2kqcSKONvUzBoVYlTkiRxsvKIhSpxov05qStX3xD65Qpv9CiwqSA61JlSgJ5atb/4fi20JgRWGu+82E8EjY3O9zxXd4iPhU5UCNKSDUsuUCth4JyJNhJrvaE6K2TWyYdNeICdJUDHwr31VLjlq3nHLXXXjWj2Lv9z3rWH0TjaBX8Pog7MGBdnM36P8XcyLrAkqQG56ZxVFHSgCUlNS57onZYgbyBa5x6WEKBLmlW41vy956Z88xYf0riK/ZvRQOFc4si9Z0FUO7u1lryX7VpTdlR0qiyqglLub4oqzUnw9td8LmyKEkvPABplX8rlzlYkOQ31vNDiO9++T443xvH++ODz58Gx4fdODbZG/aODVnMDtkxO2VnbMIk+x28DT4EuyELt8NR+HHdGgad5jW7FeHRH9LYvm0=</latexit> For Adam, use a Heavy-ball term with KL divergence as momentum (Appendix E in [1]) See Section 4.2 in Khan and Rue, 2021
Variational Online Newton (IVON) <latexit sha1_base64="6cqgdMpwEvsCUhlzFD20sqy/oaU=">AAACt3icbVFNj9MwEHXC11K+Chy5jKhAqdCWZFW+bitx4bhIdHdFXaqJ49TWOnHWdkBVlL/IgRv/BidtJbq7I1l6eu/N83icVkpaF8d/g/DW7Tt37x3cHzx4+Ojxk+HTZ6dW14bxGdNKm/MULVey5DMnneLnleFYpIqfpRefO/3sJzdW6vKbW1d8UeCqlLlk6Dy1HP6mAl2zauE1VTx3aIz+BT1HS0wVtkC5UhF1gjscUzroNXGDf9X+OAJvEHtSlBxSI/QYBLyBDsEuoMvqU/ejNtQhUFSVQIBol971Z1x5sYDxW4iovTR90I7vplsOR/Ek7guug2QLRmRbJ8vhH5ppVhe8dEyhtfMkrtyiQeMkU7wd0NryCtkFrvjcwxILbhdNv/cWXnkmg1wbf0oHPft/R4OFtesi9c4CnbBXtY68SZvXLv+4aGRZ1Y6XbHNRXitwGrpPhEwazpxae4DMSD8rMIEGmfNf3S0hufrk6+D0aJK8n7z7Oh0dT7frOCAvyEsSkYR8IMfkCzkhM8KCafA9YEEWfgqXYR6KjTUMtj3PyV6Fl/8ARwHSgg==</latexit> ˆ g ˆ r`(✓) ˆ h ˆ g2 h (1 ⇢)h + ⇢ˆ h ✓ ✓ ↵(ˆ g + m)/( p h + ) 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Lin et al. “Handling the positive-definite constraints in the BLR.” ICML (2020). 4. Shen et al. “Variational Learning is Effective for Large Deep Networks.” Under review (2024) Only tune initial value of h (a scalar) <latexit sha1_base64="lMR1EEq84mMqe3ZqmU42lhTl4A0=">AAADlnicbVLbbtNAEN3YXEq4NKUvSLyMiKhspbk4KrcXVFEheKqCRNpK2SRabzZZq7u2tV5TIst/xNfwxt+wdl2C065l6ficmTkz4/VjESR6MPjTsOx79x883HnUfPzk6bPd1t7zsyRKFWVjGolIXfgkYSII2VgHWrCLWDEifcHO/cuTQj//wVQSROF3vY7ZVJJVGCwDSrSh5nuNX5gTna1yOMCCLTVRKrqCksMh8QXJATMhHKw508SFDJeemWILo2j2UyuZwRVnikFBFFGAk0AClkRzSkR2mjvy0FArSWZDF3KMm2V9fofnKsd0EemaS+XdlW7/popxwk1eS3e8LlY8coFDBwoElQk+LJ56450iYDZ0eLcKcmfDvmO+O3jBhJnTLbqUNQMJXcBExJwAODfdFl5lhpHdPjg1F76RXVOuvrp/kxxs7XTj6PWdU4dvarj5vNUe9AblgdvAq0AbVWc0b/3Gi4imkoWaCpIkE28Q62lGlA6oYHkTpwmLCb0kKzYxMCSSJdOsbCeH14ZZwDJS5g01lOz/GRmRSbKWvoks/nWyrRXkXdok1cv30ywI41SzkF4bLVMBOoLijsIiUIxqsTaAUBWYXoFyogjV5iY3zRK87ZFvg7Nhz3vbe/PtqH18VK1jB71Er5CDPPQOHaOvaITGiFr71gfrk3Viv7A/2p/tL9ehVqPK2Ue1Y4/+AgL8INo=</latexit> ˆ g ˆ r`(✓) where ✓ ⇠ N(m, 2) ˆ h ˆ g · (✓ m)/ 2 h (1 ⇢)h + ⇢ˆ h +⇢2(h ˆ h)2/(2(h + )) m m ↵(ˆ g + m)/(h + ) 2 1/(N(h + ))
Challenge 23 Watch Thomas Moellenhoff’s talk at https://www.youtube.com/watch?v=LQInlN5EU7E. 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Lin et al. “Handling the positive-definite constraints in the BLR.” ICML (2020).
same cost Trained on OpenWebText data (49.2B tokens). On 773M, we get a gain of 0.5 in perplexity. On 355M, we get a gain of 0.4 in perplexity. 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Shen et al. “Variational Learning is effective for large neural networks.” (Under review) BLR (IVON)[3]
also train on low-precision (a stable optimizer) 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Shen et al. “Variational Learning is effective for large neural networks.” (Under review) Variational Learning is Effective for Large Deep Networks (a) (b) e 12: IVON results trained on multi-GPU setups with different random seed on each machine (left) an xity for GPT-2 trained with IVON evaluated at mean and posterior predictive on OpenWebText (right). Variational Learning is Effective for Large Deep Networks (a) (b) e 12: IVON results trained on multi-GPU setups with different random seed on each machine (left) and val exity for GPT-2 trained with IVON evaluated at mean and posterior predictive on OpenWebText (right).
and 1% over SGD. Better calibration (ECE of 0.022 vs 0.066) ional Learning is Effective for Large Deep Networks (b) ResNet-50 on ImageNet (c) Calibration on ImageNet onal Learning is Effective for Large Deep Networks (b) ResNet-50 on ImageNet (c) Calibration on ImageNet
without the i’th example Truth Estimated Start Iterations Current Past information with most influence on the present Estimating it without retraining: Using the BLR, we can recover all sorts of influence criteria used in literature.
Linear Regression. Technometrics. ASA 1977 2. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS, 2023 How sensitive is a model to its training data? Deviation ( ) = predictionError *predictionVariance Δ Old model New model New data
Current (a) Effect of removing an example. CIFAR10 on ResNet-20 using IVON. SGD or Adam do not work as well. Generalization on test data (NLL) Leave-One-Out Estimates on training data and during training
by uncertainty-based gradient matching, ICLR (2024). ✓1 2 P =1 Ht t 0.2 0.4 2 4 6 1 2 3 4 5 1 2 3 4 5 Gradient mismatch Difference in test error Task Arithmetic Ours RoBERTa on IMDB What if we merge fine-tuned large-language models?
Bayes: 𝔼 ϵ∼ 𝒩 (0,σ2) [ℓ(θ + ϵ)] sup |ϵ|<ρ ℓ(θ + ϵ) SAM: Our work: Fenchel Biconjugate 1. Foret et al. Sharpness-Aware Minimization for Efficiently Improving Generalization, ICLR, 2021 2. Moellenhoff and Khan, SAM as an Optimal Relaxation of Bayes, Under review, 2022
[2-5] – SOTA on GPT-2 and ImageNet [5] • Improve DL [5-7] – Calibration, uncertainty, memory etc. – Understand and fix model behavior • Towards human-like quick adaptation 37 1. Khan and Rue, The Bayesian Learning Rule, JMLR (2023). 2. Khan, et al. Fast and scalable Bayesian deep learning by weight-perturbation in Adam, ICML (2018). 3. Osawa et al. Practical Deep Learning with Bayesian Principles, NeurIPS (2019). 4. Lin et al. Handling the positive-definite constraints in the BLR, ICML (2020). 5. Shen et al. Variational Learning is Effective for Large Deep Networks, Under review. 6. Daheim et al. Model merging by uncertainty-based gradient matching, ICLR (2024). 7. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS (2023)