Slide 1

Slide 1 text

The Bayesian Learning Rule Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo http://emtiyaz.github.io 1 Summary of recent research at https://emtiyaz.github.io/papers/symposium_2023.pdf Slides available at https://emtiyaz.github.io/ Presentation at the Optimal Transport Workshop (Berlin) Mar 14 2024

Slide 2

Slide 2 text

Human Learning at the age of 6 months. 2

Slide 3

Slide 3 text

3 Converged at the age of 12 months

Slide 4

Slide 4 text

4 Transfer skills at the age of 14 months

Slide 5

Slide 5 text

Fail because too slow or quick to adapt 5 h tt ps://www.youtube.com/watch?v=TxobtWAFh8o The video is from 2017

Slide 6

Slide 6 text

Adaptation in Machine Learning • Even a small change may need retraining • Huge amount of resources are required only few can afford (costly & unsustainable) [1,2, 3] • Difficult to apply in “dynamic” settings (robotics, medicine, epidemiology, climate science, etc.) • Our goal is to solve such challenges – Help in building safe and trustworthy AI – To reduce “magic” in deep learning (DL) 6 1. Diethe et al. Continual learning in practice, arXiv, 2019. 2. Paleyes et al. Challenges in deploying machine learning: a survey of case studies, arXiv, 2021. 3. https://www.youtube.com/watch?v=hx7BXih7zx8&t=897s

Slide 7

Slide 7 text

Bayesian Learning Rule [1] • Bridge DL & Bayesian learning [2-5] – SOTA on GPT-2 and ImageNet [5] • Improve other aspects of DL [5-7] – Calibration, uncertainty, memory etc. – Understand and fix model behavior • Towards human-like quick adaptation 7 1. Khan and Rue, The Bayesian Learning Rule, JMLR (2023). 2. Khan, et al. Fast and scalable Bayesian deep learning by weight-perturbation in Adam, ICML (2018). 3. Osawa et al. Practical Deep Learning with Bayesian Principles, NeurIPS (2019). 4. Lin et al. Handling the positive-definite constraints in the BLR, ICML (2020). 5. Shen et al. Variational Learning is Effective for Large Deep Networks, Under review. 6. Daheim et al. Model merging by uncertainty-based gradient matching, ICLR (2024). 7. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS (2023)

Slide 8

Slide 8 text

GPT-2 with Bayesian Learning Rule 8 Better performance & uncertainty at the same cost [5] Trained on OpenWebText data (49.2B tokens). On 773M, we get a gain of 0.5 in perplexity. On 355M, we get a gain of 0.4 in perplexity. 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Shen et al. “Variational Learning is Effective for Large Deep Networks.” Under review (2024) BLR (IVON)[3]

Slide 9

Slide 9 text

BLR for large deep networks 9 RMSprop/Adam BLR variant Improved Variational Online Newton (IVON) AAACt3icbVFNj9MwEHXC11K+Chy5jKhAqdCWZFW+bitx4bhIdHdFXaqJ49TWOnHWdkBVlL/IgRv/BidtJbq7I1l6eu/N83icVkpaF8d/g/DW7Tt37x3cHzx4+Ojxk+HTZ6dW14bxGdNKm/MULVey5DMnneLnleFYpIqfpRefO/3sJzdW6vKbW1d8UeCqlLlk6Dy1HP6mAl2zauE1VTx3aIz+BT1HS0wVtkC5UhF1gjscUzroNXGDf9X+OAJvEHtSlBxSI/QYBLyBDsEuoMvqU/ejNtQhUFSVQIBol971Z1x5sYDxW4iovTR90I7vplsOR/Ek7guug2QLRmRbJ8vhH5ppVhe8dEyhtfMkrtyiQeMkU7wd0NryCtkFrvjcwxILbhdNv/cWXnkmg1wbf0oHPft/R4OFtesi9c4CnbBXtY68SZvXLv+4aGRZ1Y6XbHNRXitwGrpPhEwazpxae4DMSD8rMIEGmfNf3S0hufrk6+D0aJK8n7z7Oh0dT7frOCAvyEsSkYR8IMfkCzkhM8KCafA9YEEWfgqXYR6KjTUMtj3PyV6Fl/8ARwHSgg== ˆ g ˆ r`(✓) ˆ h ˆ g2 h (1 ⇢)h + ⇢ˆ h ✓ ✓ ↵(ˆ g + m)/( p h + ) 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Lin et al. “Handling the positive-definite constraints in the BLR.” ICML (2020). 4. Shen et al. “Variational Learning is Effective for Large Deep Networks.” Under review (2024) Only tune initial value of h (a scalar) AAADlnicbVLbbtNAEN3YXEq4NKUvSLyMiKhspbk4KrcXVFEheKqCRNpK2SRabzZZq7u2tV5TIst/xNfwxt+wdl2C065l6ficmTkz4/VjESR6MPjTsOx79x883HnUfPzk6bPd1t7zsyRKFWVjGolIXfgkYSII2VgHWrCLWDEifcHO/cuTQj//wVQSROF3vY7ZVJJVGCwDSrSh5nuNX5gTna1yOMCCLTVRKrqCksMh8QXJATMhHKw508SFDJeemWILo2j2UyuZwRVnikFBFFGAk0AClkRzSkR2mjvy0FArSWZDF3KMm2V9fofnKsd0EemaS+XdlW7/popxwk1eS3e8LlY8coFDBwoElQk+LJ56450iYDZ0eLcKcmfDvmO+O3jBhJnTLbqUNQMJXcBExJwAODfdFl5lhpHdPjg1F76RXVOuvrp/kxxs7XTj6PWdU4dvarj5vNUe9AblgdvAq0AbVWc0b/3Gi4imkoWaCpIkE28Q62lGlA6oYHkTpwmLCb0kKzYxMCSSJdOsbCeH14ZZwDJS5g01lOz/GRmRSbKWvoks/nWyrRXkXdok1cv30ywI41SzkF4bLVMBOoLijsIiUIxqsTaAUBWYXoFyogjV5iY3zRK87ZFvg7Nhz3vbe/PtqH18VK1jB71Er5CDPPQOHaOvaITGiFr71gfrk3Viv7A/2p/tL9ehVqPK2Ue1Y4/+AgL8INo= ˆ g ˆ r`(✓) where ✓ ⇠ N(m, 2) ˆ h ˆ g · (✓ m)/ 2 h (1 ⇢)h + ⇢ˆ h +⇢2(h ˆ h)2/(2(h + )) m m ↵(ˆ g + m)/(h + ) 2 1/(N(h + ))

Slide 10

Slide 10 text

Drop-in replacement of Adam 10 https://github.com/team-approx-bayes/ivon

Slide 11

Slide 11 text

Exponential Family 11 Expectation parameters Natural parameters Sufficient Statistics q(✓) / exp ⇥ >T(✓) ⇤ AAACI3icbVBNTxsxFPSmkIa0lFCOXCwQEr1Eu62qVj2gSL30wCFIBJDibeT1vk0svGtjv0VEUX5Ib730r/SCECjiwoFbf0idj0rlYyRLo5l5tt8kRkmHYXgXVF4sLVdf1lbqr16vvllrrL89crq0AjpCK21PEu5AyQI6KFHBibHA80TBcXL6deofn4N1UheHODQQ57xfyEwKjl7qNb6c7TIcAPJ3lBmrDWrK4MIwBRl2KVP+ppR/Z6gNPfyXZFb2Bxj3GtthM5yBPiXRgmy3dvb/pD+qV+1eY8JSLcocChSKO9eNQoPxiFuUQsG4zkoHhotT3oeupwXPwcWj2Y5juuOVlGba+lMgnan/T4x47twwT3wy5zhwj72p+JzXLTH7HI9kYUqEQswfykpFfRPTwmgqLQhUQ0+4sNL/lYoBt1ygr7XuS4ger/yUHL1vRh+aHw98G3tkjhrZJFtkl0TkE2mRb6RNOkSQn+Q3uSY3wa/gMpgEt/NoJVjMbJAHCO7/Amqdp8o= Gaussian distribution q(✓) := N(✓|m, S 1) AAACFHicbVDLSgNBEJyNrxhfUY+CDAYhQQ27iiiCEvDiSSKaByQxzE4myZDZhzO9QljzEV4E/REvHhTx6sGbf+NskoMmFjQUVd10d9m+4ApM89uITUxOTc/EZxNz8wuLS8nllaLyAklZgXrCk2WbKCa4ywrAQbCyLxlxbMFKduc08ku3TCruuVfQ9VnNIS2XNzkloKV6cusmXYU2A5LBR8e46hBoUyLC895QvsPONr68DnesXqaeTJlZsw88TqwhSeXWHyM85evJr2rDo4HDXKCCKFWxTB9qIZHAqWC9RDVQzCe0Q1qsoqlLHKZqYf+pHt7USgM3PanLBdxXf0+ExFGq69i6M7pajXqR+J9XCaB5WAu56wfAXDpY1AwEBg9HCeEGl4yC6GpCqOT6VkzbRBIKOseEDsEafXmcFHez1l52/0KncYIGiKM1tIHSyEIHKIfOUB4VEEX36Bm9ojfjwXgx3o2PQWvMGM6soj8wPn8ASxmg0w== Expectation parameters Natural parameters := {Sm, S/2} AAACAXicbVDLSgMxFM3UV62vURcKboJF6ELrTEUqglBw47JS+4DOUDKZtA3NZIYkI5ShbvwAf8KNC0XcuvMT3Pkjrk0fC209EDiccy4393gRo1JZ1peRmptfWFxKL2dWVtfWN8zNrZoMY4FJFYcsFA0PScIoJ1VFFSONSBAUeIzUvd7l0K/fEiFpyG9UPyJugDqctilGSkstc9dhOuwjeH4BnaQSHMKjynEBOoOWmbXy1ghwltgTki3txB8PTu673DI/HT/EcUC4wgxJ2bStSLkJEopiRgYZJ5YkQriHOqSpKUcBkW4yumAAD7Tiw3Yo9OMKjtTfEwkKpOwHnk4GSHXltDcU//OasWqfuQnlUawIx+NF7ZhBFcJhHdCngmDF+pogLKj+K8RdJBBWurSMLsGePnmW1Ap5+yR/eq3bKIIx0mAP7IMcsEERlMAVKIMqwOAOPIJn8GLcG0/Gq/E2jqaMycw2+APj/QcLsphb µ := {E q(✓), E q(✓✓>)} AAACKXicbVDLSgMxFM3UV62vqjvdBIvQgpQZRSqCUBDBZQX7gE4tmTRtQzMPkztCGbr2T9z4BX6BLtxYUNStP2L6WGjbAwmHc+7l3nucQHAFpvllxObmFxaX4suJldW19Y3k5lZJ+aGkrEh94cuKQxQT3GNF4CBYJZCMuI5gZadzPvDLd0wq7nvX0A1YzSUtjzc5JaClejJvuyE+PcN2ZLsE2o4TXfTqt2kb2gxI5gDPUEf/jQ1+kLF79WTKzJpD4GlijUkqv/OSve8/PxXqyb7d8GnoMg+oIEpVLTOAWkQkcCpYL2GHigWEdkiLVTX1iMtULRpe2sP7Wmngpi/18wAP1b8dEXGV6rqOrhwsria9gTjLq4bQPKlF3AtCYB4dDWqGAoOPB7HhBpeMguhqQqjkeldM20QSCjrchA7Bmjx5mpQOs9ZR9vhKp5FDI8TRLtpDaWShHMqjS1RARUTRA3pF7+jDeDTejE/je1QaM8Y92+gfjJ9fEDKq8A== µ := E q[T(✓)] AAACCHicbVC7SgNBFJ2NrxhfUUsLR4MQm7CriCIogSBYRsgLskuYnUySIbMPZ+4qYUlhYaO1X2FjoYitn2Dn3zh5FJp44MLhnHu59x43FFyBaX4biZnZufmF5GJqaXlldS29vlFRQSQpK9NABLLmEsUE91kZOAhWCyUjnitY1e0WBn71hknFA78EvZA5Hmn7vMUpAS010tu2F+HTM2x7BDquG1/0G9d1XMra0GFA9p1GOmPmzCHwNLHGJJPfuRWPhae7YiP9ZTcDGnnMByqIUnXLDMGJiQROBeun7EixkNAuabO6pj7xmHLi4SN9vKeVJm4FUpcPeKj+noiJp1TPc3Xn4F416Q3E/7x6BK0TJ+Z+GAHz6WhRKxIYAjxIBTe5ZBRETxNCJde3YtohklDQ2aV0CNbky9OkcpCzDnNHVzqNczRCEm2hXZRFFjpGeXSJiqiMKLpHz+gVvRkPxovxbnyMWhPGeGYT/YHx+QMhgpvD N(✓|m, S 1) / exp  1 2 (✓ m)>S(✓ m) AAACTnicbVFNaxsxFNQ6/XDdLzc9loJoKCQQm92UkBwDufQUUhInAWtjtPJbW0RaCeltqdnsH8lf6qWEXPIHeu+lhZbQynZa2qQDgmHmDdIbZVZJj3F8GTUW7ty9d7/5oPXw0eMnT9vPFg+8KZ2AnjDKuKOMe1CygB5KVHBkHXCdKTjMTran/uF7cF6aYh8nFlLNR4XMpeAYpEEbmOY4FlxVO/UywzEgP9Wre8dVJ6lXKLPOWDSUwQdLmYIc+7TDcsdFldTV2u8E7VC9cszQWLpH/2g65J0cjTGlg/ZS3I1noLdJck2Wtta/vfx8drG6O2ifs6ERpYYCheLe95PYYlpxh1IoqFus9GC5OOEj6AdacA0+rWZ11PR1UIY0Ny6cAulM/TtRce39RGdhcrq8v+lNxf95/RLzzbSShS0RCjG/KC8VDQ1Nu6VD6UCgmgTChZPhrVSMeWgLww+0QgnJzZVvk4O1bvKmu/4utLFB5miSF+QVWSYJ2SBb5C3ZJT0iyEfyhXwnP6JP0dfoKvo5H21E15nn5B80mr8AInq2zg== / exp  (Sm)>✓ + Tr ✓ S 2 ✓✓> ◆ AAACT3icbZHNThsxFIU9gRZIfwiwRJWsokpBbaMZEIIlEpsuQRBAykwjj3MnsWKPLfsOIhrNi/BMbFp10xfoA7BhQVXVmWRRoFey/encY9k+To0UDsPwZ9BYWHzxcml5pfnq9Zu3q6219XOnC8uhy7XU9jJlDqTIoYsCJVwaC0ylEi7S8dG0f3EF1gmdn+HEQKLYMBeZ4Ay91G9lsbHaoKYxXBsaS8iwR9unavtrjNoLOAJk9KMHuEaryjNb1ab25zizjJenVblTzW2zud4YWzEc4TadrQntt7bCTlgXfQ7RHLYO9+7f/br58em43/oeDzQvFOTIJXOuF4UGk5JZFFxC1YwLB4bxMRtCz2POFLikrPOo6AevDGimrR850lr9d0fJlHMTlXqnYjhyT3tT8X+9XoHZQVKK3BQIOZ8dlBWS+vym4dKBsMBRTjwwboW/K+Uj5nNC/wVNH0L09MnP4XynE+129k58GvtkVstkk7wnbRKRfXJIvpBj0iWc3JI78kB+B9+C++BPY25tBHPYII+qsfIXeJ24gg== 1. Wainwright and Jordan, Graphical Models, Exp Fams, and Variational Inference Graphical models 2008 2. Malago et al., Towards the Geometry of Estimation of Distribution Algos based on Exp-Fam, FOGA, 2011

Slide 12

Slide 12 text

Bayes and Conjugate Computations [1] 12 1. Khan and Lin, Conjugate computation variational inference, AISTATS, 2017. AAACIHicbVDLSgMxFM34rPVVdekmWARXZUZEXRbc6K6CrUJnKJn0VkMzk5DcEcvQjf/hxi/wH0RwoYju9C/8A9PHwteBwOGce7k5J9ZSWPT9d29icmp6ZrYwV5xfWFxaLq2sNqzKDIc6V1KZs5hZkCKFOgqUcKYNsCSWcBp3Dwb+6SUYK1R6gj0NUcLOU9ERnKGTWqW9EOEKc60sghHK9GmojdKo6MiQouskFAnYsaKHY61S2a/4Q9C/JBiTcrX8+XBdv9uttUpvYVvxLIEUuWTWNgNfY5Qzg4JL6BfDzIJmvMvOoeloytzFKB8G7NNNp7RpRxn3UqRD9ftGzhJre0nsJhOGF/a3NxD/85oZdvajXKQ6Q0j56FAnk9TlH7RF28IAR9lzhHEj3F8pv2CGcdeWLboSgt+R/5LGdiXYrewcuzaOyAgFsk42yBYJyB6pkkNSI3XCyQ25J0/k2bv1Hr0X73U0OuGNd9bID3gfX1h/qU8= posterior / lik ⇥ prior Bayes rule: Multiplication of distribution = addition of (natural) params AAACLnicbZDLSgMxFIYz9VbrbVRwo4tgEQSxzIioG6Eggu4q2Au0pWbStA3NTIbkjFiGeRYfwI0+ii4EFXHrY5heFtr6Q+DnO+dwcn4vFFyD47xZqanpmdm59HxmYXFpecVeXStpGSnKilQKqSoe0UzwgBWBg2CVUDHie4KVve5Zv16+ZUpzGVxDL2R1n7QD3uKUgEEN+7wmTHOTNOIasDuIQ6khSfApHuOCdw3eG8eh4lIlScPOOjlnIDxp3JHJ5ve3blL3G0+Fhv1Sa0oa+SwAKojWVdcJoR4TBZwKlmRqkWYhoV3SZlVjA+IzXY8H5yZ4x5AmbkllXgB4QH9PxMTXuud7ptMn0NHjtT78r1aNoHVSj3kQRsACOlzUigQGifvZ4SZXjILoGUOo4uavmHaIIhRMwhkTgjt+8qQpHeTco9zhlUnjEg2VRptoG+0iFx2jPLpABVREFD2gZ/SOPqxH69X6tL6GrSlrNLOO/sj6/gExnK1b post = lik + prior AAACcnicbVHdShtBFJ5dbbXpj2uLIC2004aCgoRdKdZLoTf2TiExQnZdZidnzZDZnWHmrBiWfYA+R+8EX8a7PkVvfAAniRetyYGBj++HOfNNpqWwGIZ/PH9l9dnztfUXrZevXr/ZCDbfnllVGQ49rqQy5xmzIEUJPRQo4VwbYEUmoZ+Nf0z1/hUYK1TZxYmGpGCXpcgFZ+ioNPgFF3UsnX/I0jpGuMZaK4tNcxGj0rS7E+MIkO02NNZGaVR0MSDFeJkfRQF2iV0bocxiIA3aYSecDV0E0SNoH23e/L7dvktP0uAuHipeFVAil8zaQRRqTGpmUHAJTSuuLGjGx+wSBg6WzK2T1LPKGvrVMUOaK+NOiXTG/puoWWHtpMics2A4sk+1KblMG1SYHya1KHWFUPL5RXklqWtu2j8dCgMc5cQBxo1wu1I+YoZxdL/UciVET5+8CM72O9FB59upa+Mnmc86+UC+kB0Ske/kiByTE9IjnPz1tryP3ifv3n/vf/bbc6vvPWbekf/G33sASm/EYA== e > post T (✓) / e > lik T (✓) ⇥ e > prior T (✓) This idea can be generalized through natural-gradients. AAACIXicbVC7SgNBFJ31bXwkamFhMyiCIIZdwUehErCxVDAPSEKYndzEIbM7y8xdMSz5FRs/wV+wsVAknVjb+g1OHmA0Hhg4nHMud+7xIykMuu67MzE5NT0zOzefWlhcWk5nVlYLRsWaQ54rqXTJZwakCCGPAiWUIg0s8CUU/dZ5zy/egjZChdfYjqAasGYoGoIztFItc1xBuMNEquZepAyCFkp36Cn9kaVodejuiBD1M7XMlpt1+6DjxBuSrdz611f65PHzspbpVuqKxwGEyCUzpuy5EVYTplFwCZ1UJTYQMd5iTShbGrIATDXpX9ih21ap04bS9oVI++roRMICY9qBb5MBwxvz1+uJ/3nlGBvH1USEUYwQ8sGiRiwpKtqri9aFBo6ybQnjWti/Un7DNOO2KpOyJXh/Tx4nhf2sd5g9uLJtnJEB5sgG2SQ7xCNHJEcuyCXJE07uyRN5Ia/Og/PsvDndQXTCGc6skV9wPr4BlnKoxg== log-posterior = log-lik + log-prior AAACTXicbVHPaxQxFM6stT/W/ljtsZfYUihIl5lC1YtSKILHCm5b2BmGTObtNmwmGZM36hJy8Oy5/5QXwZv/hRcPioiZ3R5q2weBL9/7vpf3XopaCotx/D3q3Fu4v7i0vNJ9sLq2vtF7+OjU6sZwGHAttTkvmAUpFAxQoITz2gCrCglnxeS4zZ+9B2OFVm9xWkNWsbESI8EZBirvlakM4pLlLkX4iK7WFr2nL6hLZ8WdgdLTVLFCtpqqCZeK4UVRuFc+f+fpkM6NUo/3pZh4+uQaURuhjc/y3k7cj2dBb4PkCuwcPf4gPx9ffjrJe9/SUvOmAoVcMmuHSVxj5phBwSX4btpYqBmfsDEMA1SsApu5WcOe7gampCNtwlFIZ+x1h2OVtdOqCMp2Ensz15J35YYNjp5nTqi6QVB8/tCokRQ1bVdLS2GAo5wGwLgRoVfKL5hhHMMHdMMSkpsj3wanB/3kaf/wTdjGSzKPZbJFtskeScgzckRekxMyIJx8IT/IL/I7+hr9jP5Ef+fSTnTl2ST/RWfpHw7Gubk= post = rµ E q[log-lik + log-prior] Natural gradient Posterior “approximation”

Slide 13

Slide 13 text

Bayes Rule as (Natural) Gradient Descent 13 AAACW3ichVHLShxBFK1ujU4mPiZK3OiiiAi6cOgWMW6EgSDE3QiOCt2dtrrmzkwx1V1t1W1xaPpb8g35lazMIr8SrHksfIEXCg7n3HOr7qkkl8Kg5/113Ln5DwuLtY/1T0vLK6uNz2uXRhWaQ4crqfR1wgxIkUEHBUq4zjWwNJFwlQy/j/WrO9BGqOwCRzlEKetnoic4Q0vFDR2mDAdJUp5W8S0NaIhwj6VU/X0phlVET2go7bQui8uZZOnqZ4gqp0+twcVuiANAtveep4gb217TmxR9DfwZ2G7tb924vzZ+t+PGn7CreJFChlwyYwLfyzEqmUbBJVT1sDCQMz5kfQgszFgKJion2VR0xzJd2lPangzphH3qKFlqzChNbOd4HfNSG5NvaUGBveOoFFleIGR8elGvkBQVHQdNu0IDRzmygHEt7FspHzDNONrvqNsQ/JcrvwaXB03/qHl4btM4I9OqkU3ylewSn3wjLfKDtEmHcPJA/juLTs355865dXdp2uo6M886eVbul0dX0rkr E q[log-lik] = > lik E q[T(✓)] = > lik µ Expected log-lik and log-prior are linear in [1] μ Gradient wrt is simply the natural parameter μ AAACK3icbVDLSgMxFM3Ud31VBTe6CBbBTcuMiLoRRBF0p2BV6AxjJk1raCYzJnfEMsy3uHXjwh9xoQsfuPU/zLQutPVA4HDOvUnOCWLBNdj2u1UYGh4ZHRufKE5OTc/Mlubmz3SUKMpqNBKRugiIZoJLVgMOgl3EipEwEOw8aO/n/vkNU5pH8hQ6MfNC0pK8ySkBI/mlPVeSQBA/dcMkw25I4CoI0oPMv8Z17AK7hVRErYrg7czDO9gV5upGPt6zjJz5pbJdtbvAg8T5IeXdyvJl4W7x8dgvPbuNiCYhk0AF0bru2DF4KVHAqWBZ0U00iwltkxarGypJyLSXdrNmeNUoDdyMlDkScFf9vZGSUOtOGJjJPIzu93LxP6+eQHPbS7mME2CS9h5qJgJDhPPicIMrRkF0DCFUcfNXTK+IIhRMvUVTgtMfeZCcrVedzerGiWnjCPUwjpbQClpDDtpCu+gQHaMaougePaFX9GY9WC/Wh/XZGy1YPzsL6A+sr28xhass rµ E q[log-lik] = lik So Bayes’ rule can be written as (for an arbitrary q) AAACSHicbVDLahRBFK1uH4nja1RwkywKgyBIhm6R6DIQBN1FyCSB6aa9XXN7Ukx1VVt1O3Fo+lvyDX6EG5fZ5RvcuFBCdqmeySImHig4nHMut+7JKyUdRdFpEN66fefu0vK93v0HDx897j95uutMbQUOhVHG7ufgUEmNQ5KkcL+yCGWucC+fbnX+3iFaJ43eoVmFaQkTLQspgLyU9bNE+fAYsiYh/EZNZRy1LU8UFgTWmiOeaMhV55e110uggzxvPrTZVz7iixllJutKTlv++opQWWlsm2b9tWgQzcFvkviSrG2ur34Jj59/3876J8nYiLpETUKBc6M4qihtwJIUCtteUjusQExhgiNPNZTo0mZeRMtfemXMC2P908Tn6tWJBkrnZmXuk90h7rrXif/zRjUV79NG6qom1GKxqKgVJ8O7VvlYWhSkZp6AsNL/lYsDsCDId9/zJcTXT75Jdt8M4o3B28++jU9sgWW2wl6wVyxm79gm+8i22ZAJ9oP9Yn/Y3+Bn8Ds4C84X0TC4nHnG/kEYXgDBSbbp post rµ E q[log-lik + log-prior] 1. Khan, Variational-Bayes Made Easy, AABI 2023. As an analogy, think of least-square = 1-step of Newton AAACN3icbVDLSgMxFM3UV62vquBGF0ERBLHMiKhLwY1uRMGq0Ck1k97R0MxkSO6oZZhv8RsEN/6GO924UMStf2DautDqgcDhnHO5uSdIpDDouk9OYWBwaHikOFoaG5+YnCpPz5wYlWoOVa6k0mcBMyBFDFUUKOEs0cCiQMJp0Nrt+KdXoI1Q8TG2E6hH7CIWoeAMrdQoH/jShpuskfkIN5glymCeU19CiExrdU37AlK0rL/aLydaKJ3njfKSW3G7oH+J902WdtYWzgu3c3eHjfKj31Q8jSBGLpkxNc9NsJ4xjYJLyEt+aiBhvMUuoGZpzCIw9ax7d06XrdKkodL2xUi76s+JjEXGtKPAJiOGl6bf64j/ebUUw+16JuIkRYh5b1GYSoqKdkqkTaGBo2xbwrgW9q+UXzLNONqqS7YEr//kv+RkveJtVjaObBv7pIcimSeLZIV4ZIvskD1ySKqEk3vyTF7Jm/PgvDjvzkcvWnC+Z2bJLzifX514sao= post lik + prior

Slide 14

Slide 14 text

Approximate Bayes 14 1. Zellner, Optimal information processing and Bayes’s theorem, The American Statistician, 1988. Posterior approximation (expo-family) Entropy AAACWXicbVFdSxwxFM2MVddptWt99CVUhBXsMlNKlT4USxF8VHBV2AxDJnvXDWYys8mdwhLmf/k39KFQ+lf6YHZX8asXAueec+7NzU1eKWkxjv8E4cKbxaXl1kr09t3q2vv2+oczW9ZGQE+UqjQXObegpIYeSlRwURngRa7gPL/6OdXPf4GxstSnOKkgLfillkMpOHoqa1eskDpzY8qkpo7NGjoDg4ayguNIcOVOmsZnu2x3TuW5O2wy98w77jAcAfId7+wzUOohT+mnx0ZHTWe8k7W34m48C/oaJPdg6+DbTRRe/9g+ztq3bFCKugCNQnFr+0lcYeq4QSkUNBGrLVRcXPFL6HuoeQE2dbPhGrrtmQEdlsYfjXTGPq1wvLB2UuTeOZ3SvtSm5P+0fo3D/dRJXdUIWswvGtaKYkmna6YDaUCgmnjAhZF+VipG3HCB/jMiv4Tk5ZNfg7PP3eRr98uJ38Z3Mo8W2SQfSYckZI8ckCNyTHpEkN/kX7AYLAV/wyBshdHcGgb3NRvkWYQbd87xtyg= min q2Q E q(✓) [`(✓)] H(q) Generalized Approx Bayesian learning: log-lik + log-prior AAACIHicbVDLSgMxFM34rPVVdekmWARXZUZEXRbc6K6CrUJnKJn0VkMzk5DcEcvQjf/hxi/wH0RwoYju9C/8A9PHwteBwOGce7k5J9ZSWPT9d29icmp6ZrYwV5xfWFxaLq2sNqzKDIc6V1KZs5hZkCKFOgqUcKYNsCSWcBp3Dwb+6SUYK1R6gj0NUcLOU9ERnKGTWqW9EOEKc60sghHK9GmojdKo6MiQouskFAnYsaKHY61S2a/4Q9C/JBiTcrX8+XBdv9uttUpvYVvxLIEUuWTWNgNfY5Qzg4JL6BfDzIJmvMvOoeloytzFKB8G7NNNp7RpRxn3UqRD9ftGzhJre0nsJhOGF/a3NxD/85oZdvajXKQ6Q0j56FAnk9TlH7RF28IAR9lzhHEj3F8pv2CGcdeWLboSgt+R/5LGdiXYrewcuzaOyAgFsk42yBYJyB6pkkNSI3XCyQ25J0/k2bv1Hr0X73U0OuGNd9bID3gfX1h/qU8= posterior / lik ⇥ prior Bayes rule: Bayes as optimization [1], aka variational inference: AAACJXicbVDLSgMxFM3Ud31VBTe6CBahIpYZEXXhQhBB0YWC1UJnqJk0raGZyZjcEcs43yK4ce1fuHGhiODKXzF9LLR6IHByzr0k5/iR4Bps+9PKDAwODY+MjmXHJyanpnMzs+daxoqyEpVCqrJPNBM8ZCXgIFg5UowEvmAXfnOv7V/cMKW5DM+gFTEvII2Q1zklYKRqbscNCFz5frKfVq8r2AV2C4mQjTXBm6mHV3vK0XFauMbuXe8aKS5VulLN5e2i3QH+S5weye+uLV5m7uefTqq5N7cmaRywEKggWlccOwIvIQo4FSzNurFmEaFN0mAVQ0MSMO0lnZQpXjZKDdelMicE3FF/biQk0LoV+GaynUn3e23xP68SQ33bS3gYxcBC2n2oHgsMErcrwzWuGAXRMoRQxc1fMb0iilAwxWZNCU5/5L/kfL3obBY3Tk0bh6iLUbSAllABOWgL7aIDdIJKiKIH9Ixe0Zv1aL1Y79ZHdzRj9Xbm0C9YX99ifqgR E q[log-lik] + KL(qkprior) AAACWXicbVFdSxwxFM2MVddptWt99CVUhBXsMlNKlT4USxF8VHBV2AxDJnvXDWYys8mdwhLmf/k39KFQ+lf6YHZX8asXAueec+7NzU1eKWkxjv8E4cKbxaXl1kr09t3q2vv2+oczW9ZGQE+UqjQXObegpIYeSlRwURngRa7gPL/6OdXPf4GxstSnOKkgLfillkMpOHoqa1eskDpzY8qkpo7NGjoDg4ayguNIcOVOmsZnu2x3TuW5O2wy98w77jAcAfId7+wzUOohT+mnx0ZHTWe8k7W34m48C/oaJPdg6+DbTRRe/9g+ztq3bFCKugCNQnFr+0lcYeq4QSkUNBGrLVRcXPFL6HuoeQE2dbPhGrrtmQEdlsYfjXTGPq1wvLB2UuTeOZ3SvtSm5P+0fo3D/dRJXdUIWswvGtaKYkmna6YDaUCgmnjAhZF+VipG3HCB/jMiv4Tk5ZNfg7PP3eRr98uJ38Z3Mo8W2SQfSYckZI8ckCNyTHpEkN/kX7AYLAV/wyBshdHcGgb3NRvkWYQbd87xtyg= min q2Q E q(✓) [`(✓)] H(q)

Slide 15

Slide 15 text

The Bayesian Learning Rule 15 Posterior approximation (expo-family) vs Entropy 1. Khan and Rue, The Bayesian Learning Rule, JMLR, 2023 2. Khan and Lin. "Conjugate-computation variational inference….” AIstats, 2017 AAACBnicbVDJSgNBEO2Je9xGPYrQGATFEGZE1JMIXjxGMAtkQujp1CRNenqG7hohBE968Fe8eFDEgxe/wZt/Y2c5uD0oeLxXRVW9MJXCoOd9Ormp6ZnZufmF/OLS8sqqu7ZeNUmmOVR4IhNdD5kBKRRUUKCEeqqBxaGEWtg7H/q1a9BGJOoK+yk0Y9ZRIhKcoZVa7lYQC9UKsAvIaFAMijQAKXfHwl7LLXglbwT6l/gTUjg7enu9i/Yfyy33I2gnPItBIZfMmIbvpdgcMI2CS7jJB5mBlPEe60DDUsViMM3B6I0bumOVNo0SbUshHanfJwYsNqYfh7YzZtg1v72h+J/XyDA6aQ6ESjMExceLokxSTOgwE9oWGjjKviWMa2FvpbzLNONok8vbEPzfL/8l1YOSf1Q6vLRpnJIx5skm2Sa7xCfH5IxckDKpEE5uyQN5Is/OvfPovDiv49acM5nZID/gvH8BjDWblg== min ✓ `(✓) AAACWXicbVFdSxwxFM2MVddptWt99CVUhBXsMlNKlT4USxF8VHBV2AxDJnvXDWYys8mdwhLmf/k39KFQ+lf6YHZX8asXAueec+7NzU1eKWkxjv8E4cKbxaXl1kr09t3q2vv2+oczW9ZGQE+UqjQXObegpIYeSlRwURngRa7gPL/6OdXPf4GxstSnOKkgLfillkMpOHoqa1eskDpzY8qkpo7NGjoDg4ayguNIcOVOmsZnu2x3TuW5O2wy98w77jAcAfId7+wzUOohT+mnx0ZHTWe8k7W34m48C/oaJPdg6+DbTRRe/9g+ztq3bFCKugCNQnFr+0lcYeq4QSkUNBGrLVRcXPFL6HuoeQE2dbPhGrrtmQEdlsYfjXTGPq1wvLB2UuTeOZ3SvtSm5P+0fo3D/dRJXdUIWswvGtaKYkmna6YDaUCgmnjAhZF+VipG3HCB/jMiv4Tk5ZNfg7PP3eRr98uJ38Z3Mo8W2SQfSYckZI8ckCNyTHpEkN/kX7AYLAV/wyBshdHcGgb3NRvkWYQbd87xtyg= min q2Q E q(✓) [`(✓)] H(q) Bayesian Learning Rule [1,2] (natural-gradient descent) Natural and Expectation parameters of q AAACX3icbVHBbtQwEHVSoGVb2rRcQFwsKqTtgVUCiHIqFQipxyKxbaV1FE2cya5Vx0ltp3Rl5SP4CT6IGxIXLnwHzm6RoGUkW8/vvfGMx3kjhbFx/D0IV+7cvbe6dn+wvvFgcyva3jkxdas5jnkta32Wg0EpFI6tsBLPGo1Q5RJP8/P3vX56idqIWn2y8wbTCqZKlIKD9VQWXTLpzQVQJrG0oHX9mf6hnlOmZzVTkEvInGOLak5j0VFWtZ3f34kpc/4Adpbn7kOXXdAJQymHzM7Qwl7a39GrHKQ76oYXe8ucLot241G8CHobJNdg9/Dl16tHb7/8Os6ib6yoeVuhslyCMZMkbmzqQFvBJXYD1hpsgJ/DFCceKqjQpG7RcUefeaagZa39UpYu2L8zHFTGzKvcO/tmzU2tJ/+nTVpbvkmdUE1rUfFlobKV1Na0HzYthEZu5dwD4Fr4XimfgQZu/ZcM/BCSm0++DU5ejJLXo1cf/TQOyDLWyBPylAxJQvbJITkix2RMOPkRhMF6sBH8DFfDzTBaWsPgOuch+SfCx78BIYW4/Q== ⇢rµ n E q[`(✓)] H(q) o Old belief New information = natural gradients Exploiting posterior’s information geometry to derive existing algorithms as special instances by approximating q and natural gradients.

Slide 16

Slide 16 text

Warning! • This natural gradient is different from the one what we (often) encounter in machine learning for Maximum-Likelihood – In MLE, the loss is the negative log probability distribution – Here, loss and distribution are two different entities, even possible unrelated 16 min θ − log q(θ) ⇒ F(θ)−1 ∇log q(θ) min q 𝔼 q [ℓ(θ)] − ℋ(q) ⇒ F(λ)−1 ∇λ 𝔼 q [ℓ(θ)]

Slide 17

Slide 17 text

17 Bayesian learning rule: Abbreviations: cov. ! covariance, STE ! Straight-Through-Estimator, VI ! Variational Inference, VMP ! Variational Message Passing. Learning Algorithm Posterior Approx. Natural-Gradient Approx. Sec. Optimization Algorithms Gradient Descent Gaussian (fixed cov.) Delta method 1.3 Newton’s method Gaussian —–“—– 1.3 Multimodal optimization (New) Mixture of Gaussians —–“—– 3.2 Deep-Learning Algorithms Stochastic Gradient Descent Gaussian (fixed cov.) Delta method, stochastic approx. 4.1 RMSprop/Adam Gaussian (diagonal cov.) Delta method, stochastic approx., Hessian approx., square-root scal- ing, slow-moving scale vectors 4.2 Dropout Mixture of Gaussians Delta method, stochastic approx., responsibility approx. 4.3 STE Bernoulli Delta method, stochastic approx. 4.5 Online Gauss-Newton (OGN) (New) Gaussian (diagonal cov.) Gauss-Newton Hessian approx. in Adam & no square-root scaling 4.4 Variational OGN (New) —–“—– Remove delta method from OGN 4.4 BayesBiNN (New) Bernoulli Remove delta method from STE 4.5 Approximate Bayesian Inference Algorithms Conjugate Bayes Exp-family Set learning rate ⇢t = 1 5.1 Laplace’s method Gaussian Delta method 4.4 Expectation-Maximization Exp-Family + Gaussian Delta method for the parameters 5.2 Stochastic VI (SVI) Exp-family (mean-field) Stochastic approx., local ⇢t = 1 5.3 VMP —–“—– ⇢t = 1 for all nodes 5.3 Non-Conjugate VMP —–“—– —–“—– 5.3 Non-Conjugate VI (New) Mixture of Exp-family None 5.4 2.1 Bayesian learning rule as natural-gradient descent

Slide 18

Slide 18 text

Gradient Descent from BLR 18 Derived by choosing Gaussian with fixed covariance ⇢rµ (E q[`(✓)] H(q)) AAACYHicbVFNa9RAGJ6kfqxra7d608toEXYPLklF2pMUitBjBbct7ITwZvJmd+jkozNvXJaQg2fP/jFvHrz4S5xkK2jrCwMPz/N+PpNUWlkKgh+ev3Xv/oOHg0fDx9s7T3ZHe0/PbVkbiTNZ6tJcJmBRqwJnpEjjZWUQ8kTjRXJ10ukXn9FYVRafaF1hlMOiUJmSQI6KRyuhXXIKXGjMCIwpV/wP9YYLsyy5KCDREDeN6Mc1BtOWi7xu275m7DDQMkmaD218zecCtR4LWiLBJOp6dKoE3Zy24+uJa6kWS5rEo/1gGvTB74LwBuwfv1zpryffvpzFo+8iLWWdY0FSg7XzMKgoasCQkhrboagtViCvYIFzBwvI0UZNv3HLXzsm5Vlp3CuI9+zfFQ3k1q7zxGV229rbWkf+T5vXlB1FjSqqmrCQm0FZrTmVvHObp8qgJL12AKRRblcul2BAkvuToTMhvH3yXXB+MA3fTt99dG68Z5sYsBfsFRuzkB2yY3bKztiMSfbT2/K2vR3vlz/wd/29Tarv3dQ8Y/+E//w3iHi5FA== m m ⇢rm E q[`(✓)] AAACMnicbVBNaxRBFOyJUeP6kVWPXtoEIR5cZhJETxIIgrlFcJPA9jC86X2TbdIfY/cbwzLMwbPn/I5c/CWCh3hQxKs/wt7dHDSxoKGoeo9XXWWtVaA0PU+Wri1fv3Fz5Vbv9p2791b79x/sB9d4iUPptPOHJQTUyuKQFGk8rD2CKTUelMc7M//gA/qgnH1H0xpzA0dWVUoCRano7xouNFYE3rsTbvgzLvzEcS4slBqKVsxPtB7HHTcdFwZoUpbt6654z0cCtd4QNEGCp3nRX08H6Rz8KskuyPr24xP9aef0417R/yLGTjYGLUkNIYyytKa8BU9Kaux6oglYgzyGIxxFasFgyNt5no4/icqYV87HZ4nP1b83WjAhTE0ZJ2eRw2VvJv7PGzVUvcxbZeuG0MrFoarRnByf9cfHyqMkPY0EpFcxK5cT8CApttyLJWSXv3yV7G8Osq3B87exjVdsgRX2iK2xDZaxF2ybvWF7bMgkO2Nf2Xf2I/mcfEt+Jr8Wo0vJxc5D9g+S338Ali+tYQ== m m ⇢rm`(m) AAACEHicbVA9SwNBEN3z2/gVtRRkUUQtDHeKaCWCjWUEE4VcCHObObO4H8funhJCfoKNnf4NGwtFbC3t/DduEgu/Hgw83pthZl6SCW5dGH4EQ8Mjo2PjE5OFqemZ2bni/ELV6twwrDAttDlPwKLgCiuOO4HnmUGQicCz5PKo559dobFcq1PXzrAu4ULxlDNwXmoU1yWNBaYOjNHXVNItGpuWpjRWkAhoeBeF2JCbjeJqWAr7oH9J9EVWD5fvergvN4rvcVOzXKJyTIC1tSjMXL0DxnEmsFuIc4sZsEu4wJqnCiTaeqf/UJeueaVJU218KUf76veJDkhr2zLxnRJcy/72euJ/Xi136X69w1WWO1RssCjNBXWa9tKhTW6QOdH2BJjh/lbKWmCAOZ9hwYcQ/X75L6lul6Kd0u6JT+OADDBBlsgK2SAR2SOH5JiUSYUwckMeyBN5Dm6Dx+AleB20DgVfM4vkB4K3T4sKn2A= GD: E q[`(✓)] ⇡ `(m) AAACE3icbVC7SgNBFJ31bXxFLW1GRVCLsKuIViIEwTKCeUB2CbOTm2TI7MOZu2pYUljb2ORXbCwUsbWx82+cPAqNHhg4nHMvd87xYyk02vaXNTE5NT0zOzefWVhcWl7Jrq6VdJQoDkUeyUhVfKZBihCKKFBCJVbAAl9C2W/n+375BpQWUXiFnRi8gDVD0RCcoZFq2X03YNjy/fS8W7uuuiDlrostQLbnUZfFsYru6EAN9mrZbTtnD0D/EmdEts82b+VDvndfqGU/3XrEkwBC5JJpXXXsGL2UKRRcQjfjJhpixtusCVVDQxaA9tJBpi7dMUqdNiJlXoh0oP7cSFmgdSfwzWQ/gR73+uJ/XjXBxomXijBOEEI+PNRIJMWI9guidaGAo+wYwrgS5q+Ut5hiHE2NGVOCMx75Lykd5JzD3NGlaeOUDDFHNsgW2SUOOSZn5IIUSJFw8kieyAt5tXrWs/VmvQ9HJ6zRzjr5BevjG5S3oO4= “Global” to “local” (the delta method) := m AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqswooghqwY3LCvYBnaFkMpk2NMkMSUYopb/hxoUibv0K/8Cdf2M67UJbDwQO55zLvTlhypk2rvvtFJaWV1bXiuuljc2t7Z3y7l5TJ5kitEESnqh2iDXlTNKGYYbTdqooFiGnrXBwO/Fbj1RplsgHM0xpIHBPspgRbKzk+9xGI4wur5Dolitu1c2BFok3I5Wbz9Mc9W75y48SkgkqDeFY647npiYYYWUY4XRc8jNNU0wGuEc7lkosqA5G+c1jdGSVCMWJsk8alKu/J0ZYaD0UoU0KbPp63puI/3mdzMQXwYjJNDNUkumiOOPIJGhSAIqYosTwoSWYKGZvRaSPFSbG1lSyJXjzX14kzZOqd1o9u3crtWuYoggHcAjH4ME51OAO6tAAAik8wQu8Opnz7Lw579NowZnN7MMfOB8/uDmTBg== Expectation parameters Natural parameters Gaussian distribution µ := E q[✓] = m AAACCHicbVA9SwNBEN3zM8avqKWFq0GwCneKKEIkEATLCCYRckfY22zMkt27c3dOCUcKCxut/RU2ForY+hPs/DfuJSk0+mDg8d4MM/P8SHANtv1lTUxOTc/MZuay8wuLS8u5ldWaDmNFWZWGIlQXPtFM8IBVgYNgF5FiRPqC1f1uOfXr10xpHgbn0IuYJ8llwNucEjBSM7fhyhgfFbErCXR8PznpN68aLnQYEA8XsWzm8nbBHgD/Jc6I5EubN+Kh/HhbaeY+3VZIY8kCoIJo3XDsCLyEKOBUsH7WjTWLCO2SS9YwNCCSaS8ZPNLH20Zp4XaoTAWAB+rPiYRIrXvSN53pvXrcS8X/vEYM7UMv4UEUAwvocFE7FhhCnKaCW1wxCqJnCKGKm1sx7RBFKJjssiYEZ/zlv6S2W3D2CvtnJo1jNEQGraMttIMcdIBK6BRVUBVRdIee0At6te6tZ+vNeh+2TlijmTX0C9bHN129m+g= Entropy q(✓) := N(m, 1) AAACCHicbVC7SgNBFJ2NrxhfUQsLCweDsAEJu4pEBCFgYyURzAOyS5idTMyQ2Yczd4WwpLSx8UNsLBSxzSfY+SPWTh6FJh64cDjnXu69x4sEV2BZX0Zqbn5hcSm9nFlZXVvfyG5uVVUYS8oqNBShrHtEMcEDVgEOgtUjyYjvCVbzuhdDv3bPpOJhcAO9iLk+uQ14m1MCWmpm9+5MBzoMSB6fnWPHJ9ChRCRXfdM/xHYeN7M5q2CNgGeJPSG50k48eHLM73Iz++m0Qhr7LAAqiFIN24rATYgETgXrZ5xYsYjQLrllDU0D4jPlJqNH+vhAKy3cDqWuAPBI/T2REF+pnu/pzuGlatobiv95jRjap27CgygGFtDxonYsMIR4mApucckoiJ4mhEqub8W0QyShoLPL6BDs6ZdnSfWoYB8XTq51GkU0Rhrton1kIhsVUQldojKqIIoe0DN6RW/Go/FivBsf49aUMZnZRn9gDH4AN0KbMg== H(q) := log(2⇡)/2 AAACB3icbVC7SgNBFJ2NrxhfqxYWggwGYdPE3YhEBCFgkzKCeUB2CbOT2WTI7MOZWSEs6Wys/A8bC0Vs9RPs/BFrZ5MUmnjgwuGce7n3HjdiVEjT/NIyC4tLyyvZ1dza+sbmlr690xBhzDGp45CFvOUiQRgNSF1SyUgr4gT5LiNNd3CZ+s1bwgUNg2s5jIjjo15APYqRVFJHP7B9JPsYsaQ6Mm4K8PwC2izsGSU7ooXjUkfPm0VzDDhPrCnJV/bijwfb+K519E+7G+LYJ4HEDAnRtsxIOgnikmJGRjk7FiRCeIB6pK1ogHwinGT8xwgeKaULvZCrCiQcq78nEuQLMfRd1ZleLWa9VPzPa8fSO3MSGkSxJAGeLPJiBmUI01Bgl3KCJRsqgjCn6laI+4gjLFV0ORWCNfvyPGmUitZJ8fRKpVEGE2TBPjgEBrBAGVRAFdRAHWBwBx7BM3jR7rUn7VV7m7RmtOnMLvgD7f0H5ASbDA== ✓ ✓ ⇢r✓`(✓) AAACI3icbVBNS1tBFJ2X1hqjrbEuhTI0CHHR8J4iLS6K4MalBaNCXgj3Te5Lhsybeczc1xJCfoiuuvGvuHGhiBsX/S9OPgpt9MDA4ZxzuXNPkivpKAyfgtKbt0vvlssrldW19x/Wqxsfz5wprMCmMMrYiwQcKqmxSZIUXuQWIUsUnieDo4l//hOtk0af0jDHdgY9LVMpgLzUqR7E1EcCHitMCaw1v/hc+cJj2zc81pAo6PyNoVL1Gd/pVGthI5yCvyTRnNQOP11OcHXSqT7EXSOKDDUJBc61ojCn9ggsSaFwXIkLhzmIAfSw5amGDF17NL1xzLe90uWpsf5p4lP134kRZM4Ns8QnM6C+W/Qm4mteq6D0W3skdV4QajFblBaKk+GTwnhXWhSkhp6AsNL/lYs+WBDka634EqLFk1+Ss91GtNfY/+Hb+M5mKLMt9pnVWcS+skN2zE5Ykwn2m92wO3YfXAe3wUPwOIuWgvnMJvsPwZ9nwsKoEg== BLR: See Section 1.3.1 in Khan and Rue, 2021

Slide 19

Slide 19 text

Newton’s Method from BLR 19 Derived by choosing a multivariate Gaussian Expectation parameters Natural parameters Gaussian distribution q(✓) := N(✓|m, S 1) AAACFHicbVDLSgNBEJyNrxhfUY+CDAYhQQ27iiiCEvDiSSKaByQxzE4myZDZhzO9QljzEV4E/REvHhTx6sGbf+NskoMmFjQUVd10d9m+4ApM89uITUxOTc/EZxNz8wuLS8nllaLyAklZgXrCk2WbKCa4ywrAQbCyLxlxbMFKduc08ku3TCruuVfQ9VnNIS2XNzkloKV6cusmXYU2A5LBR8e46hBoUyLC895QvsPONr68DnesXqaeTJlZsw88TqwhSeXWHyM85evJr2rDo4HDXKCCKFWxTB9qIZHAqWC9RDVQzCe0Q1qsoqlLHKZqYf+pHt7USgM3PanLBdxXf0+ExFGq69i6M7pajXqR+J9XCaB5WAu56wfAXDpY1AwEBg9HCeEGl4yC6GpCqOT6VkzbRBIKOseEDsEafXmcFHez1l52/0KncYIGiKM1tIHSyEIHKIfOUB4VEEX36Bm9ojfjwXgx3o2PQWvMGM6soj8wPn8ASxmg0w== := {Sm, S/2} AAACAXicbVDLSgMxFM3UV62vURcKboJF6ELrTEUqglBw47JS+4DOUDKZtA3NZIYkI5ShbvwAf8KNC0XcuvMT3Pkjrk0fC209EDiccy4393gRo1JZ1peRmptfWFxKL2dWVtfWN8zNrZoMY4FJFYcsFA0PScIoJ1VFFSONSBAUeIzUvd7l0K/fEiFpyG9UPyJugDqctilGSkstc9dhOuwjeH4BnaQSHMKjynEBOoOWmbXy1ghwltgTki3txB8PTu673DI/HT/EcUC4wgxJ2bStSLkJEopiRgYZJ5YkQriHOqSpKUcBkW4yumAAD7Tiw3Yo9OMKjtTfEwkKpOwHnk4GSHXltDcU//OasWqfuQnlUawIx+NF7ZhBFcJhHdCngmDF+pogLKj+K8RdJBBWurSMLsGePnmW1Ap5+yR/eq3bKIIx0mAP7IMcsEERlMAVKIMqwOAOPIJn8GLcG0/Gq/E2jqaMycw2+APj/QcLsphb µ := {E q(✓), E q(✓✓>)} AAACKXicbVDLSgMxFM3UV62vqjvdBIvQgpQZRSqCUBDBZQX7gE4tmTRtQzMPkztCGbr2T9z4BX6BLtxYUNStP2L6WGjbAwmHc+7l3nucQHAFpvllxObmFxaX4suJldW19Y3k5lZJ+aGkrEh94cuKQxQT3GNF4CBYJZCMuI5gZadzPvDLd0wq7nvX0A1YzSUtjzc5JaClejJvuyE+PcN2ZLsE2o4TXfTqt2kb2gxI5gDPUEf/jQ1+kLF79WTKzJpD4GlijUkqv/OSve8/PxXqyb7d8GnoMg+oIEpVLTOAWkQkcCpYL2GHigWEdkiLVTX1iMtULRpe2sP7Wmngpi/18wAP1b8dEXGV6rqOrhwsria9gTjLq4bQPKlF3AtCYB4dDWqGAoOPB7HhBpeMguhqQqjkeldM20QSCjrchA7Bmjx5mpQOs9ZR9vhKp5FDI8TRLtpDaWShHMqjS1RARUTRA3pF7+jDeDTejE/je1QaM8Y92+gfjJ9fEDKq8A== ⇢rµ (E q[`(✓)] H(q)) AAACYHicbVFNa9RAGJ6kfqxra7d608toEXYPLklF2pMUitBjBbct7ITwZvJmd+jkozNvXJaQg2fP/jFvHrz4S5xkK2jrCwMPz/N+PpNUWlkKgh+ev3Xv/oOHg0fDx9s7T3ZHe0/PbVkbiTNZ6tJcJmBRqwJnpEjjZWUQ8kTjRXJ10ukXn9FYVRafaF1hlMOiUJmSQI6KRyuhXXIKXGjMCIwpV/wP9YYLsyy5KCDREDeN6Mc1BtOWi7xu275m7DDQMkmaD218zecCtR4LWiLBJOp6dKoE3Zy24+uJa6kWS5rEo/1gGvTB74LwBuwfv1zpryffvpzFo+8iLWWdY0FSg7XzMKgoasCQkhrboagtViCvYIFzBwvI0UZNv3HLXzsm5Vlp3CuI9+zfFQ3k1q7zxGV229rbWkf+T5vXlB1FjSqqmrCQm0FZrTmVvHObp8qgJL12AKRRblcul2BAkvuToTMhvH3yXXB+MA3fTt99dG68Z5sYsBfsFRuzkB2yY3bKztiMSfbT2/K2vR3vlz/wd/29Tarv3dQ8Y/+E//w3iHi5FA== (1 ⇢) ⇢rµ E q[`(✓)] AAACSnicbVBNaxRBEO1Zo8b1a9VjLh2DsDm4zCgSTxIIgscI2SSwPQw1PTXZJj3dY3dNwjLMwbNnf5MXT978EV5yUMRLencT0MQHDa/eq6KqX15r5SmOv0e9Gys3b91evdO/e+/+g4eDR4/3vW2cxLG02rrDHDxqZXBMijQe1g6hyjUe5Mc7c//gBJ1X1uzRrMa0giOjSiWBgpQNQOjQXAAXGksC5+wpHybPuXBTu8kvzWXNhYFcQ9a2YrG4dVh0XFRN14kKaJrn7dsu+8AnArUeCpoiwWaaDTbiUbwAv06SC7KxvX6qP+18/ribDb6JwsqmQkNSg/eTJK4pbcGRkhq7vmg81iCP4QgngRqo0Kft4qKOPwtKwUvrwjPEF+rfEy1U3s+qPHTOT/ZXvbn4P2/SUPk6bZWpG0Ijl4vKRnOyfJ4rL5RDSXoWCEinwq1cTsGBpJB+P4SQXP3ydbL/YpS8HL16H9J4w5ZYZWvsKRuyhG2xbfaO7bIxk+wL+8F+sl/R1+gs+h39Wbb2oouZJ+wf9FbOASYWtg8= rµH(q) = AAACDnicbVA9TwJBEN3DL8Qv1NJmlZBgIbnTGG00JDSUmMhHwhEytyywYW/v3N3TkAuFNY2NP8TGQmNsre38Ny5goeBLNnl5b2Z25nkhZ0rb9peVWFhcWl5JrqbW1jc2t9LbO1UVRJLQCgl4IOseKMqZoBXNNKf1UFLwPU5rXr849mu3VCoWiGs9CGnTh65gHUZAG6mVzh5hV4DHoeX6EXZ90D0CPC4NczeH+AK73IxqQyudsfP2BHieOD8kU9i/46Pi4325lf502wGJfCo04aBUw7FD3YxBakY4HabcSNEQSB+6tGGoAJ+qZjw5Z4izRmnjTiDNExpP1N8dMfhKDXzPVI73VbPeWPzPa0S6c96MmQgjTQWZftSJONYBHmeD20xSovnAECCSmV0x6YEEok2CKROCM3vyPKke552T/OmVSeMSTZFEe+gA5ZCDzlABlVAZVRBBI/SEXtCr9WA9W2/W+7Q0Yf307KI/sD6+AXn3nho= ✓ ✓ H 1 ✓ [r✓`(✓)] AAACO3icbZDNSxxBEMV71ETdfK16zKVRAnpwmTGEeBJBCB5NcFXYmQw1vTW7jT09Q3eNsgz7f3nx7i3izUsuORjEq3d7d1YwmgcNP15VUV0vKZS05PvX3tT0zKvXs3PzjTdv373/0FxYPLB5aQS2Ra5yc5SARSU1tkmSwqPCIGSJwsPkeGdUPzxBY2Wu92lQYJRBT8tUCiBnxc0fIfWRgIcKUwJj8lM+cdb5blzjz2o9GNYdHR5qSBTEj2Oo1GrNazw0stenKG6u+C1/LP4SggmsbC+n335drF3uxc2rsJuLMkNNQoG1ncAvKKrAkBQKh42wtFiAOIYedhxqyNBG1fj2If/knC5Pc+OeJj52n05UkFk7yBLXmQH17fPayPxfrVNSuhlVUhcloRb1orRUnHI+CpJ3pUFBauAAhJHur1z0wYAgF3fDhRA8P/klHGy0gs+tL99dGlus1hz7yJbZKgvYV7bNdtkeazPBzthvdsP+eufeH+/Wu6tbp7zJzBL7R979Az/dsUw= Newton’s method: ⇢ (rµ E q[`(✓)] + ) AAACWnicbVFNi9RAEO1E3Y9ZP8aPm5fWRZhFHBJF9CQLi+BxBWd3YTqESqcyabbTyXZXXIaQg2fP/jEvIvhXBDszq+iuBd083qvqqnqdNVo5iqLvQXjt+o2Nza3t0c7NW7fvjO/eO3J1ayXOZK1re5KBQ60MzkiRxpPGIlSZxuPs9GDQjz+idao2H2jZYFLBwqhCSSBPpeMzoX1yDlxoLAisrc/5b+oZF7as18qECwOZhrTrxKprZzHvuajafriByizr3vbpGZ8L1HoiqESCvYQ//fOcsGpR0l463o2m0Sr4VRBfgN39R+f688GXT4fp+KvIa9lWaEhqcG4eRw0lHVhSUmM/Eq3DBuQpLHDuoYEKXdKthuz5E8/kvKitP4b4iv27ooPKuWWV+cxhCXdZG8j/afOWitdJp0zTEhq5blS0mlPNB595rixK0ksPQFrlZ+WyBAuS/G+MvAnx5ZWvgqPn0/jF9OV778Ybto4t9pA9ZhMWs1dsn71jh2zGJPvGfgYbwWbwIwzD7XBnnRoGFzX32T8RPvgFI1C3eg== Sm (1 ⇢)Sm ⇢rE q(✓) E q[`(✓)] AAACUnicbVJdSxwxFM1uP7Srbdf62JdQsawPXWYsYp+KUAQfLe2qsBmGO9k7bjCTjMmdynaYP9U/Iogv/hAR+tA2u2vBai8EDufcyz05SVZq5SmKrlrtR4+fPF1YfNZZWn7+4mV35dWBt5WTOJBWW3eUgUetDA5Ikcaj0iEUmcbD7OTTVD/8hs4ra77SpMSkgGOjciWBApV21ZeCC405gXP2jPfid1y4sd3ggZ9DLgxkGtK6FrN1tcNRw7kogMZZVu826WlP0BgJNprmLsuHArX+qyVpdy3qR7PiD0F8C9Z23t58v/xBYj/tXoiRlVWBhqQG74dxVFJSgyMlNTYdUXksQZ7AMQ4DNFCgT+qZx4avB2bEc+vCMcRn7N2JGgrvJ0UWOqeW/X1tSv5PG1aUf0hqZcqK0Mj5orzSnCyf5stHyqEkPQkApFPBK5djcCApvEInhBDfv/JDcLDZj9/3tz6HND6yeS2y1+wN67GYbbMdtsf22YBJds6u2S/2u3XZ+tkOv2Te2m7dzqyyf6q9/Ac9Kbfo 1 2 S (1 ⇢) 1 2 S + ⇢rE q(✓✓>) E q[`(✓)] AAACdnicbVHLbhMxFPUMrxJeKawQErIaVSRCiWbaIlihShUSyyJIWykeRh7nTmPVM57ad0CRNQs+gD3fwmfQFR/BArFhiTPJog+uZOvo3HPl43OzSkmLUfQzCK9dv3Hz1trtzp279+4/6K4/PLC6NgLGQittjjJuQckSxihRwVFlgBeZgsPsZG/RP/wExkpdfsB5BUnBj0uZS8HRU2n365DlhgsXN26roe8pU5AjN0Z/pkPaj4eUmZke0Iui5y1LWckzxVPnWOvDGZg2lLKC4yzL3JsmPe0znAHy5f2Roa4GTXNeQCcMlFrJBkna7UWjqC16FcQr0Nvd+f7l2++zX/tp9webalEXUKJQ3NpJHFWYOG5QCgVNh9UWKi5O+DFMPCx5ATZxrd2GbnpmSnNt/CmRtuz5CccLa+dF5pULy/Zyb0H+rzepMX+VOFlWNUIplg/ltaKo6WIHdCoNCFRzD7gw0nulYsZ9wug31fEhxJe/fBUcbI3i7dGLdz6N12RZa+QJ2SB9EpOXZJe8JftkTAT5EzwONoJe8Dd8Gm6Gz5bSMFjNPCIXKoz+AZn8wxI= S (1 ⇢)S ⇢2rE q(✓✓>) E q[`(✓)] AAACXXicbVHBbtQwEHUCLe1SyrZF4sBlRIW0e2CVFKH2RCshJA4cisq2ldYhcryTrlUnTu0JaBXl9/gAbnDhV/Bu9lBanmTr6b03mvE4q7RyFEW/gvDBw7X1RxubvcdbT7af9nd2z52prcSxNNrYy0w41KrEMSnSeFlZFEWm8SK7fr/wL76hdcqUX2heYVKIq1LlSgryUtqnM+AacxLWmu8wiF8DtzMzhDPoGBwAL0WmRdo0fNmusThtAXghaJZlzYc2vRlwmiGJ7v7KyVTDtr0dgAlHrVexYZL296NRtATcJ/GK7J8MPv04fra7fpr2f/KpkXWBJUktnJvEUUVJIywpqbHt8dphJeS1uMKJp6Uo0CXNctwWXnllCrmx/pQES/V2RSMK5+ZF5pOLkd1dbyH+z5vUlB8ljSqrmrCUXaO81kAGFquGqbIoSc89EdIqPyvImbBCkv+Qnl9CfPfJ98n5wSh+M3r72W/jHeuwwV6wl2zAYnbITthHdsrGTLLfAQs2g17wJ1wLt8LtLhoGq5o99g/C538BvoO2CQ== 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). See Section 1.3.2 in Khan and Rue, 2021

Slide 20

Slide 20 text

Newton’s Method from BLR 20 Sm (1 ⇢)Sm ⇢rE q(✓) E q[`(✓)] AAACUnicbVJdSxwxFM1uP7Srbdf62JdQsawPXWYsYp+KUAQfLe2qsBmGO9k7bjCTjMmdynaYP9U/Iogv/hAR+tA2u2vBai8EDufcyz05SVZq5SmKrlrtR4+fPF1YfNZZWn7+4mV35dWBt5WTOJBWW3eUgUetDA5Ikcaj0iEUmcbD7OTTVD/8hs4ra77SpMSkgGOjciWBApV21ZeCC405gXP2jPfid1y4sd3ggZ9DLgxkGtK6FrN1tcNRw7kogMZZVu826WlP0BgJNprmLsuHArX+qyVpdy3qR7PiD0F8C9Z23t58v/xBYj/tXoiRlVWBhqQG74dxVFJSgyMlNTYdUXksQZ7AMQ4DNFCgT+qZx4avB2bEc+vCMcRn7N2JGgrvJ0UWOqeW/X1tSv5PG1aUf0hqZcqK0Mj5orzSnCyf5stHyqEkPQkApFPBK5djcCApvEInhBDfv/JDcLDZj9/3tz6HND6yeS2y1+wN67GYbbMdtsf22YBJds6u2S/2u3XZ+tkOv2Te2m7dzqyyf6q9/Ac9Kbfo S (1 ⇢)S ⇢2rE q(✓✓>) E q[`(✓)] AAACXXicbVHBbtQwEHUCLe1SyrZF4sBlRIW0e2CVFKH2RCshJA4cisq2ldYhcryTrlUnTu0JaBXl9/gAbnDhV/Bu9lBanmTr6b03mvE4q7RyFEW/gvDBw7X1RxubvcdbT7af9nd2z52prcSxNNrYy0w41KrEMSnSeFlZFEWm8SK7fr/wL76hdcqUX2heYVKIq1LlSgryUtqnM+AacxLWmu8wiF8DtzMzhDPoGBwAL0WmRdo0fNmusThtAXghaJZlzYc2vRlwmiGJ7v7KyVTDtr0dgAlHrVexYZL296NRtATcJ/GK7J8MPv04fra7fpr2f/KpkXWBJUktnJvEUUVJIywpqbHt8dphJeS1uMKJp6Uo0CXNctwWXnllCrmx/pQES/V2RSMK5+ZF5pOLkd1dbyH+z5vUlB8ljSqrmrCUXaO81kAGFquGqbIoSc89EdIqPyvImbBCkv+Qnl9CfPfJ98n5wSh+M3r72W/jHeuwwV6wl2zAYnbITthHdsrGTLLfAQs2g17wJ1wLt8LtLhoGq5o99g/C538BvoO2CQ== m m ⇢S 1rm`(m) S (1 ⇢)S + ⇢Hm AAACW3icbVFNaxsxENVumsbdpK3T0lMvQ0yCQ7HZ7QftKQR6yTElcRKwXKOVZ2MRfSySNsUs+yd7Sg/9K6Xyxoc66QPB4715M9IoL6VwPk1/RfHGk82nW51nyfbO8xcvu7uvLpypLMcRN9LYq5w5lELjyAsv8aq0yFQu8TK/+br0L2/ROmH0uV+UOFHsWotCcOaDNO1aBQdUYuGZteYHKBgAtXMDNW171xZnDZxB870eZA1QzXLJpgooStlXh5QmZ2v5fjZYxg9D5N16j7brSYg2024vHaYt4DHJVqRHVjiddn/SmeGVQu25ZM6Ns7T0k5pZL7jEJqGVw5LxG3aN40A1U+gmdTu7gf2gzKAwNhztoVX/TdRMObdQeahUzM/dQ28p/s8bV774MqmFLiuPmt8PKioJ3sBy0TATFrmXi0AYtyLcFficWcZ9+I4kLCF7+OTH5OL9MPsw/PTtY+/4aLWODnlL9kifZOQzOSYn5JSMCCd35E+0FXWi3/FGnMQ796VxtMq8JmuI3/wFGkeyFA== rE q(✓) E q[`(✓)] = E q[r✓`(✓)] 2E q[H✓]m rE q(✓✓>) E q[`(✓)] = E q[H✓] AAADHniclVJNaxRBEO2Z+BHXr40evRQuSjy4zMSEeDEERMgxgpsEtsehp7c226SnZ+yuEZZhfkku+StePCgieNJ/48zuBHY3OWhBN4/36nW9bjrJtXIUBH88f+3GzVu31+907t67/+Bhd+PRkcsKK3EgM53Zk0Q41MrggBRpPMktijTReJycvW30489oncrMB5rmGKXi1KixkoJqKt7wtrkRiRZxWfLZaaXFUQXAU0GTJCnfVfGnTU4TJPGiqhZZGHLU+lKL4PkbWJaXDmynzLth0VlBBC9ha8UMS+6DS2PTnALnnX9OPd8/csry/73A9RmaCHG3F/SDWcFVELagx9o6jLu/+CiTRYqGpBbODcMgp6gUlpTUWHV44TAX8kyc4rCGRqToonI2vYJnNTOCcWbrZQhm7KKjFKlz0zSpO5v8blVryOu0YUHj11GpTF4QGjkfNC40UAbNX4GRsihJT2sgpFV1VpATYYWk+kd16kcIV698FRxt9cNX/Z332739vfY51tkT9pRtspDtsn12wA7ZgEnv3PviffO++xf+V/+H/3Pe6nut5zFbKv/3X0+sAFw= ✓ ✓ H 1 ✓ [r✓`(✓)] AAACO3icbZDNSxxBEMV71ETdfK16zKVRAnpwmTGEeBJBCB5NcFXYmQw1vTW7jT09Q3eNsgz7f3nx7i3izUsuORjEq3d7d1YwmgcNP15VUV0vKZS05PvX3tT0zKvXs3PzjTdv373/0FxYPLB5aQS2Ra5yc5SARSU1tkmSwqPCIGSJwsPkeGdUPzxBY2Wu92lQYJRBT8tUCiBnxc0fIfWRgIcKUwJj8lM+cdb5blzjz2o9GNYdHR5qSBTEj2Oo1GrNazw0stenKG6u+C1/LP4SggmsbC+n335drF3uxc2rsJuLMkNNQoG1ncAvKKrAkBQKh42wtFiAOIYedhxqyNBG1fj2If/knC5Pc+OeJj52n05UkFk7yBLXmQH17fPayPxfrVNSuhlVUhcloRb1orRUnHI+CpJ3pUFBauAAhJHur1z0wYAgF3fDhRA8P/klHGy0gs+tL99dGlus1hz7yJbZKgvYV7bNdtkeazPBzthvdsP+eufeH+/Wu6tbp7zJzBL7R979Az/dsUw= Newton’s method: Express in terms of gradient and Hessian of loss: E q[`(✓)] ⇡ `(m) AAACE3icbVC7SgNBFJ31bXxFLW1GRVCLsKuIViIEwTKCeUB2CbOTm2TI7MOZu2pYUljb2ORXbCwUsbWx82+cPAqNHhg4nHMvd87xYyk02vaXNTE5NT0zOzefWVhcWl7Jrq6VdJQoDkUeyUhVfKZBihCKKFBCJVbAAl9C2W/n+375BpQWUXiFnRi8gDVD0RCcoZFq2X03YNjy/fS8W7uuuiDlrostQLbnUZfFsYru6EAN9mrZbTtnD0D/EmdEts82b+VDvndfqGU/3XrEkwBC5JJpXXXsGL2UKRRcQjfjJhpixtusCVVDQxaA9tJBpi7dMUqdNiJlXoh0oP7cSFmgdSfwzWQ/gR73+uJ/XjXBxomXijBOEEI+PNRIJMWI9guidaGAo+wYwrgS5q+Ut5hiHE2NGVOCMx75Lykd5JzD3NGlaeOUDDFHNsgW2SUOOSZn5IIUSJFw8kieyAt5tXrWs/VmvQ9HJ6zRzjr5BevjG5S3oO4= Delta Method Set =1 to get m m H 1 m [rm`(m)] AAACFnicbVDLSgNBEJz1bXxFPXoZDIoeEnZ9oCcRvHhUMCpk16V30quDM7PLzKwSlnyFF3/FiwdFvIo3/8ZJzMFXQUNR1U13V5ILbqzvf3hDwyOjY+MTk5Wp6ZnZuer8wqnJCs2wyTKR6fMEDAqusGm5FXieawSZCDxLrg96/tkNasMzdWI7OUYSLhVPOQPrpLhal3Q1FJha0Dq7pZLW6WEsL8p60KWtUEEiIJY0RCHW5HoUV2t+w++D/iXBgNTIAEdx9T1sZ6yQqCwTYEwr8HMblaAtZwK7lbAwmAO7hktsOapAoonK/ltduuKUNk0z7UpZ2le/T5QgjenIxHVKsFfmt9cT//NahU13o5KrvLCo2NeitBDUZrSXEW1zjcyKjiPANHe3UnYFGph1SVZcCMHvl/+S041GsNnYPt6q7e8N4pggS2SZrJGA7JB9ckiOSJMwckceyBN59u69R+/Fe/1qHfIGM4vkB7y3T0BGnYw= ⇢ AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKez6QE8S8OIxgjGBZAmzk9nskHksM7NCCPkFLx4U8eoPefNvnE32oIkFDUVVN91dUcqZsb7/7ZVWVtfWN8qbla3tnd296v7Bo1GZJrRFFFe6E2FDOZO0ZZnltJNqikXEaTsa3eZ++4lqw5R8sOOUhgIPJYsZwTaXejpR/WrNr/szoGUSFKQGBZr96ldvoEgmqLSEY2O6gZ/acIK1ZYTTaaWXGZpiMsJD2nVUYkFNOJndOkUnThmgWGlX0qKZ+ntigoUxYxG5ToFtYha9XPzP62Y2vg4nTKaZpZLMF8UZR1ah/HE0YJoSy8eOYKKZuxWRBGtMrIun4kIIFl9eJo9n9eC8fnl/UWvcFHGU4QiO4RQCuIIG3EETWkAggWd4hTdPeC/eu/cxby15xcwh/IH3+QMgvI5K 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). See Section 1.3.2 in Khan and Rue, 2021

Slide 21

Slide 21 text

RMSprop/Adam from BLR 21 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). RMSprop BLR for Gaussian approx To get RMSprop, make the following choices • Restrict covariance to be diagonal • Replace Hessian by square of gradients • Add square root for scaling vector S (1 ⇢)S + ⇢(H✓) m m ↵S 1r✓`(✓) AAACenicbZFNaxsxEIa126/U/XLbYykMMU1tgs1um5AeA73kmOI6CViumZVnsyLSapG0DWbZH5G/llt/SS89dG3voU46IHh4Z17NSJMUSjofRb+C8MHDR4+f7DztPHv+4uWr7us3Z86UVtBEGGXsRYKOlMxp4qVXdFFYQp0oOk+uvq7y5z/JOmny735Z0EzjZS5TKdA30rx7M4Y9rij1aK25hn485DYzAxjDPqwI+hVfd6ksLWo4mXOfkcd6wHlHb1k1DIGjKjKELcu4/lEN43pb5DkmCuv2NuCkVH/Dg3m3F42idcB9iFvosTZO591bvjCi1JR7odC5aRwVflah9VIoqju8dFSguMJLmjaYoyY3q9bD1PChURaQGtuc3MNa/ddRoXZuqZOmUqPP3N3cSvxfblr69MusknlResrFplFaKvAGVnuAhbQkvFo2gMLKZlYQGVoUvtlWp/mE+O6T78PZp1H8eXT47aB3fNR+xw57x3ZZn8XsiB2zE3bKJkyw38H7YC/4GPwJd8NBuL8pDYPW85ZtRXjwFxMjvuc= s (1 ⇢)s + ⇢[ ˆ r`(✓)]2 ✓ ✓ ↵( p s + ) 1 ˆ r`(✓) AAACfnicdVFNb9QwEHVSPsrytYUjF8OKaku1S9JStcdKvfRYJLattE5XE++kserEwZ6AVtH+DP4YN34LF5zdHGgLI1l682besz2TVlo5iqJfQbjx4OGjx5tPek+fPX/xsr/16tyZ2kqcSKONvUzBoVYlTkiRxsvKIhSpxov05qStX3xD65Qpv9CiwqSA61JlSgJ5atb/4fi20JgRWGu+82E8EjY3O9zxXd4iPhU5UCNKSDUsuUCth4JyJNhJrvaE6K2TWyYdNeICdJUDHwr31VLjlq3nHLXXXjWj2Lv9z3rWH0TjaBX8Pog7MGBdnM36P8XcyLrAkqQG56ZxVFHSgCUlNS57onZYgbyBa5x6WEKBLmlW41vy956Z88xYf0riK/ZvRQOFc4si9Z0FUO7u1lryX7VpTdlR0qiyqglLub4oqzUnw9td8LmyKEkvPABplX8rlzlYkOQ31vNDiO9++T443xvH++ODz58Gx4fdODbZG/aODVnMDtkxO2VnbMIk+x28DT4EuyELt8NR+HHdGgad5jW7FeHRH9LYvm0= For Adam, use a Heavy-ball term with KL divergence as momentum (Appendix E in [1]) See Section 4.2 in Khan and Rue, 2021

Slide 22

Slide 22 text

BLR for large deep networks 22 RMSprop/Adam BLR variant Improved Variational Online Newton (IVON) AAACt3icbVFNj9MwEHXC11K+Chy5jKhAqdCWZFW+bitx4bhIdHdFXaqJ49TWOnHWdkBVlL/IgRv/BidtJbq7I1l6eu/N83icVkpaF8d/g/DW7Tt37x3cHzx4+Ojxk+HTZ6dW14bxGdNKm/MULVey5DMnneLnleFYpIqfpRefO/3sJzdW6vKbW1d8UeCqlLlk6Dy1HP6mAl2zauE1VTx3aIz+BT1HS0wVtkC5UhF1gjscUzroNXGDf9X+OAJvEHtSlBxSI/QYBLyBDsEuoMvqU/ejNtQhUFSVQIBol971Z1x5sYDxW4iovTR90I7vplsOR/Ek7guug2QLRmRbJ8vhH5ppVhe8dEyhtfMkrtyiQeMkU7wd0NryCtkFrvjcwxILbhdNv/cWXnkmg1wbf0oHPft/R4OFtesi9c4CnbBXtY68SZvXLv+4aGRZ1Y6XbHNRXitwGrpPhEwazpxae4DMSD8rMIEGmfNf3S0hufrk6+D0aJK8n7z7Oh0dT7frOCAvyEsSkYR8IMfkCzkhM8KCafA9YEEWfgqXYR6KjTUMtj3PyV6Fl/8ARwHSgg== ˆ g ˆ r`(✓) ˆ h ˆ g2 h (1 ⇢)h + ⇢ˆ h ✓ ✓ ↵(ˆ g + m)/( p h + ) 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Lin et al. “Handling the positive-definite constraints in the BLR.” ICML (2020). 4. Shen et al. “Variational Learning is Effective for Large Deep Networks.” Under review (2024) Only tune initial value of h (a scalar) AAADlnicbVLbbtNAEN3YXEq4NKUvSLyMiKhspbk4KrcXVFEheKqCRNpK2SRabzZZq7u2tV5TIst/xNfwxt+wdl2C065l6ficmTkz4/VjESR6MPjTsOx79x883HnUfPzk6bPd1t7zsyRKFWVjGolIXfgkYSII2VgHWrCLWDEifcHO/cuTQj//wVQSROF3vY7ZVJJVGCwDSrSh5nuNX5gTna1yOMCCLTVRKrqCksMh8QXJATMhHKw508SFDJeemWILo2j2UyuZwRVnikFBFFGAk0AClkRzSkR2mjvy0FArSWZDF3KMm2V9fofnKsd0EemaS+XdlW7/popxwk1eS3e8LlY8coFDBwoElQk+LJ56450iYDZ0eLcKcmfDvmO+O3jBhJnTLbqUNQMJXcBExJwAODfdFl5lhpHdPjg1F76RXVOuvrp/kxxs7XTj6PWdU4dvarj5vNUe9AblgdvAq0AbVWc0b/3Gi4imkoWaCpIkE28Q62lGlA6oYHkTpwmLCb0kKzYxMCSSJdOsbCeH14ZZwDJS5g01lOz/GRmRSbKWvoks/nWyrRXkXdok1cv30ywI41SzkF4bLVMBOoLijsIiUIxqsTaAUBWYXoFyogjV5iY3zRK87ZFvg7Nhz3vbe/PtqH18VK1jB71Er5CDPPQOHaOvaITGiFr71gfrk3Viv7A/2p/tL9ehVqPK2Ue1Y4/+AgL8INo= ˆ g ˆ r`(✓) where ✓ ⇠ N(m, 2) ˆ h ˆ g · (✓ m)/ 2 h (1 ⇢)h + ⇢ˆ h +⇢2(h ˆ h)2/(2(h + )) m m ↵(ˆ g + m)/(h + ) 2 1/(N(h + ))

Slide 23

Slide 23 text

IVON [3] got 1st prize in NeurIPS 2021 Approximate Inference Challenge 23 Watch Thomas Moellenhoff’s talk at https://www.youtube.com/watch?v=LQInlN5EU7E. 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Lin et al. “Handling the positive-definite constraints in the BLR.” ICML (2020).

Slide 24

Slide 24 text

GPT-2 with Bayes 24 Better performance and uncertainty at the same cost Trained on OpenWebText data (49.2B tokens). On 773M, we get a gain of 0.5 in perplexity. On 355M, we get a gain of 0.4 in perplexity. 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Shen et al. “Variational Learning is effective for large neural networks.” (Under review) BLR (IVON)[3]

Slide 25

Slide 25 text

GPT-2 with Bayes 25 Posterior averaging improve the result. Can also train on low-precision (a stable optimizer) 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Shen et al. “Variational Learning is effective for large neural networks.” (Under review) Variational Learning is Effective for Large Deep Networks (a) (b) e 12: IVON results trained on multi-GPU setups with different random seed on each machine (left) an xity for GPT-2 trained with IVON evaluated at mean and posterior predictive on OpenWebText (right). Variational Learning is Effective for Large Deep Networks (a) (b) e 12: IVON results trained on multi-GPU setups with different random seed on each machine (left) and val exity for GPT-2 trained with IVON evaluated at mean and posterior predictive on OpenWebText (right).

Slide 26

Slide 26 text

ImageNet on ResNet-50 (25.6M) 26 2% better accuracy over AdamW and 1% over SGD. Better calibration (ECE of 0.022 vs 0.066) ional Learning is Effective for Large Deep Networks (b) ResNet-50 on ImageNet (c) Calibration on ImageNet onal Learning is Effective for Large Deep Networks (b) ResNet-50 on ImageNet (c) Calibration on ImageNet

Slide 27

Slide 27 text

ImageNet on ResNet-50 (25.6M) 27 No severe overfitting like AdamW while improving accuracy over SGD consistently & better uncertainty Variational Learning is Effective for Large Deep Networks Dataset & Model Epochs Method Top-1 Acc. " Top-5 Acc. " NLL # ECE # Brier # AdamW 74.56±0.24 92.05±0.17 1.018±0.012 0.043±0.001 0.352±0.003 SGD 76.18±0.09 92.94±0.05 0.928±0.003 0.019±0.001 0.330±0.001 IVON@mean 76.14±0.11 92.83±0.04 0.934±0.002 0.025±0.001 0.330±0.001 100 IVON 76.24±0.09 92.90±0.04 0.925±0.002 0.015±0.001 0.330±0.001 AdamW 75.16±0.14 92.37±0.03 1.018±0.003 0.066±0.002 0.349±0.002 SGD 76.63±0.45 93.21±0.25 0.917±0.026 0.038±0.009 0.326±0.006 IVON@mean 77.30±0.08 93.58±0.05 0.884±0.002 0.035±0.002 0.316±0.001 ImageNet-1k ResNet-50 (25.6M params) 200 IVON 77.46±0.07 93.68±0.04 0.869±0.002 0.022±0.002 0.315±0.001 AdamW 47.33±0.90 71.54±0.95 6.823±0.235 0.421±0.008 0.913±0.018 SGD 61.39±0.18 82.30±0.22 1.811±0.010 0.138±0.002 0.536±0.002 IVON@mean 62.41±0.15 83.77±0.18 1.776±0.018 0.150±0.005 0.532±0.002 TinyImageNet ResNet-18 (11M params, wide) 200 IVON 62.68±0.16 84.12±0.24 1.528±0.010 0.019±0.004 0.491±0.001 AdamW 50.65±0.0⇤ 74.94±0.0⇤ 4.487±0.0⇤ 0.357±0.0⇤ 0.812±0.0⇤ AdaHessian 55.03±0.53 78.49±0.34 2.971±0.064 0.272±0.005 0.690±0.008 SGD 59.39±0.50 81.34±0.30 2.040±0.040 0.176±0.006 0.577±0.007 IVON @mean 60.85±0.39 83.89±0.14 1.584±0.009 0.053±0.002 0.514±0.003 TinyImageNet PreResNet-110 (4M params, deep) 200 IVON 61.25±0.48 84.13±0.17 1.550±0.009 0.049±0.002 0.511±0.003 AdamW 64.12±0.43 86.85±0.51 3.357±0.071 0.278±0.005 0.615±0.008 SGD 74.46±0.17 92.66±0.06 1.083±0.007 0.113±0.001 0.376±0.001 IVON@mean 74.51±0.24 92.74±0.19 1.284±0.013 0.152±0.003 0.399±0.002 CIFAR-100 ResNet-18 (11M params, wide) 200 IVON 75.14±0.34 93.30±0.19 0.912±0.009 0.021±0.003 0.344±0.003 +2% +15% +10% +11% +1% +1% +2% +.7%

Slide 28

Slide 28 text

28 Sensitivity to data is easy to compute “during” training. MNIST on MLP. Also work at large scale (ImageNet ) 1. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS, 2023

Slide 29

Slide 29 text

Sensitivity to Training Data 29 Training on full dataset Retraining without the i’th example Truth Estimated Start Iterations Current Past information with most influence on the present Estimating it without retraining: Using the BLR, we can recover all sorts of influence criteria used in literature.

Slide 30

Slide 30 text

Memory Perturbation 30 1. Cook. Detection of Influential Observations in Linear Regression. Technometrics. ASA 1977 2. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS, 2023 How sensitive is a model to its training data? Deviation ( ) = predictionError *predictionVariance Δ Old model New model New data

Slide 31

Slide 31 text

Memory Maps using the BLR 31 ast (right) memorable Regular examples Prediction Variance Prediction Error Understand generic ML models and algorithms. Unpredictable Uncertain 1. Tailor, Chang, Swaroop, Nalisnick, Solin, Khan, Memory maps to understand models (under review)

Slide 32

Slide 32 text

A Tool for Data-Scientists 32 Understand the memory of a model.

Slide 33

Slide 33 text

Predict Generalization during Training 33 Iterations Training on full dataset Current (a) Effect of removing an example. CIFAR10 on ResNet-20 using IVON. SGD or Adam do not work as well. Generalization on test data (NLL) Leave-One-Out Estimates on training data and during training

Slide 34

Slide 34 text

Answering “What-If” Questions 34 Test Performance (NLL) by brute-force retraining Estimates on training data (no retraining) What if we removed a class from MNIST?

Slide 35

Slide 35 text

Answering “What-If” Questions 35 1. Daheim et al. Model merging by uncertainty-based gradient matching, ICLR (2024). ✓1 2 P =1 Ht t 0.2 0.4 2 4 6 1 2 3 4 5 1 2 3 4 5 Gradient mismatch Difference in test error Task Arithmetic Ours RoBERTa on IMDB What if we merge fine-tuned large-language models?

Slide 36

Slide 36 text

SAM as an Optimal relaxation of Bayes 36 ρ θ Bayes: 𝔼 ϵ∼ 𝒩 (0,σ2) [ℓ(θ + ϵ)] sup |ϵ|<ρ ℓ(θ + ϵ) SAM: Our work: Fenchel Biconjugate 1. Foret et al. Sharpness-Aware Minimization for Efficiently Improving Generalization, ICLR, 2021 2. Moellenhoff and Khan, SAM as an Optimal Relaxation of Bayes, Under review, 2022

Slide 37

Slide 37 text

Bayesian Learning Rule [1] • Bridge DL & Bayesian learning [2-5] – SOTA on GPT-2 and ImageNet [5] • Improve DL [5-7] – Calibration, uncertainty, memory etc. – Understand and fix model behavior • Towards human-like quick adaptation 37 1. Khan and Rue, The Bayesian Learning Rule, JMLR (2023). 2. Khan, et al. Fast and scalable Bayesian deep learning by weight-perturbation in Adam, ICML (2018). 3. Osawa et al. Practical Deep Learning with Bayesian Principles, NeurIPS (2019). 4. Lin et al. Handling the positive-definite constraints in the BLR, ICML (2020). 5. Shen et al. Variational Learning is Effective for Large Deep Networks, Under review. 6. Daheim et al. Model merging by uncertainty-based gradient matching, ICLR (2024). 7. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS (2023)

Slide 38

Slide 38 text

NeurIPS 2019 Tutorial 38

Slide 39

Slide 39 text

39 The webpage is available at https://bayesduality.github.io/, and Twitter account @BayesDuality Received total funding of around USD 3 million through JST’s CREST-ANR (2021-2027) and Kakenhi Grants (2019-2021).

Slide 40

Slide 40 text

40 Team Approx-Bayes https://team-approx-bayes.github.io/ Many thanks to our group members and collaborators (many not on this slide). We are always looking for new collaborations.