Slide 1

Slide 1 text

Recent Findings on Density-Ratio Approaches in Machine Learning Workshop on FIMI March. 30th, 2022 Masahiro Kato The University of Tokyo, Imaizumi Lab / CyberAgent, Inc. AILab

Slide 2

Slide 2 text

Density-Ratio Approaches in Machine Learning (ML) n Consider two distributions 𝑃 and 𝑄 with a common support. n Let π‘βˆ— and π‘žβˆ— be the density functions of 𝑃 and 𝑄, respectively. n Define the density ratio (function) as π‘Ÿβˆ— π‘₯ = "βˆ— # $βˆ— # . n Approaches using density ratios. β†’ Useful in many ML applications. 2 π‘βˆ—(π‘₯) π‘žβˆ—(π‘₯) Density ratio π‘Ÿ π‘₯ = "βˆ— # $βˆ— # .

Slide 3

Slide 3 text

Density-Ratio Approach in ML n Many ML applications include two or more distributions. β€’ Classification. β€’ Generative adversarial networks (GAN). β€’ Divergence between probability measures. β€’ Multi-armed bandit (MAB) problem (change of measures) β†’ In these tasks, the density ratio appears as a key component. 3

Slide 4

Slide 4 text

Empirical Perspective n An estimator of the density ratio π‘Ÿβˆ— provides a solution. β€’ Inlier-based outlier detection: finding outliers based on the density ratio (Hido et al. (2008)). β€’ Causal inference: conditional moment restrictions can be approximated by the density ratio (Kato et al. (2022)). β€’ GAN (Goodfellow (2010), Uehara et al. (2016)). β€’ Variational Bayesian (VB) method (Tran et al. (2017)) etc. 4

Slide 5

Slide 5 text

Theoretical Viewpoint n Using density ratios is also useful in theoretical analysis β€’ Likelihood-ratio gives tight lower bounds on decision making problems. Ex. Lower bounds in MAB problems (Lai and Robbins (1985)). β€’ Transform an original problem to obtain a tight theoretical result. Ex. Large deviation principles for martingales (Fan et al. (2013, 2014)). 5 𝑄 𝑃 π‘βˆ—(π‘₯) π‘žβˆ—(π‘₯) Some theoretical results Some theoretical results

Slide 6

Slide 6 text

Presentation Outline: Recent (Our) Findings on Density Ratios 1. Density-Ratio Estimation and its Applications Kato and Teshima (ICML2022), β€œNon-negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation” 2. Causal Inference and Density Ratios Kato, Imaizumi, McAlinn, Yasui, and Kakehi (ICLR2022), β€œLearning Causal Relationships from Conditional Moment Restrictions by Importance Weighting” 3. Density Ratios And Divergences between Probability Measures Kato, Imaizumi, and Minami (2022), β€œUnified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation” 4. Change-of-Measure Arguments in Best Arm Identification Problem Kato, Ariu, Imaizumi, Uehara, Nomura, and Qin (2022), β€œBest Arm Identification with a Fixed Budget under a Small Gap” 6

Slide 7

Slide 7 text

Density-Ratio Estimation and its Applications Workshop on Functional Inference and Machine Intelligence Masahiro Kato, March. 30th, 2022 The University of Tokyo / CyberAgent, Inc. AILab 7

Slide 8

Slide 8 text

Density-Ratio Estimation (DRE) n Consider DRE from observations. n Two sets of observations: 𝑋% %&' ( ∼ π‘βˆ—, 𝑍) )&' * ∼ π‘žβˆ—. n Two-step method: β€’ Estimate π‘βˆ— π‘₯ and π‘žβˆ—(π‘₯); then, construct an estimator of π‘Ÿβˆ—(π‘₯). Γ— Empirical performance. Γ— Theoretical guarantee. β†’ Consider direct estimation of π‘Ÿβˆ— π‘₯ : LSIF, KLIEP, and PU Learning 8

Slide 9

Slide 9 text

Least-Squares Importance Fitting (LSIF) n Let π‘Ÿ be a model of the density ratio π‘Ÿβˆ—. n The risk of the squared error is 𝑅 π‘Ÿ = 𝔼+ π‘Ÿβˆ— 𝑋 βˆ’ π‘Ÿ 𝑋 , . n The minimizer of the empirical risk, Μ‚ π‘Ÿ, is an estimator of π‘Ÿβˆ—. n Instead of 𝑹(𝒓), we minimize an empirical risk of the following risk: 4 𝑅 π‘Ÿ = βˆ’2𝔼- π‘Ÿ 𝑋 + 𝔼+ π‘Ÿ, 𝑋 . n This method is called LSIF (Kanamori et al. (2009)). 9

Slide 10

Slide 10 text

LSIF n Derivation: π‘Ÿβˆ— = arg min . 𝔼+ π‘Ÿβˆ— 𝑋 βˆ’ π‘Ÿ 𝑋 , = arg min . 𝔼+ π‘Ÿβˆ— 𝑋 , βˆ’ 2π‘Ÿβˆ— 𝑋 π‘Ÿ 𝑋 + π‘Ÿ,(𝑋) = arg min . 𝔼+ βˆ’2π‘Ÿβˆ— 𝑋 π‘Ÿ 𝑋 + π‘Ÿ,(𝑋) = arg min . βˆ’2𝔼- π‘Ÿ 𝑋 + 𝔼+ π‘Ÿ,(𝑋) . β€’ Here, we used 𝔼+ π‘Ÿβˆ— π‘₯ π‘Ÿ π‘₯ = ∫ π‘Ÿβˆ— π‘₯ π‘Ÿ π‘₯ π‘žβˆ— π‘₯ 𝑑π‘₯ = ∫ π‘Ÿ π‘₯ π‘βˆ— π‘₯ 𝑑π‘₯ = 𝔼- π‘Ÿ π‘₯ . 10

Slide 11

Slide 11 text

KL Importance Estimation Procedure (KLIEP) n KLIEP (Sugiyama et al. (2007)) is another DRE method that uses the KL divergence between π‘βˆ—(π‘₯) and a model 𝑝 π‘₯ = π‘Ÿ π‘₯ π‘žβˆ— π‘₯ : KL π‘βˆ— π‘₯ βˆ₯ 𝑝 π‘₯ = ' π‘βˆ— π‘₯ log π‘βˆ— π‘₯ 𝑝 π‘₯ 𝑑π‘₯ = ' π‘βˆ— π‘₯ log π‘βˆ— π‘₯ π‘Ÿ π‘₯ π‘žβˆ—(π‘₯) 𝑑π‘₯ = ' π‘βˆ— π‘₯ log π‘βˆ— π‘₯ π‘žβˆ—(π‘₯) 𝑑π‘₯ βˆ’ 'π‘βˆ— π‘₯ log π‘Ÿ(π‘₯) 𝑑π‘₯ n From π‘Ÿβˆ— = arg min . KL π‘βˆ— π‘₯ βˆ₯ 𝑝 π‘₯ , we estimate π‘Ÿβˆ— as Μ‚ π‘Ÿ π‘₯ = arg min . βˆ’ 1 𝑛 E %&' ( log π‘Ÿ 𝑋% s. t. 1 π‘š E )&' * π‘Ÿ(𝑍) ) = 1. 11

Slide 12

Slide 12 text

Inlier-based Outlier Detection n Find outliers using inliers (correct samples) and the density ratio (Hido et al. (2008)) β€’ Inliers are sampled from π‘βˆ— π‘₯ . β€’ Test data: inliers + outliers are sampled from π‘žβˆ—(π‘₯) n Outlier detection using the density ratio π‘Ÿβˆ— π‘₯ = "βˆ— # $βˆ—(#) . 12 π‘βˆ— π‘₯ π‘žβˆ— π‘₯ π‘Ÿβˆ— π‘₯ = π‘βˆ— π‘₯ π‘žβˆ—(π‘₯) From Sugiyama (2016) Mean AUC values over 20 trials for the benchmark datasets (Hido et al, (2008)).

Slide 13

Slide 13 text

Bregman (BR) Divergence Minimization Perspective n LSIF and KLIEP can be regarded as special cases of BR divergence minimization (Sugiyama et al. (2012)). β€’ Let 𝑔(𝑑) be a twice continuously differentiable convex function. β€’ Using the BR divergence, we can rewrite the objective function as follows: M BR4 π‘Ÿ : = P 𝔼+ πœ•π‘” π‘Ÿ 𝑋% π‘Ÿ 𝑋% βˆ’ 𝑔 π‘Ÿ 𝑋% βˆ’ P 𝔼- πœ•π‘” π‘Ÿ 𝑋) . β€’ By changing 𝑔 𝑑 , we obtain objective functions for various direct DRE. Ex. 𝑔 𝑑 = 𝑑 βˆ’ 1 ,: LSIF, 𝑔 𝑑 = 𝑑 log 𝑑 βˆ’ 𝑑: KLIEP. 13

Slide 14

Slide 14 text

Learning from Positive and Unlabeled Data (PU Learning) n PU learning is a method for a classifier only from positive and unlabeled data (du Plessis et al, (2014, 2015)). β€’ Positive label: 𝑦 = +1, negative label: 𝑦 = βˆ’1. β€’ Positive data: π‘₯% " %&' (" ∼ 𝑝 π‘₯ 𝑦 = +1 β€’ Unlabeled data: π‘₯% 5 %&' (# ∼ 𝑝 π‘₯ . 14

Slide 15

Slide 15 text

Learning from Positive and Unlabeled Data (PU Learning) n A classifier 𝑓 can be trained by minimizing β„› 𝑓 ≔ πœ‹ :log 𝑓 π‘₯ 𝑝 x y = +1 dπ‘₯ βˆ’ πœ‹ :log 1 βˆ’ 𝑓 π‘₯ 𝑝 x y = +1 dπ‘₯ + :log 1 βˆ’ 𝑓 π‘₯ 𝑝 x dπ‘₯, where πœ‹ is a class prior defined as πœ‹ = 𝑝(𝑦 = +1). n Overfitting problem in PU learning (Kiryo et al. (2017)). n The empirical PU risk is not lower bounded and goes to βˆ’βˆž. 15 This term can go to βˆ’βˆž.

Slide 16

Slide 16 text

Overfitting and Non-negative Correction n Kiryo et al. (2017) proposes non-negative correction based on βˆ’πœ‹ #log 1 βˆ’ 𝑓 π‘₯ 𝑝 x y = +1 dπ‘₯ + #log 1 βˆ’ 𝑓 π‘₯ 𝑝 x dπ‘₯ β‰₯ 0. n The nonnegative PU risk is given as β„›%%&' 𝑓 ≔ πœ‹ ,log 𝑓 π‘₯ 𝑝 x y = +1 dπ‘₯ + max 0, βˆ’πœ‹ ,log 1 βˆ’ 𝑓 π‘₯ 𝑝 x y = +1 dπ‘₯ + ,log 1 βˆ’ 𝑓 π‘₯ 𝑝 x dπ‘₯ . n In population, β„› 𝑓 = β„›GGHI 𝑓 . n Minimize an empirical version of β„›GGHI 𝑓 . 16 From Kiryo et al. (2017)

Slide 17

Slide 17 text

Overfitting and Non-negative Correction n In DRE, we face a similar overfitting problem. n Kato and Teshima (2021) applies the non-negative method to DRE. 1. PU learning can be also regarded as BR divergence minimization (optimal classifier is 𝑝 𝑦 = 1 π‘₯ = J"(#|L&M') "(#) ). 2. They apply non-negative correction to DRE. n In maximum likelihood nonparametric density estimation, this overfitting problem is known as the roughness problem (Good and Gaskin (1971)). 17 π‘žβˆ—(π‘₯) π‘βˆ—(π‘₯)

Slide 18

Slide 18 text

Inlier-based Outlier Detection with Deep Neural Networks (DNNs) n Inlier-based outlier detection with high-dimensional data (ex. CIFAR-10) n We can use DNNs by using non-negative correction together. n PU learning-based DRE shows the best performance. 18 From Kato and Teshima (2021)

Slide 19

Slide 19 text

Causal Inference and Density Ratios Workshop on Functional Inference and Machine Intelligence Masahiro Kato, March. 30th, 2022 The University of Tokyo / CyberAgent, Inc. AILab

Slide 20

Slide 20 text

Structural Equation Model n Consider the following linear model between π‘Œ and 𝑋: π‘Œ = 𝑋N𝛽 + πœ€, 𝔼 𝑋Nπœ€ β‰  0. n 𝔼 𝑋Nπœ€ β‰  0 implies the correlation between πœ€ and 𝑋. n This situation is called endogeneity. n In this case, an OLS estimator is not unbiased and consistent. β€’ 𝑋N𝛽 is not conditional mean 𝔼[π‘Œ|𝑋] οΌˆπ”Ό π‘Œ 𝑋 β‰  𝑋Nπ›½οΌ‰οΌŽ n This model is called structural equation. 20

Slide 21

Slide 21 text

NPIV: Wage Equation n The true wage equation: log(π‘€π‘Žπ‘”π‘’) = 𝛽O + π‘¦π‘’π‘Žπ‘Ÿπ‘  π‘œπ‘“ π‘’π‘‘π‘’π‘π‘Žπ‘‘π‘–π‘œπ‘›Γ—π›½' + π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦Γ—π›½, + 𝑒, 𝔼 𝑒 π‘¦π‘’π‘Žπ‘Ÿπ‘  π‘œπ‘“ π‘’π‘‘π‘’π‘π‘Žπ‘‘π‘–π‘œπ‘›, π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦ = 0 n We cannot observe β€œability” and estimate the following model: log(π‘€π‘Žπ‘”π‘’) = 𝛽O + π‘¦π‘’π‘Žπ‘Ÿπ‘  π‘œπ‘“ π‘’π‘‘π‘’π‘π‘Žπ‘‘π‘–π‘œπ‘›Γ—π›½' + πœ€, πœ€ = π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦Γ—π›½, + 𝑒. β€’ If β€œπ‘¦π‘’π‘Žπ‘Ÿπ‘  π‘œπ‘“ π‘’π‘‘π‘’π‘π‘Žπ‘‘π‘–π‘œπ‘›β€ is correlated with β€œπ‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦,” 𝔼 β€œyears of educationβ€Γ—πœ€ β‰  0 β†’ We cannot consistently estimate 𝛽' with OLS. 21

Slide 22

Slide 22 text

Instrumental Variable (IV) Method n By using IVs, we can estimate the parameter 𝛽. n The IV is a random variable 𝑍 satisfying the following conditions: 1. Uncorrelated to the error term: 𝔼 𝑍Nπœ€ = 0. 2. Correlated with the endogeneous variable 𝑋. n Angrist and Krueger (1991): Using the quarter of birth as the IV. 22 𝑍(𝐼𝑉) 𝑋 (π‘¦π‘’π‘Žπ‘Ÿπ‘  π‘œπ‘“ π‘’π‘‘π‘’π‘π‘Žπ‘‘π‘–π‘œπ‘›) π‘Œ(π‘€π‘Žπ‘”π‘’) π‘ˆ (π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦) 𝛽

Slide 23

Slide 23 text

Nonparametric Instrumental Variable (NPIV) Regression n A nonparametric version of IV problems (Newey and Powell (2003)): π‘Œ = π‘“βˆ— 𝑋 + πœ€, 𝔼 πœ€|𝑋 β‰  0. β€’ Want to estimate the structural function π‘“βˆ—. β€’ 𝔼 πœ€|𝑋 β‰  0 β†’ least-squires does not yield consistent estimator. n Instrumental variable 𝑍: the condition for IVs: 𝔼 πœ€|𝑍 = 0. n Algorithms: Two-stage least squares with series regression (Newey and Powell (2003)), Minimax optimization 23

Slide 24

Slide 24 text

NPIV via Importance Weighting n Kato, Imaizumi, McAlinn, Yasui, and Kakehi (ICLR2022) solves the problem with an approach similar to covariate shift adaptation (Shimodaira (2000)). n From 𝔼P,Q πœ€|𝑍 = 0, if we know π‘Ÿβˆ— 𝑦, π‘₯ 𝑧 = "βˆ—(L,#|R) "(L,#) , we estimate π‘“βˆ— by minimizing an empirical approximation of 𝔼S 𝔼P,Q πœ€|𝑍 , : ! 𝑓 = argmin ; 1 𝑛 , <=> ? 1 𝑛 , @=> ? π‘Œ< βˆ’ 𝑓 𝑋< π‘Ÿβˆ— 𝑦, π‘₯ 𝑧 B n We show some theoretical results on the estimation error. 24

Slide 25

Slide 25 text

NPIV via Importance Weighting n Estimate π‘Ÿβˆ— 𝑦, π‘₯ 𝑧 = "βˆ—(L,#|R) "(L,#) = "βˆ—(L,#,R) " L,# "(R) by applying the idea of LSIF as π‘Ÿβˆ— = arg min " 𝔼# 𝔼$,& π‘Ÿβˆ— π‘Œ, 𝑋|𝑍 βˆ’ π‘Ÿ π‘Œ, 𝑋|𝑍 ' = arg min " 𝔼# 𝔼$,& π‘Ÿβˆ— π‘Œ, 𝑋|𝑍 ' βˆ’ 2π‘Ÿβˆ— π‘Œ, 𝑋|𝑍 π‘Ÿ π‘Œ, 𝑋|𝑍 + π‘Ÿ'(π‘Œ, 𝑋|𝑍) = arg min " 𝔼# 𝔼$,& βˆ’2π‘Ÿβˆ— π‘Œ, 𝑋|𝑍 π‘Ÿ π‘Œ, 𝑋|𝑍 + π‘Ÿ'(π‘Œ, 𝑋|𝑍) = arg min " βˆ’2𝔼# 𝔼$,& π‘Ÿ π‘Œ, 𝑋|𝑍 + 𝔼$,&,# π‘Ÿ' π‘Œ, 𝑋|𝑍 . n KLIEP-based estimation is proposed by Suzuki et al. (2009). 25

Slide 26

Slide 26 text

Density Ratios And Divergences between Probability Measures Workshop on Functional Inference and Machine Intelligence Masahiro Kato, March. 30th, 2022 The University of Tokyo / CyberAgent, Inc. AILab 26

Slide 27

Slide 27 text

Reconsidering the BR Divergence Minimization from the Likelihood Approach n Reconsider DRE methods from maximum likelihood estimation perspectives. β€’ We can define several likelihoods based on different sampling scheme. n The maximum likelihood estimation under the stratified sampling scheme is not included in BR divergence divergence. Ø The risk belongs to integral probability metrics (IPMs). β€’ IPMs include the Wasserstein distance and MMD as special cases. n Reveal the relationships between probability distances and density ratios. β†’ Expand the range of applications of density ratios. 27

Slide 28

Slide 28 text

Likelihood of Density Ratios n Let π‘Ÿ π‘₯ be a model of π‘Ÿβˆ—(π‘₯) = "βˆ— # $βˆ—(#) . n A model of π‘βˆ— π‘₯ is given as 𝑝 π‘₯ = π‘Ÿ π‘₯ π‘žβˆ—(π‘₯). n For observations 𝑋% %&' ( ∼ π‘βˆ—, the likelihood of the model 𝑝 π‘₯ is β„’ π‘Ÿ = n %&' ( 𝑝(𝑋%) = n %&' ( π‘Ÿ 𝑋% π‘žβˆ—(𝑋%) . n The log likelihood is given as β„“ π‘Ÿ = βˆ‘%&' ( log π‘Ÿ 𝑋% + βˆ‘%&' ( log π‘žβˆ—(𝑋%) . 28

Slide 29

Slide 29 text

Nonparametric Maximum Likelihood Estimation of Density Ratios n We can estimate π‘Ÿβˆ— by solving max . 1 𝑛 E %&' ( log π‘Ÿ 𝑋% s. t. rπ‘Ÿ 𝑧 π‘žβˆ— 𝑧 d𝑧 = 1 β€’ The constraint is based on ∫ π‘Ÿβˆ— π‘₯ π‘žβˆ— π‘₯ 𝑑π‘₯ = ∫ π‘βˆ— π‘₯ 𝑑π‘₯ = 1. β€’ This formulation is equivalent to KLIEP. n Similarly, for observations 𝑍) )&' * ∼ π‘žβˆ—, we can estimate1/π‘Ÿβˆ— by solving max . βˆ’ 1 π‘š E )&' * log π‘Ÿ 𝑍) s. t. r1/π‘Ÿ π‘₯ π‘βˆ— π‘₯ dπ‘₯ = 1 29

Slide 30

Slide 30 text

KL Divergence and Likelihood of Density Ratios n KL divergence is KL β„™ βˆ₯ β„š ≔ ∫ π‘βˆ— π‘₯ log "βˆ— # $βˆ— # dπ‘₯ n KL divergence can be interpreted as the maximized log likelihood because KL β„™ βˆ₯ β„š = sup .βˆˆβ„› U.W. ∫ .(R) $βˆ— R YR&' rlog π‘Ÿ(π‘₯) π‘βˆ— π‘₯ dπ‘₯ n Derivation: KL β„™ βˆ₯ β„š = ∫ π‘βˆ— π‘₯ log "βˆ— # $βˆ— # dπ‘₯ = sup (βˆˆβ„± 1 + ∫ 𝑓 π‘₯ π‘βˆ— π‘₯ dπ‘₯ βˆ’ ∫ exp 𝑓 π‘₯ π‘žβˆ— π‘₯ dπ‘₯ = 1 + ∫ π‘“βˆ— π‘₯ π‘βˆ— π‘₯ dπ‘₯ βˆ’ ∫ exp π‘“βˆ— π‘₯ π‘žβˆ— π‘₯ dπ‘₯ = ∫ π‘“βˆ— π‘₯ π‘βˆ— π‘₯ dπ‘₯ = sup +βˆˆβ„› -./. ∫ +(2) $βˆ— 2 45 ∫ log π‘Ÿ(π‘₯) π‘βˆ— π‘₯ dπ‘₯ 30

Slide 31

Slide 31 text

Stratified Sampling Scheme n Assume that for all π‘₯ ∈ π’Ÿ, there exist π‘Ÿβˆ— π‘₯ and 1/π‘Ÿβˆ—(π‘₯). n Define the likelihood of π‘Ÿ under a stratified sampling scheme. n The likelihood uses both 𝑋% %&' ( ∼ π‘βˆ— and 𝑍) )&' * ∼ π‘žβˆ—, simultaneously. n The likelihood is given as β„’ π‘Ÿ = ∏%&' ( ~ 𝑝. (𝑋% ) ∏)&' * ~ π‘ž. (𝑍% ) β€’ This sampling scheme has been considered in causal inference(Imbens and Lancaster (1996)). 31

Slide 32

Slide 32 text

Stratified Sampling Scheme n The objective function is given as max . E %&' ( log π‘Ÿ(𝑋% ) βˆ’ E )&' * log π‘Ÿ 𝑍) s. t. r1/π‘Ÿ π‘₯ π‘βˆ— π‘₯ dπ‘₯ = rπ‘Ÿ 𝑧 π‘žβˆ— 𝑧 d𝑧 = 1 . n We consider the following equivalent unconstrained problem: max . E %&' ( log π‘Ÿ(𝑋%) βˆ’ E )&' * log π‘Ÿ 𝑍) βˆ’ 1 π‘š E )&' * π‘Ÿ 𝑍) βˆ’ 1 𝑛 E %&' ( 1 π‘Ÿ 𝑋% n This transformation is based on the results of Silverman (1982). β€’ We can apply this trick to KLIEP (see Nguyen et al. (2008)) 32

Slide 33

Slide 33 text

Integral Probability Metrics (IPMs) and Likelihood of Density Ratios n IPMs with a function class β„± defines the distance between two probability distributions 𝑃 and 𝑄 as sup Zβˆˆβ„± r𝑓 π‘₯ π‘βˆ— π‘₯ dπ‘₯ βˆ’ r𝑓 𝑧 π‘žβˆ— 𝑧 d𝑧 β€’ If β„± is a class of Lipschitz continuous functions, this distance becomes the Wasserstein distance. 33

Slide 34

Slide 34 text

IPMs and Likelihood of Density Ratios n Consider a exponential density ratio model, π‘Ÿ(π‘₯) = exp 𝑓 π‘₯ . n IPMs is the maximized log likelihood under stratified sampling scheme: IPM\(β„±) β„™ βˆ₯ β„š 𝐢 β„± = 𝑓 ∈ β„±: rexp 𝑓 𝑧 dπ‘žβˆ— 𝑧 𝑑𝑧 = rexp βˆ’π‘“ π‘₯ π‘‘π‘βˆ— π‘₯ dπ‘₯ = 1 . 34

Slide 35

Slide 35 text

Density Ratio Metrics (DRMs) n The density ratio metrics (DRMs, Kato, Imaizumi, and Minami (2022)): DRMβ„± ] β„™ βˆ₯ β„š = sup Z∈\(β„±) πœ† r𝑓 π‘₯ π‘βˆ— π‘₯ dπ‘₯ βˆ’ (1 βˆ’ πœ†) r𝑓 π‘₯ π‘žβˆ—(π‘₯) dπ‘₯ 𝐢 β„± = 𝑓 ∈ β„±: rexp 𝑓 π‘₯ π‘žβˆ— π‘₯ dπ‘₯ = rexp βˆ’π‘“ π‘₯ π‘βˆ— π‘₯ dπ‘₯ = 1 β€’ A distance based on the weighted average of maximum log likelihood of the density ratio under stratified sampling scheme (πœ† ∈ [0,1]). β€’ Bridges the IPMs and KL divergence. β€’ DRMs include the KL and IPMs as special cases. 35

Slide 36

Slide 36 text

Density Ratio Metrics (DRMs) β€’ If πœ† = 1/2, DRMβ„± '/, 𝑃 βˆ₯ 𝑄 = ' , IPM\(β„±) 𝑃 βˆ₯ 𝑄 β€’ If πœ† = 1, DRMβ„± ' 𝑃 βˆ₯ 𝑄 = KL 𝑃 βˆ₯ 𝑄 β€’ If πœ† = 0, DRMβ„± O 𝑃 βˆ₯ 𝑄 = KL 𝑄 βˆ₯ 𝑃 n Choice of β„± β†’ Smoothness of the model of the density ratio. Ex. Non-negative correction, Spectral normalization (Miyato et al. (2016)) n Probability divergence can be defined without density ratios. β†’ What are the advantages? VB methods, DualGAN, causal inference... 36 From Kato et al. (2022)

Slide 37

Slide 37 text

Change-of-Measure Arguments in Best Arm Identification Problem Workshop on Functional Inference and Machine Intelligence Masahiro Kato, March. 30th, 2022 The University of Tokyo / CyberAgent, Inc. AILab 37

Slide 38

Slide 38 text

MAB Problem n There are 𝐾 arms, 𝐾 = 1,2, … 𝐾 and fixed time horizon 𝑇. β€’ Pull an arm 𝐴_ ∈ [𝐾] in each round 𝑑. β€’ Observe a reward of chosen arm 𝐴_ , π‘Œ_ = βˆ‘`∈[b] 1 𝐴_ = π‘Ž π‘Œ`,_ , where π‘Œ`,_ is a (potential) reward of arm π‘Ž ∈ [𝐾] in each round 𝑑 β€’ Stop the trial at round 𝑑 = 𝑇 38 Arm 1 Arm 2 β‹― Arm 𝐾 π‘Œ",$ π‘Œ%,$ π‘Œ&,$ π‘Œ$ = βˆ‘'∈[&] 1 𝐴$ = π‘Ž π‘Œ',$ 𝑇 𝑑 = 1

Slide 39

Slide 39 text

MAB Problem n The distributions of π‘Œ`,_ does not change across rounds. n Denote the mean outcome of an arm π‘Ž by πœ‡` = 𝔼 π‘Œ`,_ . n Best arm: an arm with the highest reward. β€’ Denote the best arm by π‘Žβˆ— = arg max `∈[m] πœ‡` 39

Slide 40

Slide 40 text

BAI with a Fixed Budget n BAI with a fixed budget is an instance of the MAB problems. n In the final round 𝑇, we estimate the best arm and denote it by Ε’ π‘Žn βˆ— n Probability of misidenfication: β„™ Ε’ π‘Žn βˆ— β‰  π‘Žβˆ— n Goal: Minimizing the probability of misidenfication β„™ Ε’ π‘Žn βˆ— β‰  π‘Žβˆ— . 40

Slide 41

Slide 41 text

Theoretical Performance Evaluation n How to evaluate the performance of BAI algorithms? n β„™ Ε’ π‘Žn βˆ— β‰  π‘Žβˆ— converges to 0 with an exponential speed; that is, β„™ Ε’ π‘Žn βˆ— β‰  π‘Žβˆ— = exp(βˆ’π‘‡(⋆)) for a constant term (⋆). n Consider evaluating the term (⋆) by lim sup nβ†’p βˆ’ ' n log β„™ Ε’ π‘Žn βˆ— β‰  π‘Žβˆ— . n A performance lower (upper) bound of β„™ Ε½ 𝒂𝑻 βˆ— β‰  π‘Žβˆ— is an upper (lower) bound of lim sup nβ†’p βˆ’ ' n log β„™ Ε’ π‘Žn βˆ— β‰  π‘Žβˆ— . 41

Slide 42

Slide 42 text

Information Theoretic Lower Bound n Information theoretic lower bound. β€’ Lower bound based on the distribution information. β€’ This kind of a lower bound is typically based on the likelihood ratio, Fisher information, and KL divergence. n The derivation technique is called change-of-measure arguments. β€’ This technique has been used in MAB problem (Lai and Robbins (1985)) β€’ In BAI, Kaufmann et al. (2016) suggests a lower bound. 42

Slide 43

Slide 43 text

Lower Bound: Transportation Lemma n Denote the true distribution (bandit model) by 𝑣. n Denote a set of alternative hypothesizes by Alt(𝑣). n Consistent algorithm: return the true arm with probability 1 as 𝑇 β†’ ∞. n For any 𝑣r ∈ Alt(𝑣) and consistent algorithms, if 𝐾 = 2, lim sup (β†’* βˆ’ 1 𝑇 log β„™ G π‘Ž( βˆ— β‰  π‘Žβˆ— ≀ lim sup (β†’* 1 𝑇 𝔼+, K -./ ' K 0./ ( 1[𝐴0 = π‘Ž] log 𝑓- , π‘Œ 𝑓- π‘Œ where 𝑓` and 𝑓` r are the pdfs of an arm π‘Žβ€™s reward under 𝑣 and 𝑣′. 43 Transportation Lemma (Lemma 1 of Kaufmann et al. (2016) Log likelihood ratio

Slide 44

Slide 44 text

Open Problem: Optimal Algorithm? n Open problem: 1. The Kaufmann’s bound is only applicable to two-armed bandit (𝐾 = 2). 2. No optimal algorithm whose upper bound achieves the lower bound. n Kato, Ariu, Imaizumi, Uehara, Nomura, and Qin (2022) proposes an optimal algorithm under a small-gap setting. 1. Consider a small gap situationοΌšΞ”` = πœ‡`βˆ— βˆ’ πœ‡` β†’ 0 for all π‘Ž ∈ [𝐾]. 2. Proposes a large deviation upper bound. 3. Then, the upper bound matches the lower bound in the limit of Ξ”` . 44

Slide 45

Slide 45 text

Lower Bound under a Small Gap n Let 𝐼` (πœ‡` ) be the Fisher information of parameter πœ‡` of arm π‘Ž. n Let 𝑀` be an arm allocation lim sup nβ†’p 𝔼$ βˆ‘%&' ( ' t%&` n . lim sup nβ†’p βˆ’ 1 𝑇 log β„™ Ε’ π‘Žn βˆ— β‰  π‘Žβˆ— ≀ sup (u)) min `v`βˆ— Ξ”` , 2 𝐼' πœ‡' 𝑀' + 𝐼` πœ‡` 𝑀` + π‘œ(Ξ”` , ) 45 Lower Bound (Lemma 1 of Kato et al. (2022)

Slide 46

Slide 46 text

Upper Bound: Large Deviation Principles (LDPs) for Martingales n Let Μ‚ πœ‡`,n be an estimator of the mean rewad πœ‡` . n Consider returning arg max ` Μ‚ πœ‡`,n as an estimated best arm. Then, β„™ Ε’ π‘Žn βˆ— β‰  π‘Žβˆ— = E `v`βˆ— β„™ Μ‚ πœ‡`,n β‰₯ Μ‚ πœ‡`βˆ—,n = E `v`βˆ— β„™ Μ‚ πœ‡`βˆ—,n βˆ’ Μ‚ πœ‡`,n βˆ’ Ξ”` ≀ βˆ’Ξ”` n LDP: evaluation of β„™ Μ‚ πœ‡`βˆ—,n βˆ’ Μ‚ πœ‡`,n βˆ’ Ξ”` ≀ 𝐢 , where 𝐢 is a constant. β€’ Central limit theorem (CLT): evaluation of β„™ 𝑇( Μ‚ πœ‡`βˆ—,n βˆ’ Μ‚ πœ‡`,n βˆ’ Ξ”` ) ≀ 𝐢 . β€’ We cannot use the CLT for obtaining the upper bound. 46

Slide 47

Slide 47 text

Upper Bound: LDPs for Martingales n There are existing well-known results on Large deviation principals. Ex. CramΓ©r theorem and GΓ€rtner-Ellis theorem β€’ These results cannot be applied to BAI problem owing to the non- stationarity of the stochastic process. n Fan et al. (2013, 2014): LDP for martingales. β€’ Key tool: change-of-measure arguments. 47

Slide 48

Slide 48 text

Upper Bound: Large Deviation Principles for Martingales n Let β„™ be a probability measure of the original problem. n Define π‘ˆn = ∏_&' n wxy(]z%) 𝔼[wxy ]z% |β„±%*'] . n Define the conjugate probability measure β„™] as dβ„™] = π‘ˆndβ„™ 1. Derive the bound on β„™] . 2. Then, transform it to the bound on β„™ via the density ratio Yβ„™ Yβ„™+ . 48 β„™" β„™ dβ„™ dβ„™1 = π‘ˆ( Upper bound Upper bound Change measures

Slide 49

Slide 49 text

Upper Bound: Large Deviation Principles for Martingales n Kato et al. (2022) generalizes the result of Fan et al. (2013, 2014). Ø Under an appropriately designed BAI algorithm, we show that the upper bound matches the lower bound. n If βˆ‘%&' ( ' t%&` n |.U 𝑀` , then under some regularity conditions, lim sup nβ†’p βˆ’ 1 𝑇 log β„™ Ε’ π‘Žn βˆ— β‰  π‘Žβˆ— β‰₯ min `v`βˆ— Ξ”` , 2 𝐼' πœ‡' 𝑀' + 𝐼` πœ‡` 𝑀` + π‘œ Ξ”` , . n This result implies Gaussian approximation of LDP in Ξ”` β†’ 0. 49 Upper Bound (Theorem 4.1 of Kato et al. (2022)

Slide 50

Slide 50 text

Conclusion 50

Slide 51

Slide 51 text

Conclusion 51 n Density-ratio approaches. β€’ Inlier-based outlier detection. β€’ PU learning. β€’ Causal inference. β€’ Multi-armed bandit problem (change-of-measure arguments). β†’ Useful in many ML applications. n Other topics: Double/debiased machine learning (Chernozhukov et al. (2018)), Variational Bayesian methods (Tran et al. (2017)) etc

Slide 52

Slide 52 text

Reference β€’ Kato, M., and Teshima, T. (2022), β€œNon-negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation,” ,” in International Conference on Machine Learning. β€’ Kato, M., Imaizumi, M., McAlinn, K., Yasui, S., and Kakehi, H. (2022), β€œLearning Causal Relationships from Conditional Moment Restrictions by Importance Weighting,” in International Conference on Learning Representations. β€’ Kato, M., Imaizumi, M., and Minami, K. (2022), β€œUnified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation: Bridging KL-Divergence and Integral Probability Metrics,” . β€’ Kanamori, T., Hido, S., and Sugiyama, M. (2009), β€œA least-squares approach to direct importance estimation.” Journal of Machine Learning Research, 10(Jul.):1391–1445. β€’ Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. (2017), β€œPositive-Unlabeled Learning with Non-Negative Risk Estimator,” in Conference on Neural Information Processing Systems. β€’ Imbens, G. W. and Lancaster, T. (1996), β€œEfficient estimation and stratified sampling,” Journal of Econometrics, 74, 289–318. β€’ Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J.(2018), β€œDouble/debiased machine learning for treatment and structural parameters,” Econometrics Journal, 21, C1–C68. β€’ Good, I. J. and Gaskins, R. A. (1971), β€œNonparametric Roughness Penalties for Probability Densities,” Biometrika, 58, 255–277. β€’ Sugiyama, M., Nakajima, S., Kashima, H., von BΓΌnau, P., and Kawanabe, M. (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS'07). Curran Associates Inc., Red Hook, NY, USA, 1433–1440. β€’ Sugiyama, M., Suzuki, T., and Kanamori, T. (2011), β€œDensity Ratio Matching under the Bregman Divergence: A Unified Framework of Density Ratio Estimation,” Annals of the Institute of Statistical Mathematics, 64.β€” (2012), Density Ratio Estimation in Machine Learning, New York, NY, USA: Cambridge University Press, 1st ed. β€’ Sugiyama, M., (2016), β€œIntroduction to Statistical Machine Learning.” β€’ Silverman, B. W. (1982), β€œOn the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method,” The Annals of Statistics, 10, 795 – 810. 2 β€’ Suzuki, T., Sugiyama, M., Sese, Jun., and Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. In Proceedings of the Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery at ECML/PKDD 2008,volume 4 of Proceedings of Machine Learning Research, pp. 5–20. PMLR. β€’ Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. (2016), β€œGenerative Adversarial Nets from a Density Ratio Estimation Perspective.” β€’ Tran, D., Ranganath, R., and Blei, D. M. (2017), β€œHierarchical Implicit Models and Likelihood-Free Variational Inference,” in International Conference on Neural Information, Red Hook, NY, USA, p. 5529– 5539. β€’ Nguyen, X., Wainwright, M. J., and Jordan, M. (2008), β€œEstimating divergence functionals and the likelihood ratio by penalized convex risk minimization,” in Conference on Neural Information Processing Systems, vol. 20. β€’ Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003. β€’ Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., and Kanamori, T. (2011), β€œStatistical outlier detection using direct density ratio estimation,” Knowledge and Information Systems, 26, 309–336 β€’ Lai, T. and Robbins, H. (1985), β€œAsymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics β€’ Kaufmann, E., Cappe, O., and Garivier, A. (2016), β€œOn the Complexity of Best-Arm Identification in Multi-Armed Β΄ Bandit Models,” Journal of Machine Learning Research, 17, 1–42 β€’ Fan, X., Grama, I., and Liu, Q. (2013), β€œCramer large deviation expansions for martingales under Bernstein’s condi- Β΄ tion,” Stochastic Processes and their Applications, 123, 3919–3942. β€’ Fan, X., Grama, I., and Liu, Q. (2014), β€œA generalization of Cramer large deviations for martingales,” Β΄ Comptes Rendus Mathematique, 352, 853– 858. β€’ Shimodaira, H. (2000), β€œImproving predictive inference under covariate shift by weighting the log-likelihood function,” Journal of statistical planning and inference, 90, 227–244. 52