Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Recent Findings on Density-Ratio Approaches in Machine Learning

MasaKat0
March 30, 2022

Recent Findings on Density-Ratio Approaches in Machine Learning

Recent Findings on Density-Ratio Approaches in Machine Learning. FIML2022.

MasaKat0

March 30, 2022
Tweet

More Decks by MasaKat0

Other Decks in Research

Transcript

  1. Recent Findings on Density-Ratio Approaches in Machine Learning Workshop on

    FIMI March. 30th, 2022 Masahiro Kato The University of Tokyo, Imaizumi Lab / CyberAgent, Inc. AILab
  2. Density-Ratio Approaches in Machine Learning (ML) n Consider two distributions

    𝑃 and 𝑄 with a common support. n Let 𝑝∗ and 𝑞∗ be the density functions of 𝑃 and 𝑄, respectively. n Define the density ratio (function) as 𝑟∗ 𝑥 = "∗ # $∗ # . n Approaches using density ratios. → Useful in many ML applications. 2 𝑝∗(𝑥) 𝑞∗(𝑥) Density ratio 𝑟 𝑥 = "∗ # $∗ # .
  3. Density-Ratio Approach in ML n Many ML applications include two

    or more distributions. • Classification. • Generative adversarial networks (GAN). • Divergence between probability measures. • Multi-armed bandit (MAB) problem (change of measures) → In these tasks, the density ratio appears as a key component. 3
  4. Empirical Perspective n An estimator of the density ratio 𝑟∗

    provides a solution. • Inlier-based outlier detection: finding outliers based on the density ratio (Hido et al. (2008)). • Causal inference: conditional moment restrictions can be approximated by the density ratio (Kato et al. (2022)). • GAN (Goodfellow (2010), Uehara et al. (2016)). • Variational Bayesian (VB) method (Tran et al. (2017)) etc. 4
  5. Theoretical Viewpoint n Using density ratios is also useful in

    theoretical analysis • Likelihood-ratio gives tight lower bounds on decision making problems. Ex. Lower bounds in MAB problems (Lai and Robbins (1985)). • Transform an original problem to obtain a tight theoretical result. Ex. Large deviation principles for martingales (Fan et al. (2013, 2014)). 5 𝑄 𝑃 𝑝∗(𝑥) 𝑞∗(𝑥) Some theoretical results Some theoretical results
  6. Presentation Outline: Recent (Our) Findings on Density Ratios 1. Density-Ratio

    Estimation and its Applications Kato and Teshima (ICML2022), “Non-negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation” 2. Causal Inference and Density Ratios Kato, Imaizumi, McAlinn, Yasui, and Kakehi (ICLR2022), “Learning Causal Relationships from Conditional Moment Restrictions by Importance Weighting” 3. Density Ratios And Divergences between Probability Measures Kato, Imaizumi, and Minami (2022), “Unified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation” 4. Change-of-Measure Arguments in Best Arm Identification Problem Kato, Ariu, Imaizumi, Uehara, Nomura, and Qin (2022), “Best Arm Identification with a Fixed Budget under a Small Gap” 6
  7. Density-Ratio Estimation and its Applications Workshop on Functional Inference and

    Machine Intelligence Masahiro Kato, March. 30th, 2022 The University of Tokyo / CyberAgent, Inc. AILab 7
  8. Density-Ratio Estimation (DRE) n Consider DRE from observations. n Two

    sets of observations: 𝑋% %&' ( ∼ 𝑝∗, 𝑍) )&' * ∼ 𝑞∗. n Two-step method: • Estimate 𝑝∗ 𝑥 and 𝑞∗(𝑥); then, construct an estimator of 𝑟∗(𝑥). × Empirical performance. × Theoretical guarantee. → Consider direct estimation of 𝑟∗ 𝑥 : LSIF, KLIEP, and PU Learning 8
  9. Least-Squares Importance Fitting (LSIF) n Let 𝑟 be a model

    of the density ratio 𝑟∗. n The risk of the squared error is 𝑅 𝑟 = 𝔼+ 𝑟∗ 𝑋 − 𝑟 𝑋 , . n The minimizer of the empirical risk, ̂ 𝑟, is an estimator of 𝑟∗. n Instead of 𝑹(𝒓), we minimize an empirical risk of the following risk: 4 𝑅 𝑟 = −2𝔼- 𝑟 𝑋 + 𝔼+ 𝑟, 𝑋 . n This method is called LSIF (Kanamori et al. (2009)). 9
  10. LSIF n Derivation: 𝑟∗ = arg min . 𝔼+ 𝑟∗

    𝑋 − 𝑟 𝑋 , = arg min . 𝔼+ 𝑟∗ 𝑋 , − 2𝑟∗ 𝑋 𝑟 𝑋 + 𝑟,(𝑋) = arg min . 𝔼+ −2𝑟∗ 𝑋 𝑟 𝑋 + 𝑟,(𝑋) = arg min . −2𝔼- 𝑟 𝑋 + 𝔼+ 𝑟,(𝑋) . • Here, we used 𝔼+ 𝑟∗ 𝑥 𝑟 𝑥 = ∫ 𝑟∗ 𝑥 𝑟 𝑥 𝑞∗ 𝑥 𝑑𝑥 = ∫ 𝑟 𝑥 𝑝∗ 𝑥 𝑑𝑥 = 𝔼- 𝑟 𝑥 . 10
  11. KL Importance Estimation Procedure (KLIEP) n KLIEP (Sugiyama et al.

    (2007)) is another DRE method that uses the KL divergence between 𝑝∗(𝑥) and a model 𝑝 𝑥 = 𝑟 𝑥 𝑞∗ 𝑥 : KL 𝑝∗ 𝑥 ∥ 𝑝 𝑥 = ' 𝑝∗ 𝑥 log 𝑝∗ 𝑥 𝑝 𝑥 𝑑𝑥 = ' 𝑝∗ 𝑥 log 𝑝∗ 𝑥 𝑟 𝑥 𝑞∗(𝑥) 𝑑𝑥 = ' 𝑝∗ 𝑥 log 𝑝∗ 𝑥 𝑞∗(𝑥) 𝑑𝑥 − '𝑝∗ 𝑥 log 𝑟(𝑥) 𝑑𝑥 n From 𝑟∗ = arg min . KL 𝑝∗ 𝑥 ∥ 𝑝 𝑥 , we estimate 𝑟∗ as ̂ 𝑟 𝑥 = arg min . − 1 𝑛 E %&' ( log 𝑟 𝑋% s. t. 1 𝑚 E )&' * 𝑟(𝑍) ) = 1. 11
  12. Inlier-based Outlier Detection n Find outliers using inliers (correct samples)

    and the density ratio (Hido et al. (2008)) • Inliers are sampled from 𝑝∗ 𝑥 . • Test data: inliers + outliers are sampled from 𝑞∗(𝑥) n Outlier detection using the density ratio 𝑟∗ 𝑥 = "∗ # $∗(#) . 12 𝑝∗ 𝑥 𝑞∗ 𝑥 𝑟∗ 𝑥 = 𝑝∗ 𝑥 𝑞∗(𝑥) From Sugiyama (2016) Mean AUC values over 20 trials for the benchmark datasets (Hido et al, (2008)).
  13. Bregman (BR) Divergence Minimization Perspective n LSIF and KLIEP can

    be regarded as special cases of BR divergence minimization (Sugiyama et al. (2012)). • Let 𝑔(𝑡) be a twice continuously differentiable convex function. • Using the BR divergence, we can rewrite the objective function as follows: M BR4 𝑟 : = P 𝔼+ 𝜕𝑔 𝑟 𝑋% 𝑟 𝑋% − 𝑔 𝑟 𝑋% − P 𝔼- 𝜕𝑔 𝑟 𝑋) . • By changing 𝑔 𝑡 , we obtain objective functions for various direct DRE. Ex. 𝑔 𝑡 = 𝑡 − 1 ,: LSIF, 𝑔 𝑡 = 𝑡 log 𝑡 − 𝑡: KLIEP. 13
  14. Learning from Positive and Unlabeled Data (PU Learning) n PU

    learning is a method for a classifier only from positive and unlabeled data (du Plessis et al, (2014, 2015)). • Positive label: 𝑦 = +1, negative label: 𝑦 = −1. • Positive data: 𝑥% " %&' (" ∼ 𝑝 𝑥 𝑦 = +1 • Unlabeled data: 𝑥% 5 %&' (# ∼ 𝑝 𝑥 . 14
  15. Learning from Positive and Unlabeled Data (PU Learning) n A

    classifier 𝑓 can be trained by minimizing ℛ 𝑓 ≔ 𝜋 :log 𝑓 𝑥 𝑝 x y = +1 d𝑥 − 𝜋 :log 1 − 𝑓 𝑥 𝑝 x y = +1 d𝑥 + :log 1 − 𝑓 𝑥 𝑝 x d𝑥, where 𝜋 is a class prior defined as 𝜋 = 𝑝(𝑦 = +1). n Overfitting problem in PU learning (Kiryo et al. (2017)). n The empirical PU risk is not lower bounded and goes to −∞. 15 This term can go to −∞.
  16. Overfitting and Non-negative Correction n Kiryo et al. (2017) proposes

    non-negative correction based on −𝜋 #log 1 − 𝑓 𝑥 𝑝 x y = +1 d𝑥 + #log 1 − 𝑓 𝑥 𝑝 x d𝑥 ≥ 0. n The nonnegative PU risk is given as ℛ%%&' 𝑓 ≔ 𝜋 ,log 𝑓 𝑥 𝑝 x y = +1 d𝑥 + max 0, −𝜋 ,log 1 − 𝑓 𝑥 𝑝 x y = +1 d𝑥 + ,log 1 − 𝑓 𝑥 𝑝 x d𝑥 . n In population, ℛ 𝑓 = ℛGGHI 𝑓 . n Minimize an empirical version of ℛGGHI 𝑓 . 16 From Kiryo et al. (2017)
  17. Overfitting and Non-negative Correction n In DRE, we face a

    similar overfitting problem. n Kato and Teshima (2021) applies the non-negative method to DRE. 1. PU learning can be also regarded as BR divergence minimization (optimal classifier is 𝑝 𝑦 = 1 𝑥 = J"(#|L&M') "(#) ). 2. They apply non-negative correction to DRE. n In maximum likelihood nonparametric density estimation, this overfitting problem is known as the roughness problem (Good and Gaskin (1971)). 17 𝑞∗(𝑥) 𝑝∗(𝑥)
  18. Inlier-based Outlier Detection with Deep Neural Networks (DNNs) n Inlier-based

    outlier detection with high-dimensional data (ex. CIFAR-10) n We can use DNNs by using non-negative correction together. n PU learning-based DRE shows the best performance. 18 From Kato and Teshima (2021)
  19. Causal Inference and Density Ratios Workshop on Functional Inference and

    Machine Intelligence Masahiro Kato, March. 30th, 2022 The University of Tokyo / CyberAgent, Inc. AILab
  20. Structural Equation Model n Consider the following linear model between

    𝑌 and 𝑋: 𝑌 = 𝑋N𝛽 + 𝜀, 𝔼 𝑋N𝜀 ≠ 0. n 𝔼 𝑋N𝜀 ≠ 0 implies the correlation between 𝜀 and 𝑋. n This situation is called endogeneity. n In this case, an OLS estimator is not unbiased and consistent. • 𝑋N𝛽 is not conditional mean 𝔼[𝑌|𝑋] (𝔼 𝑌 𝑋 ≠ 𝑋N𝛽). n This model is called structural equation. 20
  21. NPIV: Wage Equation n The true wage equation: log(𝑤𝑎𝑔𝑒) =

    𝛽O + 𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛×𝛽' + 𝑎𝑏𝑖𝑙𝑖𝑡𝑦×𝛽, + 𝑢, 𝔼 𝑢 𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 0 n We cannot observe “ability” and estimate the following model: log(𝑤𝑎𝑔𝑒) = 𝛽O + 𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛×𝛽' + 𝜀, 𝜀 = 𝑎𝑏𝑖𝑙𝑖𝑡𝑦×𝛽, + 𝑢. • If “𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛” is correlated with “𝑎𝑏𝑖𝑙𝑖𝑡𝑦,” 𝔼 “years of education”×𝜀 ≠ 0 → We cannot consistently estimate 𝛽' with OLS. 21
  22. Instrumental Variable (IV) Method n By using IVs, we can

    estimate the parameter 𝛽. n The IV is a random variable 𝑍 satisfying the following conditions: 1. Uncorrelated to the error term: 𝔼 𝑍N𝜀 = 0. 2. Correlated with the endogeneous variable 𝑋. n Angrist and Krueger (1991): Using the quarter of birth as the IV. 22 𝑍(𝐼𝑉) 𝑋 (𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛) 𝑌(𝑤𝑎𝑔𝑒) 𝑈 (𝑎𝑏𝑖𝑙𝑖𝑡𝑦) 𝛽
  23. Nonparametric Instrumental Variable (NPIV) Regression n A nonparametric version of

    IV problems (Newey and Powell (2003)): 𝑌 = 𝑓∗ 𝑋 + 𝜀, 𝔼 𝜀|𝑋 ≠ 0. • Want to estimate the structural function 𝑓∗. • 𝔼 𝜀|𝑋 ≠ 0 → least-squires does not yield consistent estimator. n Instrumental variable 𝑍: the condition for IVs: 𝔼 𝜀|𝑍 = 0. n Algorithms: Two-stage least squares with series regression (Newey and Powell (2003)), Minimax optimization 23
  24. NPIV via Importance Weighting n Kato, Imaizumi, McAlinn, Yasui, and

    Kakehi (ICLR2022) solves the problem with an approach similar to covariate shift adaptation (Shimodaira (2000)). n From 𝔼P,Q 𝜀|𝑍 = 0, if we know 𝑟∗ 𝑦, 𝑥 𝑧 = "∗(L,#|R) "(L,#) , we estimate 𝑓∗ by minimizing an empirical approximation of 𝔼S 𝔼P,Q 𝜀|𝑍 , : ! 𝑓 = argmin ; 1 𝑛 , <=> ? 1 𝑛 , @=> ? 𝑌< − 𝑓 𝑋< 𝑟∗ 𝑦, 𝑥 𝑧 B n We show some theoretical results on the estimation error. 24
  25. NPIV via Importance Weighting n Estimate 𝑟∗ 𝑦, 𝑥 𝑧

    = "∗(L,#|R) "(L,#) = "∗(L,#,R) " L,# "(R) by applying the idea of LSIF as 𝑟∗ = arg min " 𝔼# 𝔼$,& 𝑟∗ 𝑌, 𝑋|𝑍 − 𝑟 𝑌, 𝑋|𝑍 ' = arg min " 𝔼# 𝔼$,& 𝑟∗ 𝑌, 𝑋|𝑍 ' − 2𝑟∗ 𝑌, 𝑋|𝑍 𝑟 𝑌, 𝑋|𝑍 + 𝑟'(𝑌, 𝑋|𝑍) = arg min " 𝔼# 𝔼$,& −2𝑟∗ 𝑌, 𝑋|𝑍 𝑟 𝑌, 𝑋|𝑍 + 𝑟'(𝑌, 𝑋|𝑍) = arg min " −2𝔼# 𝔼$,& 𝑟 𝑌, 𝑋|𝑍 + 𝔼$,&,# 𝑟' 𝑌, 𝑋|𝑍 . n KLIEP-based estimation is proposed by Suzuki et al. (2009). 25
  26. Density Ratios And Divergences between Probability Measures Workshop on Functional

    Inference and Machine Intelligence Masahiro Kato, March. 30th, 2022 The University of Tokyo / CyberAgent, Inc. AILab 26
  27. Reconsidering the BR Divergence Minimization from the Likelihood Approach n

    Reconsider DRE methods from maximum likelihood estimation perspectives. • We can define several likelihoods based on different sampling scheme. n The maximum likelihood estimation under the stratified sampling scheme is not included in BR divergence divergence. Ø The risk belongs to integral probability metrics (IPMs). • IPMs include the Wasserstein distance and MMD as special cases. n Reveal the relationships between probability distances and density ratios. → Expand the range of applications of density ratios. 27
  28. Likelihood of Density Ratios n Let 𝑟 𝑥 be a

    model of 𝑟∗(𝑥) = "∗ # $∗(#) . n A model of 𝑝∗ 𝑥 is given as 𝑝 𝑥 = 𝑟 𝑥 𝑞∗(𝑥). n For observations 𝑋% %&' ( ∼ 𝑝∗, the likelihood of the model 𝑝 𝑥 is ℒ 𝑟 = n %&' ( 𝑝(𝑋%) = n %&' ( 𝑟 𝑋% 𝑞∗(𝑋%) . n The log likelihood is given as ℓ 𝑟 = ∑%&' ( log 𝑟 𝑋% + ∑%&' ( log 𝑞∗(𝑋%) . 28
  29. Nonparametric Maximum Likelihood Estimation of Density Ratios n We can

    estimate 𝑟∗ by solving max . 1 𝑛 E %&' ( log 𝑟 𝑋% s. t. r𝑟 𝑧 𝑞∗ 𝑧 d𝑧 = 1 • The constraint is based on ∫ 𝑟∗ 𝑥 𝑞∗ 𝑥 𝑑𝑥 = ∫ 𝑝∗ 𝑥 𝑑𝑥 = 1. • This formulation is equivalent to KLIEP. n Similarly, for observations 𝑍) )&' * ∼ 𝑞∗, we can estimate1/𝑟∗ by solving max . − 1 𝑚 E )&' * log 𝑟 𝑍) s. t. r1/𝑟 𝑥 𝑝∗ 𝑥 d𝑥 = 1 29
  30. KL Divergence and Likelihood of Density Ratios n KL divergence

    is KL ℙ ∥ ℚ ≔ ∫ 𝑝∗ 𝑥 log "∗ # $∗ # d𝑥 n KL divergence can be interpreted as the maximized log likelihood because KL ℙ ∥ ℚ = sup .∈ℛ U.W. ∫ .(R) $∗ R YR&' rlog 𝑟(𝑥) 𝑝∗ 𝑥 d𝑥 n Derivation: KL ℙ ∥ ℚ = ∫ 𝑝∗ 𝑥 log "∗ # $∗ # d𝑥 = sup (∈ℱ 1 + ∫ 𝑓 𝑥 𝑝∗ 𝑥 d𝑥 − ∫ exp 𝑓 𝑥 𝑞∗ 𝑥 d𝑥 = 1 + ∫ 𝑓∗ 𝑥 𝑝∗ 𝑥 d𝑥 − ∫ exp 𝑓∗ 𝑥 𝑞∗ 𝑥 d𝑥 = ∫ 𝑓∗ 𝑥 𝑝∗ 𝑥 d𝑥 = sup +∈ℛ -./. ∫ +(2) $∗ 2 45 ∫ log 𝑟(𝑥) 𝑝∗ 𝑥 d𝑥 30
  31. Stratified Sampling Scheme n Assume that for all 𝑥 ∈

    𝒟, there exist 𝑟∗ 𝑥 and 1/𝑟∗(𝑥). n Define the likelihood of 𝑟 under a stratified sampling scheme. n The likelihood uses both 𝑋% %&' ( ∼ 𝑝∗ and 𝑍) )&' * ∼ 𝑞∗, simultaneously. n The likelihood is given as ℒ 𝑟 = ∏%&' ( ~ 𝑝. (𝑋% ) ∏)&' * ~ 𝑞. (𝑍% ) • This sampling scheme has been considered in causal inference(Imbens and Lancaster (1996)). 31
  32. Stratified Sampling Scheme n The objective function is given as

    max . E %&' ( log 𝑟(𝑋% ) − E )&' * log 𝑟 𝑍) s. t. r1/𝑟 𝑥 𝑝∗ 𝑥 d𝑥 = r𝑟 𝑧 𝑞∗ 𝑧 d𝑧 = 1 . n We consider the following equivalent unconstrained problem: max . E %&' ( log 𝑟(𝑋%) − E )&' * log 𝑟 𝑍) − 1 𝑚 E )&' * 𝑟 𝑍) − 1 𝑛 E %&' ( 1 𝑟 𝑋% n This transformation is based on the results of Silverman (1982). • We can apply this trick to KLIEP (see Nguyen et al. (2008)) 32
  33. Integral Probability Metrics (IPMs) and Likelihood of Density Ratios n

    IPMs with a function class ℱ defines the distance between two probability distributions 𝑃 and 𝑄 as sup Z∈ℱ r𝑓 𝑥 𝑝∗ 𝑥 d𝑥 − r𝑓 𝑧 𝑞∗ 𝑧 d𝑧 • If ℱ is a class of Lipschitz continuous functions, this distance becomes the Wasserstein distance. 33
  34. IPMs and Likelihood of Density Ratios n Consider a exponential

    density ratio model, 𝑟(𝑥) = exp 𝑓 𝑥 . n IPMs is the maximized log likelihood under stratified sampling scheme: IPM\(ℱ) ℙ ∥ ℚ 𝐶 ℱ = 𝑓 ∈ ℱ: rexp 𝑓 𝑧 d𝑞∗ 𝑧 𝑑𝑧 = rexp −𝑓 𝑥 𝑑𝑝∗ 𝑥 d𝑥 = 1 . 34
  35. Density Ratio Metrics (DRMs) n The density ratio metrics (DRMs,

    Kato, Imaizumi, and Minami (2022)): DRMℱ ] ℙ ∥ ℚ = sup Z∈\(ℱ) 𝜆 r𝑓 𝑥 𝑝∗ 𝑥 d𝑥 − (1 − 𝜆) r𝑓 𝑥 𝑞∗(𝑥) d𝑥 𝐶 ℱ = 𝑓 ∈ ℱ: rexp 𝑓 𝑥 𝑞∗ 𝑥 d𝑥 = rexp −𝑓 𝑥 𝑝∗ 𝑥 d𝑥 = 1 • A distance based on the weighted average of maximum log likelihood of the density ratio under stratified sampling scheme (𝜆 ∈ [0,1]). • Bridges the IPMs and KL divergence. • DRMs include the KL and IPMs as special cases. 35
  36. Density Ratio Metrics (DRMs) • If 𝜆 = 1/2, DRMℱ

    '/, 𝑃 ∥ 𝑄 = ' , IPM\(ℱ) 𝑃 ∥ 𝑄 • If 𝜆 = 1, DRMℱ ' 𝑃 ∥ 𝑄 = KL 𝑃 ∥ 𝑄 • If 𝜆 = 0, DRMℱ O 𝑃 ∥ 𝑄 = KL 𝑄 ∥ 𝑃 n Choice of ℱ → Smoothness of the model of the density ratio. Ex. Non-negative correction, Spectral normalization (Miyato et al. (2016)) n Probability divergence can be defined without density ratios. → What are the advantages? VB methods, DualGAN, causal inference... 36 From Kato et al. (2022)
  37. Change-of-Measure Arguments in Best Arm Identification Problem Workshop on Functional

    Inference and Machine Intelligence Masahiro Kato, March. 30th, 2022 The University of Tokyo / CyberAgent, Inc. AILab 37
  38. MAB Problem n There are 𝐾 arms, 𝐾 = 1,2,

    … 𝐾 and fixed time horizon 𝑇. • Pull an arm 𝐴_ ∈ [𝐾] in each round 𝑡. • Observe a reward of chosen arm 𝐴_ , 𝑌_ = ∑`∈[b] 1 𝐴_ = 𝑎 𝑌`,_ , where 𝑌`,_ is a (potential) reward of arm 𝑎 ∈ [𝐾] in each round 𝑡 • Stop the trial at round 𝑡 = 𝑇 38 Arm 1 Arm 2 ⋯ Arm 𝐾 𝑌",$ 𝑌%,$ 𝑌&,$ 𝑌$ = ∑'∈[&] 1 𝐴$ = 𝑎 𝑌',$ 𝑇 𝑡 = 1
  39. MAB Problem n The distributions of 𝑌`,_ does not change

    across rounds. n Denote the mean outcome of an arm 𝑎 by 𝜇` = 𝔼 𝑌`,_ . n Best arm: an arm with the highest reward. • Denote the best arm by 𝑎∗ = arg max `∈[m] 𝜇` 39
  40. BAI with a Fixed Budget n BAI with a fixed

    budget is an instance of the MAB problems. n In the final round 𝑇, we estimate the best arm and denote it by Œ 𝑎n ∗ n Probability of misidenfication: ℙ Œ 𝑎n ∗ ≠ 𝑎∗ n Goal: Minimizing the probability of misidenfication ℙ Œ 𝑎n ∗ ≠ 𝑎∗ . 40
  41. Theoretical Performance Evaluation n How to evaluate the performance of

    BAI algorithms? n ℙ Œ 𝑎n ∗ ≠ 𝑎∗ converges to 0 with an exponential speed; that is, ℙ Œ 𝑎n ∗ ≠ 𝑎∗ = exp(−𝑇(⋆)) for a constant term (⋆). n Consider evaluating the term (⋆) by lim sup n→p − ' n log ℙ Œ 𝑎n ∗ ≠ 𝑎∗ . n A performance lower (upper) bound of ℙ Ž 𝒂𝑻 ∗ ≠ 𝑎∗ is an upper (lower) bound of lim sup n→p − ' n log ℙ Œ 𝑎n ∗ ≠ 𝑎∗ . 41
  42. Information Theoretic Lower Bound n Information theoretic lower bound. •

    Lower bound based on the distribution information. • This kind of a lower bound is typically based on the likelihood ratio, Fisher information, and KL divergence. n The derivation technique is called change-of-measure arguments. • This technique has been used in MAB problem (Lai and Robbins (1985)) • In BAI, Kaufmann et al. (2016) suggests a lower bound. 42
  43. Lower Bound: Transportation Lemma n Denote the true distribution (bandit

    model) by 𝑣. n Denote a set of alternative hypothesizes by Alt(𝑣). n Consistent algorithm: return the true arm with probability 1 as 𝑇 → ∞. n For any 𝑣r ∈ Alt(𝑣) and consistent algorithms, if 𝐾 = 2, lim sup (→* − 1 𝑇 log ℙ G 𝑎( ∗ ≠ 𝑎∗ ≤ lim sup (→* 1 𝑇 𝔼+, K -./ ' K 0./ ( 1[𝐴0 = 𝑎] log 𝑓- , 𝑌 𝑓- 𝑌 where 𝑓` and 𝑓` r are the pdfs of an arm 𝑎’s reward under 𝑣 and 𝑣′. 43 Transportation Lemma (Lemma 1 of Kaufmann et al. (2016) Log likelihood ratio
  44. Open Problem: Optimal Algorithm? n Open problem: 1. The Kaufmann’s

    bound is only applicable to two-armed bandit (𝐾 = 2). 2. No optimal algorithm whose upper bound achieves the lower bound. n Kato, Ariu, Imaizumi, Uehara, Nomura, and Qin (2022) proposes an optimal algorithm under a small-gap setting. 1. Consider a small gap situation:Δ` = 𝜇`∗ − 𝜇` → 0 for all 𝑎 ∈ [𝐾]. 2. Proposes a large deviation upper bound. 3. Then, the upper bound matches the lower bound in the limit of Δ` . 44
  45. Lower Bound under a Small Gap n Let 𝐼` (𝜇`

    ) be the Fisher information of parameter 𝜇` of arm 𝑎. n Let 𝑤` be an arm allocation lim sup n→p 𝔼$ ∑%&' ( ' t%&` n . lim sup n→p − 1 𝑇 log ℙ Œ 𝑎n ∗ ≠ 𝑎∗ ≤ sup (u)) min `v`∗ Δ` , 2 𝐼' 𝜇' 𝑤' + 𝐼` 𝜇` 𝑤` + 𝑜(Δ` , ) 45 Lower Bound (Lemma 1 of Kato et al. (2022)
  46. Upper Bound: Large Deviation Principles (LDPs) for Martingales n Let

    ̂ 𝜇`,n be an estimator of the mean rewad 𝜇` . n Consider returning arg max ` ̂ 𝜇`,n as an estimated best arm. Then, ℙ Œ 𝑎n ∗ ≠ 𝑎∗ = E `v`∗ ℙ ̂ 𝜇`,n ≥ ̂ 𝜇`∗,n = E `v`∗ ℙ ̂ 𝜇`∗,n − ̂ 𝜇`,n − Δ` ≤ −Δ` n LDP: evaluation of ℙ ̂ 𝜇`∗,n − ̂ 𝜇`,n − Δ` ≤ 𝐶 , where 𝐶 is a constant. • Central limit theorem (CLT): evaluation of ℙ 𝑇( ̂ 𝜇`∗,n − ̂ 𝜇`,n − Δ` ) ≤ 𝐶 . • We cannot use the CLT for obtaining the upper bound. 46
  47. Upper Bound: LDPs for Martingales n There are existing well-known

    results on Large deviation principals. Ex. Cramér theorem and Gärtner-Ellis theorem • These results cannot be applied to BAI problem owing to the non- stationarity of the stochastic process. n Fan et al. (2013, 2014): LDP for martingales. • Key tool: change-of-measure arguments. 47
  48. Upper Bound: Large Deviation Principles for Martingales n Let ℙ

    be a probability measure of the original problem. n Define 𝑈n = ∏_&' n wxy(]z%) 𝔼[wxy ]z% |ℱ%*'] . n Define the conjugate probability measure ℙ] as dℙ] = 𝑈ndℙ 1. Derive the bound on ℙ] . 2. Then, transform it to the bound on ℙ via the density ratio Yℙ Yℙ+ . 48 ℙ" ℙ dℙ dℙ1 = 𝑈( Upper bound Upper bound Change measures
  49. Upper Bound: Large Deviation Principles for Martingales n Kato et

    al. (2022) generalizes the result of Fan et al. (2013, 2014). Ø Under an appropriately designed BAI algorithm, we show that the upper bound matches the lower bound. n If ∑%&' ( ' t%&` n |.U 𝑤` , then under some regularity conditions, lim sup n→p − 1 𝑇 log ℙ Œ 𝑎n ∗ ≠ 𝑎∗ ≥ min `v`∗ Δ` , 2 𝐼' 𝜇' 𝑤' + 𝐼` 𝜇` 𝑤` + 𝑜 Δ` , . n This result implies Gaussian approximation of LDP in Δ` → 0. 49 Upper Bound (Theorem 4.1 of Kato et al. (2022)
  50. Conclusion 51 n Density-ratio approaches. • Inlier-based outlier detection. •

    PU learning. • Causal inference. • Multi-armed bandit problem (change-of-measure arguments). → Useful in many ML applications. n Other topics: Double/debiased machine learning (Chernozhukov et al. (2018)), Variational Bayesian methods (Tran et al. (2017)) etc
  51. Reference • Kato, M., and Teshima, T. (2022), “Non-negative Bregman

    Divergence Minimization for Deep Direct Density Ratio Estimation,” ,” in International Conference on Machine Learning. • Kato, M., Imaizumi, M., McAlinn, K., Yasui, S., and Kakehi, H. (2022), “Learning Causal Relationships from Conditional Moment Restrictions by Importance Weighting,” in International Conference on Learning Representations. • Kato, M., Imaizumi, M., and Minami, K. (2022), “Unified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation: Bridging KL-Divergence and Integral Probability Metrics,” . • Kanamori, T., Hido, S., and Sugiyama, M. (2009), “A least-squares approach to direct importance estimation.” Journal of Machine Learning Research, 10(Jul.):1391–1445. • Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. (2017), “Positive-Unlabeled Learning with Non-Negative Risk Estimator,” in Conference on Neural Information Processing Systems. • Imbens, G. W. and Lancaster, T. (1996), “Efficient estimation and stratified sampling,” Journal of Econometrics, 74, 289–318. • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J.(2018), “Double/debiased machine learning for treatment and structural parameters,” Econometrics Journal, 21, C1–C68. • Good, I. J. and Gaskins, R. A. (1971), “Nonparametric Roughness Penalties for Probability Densities,” Biometrika, 58, 255–277. • Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., and Kawanabe, M. (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS'07). Curran Associates Inc., Red Hook, NY, USA, 1433–1440. • Sugiyama, M., Suzuki, T., and Kanamori, T. (2011), “Density Ratio Matching under the Bregman Divergence: A Unified Framework of Density Ratio Estimation,” Annals of the Institute of Statistical Mathematics, 64.— (2012), Density Ratio Estimation in Machine Learning, New York, NY, USA: Cambridge University Press, 1st ed. • Sugiyama, M., (2016), “Introduction to Statistical Machine Learning.” • Silverman, B. W. (1982), “On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method,” The Annals of Statistics, 10, 795 – 810. 2 • Suzuki, T., Sugiyama, M., Sese, Jun., and Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. In Proceedings of the Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery at ECML/PKDD 2008,volume 4 of Proceedings of Machine Learning Research, pp. 5–20. PMLR. • Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. (2016), “Generative Adversarial Nets from a Density Ratio Estimation Perspective.” • Tran, D., Ranganath, R., and Blei, D. M. (2017), “Hierarchical Implicit Models and Likelihood-Free Variational Inference,” in International Conference on Neural Information, Red Hook, NY, USA, p. 5529– 5539. • Nguyen, X., Wainwright, M. J., and Jordan, M. (2008), “Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization,” in Conference on Neural Information Processing Systems, vol. 20. • Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003. • Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., and Kanamori, T. (2011), “Statistical outlier detection using direct density ratio estimation,” Knowledge and Information Systems, 26, 309–336 • Lai, T. and Robbins, H. (1985), “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics • Kaufmann, E., Cappe, O., and Garivier, A. (2016), “On the Complexity of Best-Arm Identification in Multi-Armed ´ Bandit Models,” Journal of Machine Learning Research, 17, 1–42 • Fan, X., Grama, I., and Liu, Q. (2013), “Cramer large deviation expansions for martingales under Bernstein’s condi- ´ tion,” Stochastic Processes and their Applications, 123, 3919–3942. • Fan, X., Grama, I., and Liu, Q. (2014), “A generalization of Cramer large deviations for martingales,” ´ Comptes Rendus Mathematique, 352, 853– 858. • Shimodaira, H. (2000), “Improving predictive inference under covariate shift by weighting the log-likelihood function,” Journal of statistical planning and inference, 90, 227–244. 52