330

# Recent Findings on Density-Ratio Approaches in Machine Learning

Recent Findings on Density-Ratio Approaches in Machine Learning. FIML2022. March 30, 2022

## Transcript

1. Recent Findings on Density-Ratio Approaches
in Machine Learning
Workshop on FIMI March. 30th, 2022
Masahiro Kato
The University of Tokyo, Imaizumi Lab / CyberAgent, Inc. AILab

2. Density-Ratio Approaches
in Machine Learning (ML)
n Consider two distributions 𝑃 and 𝑄 with a common support.
n Let 𝑝∗ and 𝑞∗ be the density functions of 𝑃 and 𝑄, respectively.
n Define the density ratio (function) as
𝑟∗ 𝑥 = "∗ #
\$∗ #
.
n Approaches using density ratios.
→ Useful in many ML applications.
2
𝑝∗(𝑥) 𝑞∗(𝑥)
Density ratio 𝑟 𝑥 = "∗ #
\$∗ #
.

3. Density-Ratio Approach in ML
n Many ML applications include two or more distributions.
• Classification.
• Divergence between probability measures.
• Multi-armed bandit (MAB) problem (change of measures)
→ In these tasks, the density ratio appears as a key component.
3

4. Empirical Perspective
n An estimator of the density ratio 𝑟∗ provides a solution.
• Inlier-based outlier detection: finding outliers based on the density ratio
(Hido et al. (2008)).
• Causal inference: conditional moment restrictions can be approximated by
the density ratio (Kato et al. (2022)).
• GAN (Goodfellow (2010), Uehara et al. (2016)).
• Variational Bayesian (VB) method (Tran et al. (2017)) etc.
4

5. Theoretical Viewpoint
n Using density ratios is also useful in theoretical analysis
• Likelihood-ratio gives tight lower bounds on decision making problems.
Ex. Lower bounds in MAB problems (Lai and Robbins (1985)).
• Transform an original problem to obtain a tight theoretical result.
Ex. Large deviation principles for martingales (Fan et al. (2013, 2014)).
5
𝑄
𝑃
𝑝∗(𝑥)
𝑞∗(𝑥)
Some
theoretical
results
Some
theoretical
results

6. Presentation Outline:
Recent (Our) Findings on Density Ratios
1. Density-Ratio Estimation and its Applications
Kato and Teshima (ICML2022), “Non-negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation”
2. Causal Inference and Density Ratios
Kato, Imaizumi, McAlinn, Yasui, and Kakehi (ICLR2022), “Learning Causal Relationships from Conditional Moment Restrictions by Importance Weighting”
3. Density Ratios And Divergences between Probability Measures
Kato, Imaizumi, and Minami (2022), “Unified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation”
4. Change-of-Measure Arguments in Best Arm Identification Problem
Kato, Ariu, Imaizumi, Uehara, Nomura, and Qin (2022), “Best Arm Identification with a Fixed Budget under a Small Gap”
6

7. Density-Ratio Estimation
and its Applications
Workshop on Functional Inference and Machine Intelligence
Masahiro Kato, March. 30th, 2022
The University of Tokyo / CyberAgent, Inc. AILab
7

8. Density-Ratio Estimation (DRE)
n Consider DRE from observations.
n Two sets of observations: 𝑋% %&'
( ∼ 𝑝∗, 𝑍) )&'
*
∼ 𝑞∗.
n Two-step method:
• Estimate 𝑝∗ 𝑥 and 𝑞∗(𝑥); then, construct an estimator of 𝑟∗(𝑥).
× Empirical performance.
× Theoretical guarantee.
→ Consider direct estimation of 𝑟∗ 𝑥 : LSIF, KLIEP, and PU Learning
8

9. Least-Squares Importance Fitting (LSIF)
n Let 𝑟 be a model of the density ratio 𝑟∗.
n The risk of the squared error is 𝑅 𝑟 = 𝔼+
𝑟∗ 𝑋 − 𝑟 𝑋 ,
.
n The minimizer of the empirical risk, ̂
𝑟, is an estimator of 𝑟∗.
n Instead of 𝑹(𝒓), we minimize an empirical risk of the following risk:
4
𝑅 𝑟 = −2𝔼- 𝑟 𝑋 + 𝔼+ 𝑟, 𝑋 .
n This method is called LSIF (Kanamori et al. (2009)).
9

10. LSIF
n Derivation:
𝑟∗ = arg min
.
𝔼+ 𝑟∗ 𝑋 − 𝑟 𝑋
,
= arg min
.
𝔼+
𝑟∗ 𝑋 ,
− 2𝑟∗ 𝑋 𝑟 𝑋 + 𝑟,(𝑋)
= arg min
.
𝔼+
−2𝑟∗ 𝑋 𝑟 𝑋 + 𝑟,(𝑋)
= arg min
.
−2𝔼- 𝑟 𝑋 + 𝔼+ 𝑟,(𝑋) .
• Here, we used
𝔼+
𝑟∗ 𝑥 𝑟 𝑥 = ∫ 𝑟∗ 𝑥 𝑟 𝑥 𝑞∗ 𝑥 𝑑𝑥 = ∫ 𝑟 𝑥 𝑝∗ 𝑥 𝑑𝑥 = 𝔼-
𝑟 𝑥 .
10

11. KL Importance Estimation Procedure
(KLIEP)
n KLIEP (Sugiyama et al. (2007)) is another DRE method that uses the KL
divergence between 𝑝∗(𝑥) and a model 𝑝 𝑥 = 𝑟 𝑥 𝑞∗ 𝑥 :
KL 𝑝∗ 𝑥 ∥ 𝑝 𝑥 = ' 𝑝∗ 𝑥 log
𝑝∗ 𝑥
𝑝 𝑥
𝑑𝑥 = ' 𝑝∗ 𝑥 log
𝑝∗ 𝑥
𝑟 𝑥 𝑞∗(𝑥)
𝑑𝑥
= ' 𝑝∗ 𝑥 log
𝑝∗ 𝑥
𝑞∗(𝑥)
𝑑𝑥 − '𝑝∗ 𝑥 log 𝑟(𝑥) 𝑑𝑥
n From 𝑟∗ = arg min
.
KL 𝑝∗ 𝑥 ∥ 𝑝 𝑥 , we estimate 𝑟∗ as
̂
𝑟 𝑥 = arg min
.

1
𝑛
E
%&'
(
log 𝑟 𝑋%
s. t.
1
𝑚
E
)&'
*
𝑟(𝑍)
) = 1.
11

12. Inlier-based Outlier Detection
n Find outliers using inliers (correct samples)
and the density ratio (Hido et al. (2008))
• Inliers are sampled from 𝑝∗ 𝑥 .
• Test data: inliers + outliers are sampled
from 𝑞∗(𝑥)
n Outlier detection using the density ratio
𝑟∗ 𝑥 = "∗ #
\$∗(#)
.
12
𝑝∗ 𝑥
𝑞∗ 𝑥
𝑟∗ 𝑥 =
𝑝∗ 𝑥
𝑞∗(𝑥)
From Sugiyama (2016)
Mean AUC values over 20 trials for the benchmark datasets (Hido et al, (2008)).

13. Bregman (BR) Divergence Minimization
Perspective
n LSIF and KLIEP can be regarded as special cases of BR divergence
minimization (Sugiyama et al. (2012)).
• Let 𝑔(𝑡) be a twice continuously differentiable convex function.
• Using the BR divergence, we can rewrite the objective function as follows:
M
BR4
𝑟 : = P
𝔼+
𝜕𝑔 𝑟 𝑋%
𝑟 𝑋%
− 𝑔 𝑟 𝑋%
− P
𝔼-
𝜕𝑔 𝑟 𝑋)
.
• By changing 𝑔 𝑡 , we obtain objective functions for various direct DRE.
Ex. 𝑔 𝑡 = 𝑡 − 1 ,: LSIF, 𝑔 𝑡 = 𝑡 log 𝑡 − 𝑡: KLIEP.
13

14. Learning from Positive and Unlabeled Data
(PU Learning)
n PU learning is a method for a classifier only from positive and unlabeled
data (du Plessis et al, (2014, 2015)).
• Positive label: 𝑦 = +1, negative label: 𝑦 = −1.
• Positive data: 𝑥%
"
%&'
("
∼ 𝑝 𝑥 𝑦 = +1
• Unlabeled data: 𝑥%
5
%&'
(#
∼ 𝑝 𝑥 .
14

15. Learning from Positive and Unlabeled Data
(PU Learning)
n A classifier 𝑓 can be trained by minimizing
ℛ 𝑓 ≔ 𝜋 :log 𝑓 𝑥 𝑝 x y = +1 d𝑥 − 𝜋 :log 1 − 𝑓 𝑥 𝑝 x y = +1 d𝑥 + :log 1 − 𝑓 𝑥 𝑝 x d𝑥,
where 𝜋 is a class prior defined as 𝜋 = 𝑝(𝑦 = +1).
n Overfitting problem in PU learning (Kiryo et al. (2017)).
n The empirical PU risk is not lower bounded and goes to −∞.
15
This term can go to −∞.

16. Overfitting and Non-negative Correction
n Kiryo et al. (2017) proposes non-negative correction based on
−𝜋 #log 1 − 𝑓 𝑥 𝑝 x y = +1 d𝑥 + #log 1 − 𝑓 𝑥 𝑝 x d𝑥 ≥ 0.
n The nonnegative PU risk is given as
ℛ%%&' 𝑓 ≔ 𝜋 ,log 𝑓 𝑥 𝑝 x y = +1 d𝑥 + max 0, −𝜋 ,log 1 − 𝑓 𝑥 𝑝 x y = +1 d𝑥 + ,log 1 − 𝑓 𝑥 𝑝 x d𝑥 .
n In population, ℛ 𝑓 = ℛGGHI 𝑓 .
n Minimize an empirical version of ℛGGHI 𝑓 .
16
From Kiryo et al. (2017)

17. Overfitting and Non-negative Correction
n In DRE, we face a similar overfitting problem.
n Kato and Teshima (2021) applies the non-negative method to DRE.
1. PU learning can be also regarded as BR divergence minimization
(optimal classifier is 𝑝 𝑦 = 1 𝑥 = J"(#|L&M')
"(#)
).
2. They apply non-negative correction to DRE.
n In maximum likelihood nonparametric density estimation, this overfitting
problem is known as the roughness problem (Good and Gaskin (1971)).
17
𝑞∗(𝑥)
𝑝∗(𝑥)

18. Inlier-based Outlier Detection
with Deep Neural Networks (DNNs)
n Inlier-based outlier detection with high-dimensional data (ex. CIFAR-10)
n We can use DNNs by using non-negative correction together.
n PU learning-based DRE shows the best performance.
18
From Kato and Teshima (2021)

19. Causal Inference
and Density Ratios
Workshop on Functional Inference and Machine Intelligence
Masahiro Kato, March. 30th, 2022
The University of Tokyo / CyberAgent, Inc. AILab

20. Structural Equation Model
n Consider the following linear model between 𝑌 and 𝑋:
𝑌 = 𝑋N𝛽 + 𝜀, 𝔼 𝑋N𝜀 ≠ 0.
n 𝔼 𝑋N𝜀 ≠ 0 implies the correlation between 𝜀 and 𝑋.
n This situation is called endogeneity.
n In this case, an OLS estimator is not unbiased and consistent.
• 𝑋N𝛽 is not conditional mean 𝔼[𝑌|𝑋] （𝔼 𝑌 𝑋 ≠ 𝑋N𝛽）．
n This model is called structural equation.
20

21. NPIV: Wage Equation
n The true wage equation:
log(𝑤𝑎𝑔𝑒) = 𝛽O + 𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛×𝛽' + 𝑎𝑏𝑖𝑙𝑖𝑡𝑦×𝛽, + 𝑢,
𝔼 𝑢 𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 0
n We cannot observe “ability” and estimate the following model:
log(𝑤𝑎𝑔𝑒) = 𝛽O + 𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛×𝛽' + 𝜀, 𝜀 = 𝑎𝑏𝑖𝑙𝑖𝑡𝑦×𝛽, + 𝑢.
• If “𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛” is correlated with “𝑎𝑏𝑖𝑙𝑖𝑡𝑦,”
𝔼 “years of education”×𝜀 ≠ 0
→ We cannot consistently estimate 𝛽'
with OLS.
21

22. Instrumental Variable (IV) Method
n By using IVs, we can estimate the parameter 𝛽.
n The IV is a random variable 𝑍 satisfying the following conditions:
1. Uncorrelated to the error term： 𝔼 𝑍N𝜀 = 0．
2. Correlated with the endogeneous variable 𝑋.
n Angrist and Krueger (1991): Using the quarter of birth as the IV.
22
𝑍(𝐼𝑉) 𝑋 (𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛) 𝑌(𝑤𝑎𝑔𝑒)
𝑈 (𝑎𝑏𝑖𝑙𝑖𝑡𝑦)
𝛽

23. Nonparametric Instrumental Variable
(NPIV) Regression
n A nonparametric version of IV problems (Newey and Powell (2003)):
𝑌 = 𝑓∗ 𝑋 + 𝜀, 𝔼 𝜀|𝑋 ≠ 0.
• Want to estimate the structural function 𝑓∗.
• 𝔼 𝜀|𝑋 ≠ 0 → least-squires does not yield consistent estimator.
n Instrumental variable 𝑍: the condition for IVs: 𝔼 𝜀|𝑍 = 0.
n Algorithms: Two-stage least squares with series regression (Newey and
Powell (2003)), Minimax optimization
23

24. NPIV via Importance Weighting
n Kato, Imaizumi, McAlinn, Yasui, and Kakehi (ICLR2022) solves the problem
with an approach similar to covariate shift adaptation (Shimodaira (2000)).
n From 𝔼P,Q 𝜀|𝑍 = 0, if we know 𝑟∗ 𝑦, 𝑥 𝑧 = "∗(L,#|R)
"(L,#)
, we estimate 𝑓∗ by
minimizing an empirical approximation of 𝔼S 𝔼P,Q 𝜀|𝑍
,
:
!
𝑓 = argmin
;
1
𝑛
,
<=>
?
1
𝑛
,
@=>
?
𝑌<
− 𝑓 𝑋<
𝑟∗ 𝑦, 𝑥 𝑧
B
n We show some theoretical results on the estimation error.
24

25. NPIV via Importance Weighting
n Estimate 𝑟∗ 𝑦, 𝑥 𝑧 = "∗(L,#|R)
"(L,#)
= "∗(L,#,R)
" L,# "(R)
by applying the idea of LSIF as
𝑟∗ = arg min
"
𝔼# 𝔼\$,& 𝑟∗ 𝑌, 𝑋|𝑍 − 𝑟 𝑌, 𝑋|𝑍 '
= arg min
"
𝔼# 𝔼\$,& 𝑟∗ 𝑌, 𝑋|𝑍 '
− 2𝑟∗ 𝑌, 𝑋|𝑍 𝑟 𝑌, 𝑋|𝑍 + 𝑟'(𝑌, 𝑋|𝑍)
= arg min
"
𝔼# 𝔼\$,& −2𝑟∗ 𝑌, 𝑋|𝑍 𝑟 𝑌, 𝑋|𝑍 + 𝑟'(𝑌, 𝑋|𝑍)
= arg min
"
−2𝔼# 𝔼\$,& 𝑟 𝑌, 𝑋|𝑍 + 𝔼\$,&,# 𝑟' 𝑌, 𝑋|𝑍 .
n KLIEP-based estimation is proposed by Suzuki et al. (2009).
25

26. Density Ratios And Divergences
between Probability Measures
Workshop on Functional Inference and Machine Intelligence
Masahiro Kato, March. 30th, 2022
The University of Tokyo / CyberAgent, Inc. AILab
26

27. Reconsidering the BR Divergence
Minimization from the Likelihood Approach
n Reconsider DRE methods from maximum likelihood estimation perspectives.
• We can define several likelihoods based on different sampling scheme.
n The maximum likelihood estimation under the stratified sampling scheme is
not included in BR divergence divergence.
Ø The risk belongs to integral probability metrics (IPMs).
• IPMs include the Wasserstein distance and MMD as special cases.
n Reveal the relationships between probability distances and density ratios.
→ Expand the range of applications of density ratios.
27

28. Likelihood of Density Ratios
n Let 𝑟 𝑥 be a model of 𝑟∗(𝑥) = "∗ #
\$∗(#)

n A model of 𝑝∗ 𝑥 is given as 𝑝 𝑥 = 𝑟 𝑥 𝑞∗(𝑥).
n For observations 𝑋% %&'
( ∼ 𝑝∗, the likelihood of the model 𝑝 𝑥 is
ℒ 𝑟 = n
%&'
(
𝑝(𝑋%) = n
%&'
(
𝑟 𝑋% 𝑞∗(𝑋%) .
n The log likelihood is given as ℓ 𝑟 = ∑%&'
( log 𝑟 𝑋% + ∑%&'
( log 𝑞∗(𝑋%) .
28

29. Nonparametric Maximum Likelihood
Estimation of Density Ratios
n We can estimate 𝑟∗ by solving
max
.
1
𝑛
E
%&'
(
log 𝑟 𝑋% s. t. r𝑟 𝑧 𝑞∗ 𝑧 d𝑧 = 1
• The constraint is based on ∫ 𝑟∗ 𝑥 𝑞∗ 𝑥 𝑑𝑥 = ∫ 𝑝∗ 𝑥 𝑑𝑥 = 1.
• This formulation is equivalent to KLIEP.
n Similarly, for observations 𝑍) )&'
*
∼ 𝑞∗, we can estimate1/𝑟∗ by solving
max
.

1
𝑚
E
)&'
*
log 𝑟 𝑍) s. t. r1/𝑟 𝑥 𝑝∗ 𝑥 d𝑥 = 1
29

30. KL Divergence and Likelihood of Density
Ratios
n KL divergence is KL ℙ ∥ ℚ ≔ ∫ 𝑝∗ 𝑥 log "∗ #
\$∗ #
d𝑥
n KL divergence can be interpreted as the maximized log likelihood because
KL ℙ ∥ ℚ = sup
.∈ℛ
U.W. ∫ .(R) \$∗ R YR&'
rlog 𝑟(𝑥) 𝑝∗ 𝑥 d𝑥
n Derivation: KL ℙ ∥ ℚ = ∫ 𝑝∗ 𝑥 log "∗ #
\$∗ #
d𝑥 = sup
(∈ℱ
1 + ∫ 𝑓 𝑥 𝑝∗ 𝑥 d𝑥 − ∫ exp 𝑓 𝑥 𝑞∗ 𝑥 d𝑥 = 1 +
∫ 𝑓∗ 𝑥 𝑝∗ 𝑥 d𝑥 − ∫ exp 𝑓∗ 𝑥 𝑞∗ 𝑥 d𝑥 = ∫ 𝑓∗ 𝑥 𝑝∗ 𝑥 d𝑥 = sup
+∈ℛ
-./. ∫ +(2) \$∗ 2 45
∫ log 𝑟(𝑥) 𝑝∗ 𝑥 d𝑥
30

31. Stratified Sampling Scheme
n Assume that for all 𝑥 ∈ 𝒟, there exist 𝑟∗ 𝑥 and 1/𝑟∗(𝑥).
n Define the likelihood of 𝑟 under a stratified sampling scheme.
n The likelihood uses both 𝑋% %&'
( ∼ 𝑝∗ and 𝑍) )&'
*
∼ 𝑞∗, simultaneously.
n The likelihood is given as ℒ 𝑟 = ∏%&'
( ~
𝑝.
(𝑋%
) ∏)&'
* ~
𝑞.
(𝑍%
)
• This sampling scheme has been considered in causal inference（Imbens and
Lancaster (1996))．
31

32. Stratified Sampling Scheme
n The objective function is given as
max
.
E
%&'
(
log 𝑟(𝑋%
) − E
)&'
*
log 𝑟 𝑍)
s. t. r1/𝑟 𝑥 𝑝∗ 𝑥 d𝑥 = r𝑟 𝑧 𝑞∗ 𝑧 d𝑧 = 1 .
n We consider the following equivalent unconstrained problem:
max
.
E
%&'
(
log 𝑟(𝑋%) − E
)&'
*
log 𝑟 𝑍) −
1
𝑚
E
)&'
*
𝑟 𝑍) −
1
𝑛
E
%&'
(
1
𝑟 𝑋%
n This transformation is based on the results of Silverman (1982).
• We can apply this trick to KLIEP (see Nguyen et al. (2008))
32

33. Integral Probability Metrics (IPMs) and
Likelihood of Density Ratios
n IPMs with a function class ℱ defines the distance between two probability
distributions 𝑃 and 𝑄 as
sup
Z∈ℱ
r𝑓 𝑥 𝑝∗ 𝑥 d𝑥 − r𝑓 𝑧 𝑞∗ 𝑧 d𝑧
• If ℱ is a class of Lipschitz continuous functions, this distance becomes the
Wasserstein distance.
33

34. IPMs and Likelihood of Density Ratios
n Consider a exponential density ratio model, 𝑟(𝑥) = exp 𝑓 𝑥 .
n IPMs is the maximized log likelihood under stratified sampling scheme:
IPM\(ℱ) ℙ ∥ ℚ
𝐶 ℱ = 𝑓 ∈ ℱ: rexp 𝑓 𝑧 d𝑞∗ 𝑧 𝑑𝑧 = rexp −𝑓 𝑥 𝑑𝑝∗ 𝑥 d𝑥 = 1 .
34

35. Density Ratio Metrics (DRMs)
n The density ratio metrics (DRMs, Kato, Imaizumi, and Minami (2022)):
DRMℱ
] ℙ ∥ ℚ = sup
Z∈\(ℱ)
𝜆 r𝑓 𝑥 𝑝∗ 𝑥 d𝑥 − (1 − 𝜆) r𝑓 𝑥 𝑞∗(𝑥) d𝑥
𝐶 ℱ = 𝑓 ∈ ℱ: rexp 𝑓 𝑥 𝑞∗ 𝑥 d𝑥 = rexp −𝑓 𝑥 𝑝∗ 𝑥 d𝑥 = 1
• A distance based on the weighted average of maximum log likelihood of the
density ratio under stratified sampling scheme (𝜆 ∈ [0,1]).
• Bridges the IPMs and KL divergence．
• DRMs include the KL and IPMs as special cases.
35

36. Density Ratio Metrics (DRMs)
• If 𝜆 = 1/2, DRMℱ
'/, 𝑃 ∥ 𝑄 = '
,
IPM\(ℱ) 𝑃 ∥ 𝑄
• If 𝜆 = 1, DRMℱ
' 𝑃 ∥ 𝑄 = KL 𝑃 ∥ 𝑄
• If 𝜆 = 0, DRMℱ
O 𝑃 ∥ 𝑄 = KL 𝑄 ∥ 𝑃
n Choice of ℱ → Smoothness of the model of the density ratio.
Ex. Non-negative correction, Spectral normalization (Miyato et al. (2016))
n Probability divergence can be defined without density ratios.
→ What are the advantages? VB methods, DualGAN, causal inference...
36
From Kato et al. (2022)

37. Change-of-Measure Arguments
in Best Arm Identification Problem
Workshop on Functional Inference and Machine Intelligence
Masahiro Kato, March. 30th, 2022
The University of Tokyo / CyberAgent, Inc. AILab
37

38. MAB Problem
n There are 𝐾 arms, 𝐾 = 1,2, … 𝐾 and fixed time horizon 𝑇.
• Pull an arm 𝐴_ ∈ [𝐾] in each round 𝑡.
• Observe a reward of chosen arm 𝐴_
,
𝑌_
= ∑`∈[b]
1 𝐴_
= 𝑎 𝑌`,_
,
where 𝑌`,_
is a (potential) reward
of arm 𝑎 ∈ [𝐾] in each round 𝑡
• Stop the trial at round 𝑡 = 𝑇
38
Arm 1 Arm 2

Arm 𝐾
𝑌",\$
𝑌%,\$
𝑌&,\$
𝑌\$
= ∑'∈[&]
1 𝐴\$
= 𝑎 𝑌',\$
𝑇
𝑡 = 1

39. MAB Problem
n The distributions of 𝑌`,_
does not change across rounds.
n Denote the mean outcome of an arm 𝑎 by 𝜇` = 𝔼 𝑌`,_
.
n Best arm: an arm with the highest reward.
• Denote the best arm by 𝑎∗ = arg max
`∈[m]
𝜇`
39

40. BAI with a Fixed Budget
n BAI with a fixed budget is an instance of the MAB problems.
n In the final round 𝑇, we estimate the best arm and denote it by Œ
𝑎n

n Probability of misidenfication: ℙ Œ
𝑎n
∗ ≠ 𝑎∗
n Goal: Minimizing the probability of misidenfication ℙ Œ
𝑎n
∗ ≠ 𝑎∗ .
40

41. Theoretical Performance Evaluation
n How to evaluate the performance of BAI algorithms?
n ℙ Œ
𝑎n
∗ ≠ 𝑎∗ converges to 0 with an exponential speed; that is,
ℙ Œ
𝑎n
∗ ≠ 𝑎∗ = exp(−𝑇(⋆))
for a constant term (⋆).
n Consider evaluating the term (⋆) by lim sup
n→p
− '
n
log ℙ Œ
𝑎n
∗ ≠ 𝑎∗ .
n A performance lower (upper) bound of ℙ Ž
𝒂𝑻
∗ ≠ 𝑎∗ is
an upper (lower) bound of lim sup
n→p
− '
n
log ℙ Œ
𝑎n
∗ ≠ 𝑎∗ .
41

42. Information Theoretic Lower Bound
n Information theoretic lower bound.
• Lower bound based on the distribution information.
• This kind of a lower bound is typically based on the likelihood ratio, Fisher
information, and KL divergence.
n The derivation technique is called change-of-measure arguments.
• This technique has been used in MAB problem (Lai and Robbins (1985))
• In BAI, Kaufmann et al. (2016) suggests a lower bound.
42

43. Lower Bound: Transportation Lemma
n Denote the true distribution (bandit model) by 𝑣.
n Denote a set of alternative hypothesizes by Alt(𝑣).
n Consistent algorithm: return the true arm with probability 1 as 𝑇 → ∞.
n For any 𝑣r ∈ Alt(𝑣) and consistent algorithms, if 𝐾 = 2,
lim sup
(→*

1
𝑇
log ℙ G
𝑎(
∗ ≠ 𝑎∗ ≤ lim sup
(→*
1
𝑇
𝔼+, K
-./
'
K
0./
(
1[𝐴0 = 𝑎] log
𝑓-
, 𝑌
𝑓- 𝑌
where 𝑓`
and 𝑓`
r are the pdfs of an arm 𝑎’s reward under 𝑣 and 𝑣′.
43
Transportation Lemma (Lemma 1 of Kaufmann et al. (2016)
Log likelihood ratio

44. Open Problem: Optimal Algorithm?
n Open problem:
1. The Kaufmann’s bound is only applicable to two-armed bandit (𝐾 = 2).
2. No optimal algorithm whose upper bound achieves the lower bound.
n Kato, Ariu, Imaizumi, Uehara, Nomura, and Qin (2022) proposes an optimal
algorithm under a small-gap setting.
1. Consider a small gap situation：Δ` = 𝜇`∗ − 𝜇` → 0 for all 𝑎 ∈ [𝐾].
2. Proposes a large deviation upper bound.
3. Then, the upper bound matches the lower bound in the limit of Δ`
.
44

45. Lower Bound under a Small Gap
n Let 𝐼`
(𝜇`
) be the Fisher information of parameter 𝜇`
of arm 𝑎.
n Let 𝑤`
be an arm allocation lim sup
n→p
𝔼\$ ∑%&'
( ' t%&`
n
.
lim sup
n→p

1
𝑇
log ℙ Œ
𝑎n
∗ ≠ 𝑎∗ ≤ sup
(u))
min
`v`∗
Δ`
,
2
𝐼' 𝜇'
𝑤'
+
𝐼` 𝜇`
𝑤`
+ 𝑜(Δ`
, )
45
Lower Bound (Lemma 1 of Kato et al. (2022)

46. Upper Bound: Large Deviation Principles
(LDPs) for Martingales
n Let ̂
𝜇`,n
be an estimator of the mean rewad 𝜇`
.
n Consider returning arg max
`
̂
𝜇`,n
as an estimated best arm. Then,
ℙ Œ
𝑎n
∗ ≠ 𝑎∗ = E
`v`∗
ℙ ̂
𝜇`,n
≥ ̂
𝜇`∗,n
= E
`v`∗
ℙ ̂
𝜇`∗,n
− ̂
𝜇`,n
− Δ`
≤ −Δ`
n LDP: evaluation of ℙ ̂
𝜇`∗,n − ̂
𝜇`,n − Δ` ≤ 𝐶 , where 𝐶 is a constant.
• Central limit theorem (CLT): evaluation of ℙ 𝑇( ̂
𝜇`∗,n
− ̂
𝜇`,n
− Δ`
) ≤ 𝐶 .
• We cannot use the CLT for obtaining the upper bound.
46

47. Upper Bound: LDPs for Martingales
n There are existing well-known results on Large deviation principals.
Ex. Cramér theorem and Gärtner-Ellis theorem
• These results cannot be applied to BAI problem owing to the non-
stationarity of the stochastic process.
n Fan et al. (2013, 2014): LDP for martingales.
• Key tool: change-of-measure arguments.
47

48. Upper Bound: Large Deviation Principles
for Martingales
n Let ℙ be a probability measure of the original problem.
n Define 𝑈n = ∏_&'
n wxy(]z%)
𝔼[wxy ]z% |ℱ%*']
.
n Define the conjugate probability measure ℙ]
as dℙ] = 𝑈ndℙ
1. Derive the bound on ℙ]
.
2. Then, transform it to the bound
on ℙ via the density ratio Yℙ
Yℙ+
.
48
ℙ"

dℙ
dℙ1
= 𝑈(
Upper bound Upper bound
Change measures

49. Upper Bound: Large Deviation Principles
for Martingales
n Kato et al. (2022) generalizes the result of Fan et al. (2013, 2014).
Ø Under an appropriately designed BAI algorithm, we show that the upper
bound matches the lower bound.
n If ∑%&'
( ' t%&`
n
|.U
𝑤`
, then under some regularity conditions,
lim sup
n→p

1
𝑇
log ℙ Œ
𝑎n
∗ ≠ 𝑎∗ ≥ min
`v`∗
Δ`
,
2
𝐼' 𝜇'
𝑤'
+
𝐼` 𝜇`
𝑤`
+ 𝑜 Δ`
, .
n This result implies Gaussian approximation of LDP in Δ`
→ 0.
49
Upper Bound (Theorem 4.1 of Kato et al. (2022)

50. Conclusion
50

51. Conclusion
51
n Density-ratio approaches.
• Inlier-based outlier detection.
• PU learning.
• Causal inference.
• Multi-armed bandit problem (change-of-measure arguments).
→ Useful in many ML applications.
n Other topics: Double/debiased machine learning (Chernozhukov et al.
(2018)), Variational Bayesian methods (Tran et al. (2017)) etc

52. Reference
• Kato, M., and Teshima, T. (2022), “Non-negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation,” ,” in International Conference on Machine Learning.
• Kato, M., Imaizumi, M., McAlinn, K., Yasui, S., and Kakehi, H. (2022), “Learning Causal Relationships from Conditional Moment Restrictions by Importance Weighting,” in International Conference on
Learning Representations.
• Kato, M., Imaizumi, M., and Minami, K. (2022), “Unified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation: Bridging KL-Divergence and Integral Probability Metrics,” .
• Kanamori, T., Hido, S., and Sugiyama, M. (2009), “A least-squares approach to direct importance estimation.” Journal of Machine Learning Research, 10(Jul.):1391–1445.
• Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. (2017), “Positive-Unlabeled Learning with Non-Negative Risk Estimator,” in Conference on Neural Information Processing Systems.
• Imbens, G. W. and Lancaster, T. (1996), “Efficient estimation and stratified sampling,” Journal of Econometrics, 74, 289–318.
• Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J.(2018), “Double/debiased machine learning for treatment and structural parameters,” Econometrics Journal,
21, C1–C68.
• Good, I. J. and Gaskins, R. A. (1971), “Nonparametric Roughness Penalties for Probability Densities,” Biometrika, 58, 255–277.
• Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., and Kawanabe, M. (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the
20th International Conference on Neural Information Processing Systems (NIPS'07). Curran Associates Inc., Red Hook, NY, USA, 1433–1440.
• Sugiyama, M., Suzuki, T., and Kanamori, T. (2011), “Density Ratio Matching under the Bregman Divergence: A Unified Framework of Density Ratio Estimation,” Annals of the Institute of Statistical
Mathematics, 64.— (2012), Density Ratio Estimation in Machine Learning, New York, NY, USA: Cambridge University Press, 1st ed.
• Sugiyama, M., (2016), “Introduction to Statistical Machine Learning.”
• Silverman, B. W. (1982), “On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method,” The Annals of Statistics, 10, 795 – 810. 2
• Suzuki, T., Sugiyama, M., Sese, Jun., and Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. In Proceedings of the Workshop on New Challenges for
Feature Selection in Data Mining and Knowledge Discovery at ECML/PKDD 2008,volume 4 of Proceedings of Machine Learning Research, pp. 5–20. PMLR.
• Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. (2016), “Generative Adversarial Nets from a Density Ratio Estimation Perspective.”
• Tran, D., Ranganath, R., and Blei, D. M. (2017), “Hierarchical Implicit Models and Likelihood-Free Variational Inference,” in International Conference on Neural Information, Red Hook, NY, USA, p. 5529–
5539.
• Nguyen, X., Wainwright, M. J., and Jordan, M. (2008), “Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization,” in Conference on Neural Information Processing
Systems, vol. 20.
• Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003.
• Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., and Kanamori, T. (2011), “Statistical outlier detection using direct density ratio estimation,” Knowledge and Information Systems, 26, 309–336
• Lai, T. and Robbins, H. (1985), “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics
• Kaufmann, E., Cappe, O., and Garivier, A. (2016), “On the Complexity of Best-Arm Identification in Multi-Armed ´ Bandit Models,” Journal of Machine Learning Research, 17, 1–42
• Fan, X., Grama, I., and Liu, Q. (2013), “Cramer large deviation expansions for martingales under Bernstein’s condi- ´ tion,” Stochastic Processes and their Applications, 123, 3919–3942.
• Fan, X., Grama, I., and Liu, Q. (2014), “A generalization of Cramer large deviations for martingales,” ´ Comptes Rendus Mathematique, 352, 853– 858.
• Shimodaira, H. (2000), “Improving predictive inference under covariate shift by weighting the log-likelihood function,” Journal of statistical planning and inference, 90, 227–244.
52