Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Recent Findings on Density-Ratio Approaches in Machine Learning

MasaKat0
March 30, 2022

Recent Findings on Density-Ratio Approaches in Machine Learning

Recent Findings on Density-Ratio Approaches in Machine Learning. FIML2022.

MasaKat0

March 30, 2022
Tweet

More Decks by MasaKat0

Other Decks in Research

Transcript

  1. Recent Findings on Density-Ratio Approaches
    in Machine Learning
    Workshop on FIMI March. 30th, 2022
    Masahiro Kato
    The University of Tokyo, Imaizumi Lab / CyberAgent, Inc. AILab

    View Slide

  2. Density-Ratio Approaches
    in Machine Learning (ML)
    n Consider two distributions 𝑃 and 𝑄 with a common support.
    n Let 𝑝∗ and 𝑞∗ be the density functions of 𝑃 and 𝑄, respectively.
    n Define the density ratio (function) as
    𝑟∗ 𝑥 = "∗ #
    $∗ #
    .
    n Approaches using density ratios.
    → Useful in many ML applications.
    2
    𝑝∗(𝑥) 𝑞∗(𝑥)
    Density ratio 𝑟 𝑥 = "∗ #
    $∗ #
    .

    View Slide

  3. Density-Ratio Approach in ML
    n Many ML applications include two or more distributions.
    • Classification.
    • Generative adversarial networks (GAN).
    • Divergence between probability measures.
    • Multi-armed bandit (MAB) problem (change of measures)
    → In these tasks, the density ratio appears as a key component.
    3

    View Slide

  4. Empirical Perspective
    n An estimator of the density ratio 𝑟∗ provides a solution.
    • Inlier-based outlier detection: finding outliers based on the density ratio
    (Hido et al. (2008)).
    • Causal inference: conditional moment restrictions can be approximated by
    the density ratio (Kato et al. (2022)).
    • GAN (Goodfellow (2010), Uehara et al. (2016)).
    • Variational Bayesian (VB) method (Tran et al. (2017)) etc.
    4

    View Slide

  5. Theoretical Viewpoint
    n Using density ratios is also useful in theoretical analysis
    • Likelihood-ratio gives tight lower bounds on decision making problems.
    Ex. Lower bounds in MAB problems (Lai and Robbins (1985)).
    • Transform an original problem to obtain a tight theoretical result.
    Ex. Large deviation principles for martingales (Fan et al. (2013, 2014)).
    5
    𝑄
    𝑃
    𝑝∗(𝑥)
    𝑞∗(𝑥)
    Some
    theoretical
    results
    Some
    theoretical
    results

    View Slide

  6. Presentation Outline:
    Recent (Our) Findings on Density Ratios
    1. Density-Ratio Estimation and its Applications
    Kato and Teshima (ICML2022), “Non-negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation”
    2. Causal Inference and Density Ratios
    Kato, Imaizumi, McAlinn, Yasui, and Kakehi (ICLR2022), “Learning Causal Relationships from Conditional Moment Restrictions by Importance Weighting”
    3. Density Ratios And Divergences between Probability Measures
    Kato, Imaizumi, and Minami (2022), “Unified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation”
    4. Change-of-Measure Arguments in Best Arm Identification Problem
    Kato, Ariu, Imaizumi, Uehara, Nomura, and Qin (2022), “Best Arm Identification with a Fixed Budget under a Small Gap”
    6

    View Slide

  7. Density-Ratio Estimation
    and its Applications
    Workshop on Functional Inference and Machine Intelligence
    Masahiro Kato, March. 30th, 2022
    The University of Tokyo / CyberAgent, Inc. AILab
    7

    View Slide

  8. Density-Ratio Estimation (DRE)
    n Consider DRE from observations.
    n Two sets of observations: 𝑋% %&'
    ( ∼ 𝑝∗, 𝑍) )&'
    *
    ∼ 𝑞∗.
    n Two-step method:
    • Estimate 𝑝∗ 𝑥 and 𝑞∗(𝑥); then, construct an estimator of 𝑟∗(𝑥).
    × Empirical performance.
    × Theoretical guarantee.
    → Consider direct estimation of 𝑟∗ 𝑥 : LSIF, KLIEP, and PU Learning
    8

    View Slide

  9. Least-Squares Importance Fitting (LSIF)
    n Let 𝑟 be a model of the density ratio 𝑟∗.
    n The risk of the squared error is 𝑅 𝑟 = 𝔼+
    𝑟∗ 𝑋 − 𝑟 𝑋 ,
    .
    n The minimizer of the empirical risk, ̂
    𝑟, is an estimator of 𝑟∗.
    n Instead of 𝑹(𝒓), we minimize an empirical risk of the following risk:
    4
    𝑅 𝑟 = −2𝔼- 𝑟 𝑋 + 𝔼+ 𝑟, 𝑋 .
    n This method is called LSIF (Kanamori et al. (2009)).
    9

    View Slide

  10. LSIF
    n Derivation:
    𝑟∗ = arg min
    .
    𝔼+ 𝑟∗ 𝑋 − 𝑟 𝑋
    ,
    = arg min
    .
    𝔼+
    𝑟∗ 𝑋 ,
    − 2𝑟∗ 𝑋 𝑟 𝑋 + 𝑟,(𝑋)
    = arg min
    .
    𝔼+
    −2𝑟∗ 𝑋 𝑟 𝑋 + 𝑟,(𝑋)
    = arg min
    .
    −2𝔼- 𝑟 𝑋 + 𝔼+ 𝑟,(𝑋) .
    • Here, we used
    𝔼+
    𝑟∗ 𝑥 𝑟 𝑥 = ∫ 𝑟∗ 𝑥 𝑟 𝑥 𝑞∗ 𝑥 𝑑𝑥 = ∫ 𝑟 𝑥 𝑝∗ 𝑥 𝑑𝑥 = 𝔼-
    𝑟 𝑥 .
    10

    View Slide

  11. KL Importance Estimation Procedure
    (KLIEP)
    n KLIEP (Sugiyama et al. (2007)) is another DRE method that uses the KL
    divergence between 𝑝∗(𝑥) and a model 𝑝 𝑥 = 𝑟 𝑥 𝑞∗ 𝑥 :
    KL 𝑝∗ 𝑥 ∥ 𝑝 𝑥 = ' 𝑝∗ 𝑥 log
    𝑝∗ 𝑥
    𝑝 𝑥
    𝑑𝑥 = ' 𝑝∗ 𝑥 log
    𝑝∗ 𝑥
    𝑟 𝑥 𝑞∗(𝑥)
    𝑑𝑥
    = ' 𝑝∗ 𝑥 log
    𝑝∗ 𝑥
    𝑞∗(𝑥)
    𝑑𝑥 − '𝑝∗ 𝑥 log 𝑟(𝑥) 𝑑𝑥
    n From 𝑟∗ = arg min
    .
    KL 𝑝∗ 𝑥 ∥ 𝑝 𝑥 , we estimate 𝑟∗ as
    ̂
    𝑟 𝑥 = arg min
    .

    1
    𝑛
    E
    %&'
    (
    log 𝑟 𝑋%
    s. t.
    1
    𝑚
    E
    )&'
    *
    𝑟(𝑍)
    ) = 1.
    11

    View Slide

  12. Inlier-based Outlier Detection
    n Find outliers using inliers (correct samples)
    and the density ratio (Hido et al. (2008))
    • Inliers are sampled from 𝑝∗ 𝑥 .
    • Test data: inliers + outliers are sampled
    from 𝑞∗(𝑥)
    n Outlier detection using the density ratio
    𝑟∗ 𝑥 = "∗ #
    $∗(#)
    .
    12
    𝑝∗ 𝑥
    𝑞∗ 𝑥
    𝑟∗ 𝑥 =
    𝑝∗ 𝑥
    𝑞∗(𝑥)
    From Sugiyama (2016)
    Mean AUC values over 20 trials for the benchmark datasets (Hido et al, (2008)).

    View Slide

  13. Bregman (BR) Divergence Minimization
    Perspective
    n LSIF and KLIEP can be regarded as special cases of BR divergence
    minimization (Sugiyama et al. (2012)).
    • Let 𝑔(𝑡) be a twice continuously differentiable convex function.
    • Using the BR divergence, we can rewrite the objective function as follows:
    M
    BR4
    𝑟 : = P
    𝔼+
    𝜕𝑔 𝑟 𝑋%
    𝑟 𝑋%
    − 𝑔 𝑟 𝑋%
    − P
    𝔼-
    𝜕𝑔 𝑟 𝑋)
    .
    • By changing 𝑔 𝑡 , we obtain objective functions for various direct DRE.
    Ex. 𝑔 𝑡 = 𝑡 − 1 ,: LSIF, 𝑔 𝑡 = 𝑡 log 𝑡 − 𝑡: KLIEP.
    13

    View Slide

  14. Learning from Positive and Unlabeled Data
    (PU Learning)
    n PU learning is a method for a classifier only from positive and unlabeled
    data (du Plessis et al, (2014, 2015)).
    • Positive label: 𝑦 = +1, negative label: 𝑦 = −1.
    • Positive data: 𝑥%
    "
    %&'
    ("
    ∼ 𝑝 𝑥 𝑦 = +1
    • Unlabeled data: 𝑥%
    5
    %&'
    (#
    ∼ 𝑝 𝑥 .
    14

    View Slide

  15. Learning from Positive and Unlabeled Data
    (PU Learning)
    n A classifier 𝑓 can be trained by minimizing
    ℛ 𝑓 ≔ 𝜋 :log 𝑓 𝑥 𝑝 x y = +1 d𝑥 − 𝜋 :log 1 − 𝑓 𝑥 𝑝 x y = +1 d𝑥 + :log 1 − 𝑓 𝑥 𝑝 x d𝑥,
    where 𝜋 is a class prior defined as 𝜋 = 𝑝(𝑦 = +1).
    n Overfitting problem in PU learning (Kiryo et al. (2017)).
    n The empirical PU risk is not lower bounded and goes to −∞.
    15
    This term can go to −∞.

    View Slide

  16. Overfitting and Non-negative Correction
    n Kiryo et al. (2017) proposes non-negative correction based on
    −𝜋 #log 1 − 𝑓 𝑥 𝑝 x y = +1 d𝑥 + #log 1 − 𝑓 𝑥 𝑝 x d𝑥 ≥ 0.
    n The nonnegative PU risk is given as
    ℛ%%&' 𝑓 ≔ 𝜋 ,log 𝑓 𝑥 𝑝 x y = +1 d𝑥 + max 0, −𝜋 ,log 1 − 𝑓 𝑥 𝑝 x y = +1 d𝑥 + ,log 1 − 𝑓 𝑥 𝑝 x d𝑥 .
    n In population, ℛ 𝑓 = ℛGGHI 𝑓 .
    n Minimize an empirical version of ℛGGHI 𝑓 .
    16
    From Kiryo et al. (2017)

    View Slide

  17. Overfitting and Non-negative Correction
    n In DRE, we face a similar overfitting problem.
    n Kato and Teshima (2021) applies the non-negative method to DRE.
    1. PU learning can be also regarded as BR divergence minimization
    (optimal classifier is 𝑝 𝑦 = 1 𝑥 = J"(#|L&M')
    "(#)
    ).
    2. They apply non-negative correction to DRE.
    n In maximum likelihood nonparametric density estimation, this overfitting
    problem is known as the roughness problem (Good and Gaskin (1971)).
    17
    𝑞∗(𝑥)
    𝑝∗(𝑥)

    View Slide

  18. Inlier-based Outlier Detection
    with Deep Neural Networks (DNNs)
    n Inlier-based outlier detection with high-dimensional data (ex. CIFAR-10)
    n We can use DNNs by using non-negative correction together.
    n PU learning-based DRE shows the best performance.
    18
    From Kato and Teshima (2021)

    View Slide

  19. Causal Inference
    and Density Ratios
    Workshop on Functional Inference and Machine Intelligence
    Masahiro Kato, March. 30th, 2022
    The University of Tokyo / CyberAgent, Inc. AILab

    View Slide

  20. Structural Equation Model
    n Consider the following linear model between 𝑌 and 𝑋:
    𝑌 = 𝑋N𝛽 + 𝜀, 𝔼 𝑋N𝜀 ≠ 0.
    n 𝔼 𝑋N𝜀 ≠ 0 implies the correlation between 𝜀 and 𝑋.
    n This situation is called endogeneity.
    n In this case, an OLS estimator is not unbiased and consistent.
    • 𝑋N𝛽 is not conditional mean 𝔼[𝑌|𝑋] (𝔼 𝑌 𝑋 ≠ 𝑋N𝛽).
    n This model is called structural equation.
    20

    View Slide

  21. NPIV: Wage Equation
    n The true wage equation:
    log(𝑤𝑎𝑔𝑒) = 𝛽O + 𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛×𝛽' + 𝑎𝑏𝑖𝑙𝑖𝑡𝑦×𝛽, + 𝑢,
    𝔼 𝑢 𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 0
    n We cannot observe “ability” and estimate the following model:
    log(𝑤𝑎𝑔𝑒) = 𝛽O + 𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛×𝛽' + 𝜀, 𝜀 = 𝑎𝑏𝑖𝑙𝑖𝑡𝑦×𝛽, + 𝑢.
    • If “𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛” is correlated with “𝑎𝑏𝑖𝑙𝑖𝑡𝑦,”
    𝔼 “years of education”×𝜀 ≠ 0
    → We cannot consistently estimate 𝛽'
    with OLS.
    21

    View Slide

  22. Instrumental Variable (IV) Method
    n By using IVs, we can estimate the parameter 𝛽.
    n The IV is a random variable 𝑍 satisfying the following conditions:
    1. Uncorrelated to the error term: 𝔼 𝑍N𝜀 = 0.
    2. Correlated with the endogeneous variable 𝑋.
    n Angrist and Krueger (1991): Using the quarter of birth as the IV.
    22
    𝑍(𝐼𝑉) 𝑋 (𝑦𝑒𝑎𝑟𝑠 𝑜𝑓 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛) 𝑌(𝑤𝑎𝑔𝑒)
    𝑈 (𝑎𝑏𝑖𝑙𝑖𝑡𝑦)
    𝛽

    View Slide

  23. Nonparametric Instrumental Variable
    (NPIV) Regression
    n A nonparametric version of IV problems (Newey and Powell (2003)):
    𝑌 = 𝑓∗ 𝑋 + 𝜀, 𝔼 𝜀|𝑋 ≠ 0.
    • Want to estimate the structural function 𝑓∗.
    • 𝔼 𝜀|𝑋 ≠ 0 → least-squires does not yield consistent estimator.
    n Instrumental variable 𝑍: the condition for IVs: 𝔼 𝜀|𝑍 = 0.
    n Algorithms: Two-stage least squares with series regression (Newey and
    Powell (2003)), Minimax optimization
    23

    View Slide

  24. NPIV via Importance Weighting
    n Kato, Imaizumi, McAlinn, Yasui, and Kakehi (ICLR2022) solves the problem
    with an approach similar to covariate shift adaptation (Shimodaira (2000)).
    n From 𝔼P,Q 𝜀|𝑍 = 0, if we know 𝑟∗ 𝑦, 𝑥 𝑧 = "∗(L,#|R)
    "(L,#)
    , we estimate 𝑓∗ by
    minimizing an empirical approximation of 𝔼S 𝔼P,Q 𝜀|𝑍
    ,
    :
    !
    𝑓 = argmin
    ;
    1
    𝑛
    ,
    <=>
    ?
    1
    𝑛
    ,
    @=>
    ?
    𝑌<
    − 𝑓 𝑋<
    𝑟∗ 𝑦, 𝑥 𝑧
    B
    n We show some theoretical results on the estimation error.
    24

    View Slide

  25. NPIV via Importance Weighting
    n Estimate 𝑟∗ 𝑦, 𝑥 𝑧 = "∗(L,#|R)
    "(L,#)
    = "∗(L,#,R)
    " L,# "(R)
    by applying the idea of LSIF as
    𝑟∗ = arg min
    "
    𝔼# 𝔼$,& 𝑟∗ 𝑌, 𝑋|𝑍 − 𝑟 𝑌, 𝑋|𝑍 '
    = arg min
    "
    𝔼# 𝔼$,& 𝑟∗ 𝑌, 𝑋|𝑍 '
    − 2𝑟∗ 𝑌, 𝑋|𝑍 𝑟 𝑌, 𝑋|𝑍 + 𝑟'(𝑌, 𝑋|𝑍)
    = arg min
    "
    𝔼# 𝔼$,& −2𝑟∗ 𝑌, 𝑋|𝑍 𝑟 𝑌, 𝑋|𝑍 + 𝑟'(𝑌, 𝑋|𝑍)
    = arg min
    "
    −2𝔼# 𝔼$,& 𝑟 𝑌, 𝑋|𝑍 + 𝔼$,&,# 𝑟' 𝑌, 𝑋|𝑍 .
    n KLIEP-based estimation is proposed by Suzuki et al. (2009).
    25

    View Slide

  26. Density Ratios And Divergences
    between Probability Measures
    Workshop on Functional Inference and Machine Intelligence
    Masahiro Kato, March. 30th, 2022
    The University of Tokyo / CyberAgent, Inc. AILab
    26

    View Slide

  27. Reconsidering the BR Divergence
    Minimization from the Likelihood Approach
    n Reconsider DRE methods from maximum likelihood estimation perspectives.
    • We can define several likelihoods based on different sampling scheme.
    n The maximum likelihood estimation under the stratified sampling scheme is
    not included in BR divergence divergence.
    Ø The risk belongs to integral probability metrics (IPMs).
    • IPMs include the Wasserstein distance and MMD as special cases.
    n Reveal the relationships between probability distances and density ratios.
    → Expand the range of applications of density ratios.
    27

    View Slide

  28. Likelihood of Density Ratios
    n Let 𝑟 𝑥 be a model of 𝑟∗(𝑥) = "∗ #
    $∗(#)

    n A model of 𝑝∗ 𝑥 is given as 𝑝 𝑥 = 𝑟 𝑥 𝑞∗(𝑥).
    n For observations 𝑋% %&'
    ( ∼ 𝑝∗, the likelihood of the model 𝑝 𝑥 is
    ℒ 𝑟 = n
    %&'
    (
    𝑝(𝑋%) = n
    %&'
    (
    𝑟 𝑋% 𝑞∗(𝑋%) .
    n The log likelihood is given as ℓ 𝑟 = ∑%&'
    ( log 𝑟 𝑋% + ∑%&'
    ( log 𝑞∗(𝑋%) .
    28

    View Slide

  29. Nonparametric Maximum Likelihood
    Estimation of Density Ratios
    n We can estimate 𝑟∗ by solving
    max
    .
    1
    𝑛
    E
    %&'
    (
    log 𝑟 𝑋% s. t. r𝑟 𝑧 𝑞∗ 𝑧 d𝑧 = 1
    • The constraint is based on ∫ 𝑟∗ 𝑥 𝑞∗ 𝑥 𝑑𝑥 = ∫ 𝑝∗ 𝑥 𝑑𝑥 = 1.
    • This formulation is equivalent to KLIEP.
    n Similarly, for observations 𝑍) )&'
    *
    ∼ 𝑞∗, we can estimate1/𝑟∗ by solving
    max
    .

    1
    𝑚
    E
    )&'
    *
    log 𝑟 𝑍) s. t. r1/𝑟 𝑥 𝑝∗ 𝑥 d𝑥 = 1
    29

    View Slide

  30. KL Divergence and Likelihood of Density
    Ratios
    n KL divergence is KL ℙ ∥ ℚ ≔ ∫ 𝑝∗ 𝑥 log "∗ #
    $∗ #
    d𝑥
    n KL divergence can be interpreted as the maximized log likelihood because
    KL ℙ ∥ ℚ = sup
    .∈ℛ
    U.W. ∫ .(R) $∗ R YR&'
    rlog 𝑟(𝑥) 𝑝∗ 𝑥 d𝑥
    n Derivation: KL ℙ ∥ ℚ = ∫ 𝑝∗ 𝑥 log "∗ #
    $∗ #
    d𝑥 = sup
    (∈ℱ
    1 + ∫ 𝑓 𝑥 𝑝∗ 𝑥 d𝑥 − ∫ exp 𝑓 𝑥 𝑞∗ 𝑥 d𝑥 = 1 +
    ∫ 𝑓∗ 𝑥 𝑝∗ 𝑥 d𝑥 − ∫ exp 𝑓∗ 𝑥 𝑞∗ 𝑥 d𝑥 = ∫ 𝑓∗ 𝑥 𝑝∗ 𝑥 d𝑥 = sup
    +∈ℛ
    -./. ∫ +(2) $∗ 2 45
    ∫ log 𝑟(𝑥) 𝑝∗ 𝑥 d𝑥
    30

    View Slide

  31. Stratified Sampling Scheme
    n Assume that for all 𝑥 ∈ 𝒟, there exist 𝑟∗ 𝑥 and 1/𝑟∗(𝑥).
    n Define the likelihood of 𝑟 under a stratified sampling scheme.
    n The likelihood uses both 𝑋% %&'
    ( ∼ 𝑝∗ and 𝑍) )&'
    *
    ∼ 𝑞∗, simultaneously.
    n The likelihood is given as ℒ 𝑟 = ∏%&'
    ( ~
    𝑝.
    (𝑋%
    ) ∏)&'
    * ~
    𝑞.
    (𝑍%
    )
    • This sampling scheme has been considered in causal inference(Imbens and
    Lancaster (1996)).
    31

    View Slide

  32. Stratified Sampling Scheme
    n The objective function is given as
    max
    .
    E
    %&'
    (
    log 𝑟(𝑋%
    ) − E
    )&'
    *
    log 𝑟 𝑍)
    s. t. r1/𝑟 𝑥 𝑝∗ 𝑥 d𝑥 = r𝑟 𝑧 𝑞∗ 𝑧 d𝑧 = 1 .
    n We consider the following equivalent unconstrained problem:
    max
    .
    E
    %&'
    (
    log 𝑟(𝑋%) − E
    )&'
    *
    log 𝑟 𝑍) −
    1
    𝑚
    E
    )&'
    *
    𝑟 𝑍) −
    1
    𝑛
    E
    %&'
    (
    1
    𝑟 𝑋%
    n This transformation is based on the results of Silverman (1982).
    • We can apply this trick to KLIEP (see Nguyen et al. (2008))
    32

    View Slide

  33. Integral Probability Metrics (IPMs) and
    Likelihood of Density Ratios
    n IPMs with a function class ℱ defines the distance between two probability
    distributions 𝑃 and 𝑄 as
    sup
    Z∈ℱ
    r𝑓 𝑥 𝑝∗ 𝑥 d𝑥 − r𝑓 𝑧 𝑞∗ 𝑧 d𝑧
    • If ℱ is a class of Lipschitz continuous functions, this distance becomes the
    Wasserstein distance.
    33

    View Slide

  34. IPMs and Likelihood of Density Ratios
    n Consider a exponential density ratio model, 𝑟(𝑥) = exp 𝑓 𝑥 .
    n IPMs is the maximized log likelihood under stratified sampling scheme:
    IPM\(ℱ) ℙ ∥ ℚ
    𝐶 ℱ = 𝑓 ∈ ℱ: rexp 𝑓 𝑧 d𝑞∗ 𝑧 𝑑𝑧 = rexp −𝑓 𝑥 𝑑𝑝∗ 𝑥 d𝑥 = 1 .
    34

    View Slide

  35. Density Ratio Metrics (DRMs)
    n The density ratio metrics (DRMs, Kato, Imaizumi, and Minami (2022)):
    DRMℱ
    ] ℙ ∥ ℚ = sup
    Z∈\(ℱ)
    𝜆 r𝑓 𝑥 𝑝∗ 𝑥 d𝑥 − (1 − 𝜆) r𝑓 𝑥 𝑞∗(𝑥) d𝑥
    𝐶 ℱ = 𝑓 ∈ ℱ: rexp 𝑓 𝑥 𝑞∗ 𝑥 d𝑥 = rexp −𝑓 𝑥 𝑝∗ 𝑥 d𝑥 = 1
    • A distance based on the weighted average of maximum log likelihood of the
    density ratio under stratified sampling scheme (𝜆 ∈ [0,1]).
    • Bridges the IPMs and KL divergence.
    • DRMs include the KL and IPMs as special cases.
    35

    View Slide

  36. Density Ratio Metrics (DRMs)
    • If 𝜆 = 1/2, DRMℱ
    '/, 𝑃 ∥ 𝑄 = '
    ,
    IPM\(ℱ) 𝑃 ∥ 𝑄
    • If 𝜆 = 1, DRMℱ
    ' 𝑃 ∥ 𝑄 = KL 𝑃 ∥ 𝑄
    • If 𝜆 = 0, DRMℱ
    O 𝑃 ∥ 𝑄 = KL 𝑄 ∥ 𝑃
    n Choice of ℱ → Smoothness of the model of the density ratio.
    Ex. Non-negative correction, Spectral normalization (Miyato et al. (2016))
    n Probability divergence can be defined without density ratios.
    → What are the advantages? VB methods, DualGAN, causal inference...
    36
    From Kato et al. (2022)

    View Slide

  37. Change-of-Measure Arguments
    in Best Arm Identification Problem
    Workshop on Functional Inference and Machine Intelligence
    Masahiro Kato, March. 30th, 2022
    The University of Tokyo / CyberAgent, Inc. AILab
    37

    View Slide

  38. MAB Problem
    n There are 𝐾 arms, 𝐾 = 1,2, … 𝐾 and fixed time horizon 𝑇.
    • Pull an arm 𝐴_ ∈ [𝐾] in each round 𝑡.
    • Observe a reward of chosen arm 𝐴_
    ,
    𝑌_
    = ∑`∈[b]
    1 𝐴_
    = 𝑎 𝑌`,_
    ,
    where 𝑌`,_
    is a (potential) reward
    of arm 𝑎 ∈ [𝐾] in each round 𝑡
    • Stop the trial at round 𝑡 = 𝑇
    38
    Arm 1 Arm 2

    Arm 𝐾
    𝑌",$
    𝑌%,$
    𝑌&,$
    𝑌$
    = ∑'∈[&]
    1 𝐴$
    = 𝑎 𝑌',$
    𝑇
    𝑡 = 1

    View Slide

  39. MAB Problem
    n The distributions of 𝑌`,_
    does not change across rounds.
    n Denote the mean outcome of an arm 𝑎 by 𝜇` = 𝔼 𝑌`,_
    .
    n Best arm: an arm with the highest reward.
    • Denote the best arm by 𝑎∗ = arg max
    `∈[m]
    𝜇`
    39

    View Slide

  40. BAI with a Fixed Budget
    n BAI with a fixed budget is an instance of the MAB problems.
    n In the final round 𝑇, we estimate the best arm and denote it by Œ
    𝑎n

    n Probability of misidenfication: ℙ Œ
    𝑎n
    ∗ ≠ 𝑎∗
    n Goal: Minimizing the probability of misidenfication ℙ Œ
    𝑎n
    ∗ ≠ 𝑎∗ .
    40

    View Slide

  41. Theoretical Performance Evaluation
    n How to evaluate the performance of BAI algorithms?
    n ℙ Œ
    𝑎n
    ∗ ≠ 𝑎∗ converges to 0 with an exponential speed; that is,
    ℙ Œ
    𝑎n
    ∗ ≠ 𝑎∗ = exp(−𝑇(⋆))
    for a constant term (⋆).
    n Consider evaluating the term (⋆) by lim sup
    n→p
    − '
    n
    log ℙ Œ
    𝑎n
    ∗ ≠ 𝑎∗ .
    n A performance lower (upper) bound of ℙ Ž
    𝒂𝑻
    ∗ ≠ 𝑎∗ is
    an upper (lower) bound of lim sup
    n→p
    − '
    n
    log ℙ Œ
    𝑎n
    ∗ ≠ 𝑎∗ .
    41

    View Slide

  42. Information Theoretic Lower Bound
    n Information theoretic lower bound.
    • Lower bound based on the distribution information.
    • This kind of a lower bound is typically based on the likelihood ratio, Fisher
    information, and KL divergence.
    n The derivation technique is called change-of-measure arguments.
    • This technique has been used in MAB problem (Lai and Robbins (1985))
    • In BAI, Kaufmann et al. (2016) suggests a lower bound.
    42

    View Slide

  43. Lower Bound: Transportation Lemma
    n Denote the true distribution (bandit model) by 𝑣.
    n Denote a set of alternative hypothesizes by Alt(𝑣).
    n Consistent algorithm: return the true arm with probability 1 as 𝑇 → ∞.
    n For any 𝑣r ∈ Alt(𝑣) and consistent algorithms, if 𝐾 = 2,
    lim sup
    (→*

    1
    𝑇
    log ℙ G
    𝑎(
    ∗ ≠ 𝑎∗ ≤ lim sup
    (→*
    1
    𝑇
    𝔼+, K
    -./
    '
    K
    0./
    (
    1[𝐴0 = 𝑎] log
    𝑓-
    , 𝑌
    𝑓- 𝑌
    where 𝑓`
    and 𝑓`
    r are the pdfs of an arm 𝑎’s reward under 𝑣 and 𝑣′.
    43
    Transportation Lemma (Lemma 1 of Kaufmann et al. (2016)
    Log likelihood ratio

    View Slide

  44. Open Problem: Optimal Algorithm?
    n Open problem:
    1. The Kaufmann’s bound is only applicable to two-armed bandit (𝐾 = 2).
    2. No optimal algorithm whose upper bound achieves the lower bound.
    n Kato, Ariu, Imaizumi, Uehara, Nomura, and Qin (2022) proposes an optimal
    algorithm under a small-gap setting.
    1. Consider a small gap situation:Δ` = 𝜇`∗ − 𝜇` → 0 for all 𝑎 ∈ [𝐾].
    2. Proposes a large deviation upper bound.
    3. Then, the upper bound matches the lower bound in the limit of Δ`
    .
    44

    View Slide

  45. Lower Bound under a Small Gap
    n Let 𝐼`
    (𝜇`
    ) be the Fisher information of parameter 𝜇`
    of arm 𝑎.
    n Let 𝑤`
    be an arm allocation lim sup
    n→p
    𝔼$ ∑%&'
    ( ' t%&`
    n
    .
    lim sup
    n→p

    1
    𝑇
    log ℙ Œ
    𝑎n
    ∗ ≠ 𝑎∗ ≤ sup
    (u))
    min
    `v`∗
    Δ`
    ,
    2
    𝐼' 𝜇'
    𝑤'
    +
    𝐼` 𝜇`
    𝑤`
    + 𝑜(Δ`
    , )
    45
    Lower Bound (Lemma 1 of Kato et al. (2022)

    View Slide

  46. Upper Bound: Large Deviation Principles
    (LDPs) for Martingales
    n Let ̂
    𝜇`,n
    be an estimator of the mean rewad 𝜇`
    .
    n Consider returning arg max
    `
    ̂
    𝜇`,n
    as an estimated best arm. Then,
    ℙ Œ
    𝑎n
    ∗ ≠ 𝑎∗ = E
    `v`∗
    ℙ ̂
    𝜇`,n
    ≥ ̂
    𝜇`∗,n
    = E
    `v`∗
    ℙ ̂
    𝜇`∗,n
    − ̂
    𝜇`,n
    − Δ`
    ≤ −Δ`
    n LDP: evaluation of ℙ ̂
    𝜇`∗,n − ̂
    𝜇`,n − Δ` ≤ 𝐶 , where 𝐶 is a constant.
    • Central limit theorem (CLT): evaluation of ℙ 𝑇( ̂
    𝜇`∗,n
    − ̂
    𝜇`,n
    − Δ`
    ) ≤ 𝐶 .
    • We cannot use the CLT for obtaining the upper bound.
    46

    View Slide

  47. Upper Bound: LDPs for Martingales
    n There are existing well-known results on Large deviation principals.
    Ex. Cramér theorem and Gärtner-Ellis theorem
    • These results cannot be applied to BAI problem owing to the non-
    stationarity of the stochastic process.
    n Fan et al. (2013, 2014): LDP for martingales.
    • Key tool: change-of-measure arguments.
    47

    View Slide

  48. Upper Bound: Large Deviation Principles
    for Martingales
    n Let ℙ be a probability measure of the original problem.
    n Define 𝑈n = ∏_&'
    n wxy(]z%)
    𝔼[wxy ]z% |ℱ%*']
    .
    n Define the conjugate probability measure ℙ]
    as dℙ] = 𝑈ndℙ
    1. Derive the bound on ℙ]
    .
    2. Then, transform it to the bound
    on ℙ via the density ratio Yℙ
    Yℙ+
    .
    48
    ℙ"

    dℙ
    dℙ1
    = 𝑈(
    Upper bound Upper bound
    Change measures

    View Slide

  49. Upper Bound: Large Deviation Principles
    for Martingales
    n Kato et al. (2022) generalizes the result of Fan et al. (2013, 2014).
    Ø Under an appropriately designed BAI algorithm, we show that the upper
    bound matches the lower bound.
    n If ∑%&'
    ( ' t%&`
    n
    |.U
    𝑤`
    , then under some regularity conditions,
    lim sup
    n→p

    1
    𝑇
    log ℙ Œ
    𝑎n
    ∗ ≠ 𝑎∗ ≥ min
    `v`∗
    Δ`
    ,
    2
    𝐼' 𝜇'
    𝑤'
    +
    𝐼` 𝜇`
    𝑤`
    + 𝑜 Δ`
    , .
    n This result implies Gaussian approximation of LDP in Δ`
    → 0.
    49
    Upper Bound (Theorem 4.1 of Kato et al. (2022)

    View Slide

  50. Conclusion
    50

    View Slide

  51. Conclusion
    51
    n Density-ratio approaches.
    • Inlier-based outlier detection.
    • PU learning.
    • Causal inference.
    • Multi-armed bandit problem (change-of-measure arguments).
    → Useful in many ML applications.
    n Other topics: Double/debiased machine learning (Chernozhukov et al.
    (2018)), Variational Bayesian methods (Tran et al. (2017)) etc

    View Slide

  52. Reference
    • Kato, M., and Teshima, T. (2022), “Non-negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation,” ,” in International Conference on Machine Learning.
    • Kato, M., Imaizumi, M., McAlinn, K., Yasui, S., and Kakehi, H. (2022), “Learning Causal Relationships from Conditional Moment Restrictions by Importance Weighting,” in International Conference on
    Learning Representations.
    • Kato, M., Imaizumi, M., and Minami, K. (2022), “Unified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation: Bridging KL-Divergence and Integral Probability Metrics,” .
    • Kanamori, T., Hido, S., and Sugiyama, M. (2009), “A least-squares approach to direct importance estimation.” Journal of Machine Learning Research, 10(Jul.):1391–1445.
    • Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. (2017), “Positive-Unlabeled Learning with Non-Negative Risk Estimator,” in Conference on Neural Information Processing Systems.
    • Imbens, G. W. and Lancaster, T. (1996), “Efficient estimation and stratified sampling,” Journal of Econometrics, 74, 289–318.
    • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J.(2018), “Double/debiased machine learning for treatment and structural parameters,” Econometrics Journal,
    21, C1–C68.
    • Good, I. J. and Gaskins, R. A. (1971), “Nonparametric Roughness Penalties for Probability Densities,” Biometrika, 58, 255–277.
    • Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., and Kawanabe, M. (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the
    20th International Conference on Neural Information Processing Systems (NIPS'07). Curran Associates Inc., Red Hook, NY, USA, 1433–1440.
    • Sugiyama, M., Suzuki, T., and Kanamori, T. (2011), “Density Ratio Matching under the Bregman Divergence: A Unified Framework of Density Ratio Estimation,” Annals of the Institute of Statistical
    Mathematics, 64.— (2012), Density Ratio Estimation in Machine Learning, New York, NY, USA: Cambridge University Press, 1st ed.
    • Sugiyama, M., (2016), “Introduction to Statistical Machine Learning.”
    • Silverman, B. W. (1982), “On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method,” The Annals of Statistics, 10, 795 – 810. 2
    • Suzuki, T., Sugiyama, M., Sese, Jun., and Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. In Proceedings of the Workshop on New Challenges for
    Feature Selection in Data Mining and Knowledge Discovery at ECML/PKDD 2008,volume 4 of Proceedings of Machine Learning Research, pp. 5–20. PMLR.
    • Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. (2016), “Generative Adversarial Nets from a Density Ratio Estimation Perspective.”
    • Tran, D., Ranganath, R., and Blei, D. M. (2017), “Hierarchical Implicit Models and Likelihood-Free Variational Inference,” in International Conference on Neural Information, Red Hook, NY, USA, p. 5529–
    5539.
    • Nguyen, X., Wainwright, M. J., and Jordan, M. (2008), “Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization,” in Conference on Neural Information Processing
    Systems, vol. 20.
    • Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003.
    • Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., and Kanamori, T. (2011), “Statistical outlier detection using direct density ratio estimation,” Knowledge and Information Systems, 26, 309–336
    • Lai, T. and Robbins, H. (1985), “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics
    • Kaufmann, E., Cappe, O., and Garivier, A. (2016), “On the Complexity of Best-Arm Identification in Multi-Armed ´ Bandit Models,” Journal of Machine Learning Research, 17, 1–42
    • Fan, X., Grama, I., and Liu, Q. (2013), “Cramer large deviation expansions for martingales under Bernstein’s condi- ´ tion,” Stochastic Processes and their Applications, 123, 3919–3942.
    • Fan, X., Grama, I., and Liu, Q. (2014), “A generalization of Cramer large deviations for martingales,” ´ Comptes Rendus Mathematique, 352, 853– 858.
    • Shimodaira, H. (2000), “Improving predictive inference under covariate shift by weighting the log-likelihood function,” Journal of statistical planning and inference, 90, 227–244.
    52

    View Slide