Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

On the principle of Invariant Risk Minimization

On the principle of Invariant Riskย Minimization

Avatar for Masanari Kimura

Masanari Kimura

May 27, 2023
Tweet

More Decks by Masanari Kimura

Other Decks in Research

Transcript

  1. On the principle of Invariant Risk Minimization Created by: Masanari

    Kimura Institute: The Graduate University for Advanced Studies, SOKENDAI Dept: Department of Statistical Science, School of Multidisciplinary Sciences E-mail: [email protected] X E L ATEXed on 2023/05/26
  2. Table of contents Invariant Risk Minimization and its variants Invariant

    Risk Minimization Definitions of IRM and IRMv1 Connection to causality Learning theory of IRM Variants of IRM Limitations of Invariant Risk Minimization The difficulties of IRM The optimization dilemma in IRM Conclusion output.tex 1 สข 24
  3. Propblem setting: Out-Of-Distribution Generalization (OOD) [6] โŠš We consider datasets

    ๐ท๐‘’ โ‰” {(๐‘ฅ๐‘’ ๐‘– , ๐‘ฆ๐‘’ ๐‘– )}๐‘›๐‘’ ๐‘–=1 collected under multiple training environments ๐‘’ โˆˆ E๐‘ก๐‘Ÿ . โŠš The dataset ๐ท๐‘’ generated from some ๐‘๐‘’(๐‘ฅ, ๐‘ฆ) under i.i.d. assumption. โŠš Our goal is to obtain ๐‘“ (๐‘ฅ) โˆถ X โ†’ Y which minimizes ๐‘…OOD(๐‘“ ) โ‰” max ๐‘’โˆˆE๐‘Ž๐‘™๐‘™ ๐‘…๐‘’(๐‘“ ) โ‰” max ๐‘’โˆˆE๐‘Ž๐‘™๐‘™ ๐”ผ๐‘๐‘’ (๐‘ฅ,๐‘ฆ) [โ„“(๐‘“ (๐‘ฅ), ๐‘ฆ)], (1) for E๐‘Ž๐‘™๐‘™ โŠƒ E๐‘ก๐‘Ÿ. output.tex 3 สข 24
  4. Definition (Invariant predictor[1]) We say that a data representation ฮฆ

    โˆถ X โ†’ H elicits an invariant predictor ฬ‚ ๐›ฝ โˆ˜ ฮฆ across environments E if there is a classifier ฬ‚ ๐›ฝ โˆถ H โ†’ Y simultaneously optimal for all environments, that is, ฬ‚ ๐›ฝ โˆˆ argmin ๐›ฝโˆถHโ†’Y ๐‘…๐‘’(๐›ฝ โˆ˜ ฮฆ) for all ๐‘’ โˆˆ E. Goal Learning invariant representation ฮฆ such that optimal classifier ฬ‚ ๐›ฝ is identical for all environments ๐‘’ โˆˆ E. ๐”ผ๐‘๐‘’ (๐‘ฅ,๐‘ฆ) [๐‘ฆ|ฮฆ(๐‘ฅ) = โ„Ž] = ๐”ผ๐‘๐‘’โ€ฒ (๐‘ฅ,๐‘ฆ) [๐‘ฆ|ฮฆ(๐‘ฅ) = โ„Ž], โˆ€๐‘’, ๐‘’โ€ฒ โˆˆ E. (2) output.tex 4 สข 24
  5. Invariant Risk Minimization (IRM) Definition (IRM [1]) min ฮฆโˆถXโ†’H ฬ‚

    ๐›ฝโˆถHโ†’Y โˆ‘ ๐‘’โˆˆE๐‘ก๐‘Ÿ ๐‘…๐‘’( ฬ‚ ๐›ฝ โˆ˜ ฮฆ), (3) ๐‘ .๐‘ก. ฬ‚ ๐›ฝ โˆˆ argmin ๐›ฝโˆถHโ†’Y ๐‘…๐‘’(๐›ฝ โˆ˜ ฮฆ), โˆ€๐‘’ โˆˆ E๐‘ก๐‘Ÿ . (4) โŠš This bilevel program is highly non-convex and difficult to solve. โŠš To find an approximate solution, we can consider a Langrangian form, whereby the sub-optimality w.r.t. the constraint is expressed as the squared norm of the gradients of each of the inner optimization problems. output.tex 5 สข 24
  6. Definition (IRMv1 [1]) min ฮฆโˆถXโ†’Y โˆ‘ ๐‘’โˆˆE๐‘ก๐‘Ÿ ๐‘…๐‘’(ฮฆ) + ๐œ†

    โ‹… โ€–โˆ‡ ฬ‚ ๐›ฝ ๐‘…๐‘’( ฬ‚ ๐›ฝ โ‹… ฮฆ)โ€–2 2 . (5) โŠš Assuming the inner optimization problem is convex, achieving feasibility is equivalent to the penalty term being equal to 0. โŠš For ๐œ† = โˆž, IRMv1 is equivalent to IRM. output.tex 6 สข 24
  7. Connection to causality Definition (Structural Equation Model (SEM) [9, 5])

    A Structural Equation Model (SEM) C โ‰” (S, N) governing the random vector ๐‘‹ = (๐‘‹1 , โ€ฆ , ๐‘‹๐‘‘ ) is a set of structural equations: S๐‘– โˆถ ๐‘‹๐‘– โ† ๐‘“๐‘– (Pa(๐‘‹๐‘– ), ๐‘๐‘– ), (6) where Pa(๐‘‹๐‘– ) โІ {๐‘‹1 , โ€ฆ , ๐‘‹๐‘‘ } โงต {๐‘‹๐‘– } are called the parents of ๐‘‹๐‘– , and the ๐‘๐‘– are independent noise random variables. โŠš We say that โ€๐‘‹๐‘– causes ๐‘‹๐‘— if ๐‘‹๐‘– โˆˆ Pa(๐‘‹๐‘—)โ€. โŠš We call causal graph of ๐‘‹ to the graph obtained i) one node for each ๐‘‹๐‘– , ii) one edge from ๐‘‹๐‘– to ๐‘‹๐‘— if ๐‘‹๐‘– โˆˆ Pa(๐‘‹๐‘— ). โŠš We assume acyclic causal graphs. output.tex 8 สข 24
  8. โŠš From SEM C according to the topological ordering of

    its causal graph, we can draw samples from the observational distribution ๐‘ƒ(๐‘‹). โŠš We can intervene an unique SEM in different ways, indexed by ๐‘’, to obtain different but related SEMs C๐‘’. Definition Consider a SEM C = (S, N). An intervention ๐‘’ on C consists of replacing one or several of its structural equations to obtain an intervened SEM C๐‘’ = (S๐‘’, N๐‘’), with structural equations S๐‘’ ๐‘– โˆถ ๐‘‹๐‘’ ๐‘– โ† ๐‘“ ๐‘’ ๐‘– (Pa(๐‘‹๐‘’ ๐‘– ), ๐‘๐‘’ ๐‘– ), (7) where the variable ๐‘‹๐‘’ is intervened if S๐‘– โ‰  S๐‘’ ๐‘– or ๐‘๐‘– โ‰  ๐‘๐‘’ ๐‘– . output.tex 9 สข 24
  9. Definition Consider a SEM C governing the random vector (๐‘‹1

    , โ€ฆ , ๐‘‹๐‘‘ , ๐‘Œ), and the learning goal of predicting ๐‘Œ from ๐‘‹. Then, the set of all environments E๐‘Ž๐‘™๐‘™ (C) indexes all the interventional distributions ๐‘ƒ๐‘’ (๐‘‹, ๐‘Œ) = ๐‘ƒ(๐‘‹๐‘’, ๐‘Œ๐‘’) obtainable by valid interventions ๐‘’. An intervention ๐‘’ โˆˆ E๐‘Ž๐‘™๐‘™ (C) is valid as long as i) the causal graph remains acyclic; ii) ๐”ผ๐‘ƒ๐‘’ (๐‘‹,๐‘Œ) [๐‘Œ|Pa(๐‘Œ)] = ๐”ผ[๐‘Œ|Pa(๐‘Œ)]; iii) ๐•[๐‘Œ๐‘’|Pa(๐‘Œ)] remains within a finite range. โŠš The previous definitions relate causality and invariance. โŠš One can show that a predictor ๐›ฝ โˆถ X โ†’ Y is invariant across E๐‘Ž๐‘™๐‘™(C) iff. it attains optimal ๐‘…OOD, and iff. it uses only the direct causal parents of ๐‘Œ to predict. output.tex 10 สข 24
  10. Learning theory of IRM Goal Low error and invariance across

    E๐‘ก๐‘Ÿ lead low error across E๐‘Ž๐‘™๐‘™ . Intuition: Invariant Causal Prediction (ICP) [6] ICP recovers the target invariance as long as the i) data is Gaussian; ii) data satisfies a linear SEM; iii) data is obtained by certain types of interventions. output.tex 11 สข 24
  11. Assumption A set of training environments E๐‘ก๐‘Ÿ lie in linear

    general position of degree ๐‘Ÿ if |E๐‘ก๐‘Ÿ | > ๐‘‘ โˆ’ ๐‘Ÿ + ๐‘‘ ๐‘Ÿ for some ๐‘Ÿ โˆˆ โ„•, and for all non-zero ๐‘ฅ โˆˆ โ„๐‘‘, dim (span ({๐”ผ [๐‘‹๐‘’๐‘‹๐‘’โŠค] ๐‘ฅ โˆ’ ๐”ผ [๐‘‹๐‘’๐œ–๐‘’]} ๐‘’โˆˆE๐‘ก๐‘Ÿ )) > ๐‘‘ โˆ’ ๐‘Ÿ. (8) Theorem Assume that ๐‘Œ๐‘’ = ๐‘๐‘’ 1 โ‹… ๐›พ + ๐œ–๐‘’, ๐‘๐‘’ 1 โŸ‚ ๐œ–๐‘’, ๐”ผ[๐œ–๐‘’] = 0, ๐‘‹๐‘’ = S(๐‘๐‘’ 1 , ๐‘๐‘’ 2 ). Here, ๐›พ โˆˆ โ„๐‘. Assume that the ๐‘1 component of S is invertible. Let ฮฆ โˆˆ โ„๐‘‘ร—๐‘‘ have rank ๐‘Ÿ > 0. Then, if at least ๐‘‘ โˆ’ ๐‘Ÿ + ๐‘‘ ๐‘Ÿ training environments E๐‘ก๐‘Ÿ โІ E lie in linear general position of degree ๐‘Ÿ, we have that ฮฆ๐”ผ [๐‘‹๐‘’๐‘‹๐‘’โŠค] ฮฆโŠค ฬ‚ ๐›ฝ = ฮฆ๐”ผ[๐‘‹๐‘’๐‘Œ๐‘’] (9) holds for all ๐‘’ โˆˆ E๐‘ก๐‘Ÿ iff. ฮฆ elicits the invariant predictor ฮฆโŠค ฬ‚ ๐›ฝ for all ๐‘’ โˆˆ E๐‘Ž๐‘™๐‘™ . output.tex 12 สข 24
  12. Variants of IRM โŠš Risk Extrapolation (REx) [4]; โŠš Risk

    Variance Penalization (RVP) [10]; โŠš Sparse Invariant Risk Minimization (SparseIRM) [11]; โŠš Derivative Invariant Risk Minimization (DIRM) [2]; โŠš Domain Extrapolation via Regret Minimization (RGM); โŠš Domain Generalization using Causal Matching (MatchDG); โŠš etc. [8], output.tex 13 สข 24
  13. Risk Extrapolation (REx) [4] REx For ๐›พ โˆˆ [0, โˆž),

    ๐‘…Vโˆ’REx (๐‘“ ) โ‰” ๐›พVar({๐‘…1 (๐‘“ ), โ€ฆ , ๐‘…๐‘š (๐‘“ )}) + โˆ‘ ๐‘’โˆˆE๐‘ก๐‘Ÿ ๐‘…๐‘’ (๐‘“ ), (10) output.tex 14 สข 24
  14. Risk Variance Penalization (RVP) [10] RVP For ๐œ† โˆˆ [0,

    โˆž), ๐‘…RVP (๐‘“ ) โ‰” ๐œ†โˆšVar({๐‘…1 (๐‘“ ), โ€ฆ , ๐‘…๐‘š (๐‘“ )}) + โˆ‘ ๐‘’โˆˆE๐‘ก๐‘Ÿ ๐‘…๐‘’ (๐‘“ ). (11) By the Slutskyโ€™s theorem, for ๐‘š = |E|, โ„™ (๐”ผ๐‘’ [๐‘…๐‘’ (๐‘“ )] โˆ’ ๐‘…RVP(๐‘“ ) โ‰ค 0) โ†’ ฮฆ(โˆš๐‘š๐œ†). (12) Then, we can have ๐œ† = ฮฆโˆ’1(1 โˆ’ ๐›พ)/โˆš๐‘š for some confidence interval 1 โˆ’ ๐›พ. output.tex 15 สข 24
  15. Sparse Invariant Risk Minimization (SparseIRM) [11] SparseIRM For ๐พ โˆˆ

    โ„•, min ๐›ฝ,ฮฆ,๐‘š ๐‘…(๐›ฝ, ๐‘š โˆ˜ ฮฆ), ๐‘ .๐‘ก. ๐‘š โˆˆ {0, 1}๐‘‘ ฮฆ , โ€–๐‘šโ€–1 โ‰ค ๐พ. output.tex 16 สข 24
  16. Limitations of Invariant Risk Minimization โŠš IRM fundamentally does not

    improve over ERM; โŠš The Optimization Dilemma in IRM; output.tex 17 สข 24
  17. Theorem ฤฑ The Failure of IRM in the Non-Linear Regime

    [7] Suppose we observe ๐ธ environments E = {๐‘’1 , โ€ฆ , ๐‘’๐ธ }, where ๐œŽ2 ๐ธ = 1, โˆ€๐‘’ โˆˆ [1, ๐ธ]. Then, for any ๐œ– > 1, there exists a featurizer ฮฆ๐œ– which, combined with the ERM-optimal classifier ฬ‚ ๐›ฝ = [๐›ฝ๐‘ , ๐›ฝ๐‘’;๐ธ๐‘…๐‘€ , ๐›ฝ0 ]โŠค, satisfies the following 1. The regularization term of ฮฆ๐œ– , ฬ‚ ๐›ฝ is bounded as 1 ๐ธ โˆ‘ ๐‘’โˆˆE โ€–โˆ‡ ฬ‚ ๐›ฝ ๐‘…๐‘’(ฮฆ๐œ– , ฬ‚ ๐›ฝ)โ€– 2 2 โˆˆ O (๐‘2 ๐œ– (๐‘๐œ– ๐‘‘๐‘’ + 1 ๐ธ โˆ‘ ๐‘’โˆˆE โ€–๐œ‡๐‘’ โ€–2 2 )) , (13) for some constants ๐‘๐œ– and ๐‘๐œ– โ‰” exp{โˆ’๐‘‘๐‘’ min(๐œ– โˆ’ 1, (๐œ– โˆ’ 1)2/8)}. 2. ฮฆ๐œ– , ฬ‚ ๐›ฝ is equivalent to the ERM -optimal predicter on at least 1 โˆ’ ๐‘ž fraction of the test distribution, where ๐‘ž โ‰” 2๐‘… โˆš๐œ‹๐›ฟ exp{โˆ’๐›ฟ2}. output.tex 18 สข 24
  18. Here, we suppose that, for any test distribution, its environmental

    mean ๐œ‡๐ธ+1 is sufficiently far from the training mean: โˆ€๐‘’ โˆˆ E, min ๐‘ฆโˆˆ{+1,โˆ’1} โ€–๐œ‡๐ธ+1 โˆ’ ๐‘ฆ โ‹… ๐œ‡๐‘’โ€–2 โ‰ฅ (โˆš๐œ– + ๐›ฟ)/โˆš๐‘‘๐‘’ (14) for some ๐›ฟ > 0. This predictor we constructed will completely fail to use invariant prediction on most environments: โŠš when large ๐›ฟ, IRM fails to use invariant prediction on any environment that is slightly outside the high probability region of the prior. โŠš when small ๐›ฟ, ERM already guarantees reasonable performance at test-time; thus, IRM fundamentally does not improve over ERM in this regime. output.tex 19 สข 24
  19. The optimization dilemma in IRM โŠš OOD objectives such as

    IRM usually require several relaxations for the ease of optimization, which however introduces huge gaps. โŠš The gradient conflicts between ERM and OOD objectives generally exist for different objectives at different penalty weights. โŠš The typically used linear weighting scheme to combine ERM and OOD objectives requires careful tuning of the weights to approach the solution. output.tex 20 สข 24
  20. Pareto Invariant Risk Minimization [3] When given a robust OOD

    objective ๐‘…OOD, Pareto IRM aims to solve the following multi-objective optimization problem: min ๐‘“ {๐‘…ERM(๐‘“ ), ๐‘…OOD(๐‘“ )}. (15) output.tex 21 สข 24
  21. Conclusion โŠš IRM aims to learn invariant predictor to achieve

    OOD generalization. โŠš There are many variants of IRM. โŠš Several negative results for IRM are observed. output.tex 22 สข 24
  22. References [1] Martin Arjovsky et al. โ€œInvariant risk minimizationโ€. In:

    arXiv preprint arXiv:1907.02893 (2019). [2] Alexis Bellot and Mihaela van der Schaar. โ€œAccounting for unobserved confounding in domain generalizationโ€. In: arXiv preprint arXiv:2007.10653 (2020). [3] Yongqiang Chen et al. โ€œPareto Invariant Risk Minimization: Towards Mitigating the Optimization Dilemma in Out-of-Distribution Generalizationโ€. In: The Eleventh International Conference on Learning Representations. 2023. [4] David Krueger et al. โ€œOut-of-distribution generalization via risk extrapolation (rex)โ€. In: International Conference on Machine Learning. PMLR. 2021, pp. 5815โ€“5826. [5] Judea Pearl. Causality: models, reasoning, and inference. 1980. [6] J Peters, Peter Buhlmann, and N Meinshausen. โ€œCausal inference using invariant prediction: identification and confidence intervals. arXivโ€. In: Methodology (2015). output.tex 23 สข 24
  23. References [7] Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski.

    โ€œThe Risks of Invariant Risk Minimizationโ€. In: International Conference on Learning Representations. 2021. url: https://openreview.net/forum?id=BbNIbVPJ-42. [8] Zheyan Shen et al. โ€œTowards out-of-distribution generalization: A surveyโ€. In: arXiv preprint arXiv:2108.13624 (2021). [9] Sewall Wright. โ€œCorrelation and causationโ€. In: (1921). [10] Chuanlong Xie et al. โ€œRisk variance penalizationโ€. In: arXiv preprint arXiv:2006.07544 (2020). [11] Xiao Zhou et al. โ€œSparse invariant risk minimizationโ€. In: International Conference on Machine Learning. PMLR. 2022, pp. 27222โ€“27244. output.tex 24 สข 24