データセットシフトの学習理論

Intro IWERM Learning Bounds under Dataset Shifts Evaluating Model Stability
to Dataset Shift Distributionally Robust Optimization Summary References データセットシフトの学習理論 Masanari Kimura [email protected] June 7, 2021

to Dataset Shift Distributionally Robust Optimization Summary References Intro 2/43

to Dataset Shift Distributionally Robust Optimization Summary References TL;DR 本スライドで扱うこと： ▶ 学習時とテスト時でデータが従う分布が異なるデータセットシフトの問題設定について整理 [Quiñonero-Candela et al., 2009] ▶ データセットシフトにまつわる理論的な結果を紹介本スライドで扱わないこと： ▶ 各定理の証明の詳細 ▶ データセットシフトを扱う具体的なアルゴリズムの実装 3/43

to Dataset Shift Distributionally Robust Optimization Summary References Empirical Risk Minimization 教師あり学習はデータの分布 P から生成される D での経験誤差最小化に基づく： h∗ = arg min h∈H ˆ Rℓ D (h) = arg min h∈H 1 N N i=1 ℓ(yi, h(xi)) (1) 学習時とテスト時の分布が同一のとき，経験誤差 ˆ Rℓ D の期待値は期待誤差 Rℓ(h) に一致： EP ˆ Rℓ D = 1 N N i=1 EP ℓ(Yi, h(Xi)) = 1 N N i=1 Rℓ(h) = Rℓ(h) (2) 学習時の分布 P とテスト時の分布 Q が異なるとき，ERM ではテストデータについて最適な仮説を選択できない ⇒ このような状況にどう対処できるか？or どのように正しく評価するか？ 4/43

to Dataset Shift Distributionally Robust Optimization Summary References Taxonomy of Distribution Shifts ▶ Covariate Shift ▶ Target Shift ▶ Concept Shift 5/43

to Dataset Shift Distributionally Robust Optimization Summary References Covariate Shift & Target Shift Covariate Shift [Shimodaira, 2000] 学習時の分布を P，テスト時の分布を Q とすると， P(X) ̸= Q(X), P(Y|X) = Q(Y|X). Target Shift [Zhang et al., 2013] 学習時の分布を P，テスト時の分布を Q とすると， P(Y) ̸= Q(Y), P(X|Y) = Q(X|Y). 6/43

to Dataset Shift Distributionally Robust Optimization Summary References Concept Shift Concept Shift [Tsymbal, 2004; Vorburger and Bernstein, 2006] 学習時の分布を P，テスト時の分布を Q とすると， P(Y|X) ̸= Q(Y|X), P(X|Y) ̸= Q(X|Y). ▶ e.g. P(青が流行色) = 0.95 ≫ Q(青が流行色) = 0.2 7/43

to Dataset Shift Distributionally Robust Optimization Summary References General Distribution Shift General Distribution Shift Z ⊂ {X, Y} を周辺分布が変わらない不変集合，W ⊂ {X, Y} \ Z を可変集合， V = {X, Y} \ {Z, W} を残りの従属変数とする．このときデータセットの同時分布 P(X, Y) = P(V, W, Z) は以下のように条件付き確率の積に分解できる： P(X, Y) = P(V|W, Z)P(W|Z)P(Z). (3) このとき，一般化分布シフトは式 (3)において，条件付き確率 P(W|Z) の差し替えで表現できる： Q(X, Y) = P(V|W, Z)Q(W|Z)P(Z). (4) 例えば式 (4)において Z = ∅，W = X とすると共変量シフトになる． 8/43

to Dataset Shift Distributionally Robust Optimization Summary References IWERM 9/43

to Dataset Shift Distributionally Robust Optimization Summary References Importance Weighting for the ERM h∗ = arg min h∈H 1 n n i=1 w(x)ℓ(yi, h(xi)), where w(x) is the weighting function. High Importance Low Importance 10/43

to Dataset Shift Distributionally Robust Optimization Summary References Importance Weighted Empirical Risk Minimization IWERM [Shimodaira, 2000] 重み付け関数として w(x) = Q(x)/P(x) とすると，重み付き ERM で計算されるリスクはテスト分布におけるリスクの推定量として一致性を持つ： ˆ Rℓ D (h) = EP w(x)ℓ(Y, h(X)) = X Q(x) P(x) ℓ(y, h(x))P(x) = EQ[ℓ(Y, h(X))] = Rℓ(h). (5) 11/43

to Dataset Shift Distributionally Robust Optimization Summary References Learning Bounds under Dataset Shifts 12/43

to Dataset Shift Distributionally Robust Optimization Summary References Power of the Inequalities 基本方針は以下のような不等式を得ること： Rℓ Q (h) ≤ Rℓ P (h) + ψ(P, Q) + η(H, d, N, ϵ, δ). (6) ▶ データセットシフトの影響を事前に見積もることができる ▶ 不等式の右辺は小さいほど嬉しい ⇒ よりタイトなバウンド ▶ Rℓ P (h)，Rℓ Q (h)：学習分布，テスト分布における損失 ▶ ψ(P, Q)：学習分布とテスト分布の離れ度合いを評価する関数 ▶ η(H, d, N, ϵ, δ)：仮説クラス，次元，サンプルサイズ，精度に依存する関数 13/43

to Dataset Shift Distributionally Robust Optimization Summary References Total Variation Distance-based Generalization Bounds Theorem[Ben-David et al., 2007]) X × Y 上の分布 P と Q が与えられたとき，以下が成り立つ： Rℓ01 Q (h) ≤ Rℓ01 P (h) + dTV(PX , QX ) + min E x∼PX |fP(x) − fQ(x)| , E x∼QX |fQ(x) − fP(x)| . (7) ここで fP(x)，fQ(x) は学習分布とテスト分布についての真のラベリング関数， dTV(·, ·) : P × P → R は total variation distance， dTV(P, Q) = 2 sup A∈Ω P(A) − Q(A) . (8) 14/43

to Dataset Shift Distributionally Robust Optimization Summary References Limitations of the Total Variation Distance 1. 任意の確率分布間の dTV(·, ·) は有限サイズのサンプルからは推定できない； 2. 仮説クラスと独立なので不等式が緩くなる． 15/43

to Dataset Shift Distributionally Robust Optimization Summary References H-Divergence leads the Tighter Bound Theorem [Ben-David et al., 2010] h ∈ H について Ih := {(x, 1) : h(x) = 1} とすると，X × Y 上の分布 P と Q について以下が成り立つ： Rℓ01 Q (h) ≤ Rℓ01 P (h) + 1 2 dH (P, Q) + C. (9) C は仮設クラスの複雑度に依存する項で，dH (·, ·) は H-divergence， dH (P, Q) = 2 sup h∈H P(Ih) − Q(Ih) . (10) 16/43

to Dataset Shift Distributionally Robust Optimization Summary References Limitations of the H-divergence 1. 損失関数が 0 − 1 損失に限られる； 2. モデルの複雑さに依存する項が VC-次元に基づくため不等式が緩くなる． 17/43

to Dataset Shift Distributionally Robust Optimization Summary References Discrepancy Distance Discrepancy Distance X × Y 上の分布 P と Q の間の discrepancy distance discℓ : P × Q → R は以下で定義される： discℓ (P, Q) = sup (h,h′)∈H2 E x∼P ℓ(h′(x), h(x)) − E x∼Q ℓ(h′(x), h(x)) . (11) 18/43

to Dataset Shift Distributionally Robust Optimization Summary References Relations between discℓ and Other Measures Relation between disc ℓ01 and d H 損失関数として ℓ01 をとると，以下の関係が成り立つ： discℓ01 (P, Q) = 1 2 dH (P, Q). (12) Relation between disc ℓ and dTV [Mansour et al., 2009] 損失関数 ℓ が上に有界とする：(∀(x, x′) ∈ X2)(∃M > 0)(ℓ(x, x′) ≤ M)．このとき以下の関係が成り立つ： discℓ (P, Q) ≤ M dTV(P, Q). (13) 19/43

to Dataset Shift Distributionally Robust Optimization Summary References Generalization Bounds with Discrepancy Distance Theorem [Mansour et al., 2009] 任意の損失関数 ℓ : X × Y → R+ と X × Y 上の確率分布 P，Q について，以下が成り立つ： Rℓ Q (h) ≤ Rℓ P (h, h∗ P ) + discℓ (PX , QY ) + ϵ. (14) ▶ 実用上は計算量が膨大になってしまう； ▶ 実用的な計算量で任意の損失関数に適用できる不等式が欲しい． 20/43

to Dataset Shift Distributionally Robust Optimization Summary References Wasserstein Distance Optimal Transport and Wasserstein Distance 確率測度 PX , QX ∈ P(X) について，最適輸送問題の目的は，あるコスト関数 c : X × X → R+ について X × X 上の同時確率として定義される γ を見つけること： arg min γ∈Γ(PX ,QX ) X×X c(x, x′)pdγ(x, x′). (15) この概念を用いて，以下のように p 次 Wasserstein distance を定義できる： Wp p (PX , Q) := inf γ∈Γ(PX ,QX ) X×X c(x, x′)pdγ(x, x′). (16) 21/43

to Dataset Shift Distributionally Robust Optimization Summary References Optimal Transport arg min γ∈Γ(PX ,QX ) X×X c(x, x′)pdγ(x, x′). Figure: Photo by ”Introduction to Optimal Transport”. 22/43

to Dataset Shift Distributionally Robust Optimization Summary References Generalization Bounds with Wasserstein Distance Theorem [Courty et al., 2016] P と Q から生成されるサンプルサイズ NS ，NT のラベルなしデータが得られるとき，ある ζ′ < √ 2 について少なくとも 1 − δ の確率で Rℓ Q (h) ≤ Rℓ P + W1 ( ˆ PX , ˆ QX ) + 2 log 1 δ /ζ′ 1 NS + 1 NT + λ (17) が成り立つ．ここで W1 (PX , QX ) は 1 次 Wasseerstein distance． 23/43

to Dataset Shift Distributionally Robust Optimization Summary References Many Other Bounds. e.g., ▶ Maximum Mean Discrepancy and Kernel Embeddings [Redko, 2015] ▶ PAC-Bayesian generalization bounds [Germain et al., 2013, 2016; McNamara and Balcan, 2017] 24/43

to Dataset Shift Distributionally Robust Optimization Summary References Evaluating Model Stability to Dataset Shift 25/43

to Dataset Shift Distributionally Robust Optimization Summary References Quantifying Performance Under Shifts ▶ ゴールは仮説 h ∈ H の P(W|Z) の変化に対する Stability を記述すること ▶ 基本的な方針：想定しうるデータセットシフトの最悪ケースを評価 ! − # ! − # 26/43

to Dataset Shift Distributionally Robust Optimization Summary References worst (1 − α)-subsample worst (1 − α)-subsample ある α ∈ [0, 1) について，サンプルサイズが元のデータセットの (1 − α)% であって期待誤差が最大となる部分データセットを worst (1 − α)-subsample と定義する． worst-case risk g : W × Z → [0, 1] をあるデータが worst (1 − α)-subsample に含まれるかどうかを識別する関数とする．worst (1 − α)-subsample における期待誤差 Rα,0 を以下のように定義する： Rα,0 := sup g:W×Z→[0,1] 1 1 − α E g(W, Z)µ0 (W, Z) (18) s.t. E g(W, Z)|Z = 1 − α a.e. (19) where µ0 (W, Z) = E[ℓ(Y, h(X))|W, Z ]. (20) 27/43

to Dataset Shift Distributionally Robust Optimization Summary References Dual Formulation for Rα,0 Dual Formulation for Rα,0 [Duchi and Namkoong, 2018; Duchi et al., 2020] 式 (18)で与えられる Rα,0 の双対は以下で与えられる： Rα,0 = E 1 1 − α (µ0 (W, Z) − η0 (Z))+ + η0 (Z) (21) where η0 = arg inf η:Z→R E 1 1 − α (µ0 (W, Z) − η(Z))+ + η(Z) . (22) 28/43

to Dataset Shift Distributionally Robust Optimization Summary References Worst-Case Sampler Algorithm 1 Worst-Case Sampler [Subbaswamy et al., 2021] Require: hypothesis h, dataset D = {wi, zi, vi}n i=1 , K cross validation folds Ik ⊂ {1, . . . , n} 1: for k=1,…, K do 2: Estimate ˆ µk ≈ µ0 using data in Ic k ; Estimate ˆ ηk ≈ η0 using ˆ µk and data in Ic k 3: for i ∈ Ik do 4: Let ˆ mui = ˆ µk(wi, zi) 5: Let ˆ ηi = ˆ ηk(zi) 6: Let ˆ hi = [ˆ µi > ˆ ηi] 7: end for 8: end for 9: ˆ Rα = 1 K k 1 |Ik| i∈Ik 1 1−α (ˆ µi − ˆ ηi)+ + ˆ ηi + 1 1−α ˆ hi(ℓ(yi, h(xi)) − ˆ µi) 10: return ˆ Rα , {ˆ hi}n i=1 29/43

to Dataset Shift Distributionally Robust Optimization Summary References Consistent Estimator for Rα,0 Consistent Estimator ˆ Rα [Subbaswamy et al., 2021] K-fold Worst-Case Sampler によって以下の推定量を得る： ˆ Rα = 1 K k 1 |Ik| i∈Ik 1 1 − α ((ˆ µi − ˆ ηi) + |ˆ µi ≥ ˆ ηi|(ℓ(yi, h(xi) − ˆ µi)) + ˆ ηi (23) この ˆ Rα は Rα,0 の一致推定量になる． 30/43

to Dataset Shift Distributionally Robust Optimization Summary References √ N-Consistency and Central Limit Properties Theorem [Chernozhukov et al., 2018] {δN}N を δN ≥ N−1/2(∀N ≥ 1) であるような 0 に収束する正数列とする．このとき，適切な仮定のもとで ˆ Rα は Rα,0 の 1/ √ N 近傍に集中し，その分布は正規分布で近似される： √ Nσ−1( ˆ Rα − Rα,0 ) = 1 √ N σ−1 i ψ(Wi, Zi, Vi; Rα,0 , µ0 , η0 ) + OP(δN) ⇝ N(0, 1) (24) where, σ2 = E[ψ2(W, Z, V; Rα,0 , µ0 , η0 )] and ψ(·; Rα,0 , µ0 , η0 ) = 1 1 − α (µ0 − η0 )+ + [µ0 ≥ η0 ](ℓ − µ0 ) + η0 − Rα,0 . 31/43

to Dataset Shift Distributionally Robust Optimization Summary References Distributionally Robust Optimization 32/43

to Dataset Shift Distributionally Robust Optimization Summary References Distributionally Robust Optimization Distributionally Robust Optimization (DRO) パラメータ空間 Θ ⊂ Rd，可測空間 (X, A) の上のデータ生成を担う確率分布 P0 とすると， minimizeθ∈Θ Rf(θ; P0 ) := sup Q≪P0 EQ[ℓ(θ; X)] : Df[Q : P0 ] ≤ ρ . (25) !! "||$ Worst-case distribution ▶ データセットシフトの最悪ケースにロバストなモデルを学習 ▶ [Ben-Tal et al., 2013; Duchi and Namkoong, 2018; Duchi et al., 2021] 33/43

to Dataset Shift Distributionally Robust Optimization Summary References Notations ▶ uncertainty region: UP := {Q : Df[Q∥P] ≤ ρ}; ▶ likelihood ratio: L(X) = dQ(x)/dP0 (X) ▶ worst-case risk: Rf(θ; P0 ) = sup P EP ℓ(θ; X) : P ∈ UP (26) = sup L≥0 EP0 L(X)ℓ(θ; X) : EP0 f(L(X)) ≤ ρ , EP0 L(X) = 1 (27) 34/43

to Dataset Shift Distributionally Robust Optimization Summary References Divergence families for the Uncertainty Region Rényi α-divergence [Van Erven and Harremos, 2014] Dα [P∥Q ] := 1 α − 1 log dP dQ dQ. (28) α → 1 で KL-divergence に一致． Cressie-Read family of f-divergences [Cressie and Read, 1984] k ∈ (−∞, ∞)，k∗ = k k−1 について Dfk [P∥Q ] := fk dP dQ dQ, (29) fk(t) := tk − kt + k − 1 k(k − 1) , f∗ k (s) := 1 k ((k − 1)s + 1)k∗ + − 1 . (30) 35/43

to Dataset Shift Distributionally Robust Optimization Summary References Worst-Case Shift and Tail-Performance Proposition [Shapiro, 2017] ある (X, A) 上の確率分布 P と ρ > 0 について， Rf(θ; P) = inf λ≥0,η∈R EP λf∗ ℓ(θ; X) − η λ + λρ + η . (31) Lemma [Duchi and Namkoong, 2018] ある (X, A) 上の確率分布 P，k ∈ (1, ∞)，k∗ = k/(k − 1)，ρ > 0，ck(ρ) := (1 + k(k − 1)ρ)1 k について，Rk(θ; P) を Dfk [Q∥P ] ≤ ρ での worst-case risk とすると， Rk(θ; P) worst-case risk = inf η∈R ck(ρ)EP (ℓ(θ; X) − η)k∗ + 1 k∗ + η tail-performance . (32) 36/43

to Dataset Shift Distributionally Robust Optimization Summary References Empirical Evaluation of the DRO Figure: [Duchi and Namkoong, 2018]．破線が majority 群の損失，実線が minority 群の損失． 37/43

to Dataset Shift Distributionally Robust Optimization Summary References Convergence Guarantees of the DRO Theorem [Duchi and Namkoong, 2018] 任意の θ ∈ Θ と x ∈ X について ℓ(θ; X) ≤ M と仮定し，ck(ρ) := ((k − 1)ρ + 1)1/k とする． n ≥ k ∨ 3 のとき，少なくとも 1 − e−t の確率で R(θ; ˆ Pn) 経験損失 − R(θ; P0 ) 期待損失 ≤ 10n− 1 k∗∨2 ck(ρ)2M ck(ρ) ck(ρ) − 1 ∨ 2 1 k + t + 2 log n . (33) 38/43

to Dataset Shift Distributionally Robust Optimization Summary References Asymptotic Properties of the DRO Almost Surely Convergence [Duchi and Namkoong, 2018] E[f∗(|ℓ(X; θ)|)] < ∞ とすると，適当な仮定のもとで inf θ∈Θ Rf(θ; ˆ Pn) a.s. → inf θ∈Θ Rf(θ; P0 ) (34) Asymptotic Normality ˆ θn を Rf(ˆ θn; ˆ Pn) ≤ infθ Rf(θ; ˆ Pn) + oP(1/n) を満足する経験プラグイン推定量と仮定すると，gP0 := λEP0 [f∗((ℓ(θ; X) − η)/λ)] + ρλ + η について以下が成り立つ： √ n ˆ θn − θ∗ d ⇝ N 0, V Cov f∗′ ℓ(θ∗; X) − η∗ λ∗ ∇ℓ(θ∗; X) V , (35) ここで V は (∇2gP0 (θ∗, λ∗, η∗))−1 ∈ R(d+2)×(d+2) の先頭 d × d ブロック行列． 39/43

to Dataset Shift Distributionally Robust Optimization Summary References Summary 40/43

to Dataset Shift Distributionally Robust Optimization Summary References Summary ▶ データセットシフトに対応するさまざまな方針が存在 ▶ どれも重要かつ有用 ▶ Importance Weighting ▶ Learning Bounds ▶ Distributionally Robust Optimization 41/43

to Dataset Shift Distributionally Robust Optimization Summary References References I Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19:137, 2007. Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010. Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013. Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters, 2018. Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2016. Noel Cressie and Timothy RC Read. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46 (3):440–464, 1984. John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018. John Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses for latent covariate mixtures. arXiv preprint arXiv:2007.13982, 2020. John C Duchi, Peter W Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 2021. Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A pac-bayesian approach for domain adaptation with specialization to linear classifiers. In International conference on machine learning, pages 738–746. PMLR, 2013. Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A new pac-bayesian perspective on domain adaptation. In International conference on machine learning, pages 859–868. PMLR, 2016. Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009. 42/43

to Dataset Shift Distributionally Robust Optimization Summary References References II Daniel McNamara and Maria-Florina Balcan. Risk bounds for transferring representations with and without fine-tuning. In International Conference on Machine Learning, pages 2373–2381. PMLR, 2017. Joaquin Quiñonero-Candela, Masashi Sugiyama, Neil D Lawrence, and Anton Schwaighofer. Dataset shift in machine learning. Mit Press, 2009. I Redko. Nonnegative matrix factorization for unsupervised transfer learning. PhD thesis, PhD thesis, Paris North University, 2015. Alexander Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275, 2017. Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000. Adarsh Subbaswamy, Roy Adams, and Suchi Saria. Evaluating model robustness and stability to dataset shift. In International Conference on Artificial Intelligence and Statistics, pages 2611–2619. PMLR, 2021. Alexey Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106(2): 58, 2004. Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7): 3797–3820, 2014. Peter Vorburger and Abraham Bernstein. Entropy-based concept shift detection. In Sixth International Conference on Data Mining (ICDM’06), pages 1113–1118. IEEE, 2006. Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819–827. PMLR, 2013. 43/43

データセットシフトの学習理論

データセットシフトの学習理論

More Decks by Masanari Kimura

Other Decks in Science

Featured

Transcript