Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quick introduction to CounterFactual Regression (CFR)

Masa
June 15, 2022

Quick introduction to CounterFactual Regression (CFR)

(blog) https://medium.com/@masa_asami/quick-introduction-to-counterfactual-regression-cfr-382521eaef21
(My github repository)
[introduction_to_CFR] https://github.com/MasaAsami/introduction_to_CFR

(論文)
[Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. “Estimating individual treatment effect: generalization bounds and algorithms.” International Conference on Machine Learning. PMLR, 2017
[Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. “Learning representations for counterfactual inference.” International conference on machine learning. PMLR, 2016.
[Takeuchi et al., 2021] Takeuchi, Koh, et al. “Grab the Reins of Crowds: Estimating the Effects of Crowd Movement Guidance Using Causal Inference.” arXiv preprint arXiv:2102.03980, 2021.
[Ganin et al., 2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. “Domain-adversarial training of neural networks.” Journal of Machine Learning Research 17(1):2096–2030, 2016.
(参考にしたPython実装)
[cfrnet] https://github.com/clinicalml/cfrnet
[SC-CFR] https://github.com/koh-t/SC-CFR

Masa

June 15, 2022
Tweet

More Decks by Masa

Other Decks in Science

Transcript

  1. Table of Contents 1. Motivation 2. CFR's Solution 3. Theoretical

    Bound 4. Let’s do it with PyTorch 5. Experiments and discussion on Hyperparameters 6. Why do we need CFR? 2
  2. Original Paper: [Shalit et al., 2017] Shalit, Uri, Fredrik D.

    Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. "Learning representations for counterfactual inference." International conference on machine learning. PMLR, 2016. 3
  3. Counterfactual outcomes are estimated by an arbitrary machine learning model

    h Features {covariates x_i , t} -> y Treatment Effects
 5 In the following, we aim for unbiased estimation of ITE (and ATT) factual outcome Individualized Treatment Effect Average Treatment effect on the Treated counterfactual outcome covariates treated or not
  4. Covariate shift “the problem of causal inference by counterfactual prediction

    might require inference over a different distribution than the one from which samples are given. In machine learning terms, this means that the feature distribution of the test set differs from that of the train set. This is a case of covariate shift, which is a special case of domain adaptation” [Johansson et al., 2016] Problem : covariate shift
 6 Estimating yCF is usually difficult. The main reason is due to differences in the distribution of F and CF empirical factual/counterfactual distribution not eqaul (if not RCT)
  5. New Loss - Pseudo-distance between the distributions of both groups

    in the Representation layer - By adding this to loss, we design the distribution of both groups to be close in the layer. CFR's Solution
 8 Fig by. [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. "Learning representations for counterfactual inference." International conference on machine learning. PMLR, 2016.
  6. Two architectures
 [Shalit et al., 2017] [Johansson et al., 2016]

    [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. "Learning representations for counterfactual inference." International conference on machine learning. PMLR, 2016. [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - We adopt the [Shalit et al., 2017] framework to split the net to outcomes. - [Johansson et al., 2016] may underestimate the treatment effect (shrinkage estimation) in the outcome net due to regularization bias.
  7. Object function
 [Shalit et al., 2017] Shalit, Uri, Fredrik D.

    Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - Solving the mixed loss minimization problem with outcome losses and pseudo-distances - α is a hyperparameter (α> 0). - If α=0, Treatment-Agnostic Representation Network (TARNet) L2 penaltiy Sample weighting according to percentage of treatment α
  8. [FYI]Domain Adversarial Neural Networks 
 11 [Ganin et al., 2016]

    Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17(1):2096–2030. [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. Domain Adversarial Neural Networks [Ganin et al., 2016] DANN’s object CFR’s object
  9. Theoretical Bound (1/3) : Motivation
 [Shalit et al., 2017] Shalit,

    Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - Define the ITE error as follows (PEHE) - However, this is an unmeasurable metric (= True ITE is almost impossible to ascertain) - Existence of a theoretical bound!! estimated ITE true ITE
  10. Theoretical Bound (2/3) : Definition
 excepted counterfactual loss expected factual

    treated/control losses [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - Define excepted counterfactual loss to indicate a theoretical bound. This metric also cannot be directly ascertained - Note that it is different from "expected factual treated/control losses".
  11. Theoretical Bound (3/3) 
 Theorem Lemma [Shalit et al., 2017]

    Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. By setting an upper bound on CF loss, we were able to show that an upper bound on ITE loss also exists. u := p(t=1), a constant B  > 0, CF loss is given boud by factual loss and IPM (both observable metrics)
  12. 4. Let’s do it with pytorch!
 16 The following two

    github repositories were used as references. The former([cfrnet]) is the official implementation of the original; it is implemented in TensorFlow. The algorithm inside was used as a reference. The latter([SC-CFR]) is implemented in PyTorch, the same as mine. The architecture of the model is different, but I used many of the class definitions, etc. as reference - [cfrnet] https://github.com/clinicalml/cfrnet - [SC-CFR] https://github.com/koh-t/SC-CFR My repository is as follows : https://github.com/MasaAsami/introduction_to_CFR
  13. [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and

    David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. outnet repnet
  14. [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and

    David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. In this experiment, linear maximum-mean discrepancy (linear MMD) was employed. For other pseudo-distances (e.g. Wasserstein distance), please check the official github https://github.com/clinicalml/cfrnet
  15. Well-known RCT dataset: LaLonde(1986)
 Dehejia, Rajeev and Sadek Wahba. (1999).Causal

    Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 19 The National Supported Work Demonstration (NSW) : The interest of this experiment is whether "vocational training" (counseling and short-term work experience) affects subsequent earnings. In the dataset, the treatment variable, vocational training, is denoted by treat, and the outcome variable, income in 1978, is denoted by re78. Data can be downloaded at the following website: https://users.nber.org/~rdehejia/data/ outcome treatment 1 or 0
  16. Basic statistics : NSW
 Dehejia, Rajeev and Sadek Wahba. (1999).Causal

    Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 20 This experiment was conducted as an RCT, but the table below shows that the covariates are not completely balanced treated average control average | (treated avg - control avg) / std |
  17. RCT causal effect := 1676.3426 
 Dehejia, Rajeev and Sadek

    Wahba. (1999).Causal Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 21 We will use a simple multiple regression model and consider the "true causal effect" to be 1676.3426.
  18. Validation dataset
 安井翔太(2020)『効果検証入門 正しい比較のための因果推論 /. 計量経済学の基礎』、技術評論社 22 A validation dataset

    is created by excluding data from the NSW control group and instead considering non-experimental data (CPS: Current Population Survey) as the control group treated control NSW CPS control New DataSet !!

  19. Temporary hyperparameters
 3 layes 3 layes 48 dim 48 dim

    32 dim split_outnet==True [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017.
  20. Visualization of Representation layer
 • t-SNE visualizations of the balanced

    representations of our Job dataset • Original data on the left, representation layer on the right
  21. experiment_run.py
 Setting up an experimental environment using "hydra" and "mlflow”

    hydra: Obtaining experimental parameters from a yaml file from configs/experiments.yaml to cfg dict{} mlflow: Register experimental parameters here in mlflow mlflow: Register the results of the experiment here in mlflow
  22. Results & Discussions (1/2)
 split_outnet: When split_net==False (i.e., outnet is

    not split), the treatment effect is consistently estimated lower. This is probably due to the effect of regularization bias (shrinkage estimation) l2 penalty w wt (treatment effect) To avoid overfitting, we estimate weights with L2 penalty. Penalizing weights for treatment may underestimate their effect.
  23. Results & Discussions (2/2)
 alpha: • In this data, there

    were few alpha benefits. This is consistent with the results of the original paper • In any case, the difficulty of tuning of α is very high (in most cases, the true treatment effect is not known, so we have no choice but to make a fuzzy decision based on IPM and outcome losses)
  24. [Shalit et al., 2017]’s results
 [Shalit et al., 2017] Shalit,

    Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. The original paper also used the LaLonde dataset (Job). As shown in this table, the benefit of α is insignificant in this data set. The IHDP dataset, which we omitted due to time constraints, appears to be more effective for α.
  25. 6. What's good about CFR ?
 32 It's hard to

    search for alpha and There are other ways to estimate ITE. What on earth is a good CFR?
  26. Lots of interesting applied research!
 [Takeuchi et al., 2021] Takeuchi,

    K., Nishida, R., Kashima, H., & Onishi, M. (2021). Grab the Reins of Crowds: Estimating the Effects of Crowd Movement Guidance Using Causal Inference. arXiv preprint arXiv:2102.03980. • As you know, Deep Learning is a very expressive model, so the covariates do not have to be table data • For example, the following paper uses CNN to adjust image data as covariates.
  27. Reference
 34 (Papers) • [Shalit et al., 2017] Shalit, Uri,

    Fredrik D. Johansson, and David Sontag. “Estimating individual treatment effect: generalization bounds and algorithms.” International Conference on Machine Learning. PMLR, 2017 • [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. “Learning representations for counterfactual inference.” International conference on machine learning. PMLR, 2016. • [Takeuchi et al., 2021] Takeuchi, Koh, et al. “Grab the Reins of Crowds: Estimating the Effects of Crowd Movement Guidance Using Causal Inference.” arXiv preprint arXiv:2102.03980, 2021. • [Ganin et al., 2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. “Domain-adversarial training of neural networks.” Journal of Machine Learning Research 17(1):2096–2030, 2016. (Python code) • [cfrnet] https://github.com/clinicalml/cfrnet • [SC-CFR] https://github.com/koh-t/SC-CFR