Quick introduction to CounterFactual Regression (CFR)

Quick introduction to CFR :  Counterfactual Regression  twitter @asas_mimi 1

Table of Contents 1. Motivation 2. CFR's Solution 3. Theoretical
Bound 4. Let’s do it with PyTorch 5. Experiments and discussion on Hyperparameters 6. Why do we need CFR? 2

Original Paper: [Shalit et al., 2017] Shalit, Uri, Fredrik D.
Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. "Learning representations for counterfactual inference." International conference on machine learning. PMLR, 2016. 3

1. Motivation  4

Counterfactual outcomes are estimated by an arbitrary machine learning model
h Features {covariates x_i , t} -> y Treatment Effects  5 In the following, we aim for unbiased estimation of ITE (and ATT) factual outcome Individualized Treatment Effect Average Treatment effect on the Treated counterfactual outcome covariates treated or not

Covariate shift “the problem of causal inference by counterfactual prediction
might require inference over a different distribution than the one from which samples are given. In machine learning terms, this means that the feature distribution of the test set differs from that of the train set. This is a case of covariate shift, which is a special case of domain adaptation” [Johansson et al., 2016] Problem : covariate shift  6 Estimating yCF is usually difﬁcult. The main reason is due to differences in the distribution of F and CF empirical factual/counterfactual distribution not eqaul (if not RCT)

2. CFR's Solution  7

New Loss - Pseudo-distance between the distributions of both groups
in the Representation layer - By adding this to loss, we design the distribution of both groups to be close in the layer. CFR's Solution  8 Fig by. [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. "Learning representations for counterfactual inference." International conference on machine learning. PMLR, 2016.

Two architectures  [Shalit et al., 2017] [Johansson et al., 2016]
[Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. "Learning representations for counterfactual inference." International conference on machine learning. PMLR, 2016. [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - We adopt the [Shalit et al., 2017] framework to split the net to outcomes. - [Johansson et al., 2016] may underestimate the treatment effect (shrinkage estimation) in the outcome net due to regularization bias.

Object function  [Shalit et al., 2017] Shalit, Uri, Fredrik D.
Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - Solving the mixed loss minimization problem with outcome losses and pseudo-distances - α is a hyperparameter (α> 0). - If α=0, Treatment-Agnostic Representation Network (TARNet) L2 penaltiy Sample weighting according to percentage of treatment α

[FYI]Domain Adversarial Neural Networks   11 [Ganin et al., 2016]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17(1):2096–2030. [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. Domain Adversarial Neural Networks [Ganin et al., 2016] DANN’s object CFR’s object

3. Theoretical Bound  12

Theoretical Bound (1/3) : Motivation  [Shalit et al., 2017] Shalit,
Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - Deﬁne the ITE error as follows (PEHE) - However, this is an unmeasurable metric (= True ITE is almost impossible to ascertain) - Existence of a theoretical bound!! estimated ITE true ITE

Theoretical Bound (2/3) : Definition  excepted counterfactual loss expected factual
treated/control losses [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - Deﬁne excepted counterfactual loss to indicate a theoretical bound. This metric also cannot be directly ascertained - Note that it is different from "expected factual treated/control losses".

Theoretical Bound (3/3)   Theorem Lemma [Shalit et al., 2017]
Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. By setting an upper bound on CF loss, we were able to show that an upper bound on ITE loss also exists. u := p(t=1), a constant B > 0, CF loss is given boud by factual loss and IPM (both observable metrics)

4. Let’s do it with pytorch!  16 The following two
github repositories were used as references. The former([cfrnet]) is the official implementation of the original; it is implemented in TensorFlow. The algorithm inside was used as a reference. The latter([SC-CFR]) is implemented in PyTorch, the same as mine. The architecture of the model is different, but I used many of the class definitions, etc. as reference - [cfrnet] https://github.com/clinicalml/cfrnet - [SC-CFR] https://github.com/koh-t/SC-CFR My repository is as follows : https://github.com/MasaAsami/introduction_to_CFR

[Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and
David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. outnet repnet

[Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and
David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. In this experiment, linear maximum-mean discrepancy (linear MMD) was employed. For other pseudo-distances (e.g. Wasserstein distance), please check the official github https://github.com/clinicalml/cfrnet

Well-known RCT dataset: LaLonde(1986)  Dehejia, Rajeev and Sadek Wahba. (1999).Causal
Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 19 The National Supported Work Demonstration (NSW) : The interest of this experiment is whether "vocational training" (counseling and short-term work experience) affects subsequent earnings. In the dataset, the treatment variable, vocational training, is denoted by treat, and the outcome variable, income in 1978, is denoted by re78. Data can be downloaded at the following website: https://users.nber.org/~rdehejia/data/ outcome treatment 1 or 0

Basic statistics : NSW  Dehejia, Rajeev and Sadek Wahba. (1999).Causal
Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 20 This experiment was conducted as an RCT, but the table below shows that the covariates are not completely balanced treated average control average | (treated avg - control avg) / std |

RCT causal effect := 1676.3426   Dehejia, Rajeev and Sadek
Wahba. (1999).Causal Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 21 We will use a simple multiple regression model and consider the "true causal effect" to be 1676.3426.

Validation dataset  安井翔太（2020）『効果検証入門正しい比較のための因果推論 /. 計量経済学の基礎』、技術評論社 22 A validation dataset
is created by excluding data from the NSW control group and instead considering non-experimental data (CPS: Current Population Survey) as the control group treated control NSW CPS control New DataSet !! 

Temporary hyperparameters  3 layes 3 layes 48 dim 48 dim
32 dim split_outnet==True [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017.

fitting & result  True ATT is… 1676.3426

Visualization of Representation layer  • t-SNE visualizations of the balanced
representations of our Job dataset • Original data on the left, representation layer on the right

5. Experiments and discussion on Hyperparameters  for α & outnet
spliting  26

experiment_run.py  Setting up an experimental environment using "hydra" and "mlﬂow”
hydra: Obtaining experimental parameters from a yaml file from configs/experiments.yaml to cfg dict{} mlflow: Register experimental parameters here in mlflow mlflow: Register the results of the experiment here in mlflow

experiment_run.py  Run the Python ﬁle as follows and register the
results with mlﬂow.

Results & Discussions (1/2)  split_outnet: When split_net==False (i.e., outnet is
not split), the treatment effect is consistently estimated lower. This is probably due to the effect of regularization bias (shrinkage estimation) l2 penalty w wt (treatment effect) To avoid overfitting, we estimate weights with L2 penalty. Penalizing weights for treatment may underestimate their effect.

Results & Discussions (2/2)  alpha: • In this data, there
were few alpha beneﬁts. This is consistent with the results of the original paper • In any case, the difﬁculty of tuning of α is very high (in most cases, the true treatment effect is not known, so we have no choice but to make a fuzzy decision based on IPM and outcome losses)

[Shalit et al., 2017]’s results  [Shalit et al., 2017] Shalit,
Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. The original paper also used the LaLonde dataset (Job). As shown in this table, the benefit of α is insignificant in this data set. The IHDP dataset, which we omitted due to time constraints, appears to be more effective for α.

6. What's good about CFR ?  32 It's hard to
search for alpha and There are other ways to estimate ITE. What on earth is a good CFR?

Lots of interesting applied research!  [Takeuchi et al., 2021] Takeuchi,
K., Nishida, R., Kashima, H., & Onishi, M. (2021). Grab the Reins of Crowds: Estimating the Effects of Crowd Movement Guidance Using Causal Inference. arXiv preprint arXiv:2102.03980. • As you know, Deep Learning is a very expressive model, so the covariates do not have to be table data • For example, the following paper uses CNN to adjust image data as covariates.

Reference  34 (Papers) • [Shalit et al., 2017] Shalit, Uri,
Fredrik D. Johansson, and David Sontag. “Estimating individual treatment effect: generalization bounds and algorithms.” International Conference on Machine Learning. PMLR, 2017 • [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. “Learning representations for counterfactual inference.” International conference on machine learning. PMLR, 2016. • [Takeuchi et al., 2021] Takeuchi, Koh, et al. “Grab the Reins of Crowds: Estimating the Effects of Crowd Movement Guidance Using Causal Inference.” arXiv preprint arXiv:2102.03980, 2021. • [Ganin et al., 2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. “Domain-adversarial training of neural networks.” Journal of Machine Learning Research 17(1):2096–2030, 2016. (Python code) • [cfrnet] https://github.com/clinicalml/cfrnet • [SC-CFR] https://github.com/koh-t/SC-CFR

Thank you  35

Quick introduction to CounterFactual Regression...

Quick introduction to CounterFactual Regression (CFR)

More Decks by Masa

Other Decks in Science

Featured

Transcript