Quick introduction to CounterFactual Regression (CFR)

Slide 1

Slide 1 text

Quick introduction to CFR :  Counterfactual Regression  twitter @asas_mimi 1

Slide 2

Slide 2 text

Table of Contents 1. Motivation 2. CFR's Solution 3. Theoretical Bound 4. Let’s do it with PyTorch 5. Experiments and discussion on Hyperparameters 6. Why do we need CFR? 2

Slide 3

Slide 3 text

Original Paper: [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. "Learning representations for counterfactual inference." International conference on machine learning. PMLR, 2016. 3

Slide 4

Slide 4 text

1. Motivation  4

Slide 5

Slide 5 text

Counterfactual outcomes are estimated by an arbitrary machine learning model h Features {covariates x_i , t} -> y Treatment Effects  5 In the following, we aim for unbiased estimation of ITE (and ATT) factual outcome Individualized Treatment Effect Average Treatment effect on the Treated counterfactual outcome covariates treated or not

Slide 6

Slide 6 text

Covariate shift “the problem of causal inference by counterfactual prediction might require inference over a different distribution than the one from which samples are given. In machine learning terms, this means that the feature distribution of the test set differs from that of the train set. This is a case of covariate shift, which is a special case of domain adaptation” [Johansson et al., 2016] Problem : covariate shift  6 Estimating yCF is usually difﬁcult. The main reason is due to differences in the distribution of F and CF empirical factual/counterfactual distribution not eqaul (if not RCT)

Slide 7

Slide 7 text

2. CFR's Solution  7

Slide 8

Slide 8 text

New Loss - Pseudo-distance between the distributions of both groups in the Representation layer - By adding this to loss, we design the distribution of both groups to be close in the layer. CFR's Solution  8 Fig by. [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. "Learning representations for counterfactual inference." International conference on machine learning. PMLR, 2016.

Slide 9

Slide 9 text

Two architectures  [Shalit et al., 2017] [Johansson et al., 2016] [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. "Learning representations for counterfactual inference." International conference on machine learning. PMLR, 2016. [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - We adopt the [Shalit et al., 2017] framework to split the net to outcomes. - [Johansson et al., 2016] may underestimate the treatment effect (shrinkage estimation) in the outcome net due to regularization bias.

Slide 10

Slide 10 text

Object function  [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - Solving the mixed loss minimization problem with outcome losses and pseudo-distances - α is a hyperparameter (α> 0). - If α=0, Treatment-Agnostic Representation Network (TARNet) L2 penaltiy Sample weighting according to percentage of treatment α

Slide 11

Slide 11 text

[FYI]Domain Adversarial Neural Networks   11 [Ganin et al., 2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17(1):2096–2030. [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. Domain Adversarial Neural Networks [Ganin et al., 2016] DANN’s object CFR’s object

Slide 12

Slide 12 text

3. Theoretical Bound  12

Slide 13

Slide 13 text

Theoretical Bound (1/3) : Motivation  [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - Deﬁne the ITE error as follows (PEHE) - However, this is an unmeasurable metric (= True ITE is almost impossible to ascertain) - Existence of a theoretical bound!! estimated ITE true ITE

Slide 14

Slide 14 text

Theoretical Bound (2/3) : Definition  excepted counterfactual loss expected factual treated/control losses [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. - Deﬁne excepted counterfactual loss to indicate a theoretical bound. This metric also cannot be directly ascertained - Note that it is different from "expected factual treated/control losses".

Slide 15

Slide 15 text

Theoretical Bound (3/3)   Theorem Lemma [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. By setting an upper bound on CF loss, we were able to show that an upper bound on ITE loss also exists. u := p(t=1), a constant B > 0, CF loss is given boud by factual loss and IPM (both observable metrics)

Slide 16

Slide 16 text

4. Let’s do it with pytorch!  16 The following two github repositories were used as references. The former([cfrnet]) is the official implementation of the original; it is implemented in TensorFlow. The algorithm inside was used as a reference. The latter([SC-CFR]) is implemented in PyTorch, the same as mine. The architecture of the model is different, but I used many of the class definitions, etc. as reference - [cfrnet] https://github.com/clinicalml/cfrnet - [SC-CFR] https://github.com/koh-t/SC-CFR My repository is as follows : https://github.com/MasaAsami/introduction_to_CFR

Slide 17

Slide 17 text

Slide 18

Slide 18 text

[Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. In this experiment, linear maximum-mean discrepancy (linear MMD) was employed. For other pseudo-distances (e.g. Wasserstein distance), please check the official github https://github.com/clinicalml/cfrnet

Slide 19

Slide 19 text

Well-known RCT dataset: LaLonde(1986)  Dehejia, Rajeev and Sadek Wahba. (1999).Causal Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 19 The National Supported Work Demonstration (NSW) : The interest of this experiment is whether "vocational training" (counseling and short-term work experience) affects subsequent earnings. In the dataset, the treatment variable, vocational training, is denoted by treat, and the outcome variable, income in 1978, is denoted by re78. Data can be downloaded at the following website: https://users.nber.org/~rdehejia/data/ outcome treatment 1 or 0

Slide 20

Slide 20 text

Basic statistics : NSW  Dehejia, Rajeev and Sadek Wahba. (1999).Causal Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 20 This experiment was conducted as an RCT, but the table below shows that the covariates are not completely balanced treated average control average | (treated avg - control avg) / std |

Slide 21

Slide 21 text

RCT causal effect := 1676.3426   Dehejia, Rajeev and Sadek Wahba. (1999).Causal Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 21 We will use a simple multiple regression model and consider the "true causal effect" to be 1676.3426.

Slide 22

Slide 22 text

Validation dataset  安井翔太（2020）『効果検証入門正しい比較のための因果推論 /. 計量経済学の基礎』、技術評論社 22 A validation dataset is created by excluding data from the NSW control group and instead considering non-experimental data (CPS: Current Population Survey) as the control group treated control NSW CPS control New DataSet !! 

Slide 23

Slide 23 text

Temporary hyperparameters  3 layes 3 layes 48 dim 48 dim 32 dim split_outnet==True [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017.

Slide 24

Slide 24 text

fitting & result  True ATT is… 1676.3426

Slide 25

Slide 25 text

Visualization of Representation layer  ● t-SNE visualizations of the balanced representations of our Job dataset ● Original data on the left, representation layer on the right

Slide 26

Slide 26 text

5. Experiments and discussion on Hyperparameters  for α & outnet spliting  26

Slide 27

Slide 27 text

experiment_run.py  Setting up an experimental environment using "hydra" and "mlﬂow” hydra: Obtaining experimental parameters from a yaml file from configs/experiments.yaml to cfg dict{} mlflow: Register experimental parameters here in mlflow mlflow: Register the results of the experiment here in mlflow

Slide 28

Slide 28 text

experiment_run.py  Run the Python ﬁle as follows and register the results with mlﬂow.

Slide 29

Slide 29 text

Results & Discussions (1/2)  split_outnet: When split_net==False (i.e., outnet is not split), the treatment effect is consistently estimated lower. This is probably due to the effect of regularization bias (shrinkage estimation) l2 penalty w wt (treatment effect) To avoid overfitting, we estimate weights with L2 penalty. Penalizing weights for treatment may underestimate their effect.

Slide 30

Slide 30 text

Results & Discussions (2/2)  alpha: ● In this data, there were few alpha beneﬁts. This is consistent with the results of the original paper ● In any case, the difﬁculty of tuning of α is very high (in most cases, the true treatment effect is not known, so we have no choice but to make a fuzzy decision based on IPM and outcome losses)

Slide 31

Slide 31 text

[Shalit et al., 2017]’s results  [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International Conference on Machine Learning. PMLR, 2017. The original paper also used the LaLonde dataset (Job). As shown in this table, the benefit of α is insignificant in this data set. The IHDP dataset, which we omitted due to time constraints, appears to be more effective for α.

Slide 32

Slide 32 text

6. What's good about CFR ?  32 It's hard to search for alpha and There are other ways to estimate ITE. What on earth is a good CFR?

Slide 33

Slide 33 text

Lots of interesting applied research!  [Takeuchi et al., 2021] Takeuchi, K., Nishida, R., Kashima, H., & Onishi, M. (2021). Grab the Reins of Crowds: Estimating the Effects of Crowd Movement Guidance Using Causal Inference. arXiv preprint arXiv:2102.03980. ● As you know, Deep Learning is a very expressive model, so the covariates do not have to be table data ● For example, the following paper uses CNN to adjust image data as covariates.

Slide 34

Slide 34 text

Reference  34 (Papers) ● [Shalit et al., 2017] Shalit, Uri, Fredrik D. Johansson, and David Sontag. “Estimating individual treatment effect: generalization bounds and algorithms.” International Conference on Machine Learning. PMLR, 2017 ● [Johansson et al., 2016] Johansson, Fredrik, Uri Shalit, and David Sontag. “Learning representations for counterfactual inference.” International conference on machine learning. PMLR, 2016. ● [Takeuchi et al., 2021] Takeuchi, Koh, et al. “Grab the Reins of Crowds: Estimating the Effects of Crowd Movement Guidance Using Causal Inference.” arXiv preprint arXiv:2102.03980, 2021. ● [Ganin et al., 2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. “Domain-adversarial training of neural networks.” Journal of Machine Learning Research 17(1):2096–2030, 2016. (Python code) ● [cfrnet] https://github.com/clinicalml/cfrnet ● [SC-CFR] https://github.com/koh-t/SC-CFR

Slide 35

Slide 35 text

Thank you  35