Slide 1

Slide 1 text

Comparison of estimation methods in causal inference
 twitter @asas_mimi 1 (with RCT benchmark)

Slide 2

Slide 2 text

Well-known RCT dataset: LaLonde(1986)
 Dehejia, Rajeev and Sadek Wahba. (1999).Causal Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 2 The National Supported Work Demonstration (NSW) : The interest of this experiment is whether "vocational training" (counseling and short-term work experience) affects subsequent earnings. In the dataset, the treatment variable, vocational training, is denoted by treat, and the outcome variable, income in 1978, is denoted by re78. Data can be downloaded at the following website: https://users.nber.org/~rdehejia/data/ outcome treatment 1 or 0

Slide 3

Slide 3 text

Basic statistics : NSW
 Dehejia, Rajeev and Sadek Wahba. (1999).Causal Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 3 This experiment was conducted as an RCT, but the table below shows that the covariates are not completely balanced treated average control average | (treated avg - control avg) / std |

Slide 4

Slide 4 text

RCT causal effect := 1676.3426 
 Dehejia, Rajeev and Sadek Wahba. (1999).Causal Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. LaLonde, Robert. (1986). Evaluating the Econometric Evaluations of Training Programs. American Economic Review 76:604-620. 4 We will use a simple multiple regression model and consider the "true causal effect" to be 1676.3426.

Slide 5

Slide 5 text

Validation dataset
 安井翔太(2020)『効果検証入門 正しい比較のための因果推論 /. 計量経済学の基礎』、技術評論社 5 A validation dataset is created by excluding data from the NSW control group and instead considering non-experimental data (CPS: Current Population Survey) as the control group treated control NSW CPS control New DataSet !!


Slide 6

Slide 6 text

Our Assumptions
 True ATT = 1676.3426 strongly ignorable treatment assignment In the treated group, we know the true effect by RCT. However, there is no guarantee that the same effect would be obtained if the treatment were applied to the CPS data. In other words, ATT = ATU does not necessarily hold (i.e., we do not know the true ATE). All we can know with this validation data is the ATT (:= 1676.3426). ATE ATT Average treatment effects on the treated Strongly ignorable treatment assignment can also be expressed in terms of conditional independence. If treatment assignment T is conditionally independent of Y(1) , Y(0) given confounding covariates X , the treatment assignment is said to be strongly ignorable.

Slide 7

Slide 7 text

Our Approaches 
 conditional treatment assignment conditional parallel-trend (1) multiple regression (2) Propensity score Approach (IPW) (3) Meta Learner ● S-Learner ● T-Learner ● X-Learner ● DomainAdaptation-Learner (4) Double/Debiased ML (5) Doubly Robust DID (6) Double/Debiased DID Except for Doubly Robust DID, these were validated without the use of a good package such as EconML. The reasons are as follows: . - To calculate the standard error (using the boatstrap method) - For my own study

Slide 8

Slide 8 text

(1) Multiple regression
 Recently, propensity score-based methods have become popular, and some people assume that causal inference cannot be made with multiple regression models. ● Remember that the identification strategy is almost the same as the propensity score-based approach. ● Whether it is easier to model the outcome directly or the assignment to the treatment group depends on the case Multiple regression? Nonsense.lol Although this approach is very simple, it is a very bad attitude to assume that causal inference is not possible only because of multiple regression.

Slide 9

Slide 9 text

(2) IPW for ATT
 The weight of IPW estimation for ATT can be defined as follows. Only the control group is weighted by propensity score. w Covariate balance improved.

Slide 10

Slide 10 text

(3) Meta Learner : S & T -Learner
 EconML “EconML User Guide“ ( https://econml.azurewebsites.net/spec/estimation/metalearners.html ) We use the following process to estimate the ATT (1) Once the CATE (conditional ATE) is calculated for each individual using these algorithms (2) Calculate the ATT by averaging the estimated CATE over the T=1 records. Create a separate model for the treatment and control groups, and then take the difference between the output values of the two models for each record. Simplest method! Adopt the difference between T=1 and T=0 as CATE in the learned model.

Slide 11

Slide 11 text

(3) Meta Learner : X-Learner
 
 EconML “EconML User Guide“ ( https://econml.azurewebsites.net/spec/estimation/metalearners.html ) We use the following process to estimate the ATT (1) Once the CATE (conditional ATE) is calculated for each individual using these algorithms (2) Calculate the ATT by averaging the estimated CATE over the T=1 records. Estimate outcome function Average the estimates (g(x) : propensity score) Compute imputed treatment effects Estimate CATE in 2 ways

Slide 12

Slide 12 text

(3) Meta Learner : DA-Learner
 EconML “EconML User Guide“ ( https://econml.azurewebsites.net/spec/estimation/metalearners.html ) We use the following process to estimate the ATT (1) Once the CATE (conditional ATE) is calculated for each individual using these algorithms (2) Calculate the ATT by averaging the estimated CATE over the T=1 records. Estimate outcome function using propensity score weighting Compute imputed treatment effects Estimate CATE

Slide 13

Slide 13 text

(4) Double/Debiased Machine Learning
 for non-linear CATE
 EconML “EconML User Guide“ ( https://econml.azurewebsites.net/spec/estimation/dml.html ) We use the following process to estimate the ATT (1) Once the CATE (conditional ATE) is calculated for each individual using these algorithms (2) Calculate the ATT by averaging the estimated CATE over the T=1 records. DML τ is a function of X and aims to compute CATE sample weight supervised Label
 τ(X) can be viewed as weighted supervised learning

Slide 14

Slide 14 text

Assumptions for DID models
 Chang, N. C. (2020). Double/debiased machine learning for difference-in-differences models. The Econometrics Journal, 23(2), 177-191. 14 The support of the ps of the treated is a subset of the support for the untreated conditional parallel-trend potential outcomes Counterfactual outcomes if no intervention is received treatment group control group violation for parallel-trend conditioning with X trend plot | X = 〇〇 ps Not overrap !! Common support is a subset of the untreated Comparable!! This states that the support of the propensity score of the treated group is a subset of the support for the untreated. This is the same constraint placed on ATT estimation in other propensity score methods

Slide 15

Slide 15 text

(5) Doubly Robust DID
 For this model only, the R package was used as is. 
 
 Please note that this is not a fair comparison since we used the default model without any modification.
 https://psantanna.com/DRDID/index.html

Slide 16

Slide 16 text

(6) Double/Debiased DID
 Chang, N. C. (2020). Double/debiased machine learning for difference-in-differences models. The Econometrics Journal, 23(2), 177-191. supervised learning Label = Diff. Learning with control group only Cross fitting 
 separates samples for “fitting” and “prediction” as in Chernozhukov (2018)
 propensity score

Slide 17

Slide 17 text

Estimated Results 
 17 ● Point estimation and absolute error with RCT result ● The default hyperparameters of LGBM are used for DML, Meta Leaner, and DMLDID. DRDID uses R package defaults If we only consider point estimates, DML is closest to the RCT result In the present case (table data with low dimensionality of features as well), such a simple model is sufficient to adjust the bias without using a complex model such as the following.

Slide 18

Slide 18 text

Standard error…
 18 ● Considering the standard error, it is clear that DML was not suitable in this case. ● This may be an underfit of the ML model due to the reduction in data volume caused by cross-fitting. DML

Slide 19

Slide 19 text

Conclusions
 19 ● In this experiment, the identification strategy is almost the same. ● The only difference is the "estimation method". ● The most important thing in practice is to agree on the identification strategy, which should be discussed with the stakeholders and make full use of domain knowledges. ● The choice of estimation method itself should be decided flexibly depending on the characteristics of the data. ○ In the present case, a nonlinear ML-based approach proved to be overkill. ■ Of course, if the given data are expected to have high-dimensional features or nonlinear functions to treatments or outcomes, the ML-based approach is likely to have strengths. ○ It is better to try several estimation methods, and if there is a difference between them, it is better to have an attitude of digging deeper into the causes.

Slide 20

Slide 20 text

20 Multiple regression model? That's too naive! If you're going to do causal inference, it has to be a XX model (propensity model or DML or …), right? So reject!! Multiple regression model, right? I see. But your data is p>>n, so maybe OLS doesn't estimate it well, maybe you should try DML or something? constructive advice Inappropriate advice

Slide 21

Slide 21 text

Extra Analysis
 21 ● Since DML and Meta Learner calculate CATE, it is useful to visualize it using shap. ● We can also check flexible non-linear relationships rather than interaction terms with linear models. If you are interested, please refer to my NOTEBOOK. Younger age groups with higher 75-year annual incomes are more likely to benefit from the treatment.

Slide 22

Slide 22 text

Thank you
 
 ※ The Python code for this article is stored in this repository.
 https://github.com/MasaAsami/D2ML 
 
 22

Slide 23

Slide 23 text

References:
 [1] 安井翔太(著)株式会社ホクソエム(監修)(2020).『効果検証入門:正しい比較のための因果推論 /計量経済学の基礎』技術評論社
 [2] Microsoft Research . EconML User Guide
 [3] Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242.
 [4] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters.
 [5] Sant'Anna, P. H., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101–122.
 [6] R pkg. Doubly Robust Difference-in-Differences
 [7] Chang, N. C. (2020). Double/debiased machine learning for difference-in-differences models. The Econometrics Journal, 23(2), 177–191. 23