LiNGAM approach to causal discovery Shohei SHIMIZU Shiga University & RIKEN The KDD2021 Workshop on Causal Discovery (CD2021)

What is causal discovery? • Methodology for inferring causal graphs using data 2 Maeda and Shimizu (2020) Assumptions • Functional form? • Distribution? • Hidden common cause present? • Acyclic? etc. Data Causal graph

Causal graphs are the key to statistical causal inference • Estimate intervention effects – Need causal graph to select variables to be adjusted, e.g., using backdoor criterion (Pearl, 1995) • Also useful for machine learning – E.g., domain adaptation (Zhang et al., 2020), fairness (Kuzner et al., 2017), and interpretability (Blobaum & Shimizu, 2017) 3 Messerli (2012) Chocolate Nobel laureates GDP Number of Nobel laureates Chocolate consumption

How do we draw a causal graph? • Common way: Use background knowledge • Often need to use both background knowledge AND DATA • Causal discovery: Infer the causal graph from data 4 ? or or Chocolate Nobel laureates GDP Chocolate Nobel GDP Chocolate Nobel GDP Chocolate Nobel GDP

Application areas 5 Epidemiology Economics Sleep problems Depression mood Sleep problems Depression mood ? or R& R&D(.grt+1) R& (Moneta et al., 2012) (Rosenstrom et al., 2012) Neuroscience Chemistry (Campomanes et al., 2014) (Boukrina & Graves, 2013) Prevention Medicine (Kotoku et al., 2020) Climatology (Liu & Niyogi, 2020)

Causal discovery is a challenge in causal inference • Classical non-parametric approach uses conditional independence (Pearl 2001; Spirtes 1993) – Make no assumptions about function forms or distribution – The limit is finding the Markov equivalent models • Additional assumptions needed to go beyond the limit – Restrictions on functional forms and distributions – Uniquely Identifiable or Smaller numbers of Equivalent models • LiNGAM is one example (Shimizu et al., 2006; Shimizu, 2014). – Non-Gaussian assumption to exploit independence – Growing literature on its variants (Peters et al., 2018; Shimizu & Blobaum, 2020) 6

Methods of causal discovery 9

Framework • Structural causal model (Pearl, 2001) • Make assumptions and find a causal graph(s) that is consistent with the data – Typical example 1: • Directed acyclic graph (DAG) • No hidden common cause (all observed) – Typical example 2: • DAG • Hidden common causes may exist 10 x3 x1 e3 e1 x2 e2 Error variable 𝑥! = 𝑓! (parents of 𝑥! , 𝑒! )

Non-parametric approach To what extent can we infer the causal graph without making any assumptions about the functional form or distribution? 11 Spirtes, Glymour, Shceines, 2001 (2nd ed)

Non-parametric approach: Example 1. Making assumptions on the underlying causal graph – Directed acyclic graph – No hidden common causes (all have been observed) 2. Find the graph that best matches the data among such causal graphs that satisfy the assumptions. 12 If x and y are independent in the data, select (c) on the right. If x and y are dependent in the data, select (a) and (b). (a) and (b) are indistinguishable (not uniquely identifiable): Markov equivalence class Three candidates x y x y x y (a) (b) (c)

Non-parametric approach: Example 1. Making assumptions on the underlying causal graph – Directed acyclic graph – No hidden common causes (all have been observed) 2. Find the graph that best matches the data among such causal graphs that satisfy the assumptions. 13 If x and y are independent in the data, select (c) on the right. If x and y are dependent in the data, select (a) and (b). (a) and (b) are indistinguishable (not uniquely identifiable): Markov equivalence class Three candidates x y x y x y (a) (b) (c)

Various extensions • Equivalent models including unobserved common causes (Spirtes et al., 1995) • Those for time series cases (Malinsky & Spirtes, 2018) • Equivalence class including cyclic graphs (Richardson, 1996) • Lower bound on intervention effects (Maathuis et al., 2009; Malinsky & Spirtes, 2017) 14 x y f w z x y w z x y f1 w z f2 F. Eberhardt CRM Workshop 2016

Semi-parametric approach: Make additional assumptions on function forms and distributions What are the assumptions for making causal graphs identifiable? 15

Make additional assumptions on functional forms and distributions • More information available than conditional independence • E.g., linearity + non-Gaussian continuous distribution 16 Results in different distributions of x1 and x2 No difference in terms of their conditional independence x y x y (a) (b)

LiNGAM model is identifiable (Shimizu, Hyvarinen, Hoyer & Kerminen, 2006) • Linear Non-Gaussian Acyclic Model: – 𝑘(𝑖) (𝑖 = 1, … , 𝑝): causal (topological) order of 𝑥! – Error variables 𝑒! independent and non-Gaussian • Coefficients and causal orders identifiable • Causal graph identifiable 17 or 𝑥" 𝑥# 𝑥$ Causal graph 𝑥! = # " # $"(!) 𝑏!# 𝑥# + 𝑒! 𝒙 = 𝐵𝒙 + 𝒆 𝑒$ 𝑒" 𝑒# 𝑏#" 𝑏#$ 𝑏"$

How do we use non-Gaussianity and independence? 18 𝑏!" 𝑥! = 𝑏!"𝑒" + 𝑒! and 𝑟" (!) are dependent, although they are uncorrelated Residual 𝑥" = 𝑒" and 𝑟! (") are independent 𝑟" (#) = 𝑥" − cov 𝑥", 𝑥# var 𝑥# 𝑥# = 1 − '!"()* +",+! *-. +! 𝑒" − '!"*-. +" *-. +! 𝑒# 𝑟# (") = 𝑥# − cov 𝑥# , 𝑥" var 𝑥" 𝑥" = 𝑥# − 𝑏#" 𝑥" = 𝑒# Underlying model 𝑥" = 𝑒" 𝑥# = 𝑏#" 𝑥" + 𝑒# (𝑏#" ≠ 0) 𝑥# 𝑥" 𝑒" 𝑒# 𝑒! , 𝑒" non-Gaussian Regress effect x2 on cause x1 Regress cause x1 on effect x2

Independence measure (Hyvarinen & Smith, 2013) • Can compute difference of mutual information of explanatory variable and its residual for different directions by one- dimensional entropy • Maximum entropy approximation of entropy 𝐻 (Hyvarinen, 1999) 19 𝐻(𝑢) ≈ 𝐻 𝑣 − 𝑘- [𝐸 log cosh 𝑢 − 𝛾].−𝑘. [𝐸 𝑢 exp (−𝑢./2 ]. 𝐼 𝑥" , 𝑟# " − 𝐼 𝑥# , 𝑟" # = 𝐻 𝑥" + 𝐻 𝑟# " sd 𝑟# " − 𝐻 𝑥# + 𝐻 𝑟" # sd 𝑟" #

Evaluation of estimated causal graphs 20

Before estimating causal graphs • Assessing assumptions by – Gaussianity test – Histograms • continuous? – Too high correlation? • multicollinearity? – Background knowledge 21

After estimating causal graphs • Assessing assumptions by – Testing independence of error variables, e.g., by HSIC (Gretton et al., 2005) – Prediction accuracy using Markov boundary (Biza et al., 2020) – Compare to the results of other datasets in which causal graphs expected to be similar – Check against background knowledge 22

Statistical reliability assessment • Bootstrap probability (bp) of directed paths and edges • Interpret causal effects whose bp larger than a threshold, say 5% 23 x3 x1 … … x3 x1 x0 x3 x1 x2 x3 x1 99% 96% Total effect: 20.9 10% LiNGAM Python package:

To relax the model assumptions 24

Other identifiable models • Nonlinearity + “additive” noise (Hoyer+08NIPS, Zhang+09UAI, Peters+14JMLR) • 𝑥% = 𝑓%(par(𝑥%)) + 𝑒% • 𝑥% = 𝑔% &"(𝑓%(par(𝑥%)) + 𝑒%) • Discrete variables – Poisson DAG model and its extensions (Park+18JMLR) • Mixed types of variables: LiNGAM + logistic-type model – Identifiability condition for two variables (Wenjuan+18IJCAI) – Probably ok also for multivariate cases using the idea of Thm.28 of Peters et al. (2014) 25

Other identifiable models • Nonlinearity + “additive” noise (Hoyer+08NIPS, Zhang+09UAI, Peters+14JMLR) • 𝑥% = 𝑓%(par(𝑥%)) + 𝑒% • 𝑥% = 𝑔% &"(𝑓%(par(𝑥%)) + 𝑒%) • Discrete variables – Poisson DAG model and its extensions (Park+18JMLR) • Mixed types of variables: LiNGAM + logistic-type model – Identifiability condition for two variables (Wenjuan+18IJCAI) – Probably ok also for multivariate cases using the idea of Thm.28 of Peters et al. (2014) 26

Other identifiable models • Nonlinearity + “additive” noise (Hoyer+08NIPS, Zhang+09UAI, Peters+14JMLR) • 𝑥% = 𝑓%(par(𝑥%)) + 𝑒% • 𝑥% = 𝑔% &"(𝑓%(par(𝑥%)) + 𝑒%) • Discrete variables – Poisson DAG model and its extensions (Park+18JMLR) • Mixed types of variables: LiNGAM + logistic-type model – Identifiability condition for two variables (Wenjuan+18IJCAI) 27

For better statistical reliability 28

For better statistical reliability • Use background knowledge in estimation – Causal orders – Specify functional forms – Specify distribution • E.g., in manufacturing, causal orders of these 3 groups often known – Manufacturing conditions – Intermediate characteristics – Final characteristic(s) 29 Final characteristic Manufacturing Condition 1 Manufacturing Condition 10 Intermediate chrctrstc 1 Intermediate chrctrstc 100 … Intermediate chrctrstc 82 Intermediate chrctrstc 8 Intermediate chrctrstc 66 Intermediate chrctrstc 66 Intermediate chrctrstc 16 … … … …

For better statistical reliability • Simultaneously analyze different datasets to use similarity (Ramsey et al. 2011; Shimizu, 2012) – Similarity: Causal orders same, distributions and coefficients may different – Accuracy greatly improved in fMRI simulated data (Ramsey et al., 2011) 30 x3 x1 x2 e1 e2 e3 4 -3 2 x3 x1 x2 e1 e2 e3 -0.5 5 Dataset 1 Dataset 2

LiNGAM with hidden common causes 31

Estimate causal structures of variables that do not share hidden common causes • For unconfounded pairs with no hidden common causes, estimate the causal directions • For confounded pairs with hidden common causes, leave them remain unknown 32 𝑥# 𝑥" 𝑓" 𝑥$ Underlying model Output 𝑥0 𝑥# 𝑥" 𝑥$ 𝑥0 𝑓#

Non-Gaussianity and independence work again • Existence of hidden common causes leads to dependence btw. explanatory variable and its residual (Tashiro et al., 2014) • Key result (Maeda & Shimizu, 2020) – Find a set of variables that that gives independent residual when a variable is regressed on every its subset – If succeeded, variables in such a set (x1 and x2) are the unconfounded ancestors of the variable (x4) • For nonlinear additive models, existence of hidden intermediate variables also leads to dependence (Maeda & Shimizu, 2021) 33 𝑥# 𝑥" 𝑓" !! !" "" !# !$ "! !! 𝑥# 𝑥" 𝑓$

Estimate causal structures of variables that share hidden common causes (Hoyer, Shimizu, Kerminen & Palviainen, 2008; Salehkaleybar et al., 2020) • LiNGAM with unobserved common cause is ICA (Hyvarinen et al.,2001) • Apply ICA and look at the zero/non-zero pattern 36 𝒙 = 𝐵𝒙 + 𝛬𝒇 + 𝒆 𝒙 = (𝐼 − 𝐵)"# (𝐼 − 𝐵)"#𝛬 𝒆 𝒇 𝑥" 𝑥! = 1 0 𝜆"" 𝑏!" 1 𝜆!" 𝑒" 𝑒! 𝑓" 𝑥# 𝑥" 𝑓" 𝑒" 𝑒# 𝑏!" 𝜆!" 𝜆"" 𝑥" 𝑥! = 1 𝑏"! 𝜆"" 0 1 𝜆!" 𝑒" 𝑒! 𝑓" 𝑥# 𝑥" 𝑓" 𝑒" 𝑒# 𝑏"! 𝜆!" 𝜆"" 𝑥" 𝑥! = 1 0 𝜆"" 0 1 𝜆!" 𝑒" 𝑒! 𝑓" 𝑥# 𝑥" 𝑓" 𝑒" 𝑒# 𝜆!" 𝜆"" Independent components

LiNGAM for latent factors 38

LiNGAM for latent factors (Shimizu et al., 2009) • Model: – 2 pure measurement variables per latent needed to identify the measurement model (Silva et al., 2006; Xie et al., 2020) • Estimate the latent factors and then their causal graph 39 𝑥" 𝑥! $ 𝑓" $ 𝑓! 𝑥# 𝑥$ ? 𝒇 = 𝐵𝒇+𝝐 𝒙 = 𝐺𝒇+𝒆

Find common and unique factors across multiple datasets (Zeng et al., 2021) • Model • Score function: likelihood + DAGness (Zheng et al., 2018) • Feature extraction across multiple datasets + causal discovery of latent factors 40 𝒇(1) = 𝐵(1) 𝒇(1)+ 𝝐(1) 𝒙(1) = 𝐺(1) 𝒇(1)+ 𝒆(1) 𝑚 = 1, … , 𝑀 ! " ! (#) ! ! (!) ! $ (!) ! % (!) ! & (!) ? ! ! ($) ! $ ($) ! " ! (!) ! % (%) ! & (&) ? ! " # (!) ! " # (#) ! " # (#) = ! " ! (!)?

Final summary 41

Final summary • Statistical causal inference is a fundamental tool for science – Many well-developed methods available in cases that a causal graph can be drawn with background knowledge – Helping drawing causal graphs with data is the key: Causal discovery • LiNGAM-related papers: • Next default assumptions – Hidden common cause / latent factors – Mixed data: Continuous and discrete – (Cyclicity (Lacerda et al., 2008)) 42

