統計的因果探索とAI

統計的因果探索とAI 清水昌平滋賀大学データサイエンス学系理化学研究所革新知能統合研究センター 2022年11月10日脳病態数理・データ科学セミナー任期5年 2名 https://www.shiga-u.ac.jp/wp/wp-
content/uploads/DSC_CREST_20221201.pdf https://twitter.com/sshimizu2006/status/15779932554335641 62?s=20&t=70ZDqbghQk-FN6f0ixHJxQ

統計的因果探索とは 2

統計的因果探索とは • データを用いて因果グラフを推測するための方法論 3 Maeda and Shimizu (2020) 仮定推測
• 関数形 • 分布 • 未観測共通原因の有無 • ⾮巡回 or 巡回などデータ因果グラフ

統計的因果探索の役割 4

統計的因果推論では因果グラフが要(かなめ) • データから介入効果を推定 • チョコ消費量を変えるとノーベル賞受賞者の数はどのくらい増えるのか(減るのか) • 介入効果を推定するために調整 •
共変量選び • 調整すべき変数の選択に因果グラフが必要 (e.g., バックドア基準) 5 Messerli, (2012), New England Journal of Medicine チョコ賞 GDP ノ " ベル賞受賞者の数チョコレート消費量

因果グラフをどう描くか • 現状: 分析者が領域知識を基に描く • これから: 領域知識とデータを両方使って因果グラフを描く (AIによる支援) • 因果探索:
データから描く 6 チョコ賞 GDP チョコ賞 ? チョコ賞 or GDP GDP チョコ賞 or GDP

統計的因果推論と機械学習因果グラフが描けると他にもいろいろできる 7

公平性 • 効果の分解 • 介入効果=総合効果 • 直接効果と間接効果 • 直接効果 •
性別は男から女に変えるが、適性は変えないとき、雇用される確率はどのくらい変わるか？ • これが大きいと、性別について公平でないと考える「公平」な機械学習モデルの構築 (Kusner et al., 2017) 8 x (性別) y (適性) z (雇用) ) ( ) ( , 男女男 = = = - = x y y x z E z E x 因果グラフ

説明性 • 原因の確率 (Pearl, 1999; 黒木, 2014) • 必要性の確率 •
現実には薬を飲んで病気にかかった対象者が薬を飲まなかったら、病気にならなかったであろう確率 • 他にも十分性の確率・必要十分性の確率などあり • AIの説明性ヘの応用 (Galhotra et al., 2021) • 存在範囲 (Pearl, 1999) • 因果グラフがわかると狭められることあり (Kuroki & Cai, 2011) 9 もしも | 現実もし薬を飲まなかったらかからない薬飲むかかる

統計的因果探索の方法 10

フレームワーク • 構造的因果モデル (Pearl, 2001) • 因果モデルに仮定をおき、その中でデータとつじつまの合うモデルを探す • 典型例1:
• 非巡回有向グラフ • 未観測共通原因なし(すべて観測されている) • 典型例2: • 非巡回有向グラフ • 未観測共通原因あり 11 x3 x1 e3 e1 x2 e2 𝑥! = 𝑓! (𝑥! の親, 𝑒! ) 誤差変数

因果探索の方法その1: 条件付き独立性を利用する方法関数形や分布に仮定をおかずにどこまでいけるか？ 12 Spirtes, Glymour, Shceines, 2001 (2nd
ed)

関数形や分布には仮定をおかないアプローチ 1. 因果グラフに仮定をおく • 非巡回有向グラフ • 未観測の共通原因なし(すべて観測されている) 2. 仮定を満たす構造の中で、データと(最も)つじつまの合うグラフを選ぶ 13
x y x y x y 「データでxとyが独立」なら、一番右の(c)を残す「データでxとyが従属」なら、(a)と(b)を残す (一意に決まらない): 同値類 3つの候補 (a) (b) (c)

拡張など • 未観測共通原因を含めた同値類 (Spirtes et al., 1995) • 時間情報の利用 (Malinsky
& Spirtes, 2018) • 巡回グラフを含めた同値類 (Richardson, 1996) • 介入効果の「下限」 (Maathuis et al., 2009; Malinsky & Spirtes, 2017) 14 x y ｆ w z x y w z x y ｆ1 w z ｆ2 F. Eberhardt CRM Workshop 2016より

政治経済学データ例 (Malinsky & Spirtes, 2018) • 1967-1992年 • OECD16カ国 •
縦断データ • 国ごとに平均を調整したあとプーリング • 8変数 • 資本課税率（captax） • 経済指標 • 一人当たりGDP成長率（growthpc） • 失業率（unemployment） • グローバル化の圧力 • 総外国直接投資額 (fdi) • 低賃金国からの輸入の割合 (lowwage) • 人口統計学的な要因 • 扶養家族の比率（depratio） • 政治学関係 • 左派政党とキリスト教民主党が占める内閣ポストの割合（left, cdem） • Tigramite: https://github.com/jakobrunge/tigramite 15

遺伝子発現量の間の因果効果 Maathuis et al. (2010) Predicting causal effects in large-scale
systems from observational data. Nature Methods. • 5361変数・サンプルサイズ63 • 因果探索で因果グラフを推定し、それに基づいて因果効果(の下限) を推定 (下限が甘いことはある) • 実際の介入実験結果と照らしてランダムにやるより当たっていた 16 yielded only 5 ± 2.1 true positives (10% ± 4.2%). Moreover, IDA improved substantially on Lasso4 and Elastic-net5, two state-of- the-art high-dimensional regression approaches commonly used to determine variable importance but not designed for causal inference (Fig. 1a, Supplementary Table 1 and Supplementary Methods). For m = 10 and q = 50, these methods yielded 10 (20%) and 8 (16%) true positives, respectively. Finally, we found that the superior performance of IDA compared to that of the other methods was insensitive to the choice of m value for m = 1, ... 50 (Fig. 1b). As a second test, we used data from the DREAM4 In Silico Network Challenge6, a competition in reverse engineering of gene regulation networks. These data include several types of simulated mRNA expression levels, based on sophisticated bio- logically motivated simulation methods6, for five networks of 10 genes and five networks of 100 genes. We used two types of observational data: (i) steady-state gene expression levels from unknown multifactorial perturbations of the networks and (ii) time series data on gene expression levels from the response and recovery of the networks to unknown external perturba- primary interest in many fields of science. The od for determining such relationships uses ran- lled perturbation experiments. In many settings, xperiments are expensive and time consuming. rable to obtain causal information from observa- t is, from data obtained by observing the system out subjecting it to interventions. tablished methods to estimate causal effects onal data when the possible causal relationships riables are known1. Many real-world problems, e large-scale systems without such information. enerally impossible to estimate causal effects in we recently proposed and mathematically justi- l method to obtain bounds on total causal effects, umptions (Supplementary Methods). We call this ntion-calculus when the DAG is absent (IDA). en experimentally validated until now, and there rimental validation of causal inference methods ere an experimental validation of IDA. As a first a compendium of gene ofiles of Saccharomyces taining 267 full-genome ofiles of yeast deletion ventional data), together nome expression profiles rol experiments (observa- obtained under the same ter initial data cleaning y Methods), the interven- ained expression measure- genes for 234 single-gene nt strains, and the obser- ontained expression mea- e same 5,361 genes for 63 res. nterventional data as the for estimating the total m values 0 0.5 1.0 1.5 2.0 2.5 pAUC × 105 0 10 20 30 40 50 0 1,000 2,000 3,000 4,000 0 200 400 600 800 1,000 IDA Lasso Elastic-net Random True positives False positives a b IDA Lasso Elastic-net Random 未観測共通原因ありへの拡張 (Malinsky & Spirtes, 2017) Code: https://github.com/dmalinsk/lv-ida

因果探索の方法その2: 関数形や分布に追加の仮定をする方法一意に識別できる条件は？ 17

関数形や分布にも仮定を入れてみる「と」 • 条件付き独立性以外にも利用可能な情報がある • 例えば、線形性＋非ガウス連続分布 18 x1 x2 x1 x2
観測変数x1,x2の分布が違う (条件付き独立性に違いはない)

LiNGAMモデル (Shimizu, Hyvarinen, Hoyer & Kerminen, 2006) • Linear Non-Gaussian
Acyclic Model (線形非ガウス非巡回モデル): • 𝑘 𝑖 : 𝑥! の因果的(半)順序 (topological order) • 誤差変数 𝑒! は非ガウス連続・互いに独立 • 非巡回 • 未観測共通原因なし • データ𝑋から係数𝑏!" と順序𝑘 𝑖 が識別可能(一意に推定可能) 19 𝑥" 𝑥# 𝑥$ 因果グラフ 𝑥! = ' # " $#(!) 𝑏!" 𝑥" + 𝑒! 𝑒$ 𝑒" 𝑒# 𝑏#" 𝑏"$

具体的には、非ガウス性と独立性をどう使うか？ 20 x1 x2 e1 e2 正しいモデル結果x2を原因x1に回帰原因x1を結果x2に回帰 2
1 21 2 1 1 1 2 2 ) 1 ( 2 ) var( ) , cov( e x b x x x x x x r = - = - = は独立と ) 1 ( 2 1 1 ) ( r e x = 残差 ( ) ) var( var ) var( ) , cov( 1 ) var( ) , cov( 2 1 21 1 2 2 1 21 2 2 2 1 1 ) 2 ( 1 x x b e x x x b x x x x x r - þ ý ü î í ì - = - = はと ) 2 ( 1 2 1 21 2 ) ( r e e b x + = 2 e 従属ガウスだと無相関＝独立 𝑥' = 𝑒' 𝑥( = 𝑏(' 𝑥' + 𝑒( 𝑏!" ≠ 0

因果探索の適用例: ターゲットの原因候補の探索 https://www.shimizulab.org/lingam/lingampapers/applications-and-tailor-made-methods • 生命科学 (Maathuis et al., 2010) •
医学 (Kotoku et al., 2020) • 化学 (Campomanes et al., 2014) • 材料 (Nelson et al., 2021) • 気候学 (Liu et al., 2020) • 経済学 (Moneta et al., 2013) • 心理学 (von Eye et al., 2012) • 政策 (高山ら, 2021) • ネットワークデータ (Jarry et al., 2021) 21 Kotoku et al. (2020) Moneta et al. (2013) OpInc.gr(t) Empl.gr(t) Sales.gr(t) R&D.gr(t) Empl.gr(t+1) Sales.gr(t+1) R&D(.grt+1) OpInc.gr(t+1) Empl.gr(t+2) Sales.gr(t+2) R&D.gr(t+2) OpInc.gr(t+2)

健康診断データで因果探索 (Kotoku et al., 2020; 大山ら, 2022) • 2016年度 •
大阪府特定健康診査 • 約30,000人 • 身体計測（身長：height，BMI） • 血圧関係（収縮期血圧：sBP） • 脂質関係（LDL コレステロール：LDL，HDL コレステロール：HDL，中性脂肪：TG） • 肝機能関係（GOT，γGT，GPT） • 血糖値関係（空腹時血糖：fBG，HbA1c） 22 e.g., HDLを増やすと中性脂肪が低下

脳部位の因果ネットワーク推定 (Ogawa et al., NeuroImage, 2022) • Motor imagery (運動イメージ)とMotor
execution (運動遂行)の実現というのは、脳部位のニューロンのネットワークから成っている • 被験者にタスクを実行してもらい、その時の脳活動をfMRIで測定 • 因果探索で脳部位のネットワークを推定 • 合成得点で変数「脳部位」 • 運動イメージや運動遂行を担う脳部位がどういう因果関係にあり、それは条件の違いによってどう異なるか 23

条件間で構造を比較 • 全ての条件で • 同じ脳半球で • L-dPMC -> L-M1 (正)
• R-dPMC -> R-M1 (正) • L-vPMC -> L-M1 (正) • R-vPMC -> R-M1 (正) • dPMC, vPMCもM1も前頭運動region • Symmetry • 脳半球をまたいで • L-dPMC -> R-M1 (負) • R-dPMC -> L-M1 (負) • Asymmetry • ブートストラップで、係数の差の信頼区間を作り、ゼロを含んでいなものを表示 • 右腕と左腕どっちを使ったか • 運動遂行(ME)か運動イメージ(MI)か 24

非線形モデル • 非線形回帰して説明変数と残差が独立か調べる (Hoyer et al., 2009; Zhang & Hyvarinen,
2009; Peters et al. 2014) 25 x1 x2 e1 e2 正しいモデル結果𝑥# を原因𝑥" に非線形回帰原因𝑥" を結果𝑥# に非線形回帰説明変数𝑥' (= 𝑒' )と残差は独立 𝑥' = 𝑒' 𝑥( = 𝑓(𝑥' ) + 𝑒( 説明変数𝑥( と残差は従属

未観測共通原因の存在を許すLiNGAM (Maeda & Shimizu, 2020) • 未観測共通原因のありそうな変数ペア • 未観測共通原因がない変数ペアの因果の向き •
非線形加法版 (Maeda & Shimizu, 2021) • 未観測中間変数 26 https://lingam.readthedocs.io/en/latest/tutorial/rcd.html !! !" "" !# 真出⼒ !$ !! !" !# !$ "!

未観測共通原因のある場合 (Hoyer, Shimizu, Kerminen & Palviainen, 2008) • モデル: 𝒙
= 𝐵𝒙 + 𝛬𝒇 + 𝒆 • 識別性 (Hoyer et al., 2008; Salehkaleybar et al., 2020) • 因果順序は識別可能 • 因果効果までわかるかは構造による 27 𝑥# 𝑥" 𝑓" 𝑒" 𝑒# 𝑏!" 𝜆!" 𝜆"" 𝑥# 𝑥" 𝑓" 𝑒" 𝑒# 𝑏"! 𝜆!" 𝜆"" 𝑥# 𝑥" 𝑓" 𝑒" 𝑒# 𝜆!" 𝜆""

評価 28

統計的信頼性評価 • 有向道や有向辺のブートストラップ確率 • 例えば、閾値0.05を越えるものを解釈 29 x3 x1 … …
x3 x1 x0 x3 x1 x2 x3 x1 99% 96% 総合効果: 20.9 10% https://lingam.readthedocs.io/en/latest/tutorial/bootstrap.html

モデル仮定の評価 • 分析前 • Gaussianity test • ヒストグラム • 連続変数？
• 多重共線性 • 領域知識 • 分析後 • 誤差(残差)の独立性評価 • 例えば、HSIC (Gretton et al., 2005) • マルコフバウンダリーによる予測の良さで評価 (Biza et al., 2020) (未実装) • 複数のデータセットでの結果を比較 • 領域知識による評価 30 • 仮定の⾒直しや変数の追加等

統計的因果探索のパッケージ 31

統計的因果探索のPythonやRのパッケージ • Tetrad: https://github.com/cmu-phil/tetrad • Tigramite: https://github.com/jakobrunge/tigramite • Causal-learn: https://github.com/cmu-phil/causal-learn
• pcalg (R): https://cran.r-project.org/web/packages/pcalg/ • LiNGAM: https://github.com/cdt15/lingam • 商用ソフト・分析サービス • Causal analysis (NEC) https://jpn.nec.com/solution/causalanalysis/index.html • NTech Predict (ニュートラル) https://www.ipros.jp/product/detail/2000690909/ • 因果探索ソリューション (SCREEN AS) https://www.screen.co.jp/as/solution/causal 32 条件付き独⽴性に基づく⽅法両⽅

まとめ 33

まとめ • 統計的因果推論 • 因果グラフが領域知識で描ける場合はかなり成熟 • 描けない場合のデータによる支援が今後の鍵: 統計的因果探索 • AIによる支援の対象を、分類や回帰などだけでなく因果推論へも広
げる • リサーチクエスチョンは相関に基づく予測だけではない • 関連論文: https://www.shimizulab.org/lingam/lingampapers 34

補足: 離散と連続の混在 • T. Handhayani, J. Cussens. Kernel-based Approach to
Handle Mixed Data for Inferring Causal Graphs https://arxiv.org/abs/1910.03055 • M. Tsagris, G. Borboudakis, V. Lagani & I. Tsamardinos. Constraint- based causal discovery with mixed data https://link.springer.com/article/10.1007/s41060-018-0097-y https://cran.r-project.org/web/packages/MXM/index.html • W. Wei and L. Feng. Nonlinear Causal Structure Learning for Mixed Data https://ieeexplore.ieee.org/document/9679161/ • W. Wenjuan, F. Lu, and L. Chunchen. Mixed Causal Structure Discovery with Application to Prescriptive Pricing. https://www.ijcai.org/proceedings/2018/0711.pdf 35

統計的因果探索とAI

統計的因果探索とAI

More Decks by Shohei SHIMIZU

Other Decks in Science

Featured

Transcript