Shuyu Dong

Large-scale causal structure learning: challenges and new methods Seminaire Scube,
L2S November 29, 2024 Shuyu Dong1 Joint work with Mich` ele Sebag1, Kento Uemura2, Akito Fujii2, Shuang Chang2, Yoseke Koyanagi2, and Koji Maruhashi2 1INRIA TAU team, LISN, Universit´ e Paris-Saclay 2Fujitsu Laboratories

Outline: 1. Introduction 2. Continuous optimization approach 3. A distributed
learning method 4. Conclusion and perspectives

Outline Introduction Continuous optimization approaches A distributed approach for causal
structure learning Conclusion and perspectives 3 / 34

Causality in a few examples Gene relationships (Yuan and Bar-Joseph,
2019) 4 / 34

Causality in a few examples Gene relationships (Yuan and Bar-Joseph,
2019) Image classifiers (Sauer and Geiger, 2021) 4 / 34

Structural Equation Models (SEMs) Definition: Xi = d j=1 Bji
Xj + Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} 0 1 3 5 6 8 4 7 9 0 2 5 1 4 7 3 6 8 9 (Kalainathan et al., 2022) Remarks: ▶ Generalization: Xi = j∈PaB (i) fj (Xj ) + Ni (Additive Noise Model, ANM) ▶ No latent variables: all variables in X = (X1, . . . , Xd ) are observed. Goal: Given n observations X ∈ Rn×d of X, learn the adjacency matrix B or the graph G(B) of B. 5 / 34

Structural Equation Models (SEMs) Definition: Xi = d j=1 Bji
Xj + Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} Graphical model X1 X2 X1 X2 X1 ⊥ ⊥ X2 (0, 1)-adjacency 0 1 0 0 0 0 1 0 0 0 0 0 Weighted adjacency 0 w 0 0 0 0 w 0 0 0 0 0 Remarks: ▶ Generalization: Xi = j∈PaB (i) fj (Xj ) + Ni (Additive Noise Model, ANM) ▶ No latent variables: all variables in X = (X1, . . . , Xd ) are observed. Goal: Given n observations X ∈ Rn×d of X, learn the adjacency matrix B or the graph G(B) of B. 5 / 34

Causal structure learning Definition: Xi = d j=1 Bji Xj
+ Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} PC algorithm (Spirtes et al., 2000; Kalisch and B¨ uhlman, 2007): (Peters et al., 2017) Figure 7.1 6 / 34

+ Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} GES algorithm (Chickering, 2002): greedy search to maximize the Bayesian information criterion (BIC) S(G; X) = log p(X|G, θ) − d 2 log(n). CPDAG G0 G′ X Y Z V U X Y Z V U X Y Z V U 6 / 34

+ Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} GES algorithm (Chickering, 2002): greedy search to maximize the Bayesian information criterion (BIC) S(G; X) = log p(X|G, θ) − d 2 log(n). CPDAG G0 G′ X Y Z V U X Y Z V U X Y Z V U Proposition (e.g., Peters et al. (2017)): Two DAGs are Markov equivalent if and only if (i) they have the same skeleton, and (ii) they have the same v-structures. Theorem (Chickering, 2002): Assume that the distribution of X is faithful to the DAG G. Then the CPDAG resulted from GES equals the equivalence class of G in the large sample limit. 6 / 34

Challenges in causal structure learning ▶ NP-hardness (Chickering, 1996) ▶
Identifiability: Markov equivalence at best in most cases (Ghassami et al., 2020) A large search space: The size of DAG(d) := {B ∈ {0, 1}d×d : G(B) is a DAG} grows as | DAG(d)| ≈ d!2d2/2. Notation: ▶ The set of weighted adjacency matrices of DAGs: DAG(d) := {B ∈ Rd×d : G(B) is a DAG}. ▶ Support Supp(B) indicates the edge set of G(B). (Peters et al., 2017) 7 / 34

A continuous optimization method Theorem (Zheng et al., 2018): The
graph of B ∈ Rd×d is a DAG if and only if h(B) := tr(exp(B ⊙ B)) − d = 0. Proof (sketch): For B ∈ {0, 1}d×d and any k ≥ 1, tr(Bk ) = amount of k-cycles. Total amount of all cycles: tr(exp(B)) = tr I + k≥1 1 k! Bk = d + k≥1 1 k! tr(Bk ). Trick to generalize B to weighted adjacency B: the Hadamard product where (B ⊙ B)ij = B2 ij . □ Zheng et al. (2018): DAGs with NOTEARS: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, volume 31. 9 / 34

graph of B ∈ Rd×d is a DAG if and only if h(B) := tr(exp(B ⊙ B)) − d = 0. Landscape of h(B) near C = 0 1 2 1 2 0 , in two different subspaces of R2×2: 0.0 0.5 1.0 0.2 0.1 0.0 0.1 0.2 0 250 500 750 1000 E12 E21 C 200 400 600 800 1000 0.0 0.5 1.0 1.0 0.5 0.0 0 250 500 750 1000 E12 E21 02×2 C 0 200 400 600 800 1000 E12 = 0 1 0 0 , E21 = 0 0 1 0 . 9 / 34

graph of B ∈ Rd×d is a DAG if and only if h(B) := tr(exp(B ⊙ B)) − d = 0. NOTEARS: “Non-combinatorial Opt via Trace Exponential . . . ” min B∈Rd×d S(B; X) + λ∥B∥ℓ1 ⇔ min B∈Rd×d S(B; X) + λ∥B∥ℓ1 s.t. B ∈ DAG(d) s.t. tr(exp(B ⊙ B)) − d = 0 ▶ Continuous and differentiable ▶ Augmented Lagrangian method, Gradient Descent, . . . ▶ Cost: function evaluation of B → tr(exp(B ⊙ B)) and its gradients ⇝ O(d3) 9 / 34

LoRAM: leveraging low-rankness and sparsity Definition (Dong and Sebag, 2022):
let BΩ (X, Y ) be the projection onto the index set Ω of the product matrix XY T: (BΩ (X, Y ))ij = (XY T)ij if (i, j) ∈ Ω 0 otherwise. AΩ ∈ Rd×d ←Ω← XY T ← X ∈ Rd×r × Y T ∈ Rr×d LoRAM for the projection problem: given a weighted adjacency matrix A, solve min B∈Rd×d ∥A − B∥2 F ⇝ min Ω,X,Y ∥A − B∥2 F subject to B ∈ DAG subject to B = BΩ(X, Y ) tr(exp(σ(B))) − d = 0 Remark: tr(exp(σ(B))) works for σ(B) := B ⊙ B or |B| (element-wise absolute values). Dong and Sebag (2022): SD and Mich` ele Sebag. From graphs to DAGs: a low-complexity model and a scalable algorithm. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML-PKDD 2022, pages 107–122, 2022, Springer-Verlag. 10 / 34

Optimization of LoRAM model LoRAM for the projection problem: given
a weighted adjacency matrix A, solve min Ω,X,Y ∥A − B∥2 F subject to B = BΩ(X, Y ), tr(exp(B ⊙ B)) − d = 0. Lagrangian method: 1. Primal: update (X, Y ) by minX,Y ∥A−BΩ (X, Y )∥2 F +µh(BΩ (X, Y )) 2. Dual ascent on µ 3. Update Ω: Hard thresholding of BΩ (X, Y ) Algorithm 1 Accelerated Gradient Descent of LoRAM 1: Make a gradient descent: x1 = x0 − s0 ∇F(x0) for an initial stepsize s0 > 0 2: Initialize: y0 = x0, y1 = x1, t = 1. 3: while ∥∇F(xt )∥ > ϵ do 4: Compute ∇F(yt ) = ∇f (yt ) + α∇h(yt ) # see next 5: Compute the Barzilai–Borwein stepsize: st = ∥zt−1∥2 ⟨zt−1,wt−1⟩ , where zt−1 = yt − yt−1 and wt−1 = ∇F(yt ) − ∇F(yt−1). 6: Updates with Nesterov’s acceleration: xt+1 = yt − st ∇F(yt ), (1) yt+1 = xt+1 + t t + 3 (xt+1 − xt ). 7: t = t + 1 8: end while 11 / 34

LoRAM: gradient computation AΩ ∈ Rd×d ← X ∈ Rd×r
× Y T ∈ Rr×d BΩ (X, Y ) := PΩ (XY T) Proposition (gradient of h): Function h(X, Y ) = tr(exp(σ(BΩ (X, Y )))), for σ(A) = |A|, satisfies: ∂X h(X, Y ) = (exp(σ(B)))T ⊙ sign(B) T Y , ∂Y h(X, Y ) = (exp(σ(B)))T ⊙ sign(B) X, where B := PΩ (XY T). ▶ Essentially: (A, C, ξ) → (exp(A) ⊙ C)ξ for A, C ∈ Rd×d and ξ ∈ Rd×r ▶ Due to the Hadamard product (exp(A) ⊙ C): state-of-the-art numerical methods (Al-Mohy and Higham, 2011) for “(A, ξ) → exp(A)ξ” are not applicable 12 / 34

LoRAM: gradient computation Algorithm 2 Approximation of (A, C, ξ)
→ (exp(A) ⊙ C)ξ for computing ˜ ∇h(X, Y ) 1: F ← (I ⊙ C)ξ 2: for k = 0, . . . , m∗ do 3: ξ ← 1 k+1 (A ⊙ C)ξ 4: F ← F + ξ 5: ξ ← F 6: end for 7: Return F ▶ Choice of m∗ : adapted from (Al-Mohy and Higham, 2011) for (A, ξ) → exp(A)ξ ▶ Cost: O(|Ω|r), where |Ω| = ρd2 ≤ d2 and r ≪ d. Search space Memory req. Complexity NOTEARS Rd×d O(d2) O(d3) LoRAM (ours) Rd×r × Rd×r O(dr) O(d2r) ▶ Inexact because ((A ⊙ C)k )ξ ̸= (Ak ⊙ C)ξ. 13 / 34

LoRAM: gradient computation Algorithm 2 Approximation of (A, C, ξ)
→ (exp(A) ⊙ C)ξ for computing ˜ ∇h(X, Y ) 1: F ← (I ⊙ C)ξ 2: for k = 0, . . . , m∗ do 3: ξ ← 1 k+1 (A ⊙ C)ξ 4: F ← F + ξ 5: ξ ← F 6: end for 7: Return F ▶ Choice of m∗ : adapted from (Al-Mohy and Higham, 2011) for (A, ξ) → exp(A)ξ ▶ Cost: O(|Ω|r), where |Ω| = ρd2 ≤ d2 and r ≪ d. Search space Memory req. Complexity NOTEARS Rd×d O(d2) O(d3) LoRAM (ours) Rd×r × Rd×r O(dr) O(d2r) ▶ Inexact because ((A ⊙ C)k )ξ ̸= (Ak ⊙ C)ξ. ▶ How well is this approximation? Observation: accurate enough when ∥A∥ is bounded (while ∥C∥∞ = ∥sign(B)∥∞ = 1) 13 / 34

LoRAM: gradient computation We observe how ⟨∇h(X, Y ), ˜
∇h(X, Y )⟩ evolves with ∥(X, Y )∥: ▶ measured at (cX, cY ) for various c ∈ (0, 1) at unstationary point (X, Y ) ▶ various sparsity ρ of Ω = Supp(A) (initial edge set) Relative error 10 4 10 2 100 102 |A (X,Y)|F 10 6 10 5 10 4 10 3 10 2 10 1 100 101 Relative Error =1.0e-03 =5.0e-03 =1.0e-02 =5.0e-02 Cosine similarity 10 4 10 2 100 102 |A (X,Y)|F 0.2 0.4 0.6 0.8 1.0 Average Cosine Similarity =1.0e-03 =5.0e-03 =1.0e-02 =5.0e-02 14 / 34

LoRAM: gradient computation We observe how ⟨∇h(X, Y ), ˜
∇h(X, Y )⟩ evolves with ∥(X, Y )∥: ▶ measured at (cX, cY ) for various c ∈ (0, 1) at unstationary point (X, Y ) ▶ various sparsity ρ of Ω = Supp(A) (initial edge set) → ˜ ∇h → ∇h Obs: direction is preserved when ∥A∥ is bounded by 1. Relative error 10 4 10 2 100 102 |A (X,Y)|F 10 6 10 5 10 4 10 3 10 2 10 1 100 101 Relative Error =1.0e-03 =5.0e-03 =1.0e-02 =5.0e-02 Cosine similarity 10 4 10 2 100 102 |A (X,Y)|F 0.2 0.4 0.6 0.8 1.0 Average Cosine Similarity =1.0e-03 =5.0e-03 =1.0e-02 =5.0e-02 14 / 34

Experiments Noise: Reverted cause-effect pairs E = σE B⋆T, for
σE = 0.4∥B⋆∥∞ . ▶ TPR, FPR: true/false positive rate ▶ SHD (structural Hamming dist.): total nb of +/− and reversion of edges needed 500 1000 1500 2000 d (number of nodes) 0 200 400 600 800 Time (seconds) LoRAM (Ours) NoTears Table: Comparison with NOTEARS (Zheng et al., 2018) on the projection problem. d (α, r) LoRAM (ours) / NOTEARS Runtime (sec) TPR FDR SHD 100 (5.0, 40) 1.82 / 0.67 1.00 / 1.00 0.00e+0 / 0.0 0.0 / 0.0 200 (5.0, 40) 2.20 / 3.64 0.98 / 0.95 2.50e-2 / 0.0 1.0 / 2.0 400 (5.0, 40) 2.74 / 16.96 0.98 / 0.98 2.50e-2 / 0.0 4.0 / 4.0 600 (5.0, 40) 3.40 / 42.65 0.98 / 0.96 1.67e-2 / 0.0 6.0 / 16.0 800 (5.0, 40) 4.23 / 83.68 0.99 / 0.97 7.81e-3 / 0.0 5.0 / 22.0 1000 (2.0, 80) 7.63 / 136.94 1.00 / 0.96 0.00e+0 / 0.0 0.0 / 36.0 1500 (2.0, 80) 13.34 / 437.35 1.00 / 0.96 8.88e-4 / 0.0 2.0 / 94.0 2000 (2.0, 80) 20.32 / 906.94 1.00 / 0.96 7.49e-4 / 0.0 3.0 / 148.0 15 / 34

Remarks about LoRAM ✓ Significant improvement over NOTEARS (Zheng et
al., 2018) for the projection problem. ✗ Inherent weakness of NOTEARS for causal structure learning: ▶ NOTEARS works fine when the noise variances are equal. ▶ With non-equal noise variances, score function-based optimization could induce biases leading to erroneous causal ordering varsortability bias (Reisach et al., 2021) Perspectives: ▶ Application of LoRAM to causal structure learning ▶ Serving as a general-purpose tool for learning DAGs from directed or mixed graphs. 16 / 34

Divide-and-conquer in three phases DCILP (Dong et al., 2024): Distributed
causal discovery using ILP 1. Phase 1: divide X = (X1, . . . , Xd ) into different subsets S1, S2, . . . 2. Phase 2: learn subgraph from data restricted to Si separately 3. Phase 3: aggregate subgraphs through an ILP formulation (a) (b) (c) B(i) (d) B How it differs with the related work: ▶ Phase 1–Phase 2: Parallelization vs Sequential ▶ Phase 3: ILP-based vs Simple rule-based Dong et al. (2024): SD, Mich` ele Sebag, Kento Uemura, Akito Fujii, Shuang Chang, Yusuke Koyanagi, Koji Maruhashi. DCDILP: a distributed learning method for large-scale causal structure learning. arXiv preprint arXiv:2406.10481, 2024. 18 / 34

DCILP (Phase 1): divide by Markov blankets (a) (b) (c)
B(i) (d) B Definition (e.g., Peters et al. (2017)): The Markov blanket MB(Xi ) of a variable Xi is the smallest set M ⊂ X such that X ⊥ ⊥ X\(M ∪ {Xi }) given M. Theorem (Loh and B¨ uhlmann, 2014): Under a faithfulness assumption, the Markov blankets can be identified via the support of the precision matrix: M(B) = Supp((cov(X))−1). B M(B) 0 2 5 1 4 7 3 6 8 9 0 1 2 3 5 6 8 4 7 9 19 / 34

DCILP: algorithm (a) (b) (c) B(i) (d) B Algorithm 2
(DCILP) Distributed causal discovery using ILP 1: (Phase-1) Divide: Estimate Markov blanket MB(Xi ) for i = 1, . . . , d 2: (Phase-2) for i = 1, . . . , d do in parallel 3: A(i) ← Causal discovery on Si := MB(Xi ) ∪ {Xi } using GES or DAGMA (Bello et al., 2022) 4: B(i) j,k ← A(i) j,k if j = i or k = i, and 0 otherwise 5: (Phase-3) Conquer: B ← Reconciliation from {B(i), i = 1 . . . d} through the ILP 20 / 34

Phase 3: causal structure by binary variables Bij = 1
if Xi → Xj Vijk = Vjik = 1 if there is a v-structure (Xi → Xk ← Xj ) Sij = Sji = 1 if Xi and Xj are spouses, i.e., ∃k, Vijk = 1. B⋆ M(B⋆) 0 2 5 1 4 7 3 6 8 9 0 1 2 3 5 6 8 4 7 9 21 / 34

Phase 3: ILP Constraints For all i, j, k such
that i ̸= j, j ̸= k, k ̸= i: (Si := MB(Xi ) ∪ {Xi }) Bij = 0, Sij = Sji = 0 if Xi / ∈ MB(Xj ) (2) Bij + Bji + Sij ≥ 1 if Xi ∈ MB(Xj ) (3) Bij + Bji ≤ 1 if Xi ∈ MB(Xj ) (4) Vijk ≤ Bik , Vijk ≤ Bjk , if {i, j, k} ⊂ (Si ∩ Sj ∩ Sk ) (5) Bik + Bjk ≤ 1 + Vijk , if {i, j, k} ⊂ (Si ∩ Sj ∩ Sk ) (6) Vijk ≤ Sij , if {i, j, k} ⊂ (Si ∩ Sj ∩ Sk ) (7) Sij ≤ k Vijk if {i, j, k} ⊂ (Si ∩ Sj ∩ Sk ) (8) B⋆ M(B⋆) 0 2 5 1 4 7 3 6 8 9 0 1 2 3 5 6 8 4 7 9 22 / 34

Phase 3: the ILP for merging all local solutions (Si
:= MB(Xi ) ∪ {Xi }) max B,S,V ⟨B, B⟩ (9) subject to Bij = 0, Sij = Sji = 0 if Xi / ∈ MB(Xj ) (2) Bij + Bji + Sij ≥ 1 if Xi ∈ MB(Xj ) (3) Bij + Bji ≤ 1 if Xi ∈ MB(Xj ) (4) Vijk ≤ Bik , Vijk ≤ Bjk , if Bik ̸= 0 and Bjk ̸= 0 (5) Bik + Bjk ≤ 1 + Vijk , if Bik ̸= 0 and Bjk ̸= 0 (6) Vijk ≤ Sij , if Bik ̸= 0 and Bjk ̸= 0 (7) Sij ≤ k Vijk if Bik ̸= 0 and Bjk ̸= 0 (8) where B = d i=1 B(i) (a naive merge) from Phase-2. B⋆ M(B⋆) 0 2 5 1 4 7 3 9 0 1 2 3 5 4 7 9 23 / 34

Phase 3: the ILP for merging all local solutions (Si
:= MB(Xi ) ∪ {Xi }) max B,S,V ⟨B, B⟩ (9) subject to Bij = 0, Sij = Sji = 0 if Xi / ∈ MB(Xj ) (2) Bij + Bji + Sij ≥ 1 if Xi ∈ MB(Xj ) (3) Bij + Bji ≤ 1 if Xi ∈ MB(Xj ) (4) Vijk ≤ Bik , Vijk ≤ Bjk , if Bik ̸= 0 and Bjk ̸= 0 (5) Bik + Bjk ≤ 1 + Vijk , if Bik ̸= 0 and Bjk ̸= 0 (6) Vijk ≤ Sij , if Bik ̸= 0 and Bjk ̸= 0 (7) Sij ≤ k Vijk if Bik ̸= 0 and Bjk ̸= 0 (8) where B = d i=1 B(i) (a naive merge) from Phase-2. B B B′ X Y Z V U X Y Z V U X Y Z V U 23 / 34

DCILP: experiments Algorithm 2 (DCILP) Distributed causal discovery using ILP
1: (Phase-1) Divide: Estimate Markov blanket MB(Xi ) for i = 1, . . . , d 2: (Phase-2) for i = 1, . . . , d do in parallel 3: A(i) ← Causal discovery on Si := MB(Xi ) ∪ {Xi } using GES or DAGMA (Bello et al., 2022) 4: B(i) j,k ← A(i) j,k if j = i or k = i, and 0 otherwise 5: (Phase-3) Conquer: B ← Reconciliation from {B(i), i = 1 . . . d} through the ILP ▶ Phase 1: empirical precision matrix estimator ▶ Phase 2: Parallellized on min(2d, 400) CPU cores. Running on Ruche (Mesocentre Paris-Saclay) ▶ Phase 3: implementation with Gurobi tools 24 / 34

DCILP experiments: ILP versus the naive merge 1 2 5
7 8 14 15 16 17 20 25 29 33 34 35 36 38 39 43 44 46 48 53 54 55 5759 63 66 71 73 74 75 7980 81 83 86 89 91 92 93 94 95 99 0 BS,S 1 2 5 7 8 14 15 16 17 20 25 29 33 34 35 36 38 39 43 44 46 48 53 54 55 5759 63 66 71 73 74 75 7980 81 83 86 89 91 92 93 94 95 99 0 BS,S (TPR: 0.96, FDR: 0.35) 1 2 5 7 8 14 15 16 17 20 25 29 33 34 35 36 38 39 43 44 46 48 53 54 55 5759 63 66 71 73 74 75 7980 81 83 86 89 91 92 93 94 95 99 0 BS,S (TPR: 0.97, FDR: 0.27) 500 1000 Number of variables (d) 0.4 0.6 0.8 1.0 TPR 500 1000 Number of variables (d) 0.0 0.5 1.0 FDR 500 1000 Number of variables (d) 0 2000 SHD Naive merge (B) ILP solution (DCILP-dagma) Figure: Comparing with the naive merge B: DCILP on SF3 data. 25 / 34

DCILP experiments: ILP versus the naive merge 5 9 18
21 27 31 33 36 43 45 60 65 68 76 85 90 94 87 BS,S 5 9 18 21 27 31 33 36 43 45 60 65 68 76 85 90 94 87 BS,S (TPR: 1.00, FDR: 0.10) 5 9 18 21 27 31 33 36 43 45 60 65 68 76 85 90 94 87 BS,S (TPR: 1.00, FDR: 0.00) 500 1000 Number of variables (d) 0.4 0.6 0.8 1.0 TPR 500 1000 Number of variables (d) 0.0 0.5 1.0 FDR 500 1000 Number of variables (d) 0 2000 SHD Naive merge (B) ILP solution (DCILP-dagma) Figure: Comparing with the naive merge B: DCILP on SF3 data. 25 / 34

DCILP: experiments 0 1000 Number of variables (d) 0.50 0.75
1.00 TPR 0 1000 Number of variables (d) 0.00 0.25 0.50 FDR 0 1000 Number of variables (d) 102 104 Time (seconds) DAGMA DCILP-dagma (MB*) DCILP-dagma Figure: Comparison with DAGMA (Bello et al., 2022) on ER2 data. 26 / 34

DCILP: experiments 0 1000 Number of variables (d) 0.6 0.8
1.0 TPR 0 1000 Number of variables (d) 0.0 0.5 1.0 FDR 0 1000 Number of variables (d) 0 1000 SHD GES (pcalg) DCILP-ges (MB*) DCILP-ges Figure: Comparison with GES on ER2 data. 27 / 34

DCILP experiments: MUNIN network ▶ A DAG with d =
1041 nodes (https://www.bnlearn.com/bnrepository/) ▶ Medical expert-system model based on electromyographs (EMG) 0.00 0.25 0.50 0.75 1.00 TPR DAGMA GES (pcalg) DCILP-dagma DCILP-ges 0 500 1000 1500 SHD DAGMA GES (pcalg) DCILP-dagma DCILP-ges 103 104 Time (seconds) DAGMA GES (pcalg) DCILP-dagma DCILP-ges Figure: Results on the MUNIN network data. 28 / 34

Conclusion ▶ LoRAM: An efficient and scalable continous optimization method
for finding proximal DAGs. ▶ DCILP: a distributed approach for learning large-scale causal structures from data. ▶ Significant improvement in scalability for learning sparse and large causal graphs Perspectives: ▶ Extend applicability: ▶ Nonlinear models ▶ Robustness to change of scales in the measurements/observations ▶ Adapt to the learning of denser causal graphs ▶ Causal modeling with latent variables 30 / 34

Thank You !

References Al-Mohy, A. H. and Higham, N. J. (2011). Computing
the action of the matrix exponential, with an application to exponential integrators. SIAM journal on scientific computing, 33(2):488–511. Bello, K., Aragam, B., and Ravikumar, P. (2022). DAGMA: Learning dags via m-matrices and a log-determinant acyclicity characterization. Advances in Neural Information Processing Systems, 35:8226–8239. Chickering, D. M. (1996). Learning bayesian networks is NP-complete. Learning from data: Artificial intelligence and statistics V, pages 121–130. Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554. Dong, S. and Sebag, M. (2022). From graphs to DAGs: a low-complexity model and a scalable algorithm. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 107–122. Springer. Dong, S., Sebag, M., Uemura, K., Fujii, A., Chang, S., Koyanagi, Y., and Maruhashi, K. (2024). Dcdilp: a distributed learning method for large-scale causal structure learning. arXiv preprint arxiv:2406.10481. Ghassami, A., Yang, A., Kiyavash, N., and Zhang, K. (2020). Characterizing distribution equivalence and structure learning for cyclic and acyclic directed graphs. In International Conference on Machine Learning, pages 3494–3504. PMLR.

References (cont.) Kalainathan, D., Goudet, O., Guyon, I., Lopez-Paz, D.,
and Sebag, M. (2022). Structural agnostic modeling: Adversarial learning of causal graphs. Journal of Machine Learning Research, 23(219):1–62. Kalisch, M. and B¨ uhlman, P. (2007). Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(3). Loh, P.-L. and B¨ uhlmann, P. (2014). High-dimensional learning of linear causal networks via inverse covariance estimation. The Journal of Machine Learning Research, 15(1):3065–3105. Peters, J., Janzing, D., and Sch¨ olkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. The MIT Press. Reisach, A. G., Seiler, C., and Weichwald, S. (2021). Beware of the simulated dag! causal discovery benchmarks may be easy to game. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27772–27784. Sauer, A. and Geiger, A. (2021). Counterfactual generative networks. In International Conference on Learning Representations (ICLR). Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. (2000). Causation, prediction, and search.

References (cont.) Yuan, Y. and Bar-Joseph, Z. (2019). Deep learning
for inferring gene relationships from single-cell expression data. Proceedings of the National Academy of Sciences, 116(52):27151–27158. Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. (2018). DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, volume 31.

Shuyu Dong

Shuyu Dong

More Decks by S³ Seminar

Other Decks in Research

Featured

Transcript