Slide 1

Slide 1 text

Large-scale causal structure learning: challenges and new methods Seminaire Scube, L2S November 29, 2024 Shuyu Dong1 Joint work with Mich` ele Sebag1, Kento Uemura2, Akito Fujii2, Shuang Chang2, Yoseke Koyanagi2, and Koji Maruhashi2 1INRIA TAU team, LISN, Universit´ e Paris-Saclay 2Fujitsu Laboratories

Slide 2

Slide 2 text

Outline: 1. Introduction 2. Continuous optimization approach 3. A distributed learning method 4. Conclusion and perspectives

Slide 3

Slide 3 text

Outline Introduction Continuous optimization approaches A distributed approach for causal structure learning Conclusion and perspectives 3 / 34

Slide 4

Slide 4 text

Causality in a few examples Gene relationships (Yuan and Bar-Joseph, 2019) 4 / 34

Slide 5

Slide 5 text

Causality in a few examples Gene relationships (Yuan and Bar-Joseph, 2019) Image classifiers (Sauer and Geiger, 2021) 4 / 34

Slide 6

Slide 6 text

Structural Equation Models (SEMs) Definition: Xi = d j=1 Bji Xj + Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} 0 1 3 5 6 8 4 7 9 0 2 5 1 4 7 3 6 8 9 (Kalainathan et al., 2022) Remarks: ▶ Generalization: Xi = j∈PaB (i) fj (Xj ) + Ni (Additive Noise Model, ANM) ▶ No latent variables: all variables in X = (X1, . . . , Xd ) are observed. Goal: Given n observations X ∈ Rn×d of X, learn the adjacency matrix B or the graph G(B) of B. 5 / 34

Slide 7

Slide 7 text

Structural Equation Models (SEMs) Definition: Xi = d j=1 Bji Xj + Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} Graphical model X1 X2 X1 X2 X1 ⊥ ⊥ X2 (0, 1)-adjacency 0 1 0 0 0 0 1 0 0 0 0 0 Weighted adjacency 0 w 0 0 0 0 w 0 0 0 0 0 Remarks: ▶ Generalization: Xi = j∈PaB (i) fj (Xj ) + Ni (Additive Noise Model, ANM) ▶ No latent variables: all variables in X = (X1, . . . , Xd ) are observed. Goal: Given n observations X ∈ Rn×d of X, learn the adjacency matrix B or the graph G(B) of B. 5 / 34

Slide 8

Slide 8 text

Causal structure learning Definition: Xi = d j=1 Bji Xj + Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} PC algorithm (Spirtes et al., 2000; Kalisch and B¨ uhlman, 2007): (Peters et al., 2017) Figure 7.1 6 / 34

Slide 9

Slide 9 text

Causal structure learning Definition: Xi = d j=1 Bji Xj + Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} GES algorithm (Chickering, 2002): greedy search to maximize the Bayesian information criterion (BIC) S(G; X) = log p(X|G, θ) − d 2 log(n). CPDAG G0 G′ X Y Z V U X Y Z V U X Y Z V U 6 / 34

Slide 10

Slide 10 text

Causal structure learning Definition: Xi = d j=1 Bji Xj + Ni for all i = 1, . . . , d. B: weighted adjacency matrix of a directed acyclic graph (DAG) G Ni : noise variable, Ni ⊥ ⊥ Xj for all j ∈ PaB (i) := {k ∈ [d] : Bki ̸= 0} GES algorithm (Chickering, 2002): greedy search to maximize the Bayesian information criterion (BIC) S(G; X) = log p(X|G, θ) − d 2 log(n). CPDAG G0 G′ X Y Z V U X Y Z V U X Y Z V U Proposition (e.g., Peters et al. (2017)): Two DAGs are Markov equivalent if and only if (i) they have the same skeleton, and (ii) they have the same v-structures. Theorem (Chickering, 2002): Assume that the distribution of X is faithful to the DAG G. Then the CPDAG resulted from GES equals the equivalence class of G in the large sample limit. 6 / 34

Slide 11

Slide 11 text

Challenges in causal structure learning ▶ NP-hardness (Chickering, 1996) ▶ Identifiability: Markov equivalence at best in most cases (Ghassami et al., 2020) A large search space: The size of DAG(d) := {B ∈ {0, 1}d×d : G(B) is a DAG} grows as | DAG(d)| ≈ d!2d2/2. Notation: ▶ The set of weighted adjacency matrices of DAGs: DAG(d) := {B ∈ Rd×d : G(B) is a DAG}. ▶ Support Supp(B) indicates the edge set of G(B). (Peters et al., 2017) 7 / 34

Slide 12

Slide 12 text

Outline Introduction Continuous optimization approaches A distributed approach for causal structure learning Conclusion and perspectives 8 / 34

Slide 13

Slide 13 text

A continuous optimization method Theorem (Zheng et al., 2018): The graph of B ∈ Rd×d is a DAG if and only if h(B) := tr(exp(B ⊙ B)) − d = 0. Proof (sketch): For B ∈ {0, 1}d×d and any k ≥ 1, tr(Bk ) = amount of k-cycles. Total amount of all cycles: tr(exp(B)) = tr I + k≥1 1 k! Bk = d + k≥1 1 k! tr(Bk ). Trick to generalize B to weighted adjacency B: the Hadamard product where (B ⊙ B)ij = B2 ij . □ Zheng et al. (2018): DAGs with NOTEARS: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, volume 31. 9 / 34

Slide 14

Slide 14 text

A continuous optimization method Theorem (Zheng et al., 2018): The graph of B ∈ Rd×d is a DAG if and only if h(B) := tr(exp(B ⊙ B)) − d = 0. Landscape of h(B) near C = 0 1 2 1 2 0 , in two different subspaces of R2×2: 0.0 0.5 1.0 0.2 0.1 0.0 0.1 0.2 0 250 500 750 1000 E12 E21 C 200 400 600 800 1000 0.0 0.5 1.0 1.0 0.5 0.0 0 250 500 750 1000 E12 E21 02×2 C 0 200 400 600 800 1000 E12 = 0 1 0 0 , E21 = 0 0 1 0 . 9 / 34

Slide 15

Slide 15 text

A continuous optimization method Theorem (Zheng et al., 2018): The graph of B ∈ Rd×d is a DAG if and only if h(B) := tr(exp(B ⊙ B)) − d = 0. NOTEARS: “Non-combinatorial Opt via Trace Exponential . . . ” min B∈Rd×d S(B; X) + λ∥B∥ℓ1 ⇔ min B∈Rd×d S(B; X) + λ∥B∥ℓ1 s.t. B ∈ DAG(d) s.t. tr(exp(B ⊙ B)) − d = 0 ▶ Continuous and differentiable ▶ Augmented Lagrangian method, Gradient Descent, . . . ▶ Cost: function evaluation of B → tr(exp(B ⊙ B)) and its gradients ⇝ O(d3) 9 / 34

Slide 16

Slide 16 text

LoRAM: leveraging low-rankness and sparsity Definition (Dong and Sebag, 2022): let BΩ (X, Y ) be the projection onto the index set Ω of the product matrix XY T: (BΩ (X, Y ))ij = (XY T)ij if (i, j) ∈ Ω 0 otherwise. AΩ ∈ Rd×d ←Ω← XY T ← X ∈ Rd×r × Y T ∈ Rr×d LoRAM for the projection problem: given a weighted adjacency matrix A, solve min B∈Rd×d ∥A − B∥2 F ⇝ min Ω,X,Y ∥A − B∥2 F subject to B ∈ DAG subject to B = BΩ(X, Y ) tr(exp(σ(B))) − d = 0 Remark: tr(exp(σ(B))) works for σ(B) := B ⊙ B or |B| (element-wise absolute values). Dong and Sebag (2022): SD and Mich` ele Sebag. From graphs to DAGs: a low-complexity model and a scalable algorithm. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML-PKDD 2022, pages 107–122, 2022, Springer-Verlag. 10 / 34

Slide 17

Slide 17 text

Optimization of LoRAM model LoRAM for the projection problem: given a weighted adjacency matrix A, solve min Ω,X,Y ∥A − B∥2 F subject to B = BΩ(X, Y ), tr(exp(B ⊙ B)) − d = 0. Lagrangian method: 1. Primal: update (X, Y ) by minX,Y ∥A−BΩ (X, Y )∥2 F +µh(BΩ (X, Y )) 2. Dual ascent on µ 3. Update Ω: Hard thresholding of BΩ (X, Y ) Algorithm 1 Accelerated Gradient Descent of LoRAM 1: Make a gradient descent: x1 = x0 − s0 ∇F(x0) for an initial stepsize s0 > 0 2: Initialize: y0 = x0, y1 = x1, t = 1. 3: while ∥∇F(xt )∥ > ϵ do 4: Compute ∇F(yt ) = ∇f (yt ) + α∇h(yt ) # see next 5: Compute the Barzilai–Borwein stepsize: st = ∥zt−1∥2 ⟨zt−1,wt−1⟩ , where zt−1 = yt − yt−1 and wt−1 = ∇F(yt ) − ∇F(yt−1). 6: Updates with Nesterov’s acceleration: xt+1 = yt − st ∇F(yt ), (1) yt+1 = xt+1 + t t + 3 (xt+1 − xt ). 7: t = t + 1 8: end while 11 / 34

Slide 18

Slide 18 text

LoRAM: gradient computation AΩ ∈ Rd×d ← X ∈ Rd×r × Y T ∈ Rr×d BΩ (X, Y ) := PΩ (XY T) Proposition (gradient of h): Function h(X, Y ) = tr(exp(σ(BΩ (X, Y )))), for σ(A) = |A|, satisfies: ∂X h(X, Y ) = (exp(σ(B)))T ⊙ sign(B) T Y , ∂Y h(X, Y ) = (exp(σ(B)))T ⊙ sign(B) X, where B := PΩ (XY T). ▶ Essentially: (A, C, ξ) → (exp(A) ⊙ C)ξ for A, C ∈ Rd×d and ξ ∈ Rd×r ▶ Due to the Hadamard product (exp(A) ⊙ C): state-of-the-art numerical methods (Al-Mohy and Higham, 2011) for “(A, ξ) → exp(A)ξ” are not applicable 12 / 34

Slide 19

Slide 19 text

LoRAM: gradient computation Algorithm 2 Approximation of (A, C, ξ) → (exp(A) ⊙ C)ξ for computing ˜ ∇h(X, Y ) 1: F ← (I ⊙ C)ξ 2: for k = 0, . . . , m∗ do 3: ξ ← 1 k+1 (A ⊙ C)ξ 4: F ← F + ξ 5: ξ ← F 6: end for 7: Return F ▶ Choice of m∗ : adapted from (Al-Mohy and Higham, 2011) for (A, ξ) → exp(A)ξ ▶ Cost: O(|Ω|r), where |Ω| = ρd2 ≤ d2 and r ≪ d. Search space Memory req. Complexity NOTEARS Rd×d O(d2) O(d3) LoRAM (ours) Rd×r × Rd×r O(dr) O(d2r) ▶ Inexact because ((A ⊙ C)k )ξ ̸= (Ak ⊙ C)ξ. 13 / 34

Slide 20

Slide 20 text

LoRAM: gradient computation Algorithm 2 Approximation of (A, C, ξ) → (exp(A) ⊙ C)ξ for computing ˜ ∇h(X, Y ) 1: F ← (I ⊙ C)ξ 2: for k = 0, . . . , m∗ do 3: ξ ← 1 k+1 (A ⊙ C)ξ 4: F ← F + ξ 5: ξ ← F 6: end for 7: Return F ▶ Choice of m∗ : adapted from (Al-Mohy and Higham, 2011) for (A, ξ) → exp(A)ξ ▶ Cost: O(|Ω|r), where |Ω| = ρd2 ≤ d2 and r ≪ d. Search space Memory req. Complexity NOTEARS Rd×d O(d2) O(d3) LoRAM (ours) Rd×r × Rd×r O(dr) O(d2r) ▶ Inexact because ((A ⊙ C)k )ξ ̸= (Ak ⊙ C)ξ. ▶ How well is this approximation? Observation: accurate enough when ∥A∥ is bounded (while ∥C∥∞ = ∥sign(B)∥∞ = 1) 13 / 34

Slide 21

Slide 21 text

LoRAM: gradient computation We observe how ⟨∇h(X, Y ), ˜ ∇h(X, Y )⟩ evolves with ∥(X, Y )∥: ▶ measured at (cX, cY ) for various c ∈ (0, 1) at unstationary point (X, Y ) ▶ various sparsity ρ of Ω = Supp(A) (initial edge set) Relative error 10 4 10 2 100 102 |A (X,Y)|F 10 6 10 5 10 4 10 3 10 2 10 1 100 101 Relative Error =1.0e-03 =5.0e-03 =1.0e-02 =5.0e-02 Cosine similarity 10 4 10 2 100 102 |A (X,Y)|F 0.2 0.4 0.6 0.8 1.0 Average Cosine Similarity =1.0e-03 =5.0e-03 =1.0e-02 =5.0e-02 14 / 34

Slide 22

Slide 22 text

LoRAM: gradient computation We observe how ⟨∇h(X, Y ), ˜ ∇h(X, Y )⟩ evolves with ∥(X, Y )∥: ▶ measured at (cX, cY ) for various c ∈ (0, 1) at unstationary point (X, Y ) ▶ various sparsity ρ of Ω = Supp(A) (initial edge set) → ˜ ∇h → ∇h Obs: direction is preserved when ∥A∥ is bounded by 1. Relative error 10 4 10 2 100 102 |A (X,Y)|F 10 6 10 5 10 4 10 3 10 2 10 1 100 101 Relative Error =1.0e-03 =5.0e-03 =1.0e-02 =5.0e-02 Cosine similarity 10 4 10 2 100 102 |A (X,Y)|F 0.2 0.4 0.6 0.8 1.0 Average Cosine Similarity =1.0e-03 =5.0e-03 =1.0e-02 =5.0e-02 14 / 34

Slide 23

Slide 23 text

Experiments Noise: Reverted cause-effect pairs E = σE B⋆T, for σE = 0.4∥B⋆∥∞ . ▶ TPR, FPR: true/false positive rate ▶ SHD (structural Hamming dist.): total nb of +/− and reversion of edges needed 500 1000 1500 2000 d (number of nodes) 0 200 400 600 800 Time (seconds) LoRAM (Ours) NoTears Table: Comparison with NOTEARS (Zheng et al., 2018) on the projection problem. d (α, r) LoRAM (ours) / NOTEARS Runtime (sec) TPR FDR SHD 100 (5.0, 40) 1.82 / 0.67 1.00 / 1.00 0.00e+0 / 0.0 0.0 / 0.0 200 (5.0, 40) 2.20 / 3.64 0.98 / 0.95 2.50e-2 / 0.0 1.0 / 2.0 400 (5.0, 40) 2.74 / 16.96 0.98 / 0.98 2.50e-2 / 0.0 4.0 / 4.0 600 (5.0, 40) 3.40 / 42.65 0.98 / 0.96 1.67e-2 / 0.0 6.0 / 16.0 800 (5.0, 40) 4.23 / 83.68 0.99 / 0.97 7.81e-3 / 0.0 5.0 / 22.0 1000 (2.0, 80) 7.63 / 136.94 1.00 / 0.96 0.00e+0 / 0.0 0.0 / 36.0 1500 (2.0, 80) 13.34 / 437.35 1.00 / 0.96 8.88e-4 / 0.0 2.0 / 94.0 2000 (2.0, 80) 20.32 / 906.94 1.00 / 0.96 7.49e-4 / 0.0 3.0 / 148.0 15 / 34

Slide 24

Slide 24 text

Remarks about LoRAM ✓ Significant improvement over NOTEARS (Zheng et al., 2018) for the projection problem. ✗ Inherent weakness of NOTEARS for causal structure learning: ▶ NOTEARS works fine when the noise variances are equal. ▶ With non-equal noise variances, score function-based optimization could induce biases leading to erroneous causal ordering varsortability bias (Reisach et al., 2021) Perspectives: ▶ Application of LoRAM to causal structure learning ▶ Serving as a general-purpose tool for learning DAGs from directed or mixed graphs. 16 / 34

Slide 25

Slide 25 text

Outline Introduction Continuous optimization approaches A distributed approach for causal structure learning Conclusion and perspectives 17 / 34

Slide 26

Slide 26 text

Divide-and-conquer in three phases DCILP (Dong et al., 2024): Distributed causal discovery using ILP 1. Phase 1: divide X = (X1, . . . , Xd ) into different subsets S1, S2, . . . 2. Phase 2: learn subgraph from data restricted to Si separately 3. Phase 3: aggregate subgraphs through an ILP formulation (a) (b) (c) B(i) (d) B How it differs with the related work: ▶ Phase 1–Phase 2: Parallelization vs Sequential ▶ Phase 3: ILP-based vs Simple rule-based Dong et al. (2024): SD, Mich` ele Sebag, Kento Uemura, Akito Fujii, Shuang Chang, Yusuke Koyanagi, Koji Maruhashi. DCDILP: a distributed learning method for large-scale causal structure learning. arXiv preprint arXiv:2406.10481, 2024. 18 / 34

Slide 27

Slide 27 text

DCILP (Phase 1): divide by Markov blankets (a) (b) (c) B(i) (d) B Definition (e.g., Peters et al. (2017)): The Markov blanket MB(Xi ) of a variable Xi is the smallest set M ⊂ X such that X ⊥ ⊥ X\(M ∪ {Xi }) given M. Theorem (Loh and B¨ uhlmann, 2014): Under a faithfulness assumption, the Markov blankets can be identified via the support of the precision matrix: M(B) = Supp((cov(X))−1). B M(B) 0 2 5 1 4 7 3 6 8 9 0 1 2 3 5 6 8 4 7 9 19 / 34

Slide 28

Slide 28 text

DCILP: algorithm (a) (b) (c) B(i) (d) B Algorithm 2 (DCILP) Distributed causal discovery using ILP 1: (Phase-1) Divide: Estimate Markov blanket MB(Xi ) for i = 1, . . . , d 2: (Phase-2) for i = 1, . . . , d do in parallel 3: A(i) ← Causal discovery on Si := MB(Xi ) ∪ {Xi } using GES or DAGMA (Bello et al., 2022) 4: B(i) j,k ← A(i) j,k if j = i or k = i, and 0 otherwise 5: (Phase-3) Conquer: B ← Reconciliation from {B(i), i = 1 . . . d} through the ILP 20 / 34

Slide 29

Slide 29 text

Phase 3: causal structure by binary variables Bij = 1 if Xi → Xj Vijk = Vjik = 1 if there is a v-structure (Xi → Xk ← Xj ) Sij = Sji = 1 if Xi and Xj are spouses, i.e., ∃k, Vijk = 1. B⋆ M(B⋆) 0 2 5 1 4 7 3 6 8 9 0 1 2 3 5 6 8 4 7 9 21 / 34

Slide 30

Slide 30 text

Phase 3: ILP Constraints For all i, j, k such that i ̸= j, j ̸= k, k ̸= i: (Si := MB(Xi ) ∪ {Xi }) Bij = 0, Sij = Sji = 0 if Xi / ∈ MB(Xj ) (2) Bij + Bji + Sij ≥ 1 if Xi ∈ MB(Xj ) (3) Bij + Bji ≤ 1 if Xi ∈ MB(Xj ) (4) Vijk ≤ Bik , Vijk ≤ Bjk , if {i, j, k} ⊂ (Si ∩ Sj ∩ Sk ) (5) Bik + Bjk ≤ 1 + Vijk , if {i, j, k} ⊂ (Si ∩ Sj ∩ Sk ) (6) Vijk ≤ Sij , if {i, j, k} ⊂ (Si ∩ Sj ∩ Sk ) (7) Sij ≤ k Vijk if {i, j, k} ⊂ (Si ∩ Sj ∩ Sk ) (8) B⋆ M(B⋆) 0 2 5 1 4 7 3 6 8 9 0 1 2 3 5 6 8 4 7 9 22 / 34

Slide 31

Slide 31 text

Phase 3: the ILP for merging all local solutions (Si := MB(Xi ) ∪ {Xi }) max B,S,V ⟨B, B⟩ (9) subject to Bij = 0, Sij = Sji = 0 if Xi / ∈ MB(Xj ) (2) Bij + Bji + Sij ≥ 1 if Xi ∈ MB(Xj ) (3) Bij + Bji ≤ 1 if Xi ∈ MB(Xj ) (4) Vijk ≤ Bik , Vijk ≤ Bjk , if Bik ̸= 0 and Bjk ̸= 0 (5) Bik + Bjk ≤ 1 + Vijk , if Bik ̸= 0 and Bjk ̸= 0 (6) Vijk ≤ Sij , if Bik ̸= 0 and Bjk ̸= 0 (7) Sij ≤ k Vijk if Bik ̸= 0 and Bjk ̸= 0 (8) where B = d i=1 B(i) (a naive merge) from Phase-2. B⋆ M(B⋆) 0 2 5 1 4 7 3 9 0 1 2 3 5 4 7 9 23 / 34

Slide 32

Slide 32 text

Phase 3: the ILP for merging all local solutions (Si := MB(Xi ) ∪ {Xi }) max B,S,V ⟨B, B⟩ (9) subject to Bij = 0, Sij = Sji = 0 if Xi / ∈ MB(Xj ) (2) Bij + Bji + Sij ≥ 1 if Xi ∈ MB(Xj ) (3) Bij + Bji ≤ 1 if Xi ∈ MB(Xj ) (4) Vijk ≤ Bik , Vijk ≤ Bjk , if Bik ̸= 0 and Bjk ̸= 0 (5) Bik + Bjk ≤ 1 + Vijk , if Bik ̸= 0 and Bjk ̸= 0 (6) Vijk ≤ Sij , if Bik ̸= 0 and Bjk ̸= 0 (7) Sij ≤ k Vijk if Bik ̸= 0 and Bjk ̸= 0 (8) where B = d i=1 B(i) (a naive merge) from Phase-2. B B B′ X Y Z V U X Y Z V U X Y Z V U 23 / 34

Slide 33

Slide 33 text

DCILP: experiments Algorithm 2 (DCILP) Distributed causal discovery using ILP 1: (Phase-1) Divide: Estimate Markov blanket MB(Xi ) for i = 1, . . . , d 2: (Phase-2) for i = 1, . . . , d do in parallel 3: A(i) ← Causal discovery on Si := MB(Xi ) ∪ {Xi } using GES or DAGMA (Bello et al., 2022) 4: B(i) j,k ← A(i) j,k if j = i or k = i, and 0 otherwise 5: (Phase-3) Conquer: B ← Reconciliation from {B(i), i = 1 . . . d} through the ILP ▶ Phase 1: empirical precision matrix estimator ▶ Phase 2: Parallellized on min(2d, 400) CPU cores. Running on Ruche (Mesocentre Paris-Saclay) ▶ Phase 3: implementation with Gurobi tools 24 / 34

Slide 34

Slide 34 text

DCILP experiments: ILP versus the naive merge 1 2 5 7 8 14 15 16 17 20 25 29 33 34 35 36 38 39 43 44 46 48 53 54 55 5759 63 66 71 73 74 75 7980 81 83 86 89 91 92 93 94 95 99 0 BS,S 1 2 5 7 8 14 15 16 17 20 25 29 33 34 35 36 38 39 43 44 46 48 53 54 55 5759 63 66 71 73 74 75 7980 81 83 86 89 91 92 93 94 95 99 0 BS,S (TPR: 0.96, FDR: 0.35) 1 2 5 7 8 14 15 16 17 20 25 29 33 34 35 36 38 39 43 44 46 48 53 54 55 5759 63 66 71 73 74 75 7980 81 83 86 89 91 92 93 94 95 99 0 BS,S (TPR: 0.97, FDR: 0.27) 500 1000 Number of variables (d) 0.4 0.6 0.8 1.0 TPR 500 1000 Number of variables (d) 0.0 0.5 1.0 FDR 500 1000 Number of variables (d) 0 2000 SHD Naive merge (B) ILP solution (DCILP-dagma) Figure: Comparing with the naive merge B: DCILP on SF3 data. 25 / 34

Slide 35

Slide 35 text

DCILP experiments: ILP versus the naive merge 5 9 18 21 27 31 33 36 43 45 60 65 68 76 85 90 94 87 BS,S 5 9 18 21 27 31 33 36 43 45 60 65 68 76 85 90 94 87 BS,S (TPR: 1.00, FDR: 0.10) 5 9 18 21 27 31 33 36 43 45 60 65 68 76 85 90 94 87 BS,S (TPR: 1.00, FDR: 0.00) 500 1000 Number of variables (d) 0.4 0.6 0.8 1.0 TPR 500 1000 Number of variables (d) 0.0 0.5 1.0 FDR 500 1000 Number of variables (d) 0 2000 SHD Naive merge (B) ILP solution (DCILP-dagma) Figure: Comparing with the naive merge B: DCILP on SF3 data. 25 / 34

Slide 36

Slide 36 text

DCILP: experiments 0 1000 Number of variables (d) 0.50 0.75 1.00 TPR 0 1000 Number of variables (d) 0.00 0.25 0.50 FDR 0 1000 Number of variables (d) 102 104 Time (seconds) DAGMA DCILP-dagma (MB*) DCILP-dagma Figure: Comparison with DAGMA (Bello et al., 2022) on ER2 data. 26 / 34

Slide 37

Slide 37 text

DCILP: experiments 0 1000 Number of variables (d) 0.6 0.8 1.0 TPR 0 1000 Number of variables (d) 0.0 0.5 1.0 FDR 0 1000 Number of variables (d) 0 1000 SHD GES (pcalg) DCILP-ges (MB*) DCILP-ges Figure: Comparison with GES on ER2 data. 27 / 34

Slide 38

Slide 38 text

DCILP experiments: MUNIN network ▶ A DAG with d = 1041 nodes (https://www.bnlearn.com/bnrepository/) ▶ Medical expert-system model based on electromyographs (EMG) 0.00 0.25 0.50 0.75 1.00 TPR DAGMA GES (pcalg) DCILP-dagma DCILP-ges 0 500 1000 1500 SHD DAGMA GES (pcalg) DCILP-dagma DCILP-ges 103 104 Time (seconds) DAGMA GES (pcalg) DCILP-dagma DCILP-ges Figure: Results on the MUNIN network data. 28 / 34

Slide 39

Slide 39 text

Outline Introduction Continuous optimization approaches A distributed approach for causal structure learning Conclusion and perspectives 29 / 34

Slide 40

Slide 40 text

Conclusion ▶ LoRAM: An efficient and scalable continous optimization method for finding proximal DAGs. ▶ DCILP: a distributed approach for learning large-scale causal structures from data. ▶ Significant improvement in scalability for learning sparse and large causal graphs Perspectives: ▶ Extend applicability: ▶ Nonlinear models ▶ Robustness to change of scales in the measurements/observations ▶ Adapt to the learning of denser causal graphs ▶ Causal modeling with latent variables 30 / 34

Slide 41

Slide 41 text

Thank You !

Slide 42

Slide 42 text

References Al-Mohy, A. H. and Higham, N. J. (2011). Computing the action of the matrix exponential, with an application to exponential integrators. SIAM journal on scientific computing, 33(2):488–511. Bello, K., Aragam, B., and Ravikumar, P. (2022). DAGMA: Learning dags via m-matrices and a log-determinant acyclicity characterization. Advances in Neural Information Processing Systems, 35:8226–8239. Chickering, D. M. (1996). Learning bayesian networks is NP-complete. Learning from data: Artificial intelligence and statistics V, pages 121–130. Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554. Dong, S. and Sebag, M. (2022). From graphs to DAGs: a low-complexity model and a scalable algorithm. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 107–122. Springer. Dong, S., Sebag, M., Uemura, K., Fujii, A., Chang, S., Koyanagi, Y., and Maruhashi, K. (2024). Dcdilp: a distributed learning method for large-scale causal structure learning. arXiv preprint arxiv:2406.10481. Ghassami, A., Yang, A., Kiyavash, N., and Zhang, K. (2020). Characterizing distribution equivalence and structure learning for cyclic and acyclic directed graphs. In International Conference on Machine Learning, pages 3494–3504. PMLR.

Slide 43

Slide 43 text

References (cont.) Kalainathan, D., Goudet, O., Guyon, I., Lopez-Paz, D., and Sebag, M. (2022). Structural agnostic modeling: Adversarial learning of causal graphs. Journal of Machine Learning Research, 23(219):1–62. Kalisch, M. and B¨ uhlman, P. (2007). Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(3). Loh, P.-L. and B¨ uhlmann, P. (2014). High-dimensional learning of linear causal networks via inverse covariance estimation. The Journal of Machine Learning Research, 15(1):3065–3105. Peters, J., Janzing, D., and Sch¨ olkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. The MIT Press. Reisach, A. G., Seiler, C., and Weichwald, S. (2021). Beware of the simulated dag! causal discovery benchmarks may be easy to game. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27772–27784. Sauer, A. and Geiger, A. (2021). Counterfactual generative networks. In International Conference on Learning Representations (ICLR). Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. (2000). Causation, prediction, and search.

Slide 44

Slide 44 text

References (cont.) Yuan, Y. and Bar-Joseph, Z. (2019). Deep learning for inferring gene relationships from single-cell expression data. Proceedings of the National Academy of Sciences, 116(52):27151–27158. Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. (2018). DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, volume 31.