Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Infer Gene Regulatory Networks from Time Series

Luigi Cerulo
December 18, 2013

Infer Gene Regulatory Networks from Time Series

Infer Gene Regulatory Networks from Time Series Data with Formal Methods

Luigi Cerulo

December 18, 2013
Tweet

Other Decks in Research

Transcript

  1. Infer Gene Regulatory Networks from Time Series Data with Formal

    Methods Michele Ceccarelli Luigi Cerulo Antonella Santone Benevento - Italy www.unisannio.it
  2. Gene Space Layer • Gene regulatory network TF TF protein

    gene A gene B Gene B Gene A dna model
  3. Gene Regulatory Network (GRN) • A gene regulatory network can

    be 
 represented as a graph G = (Vertices, Edges) • Vertices = Genes • G1, G2, G3, ... • Edges = Interactions • Activation • Inhibition G1 G2 G3 G4 G5 G6 G7 G8
  4. The aim: Inference of Gene regulatory networks • Gene experimental

    conditions (e.g. Microarray experiments) G1 G2 G3 G4 G5 G6 G7 G8
  5. The Literature of Inference Methods • Correlation methods (e.g. Mutual

    Information,...) • Boolean Networks • Bayesian Networks • Ordinary Differential Equations • ...
  6. • Formal Methods Our approach: FormalM complex system formal model

    specification language (e.g. CCS, mu-Calculus) simulation/ verification Automated proof model checking - rules - temporal logic execute observe outcome (TRUE/FALSE) inference Formal methods are a particular kind of mathematically based techniques for the specification, development and verification of complex system (e.g. software and hardware systems)
  7. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 FormalM: The overall schema FormalM
  8. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 The overall schema - time series discretization UP UP UP BS BS DW DW DW
  9. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 The overall schema - time series discretization B A C BUP ABS CUP BUP ADW CDW BUP ADW CDW BBS ADW CDW BBS ABS CDW BDW ABS CDW BDW ABS CDW BDW ABS CDW
  10. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 The overall schema - CCS model B A C BUP ABS CUP BUP ADW CDW BUP ADW CDW BBS ADW CDW BBS ABS CDW BDW ABS CDW BDW ABS CDW BDW ABS CDW
  11. CCS  (Milner’s Calculus of Communicating Systems) • CCS is

    a process algebra particularly useful to model concurrent systems. Every entity is modeled as a process named with a lower case letter (e.g. p) • Provides basic operators to build finite processes
 communication operators to express concurrency
 and some notion of recursion to capture infinite behavior. • Operators:
 + (the alternative operator) the process p + q is a process that non- deterministically behaves either as p or as q. 
 
 ; (the sequence operator) this operator is suitably used to express the sequentialization between two processes. The process p ; q means that p must terminate before the process q can start its execution.
 
 || (the parallel operator) the process p || q represents the parallel execution of p and q and terminates only if both processes terminate.
  12. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 The overall schema - CCS model B A C BUP ABS CUP BUP ADW CDW BUP ADW CDW BBS ADW CDW BBS ABS CDW BDW ABS CDW BDW ABS CDW BDW ABS CDW
  13. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 The overall schema - CCS model B A C BUP ABS CUP BUP ADW CDW BUP ADW CDW BBS ADW CDW BBS ABS CDW BDW ABS CDW BDW ABS CDW BDW ABS CDW A gene is modeled as a process evolving in time among three states (UP, BS, DW)
  14. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 The overall schema - CCS model B A C BUP ABS CUP BUP ADW CDW BUP ADW CDW BBS ADW CDW BBS ABS CDW BDW ABS CDW BDW ABS CDW BDW ABS CDW A gene is modeled as a process evolving in time among three states (UP, BS, DW) (ABS || BUP || CUP ) ; (ADW || BUP || CDW ) ; (ADW || BUP || CDW ) ; ...
  15. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 The overall schema - CCS model Multiple perturbation experiments are connected with the alternative operator (+) (ABS || BUP || CUP ) ; (ADW || BUP || CDW ) ; (ADW || BUP || CDW ) ; ... (ABS || BUP || CUP ) ; (ADW || BUP || CDW ) ; (ADW || BUP || CDW ) ; ... + (ABS || BUP || CUP ) ; (ADW || BUP || CDW ) ; (ADW || BUP || CDW ) ; ... + (ABS || BUP || CUP ) ; (ADW || BUP || CDW ) ; (ADW || BUP || CDW ) ; ... +
  16. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 The overall schema - model checking
  17. • Exists a direct activation 
 from A to B?

    • Exists a direct inhibition 
 arch from A to B? Checked properties X Y X Y express concurrency, and some notion of e infinite behaviour. The semantics of a cisely defined by means of the structural cs. The semantic definition is given by a les describing the transition relation of the nding to the behavior expression defining aper we use the following CCS operators: cess p + q is a process that non- ly behaves either as p or as q. or is suitably used to express the sequen- ween two processes. The process p;q means rminate before the process q can start its ss p￿q represents the parallel execution of rminates only if both processes terminate. on for defining properties. This need can be oral logic as, for example, the mu-calculus l logics present constructs allowing to state hat, for instance, all scenarii will respect ery step, or that some particular event will and so on. to check if a system satisfies a property. several algorithms exist. The most used ology is model checking [25]. In the model k, systems are modeled as transition sys- nts are expressed as formulae in temporal cker then accepts two inputs, a transition ral formula, and returns true if the system a and false otherwise. thm n genes and a set of k time course proach starts considering a directed graph the fact that each perturbation is independent from each other. The i-th perturbation–process pi (∀i ∈ [1..k]) is modeled as a sequence of network–state–processes of the form q1;...;qm indicating the sequence of states in which the gene network can be during the i-th perturbation. In particular, starting form q1 , each network–state–process qj (j ∈ [1..m−1]) evolves into the network–state–process qj+1 and is modeled as the parallel composition of n gene–state–processes, specifying one of the possible discrete state of each gene (Up, Basal, Down). The CCS model is then used to prune the original graph checking whether all edges are correct (i.e., explains the data or not) and eliminating the arcs that do not satisfy certain properties expressed in a temporal logic. This is performed by using model checking techniques. Two properties have been defined to model two fundamental biological functions between two genes X and Y : gene activation and gene inhibition. For each activator edge e ∶ X ￿→Y , we check on the CCS model the following property, expressed using the selective mu-calculus logic [27]: = [X Up]￿ ￿Y Up￿￿ [Y Down]￿ ff which states for: “if the activator edge e exists, then whenever the gene X is Up the gene Y must become Up and then it cannot become Down”. If the model does not satisfy , the edge e is eliminated. Similarly, for each inhibitor edge e ∶ X ￿ Y , we check on the CCS model the following property: ' = [X Up]￿ ￿Y Down￿￿ [Y Up]￿ ff which states for: “if the inhibitor edge e exists, then whenever the gene X is Up the gene Y must become Down and then it cannot become Up”. Issues related to complexity and scalability cisely defined by means of the structural cs. The semantic definition is given by a les describing the transition relation of the nding to the behavior expression defining aper we use the following CCS operators: cess p + q is a process that non- y behaves either as p or as q. or is suitably used to express the sequen- een two processes. The process p;q means minate before the process q can start its s p￿q represents the parallel execution of rminates only if both processes terminate. on for defining properties. This need can be oral logic as, for example, the mu-calculus logics present constructs allowing to state at, for instance, all scenarii will respect ery step, or that some particular event will and so on. o check if a system satisfies a property. several algorithms exist. The most used ology is model checking [25]. In the model k, systems are modeled as transition sys- nts are expressed as formulae in temporal cker then accepts two inputs, a transition ral formula, and returns true if the system and false otherwise. hm n genes and a set of k time course proach starts considering a directed graph es (Figure 1). The vertices of the graph i a sequence of network–state–processes of the form q1;...;qm indicating the sequence of states in which the gene network can be during the i-th perturbation. In particular, starting form q1 , each network–state–process qj (j ∈ [1..m−1]) evolves into the network–state–process qj+1 and is modeled as the parallel composition of n gene–state–processes, specifying one of the possible discrete state of each gene (Up, Basal, Down). The CCS model is then used to prune the original graph checking whether all edges are correct (i.e., explains the data or not) and eliminating the arcs that do not satisfy certain properties expressed in a temporal logic. This is performed by using model checking techniques. Two properties have been defined to model two fundamental biological functions between two genes X and Y : gene activation and gene inhibition. For each activator edge e ∶ X ￿→Y , we check on the CCS model the following property, expressed using the selective mu-calculus logic [27]: = [X Up]￿ ￿Y Up￿￿ [Y Down]￿ ff which states for: “if the activator edge e exists, then whenever the gene X is Up the gene Y must become Up and then it cannot become Down”. If the model does not satisfy , the edge e is eliminated. Similarly, for each inhibitor edge e ∶ X ￿ Y , we check on the CCS model the following property: ' = [X Up]￿ ￿Y Down￿￿ [Y Up]￿ ff which states for: “if the inhibitor edge e exists, then whenever the gene X is Up the gene Y must become Down and then it cannot become Up”. Issues related to complexity and scalability The complexity of the extraction of the CCS model from Biological question Time series property Exist an activator 
 edge from X to Y? Exist an inhibitor 
 edge from X to Y? if exists then whenever the gene X is UP the gene Y must become DOWN and then it cannot become UP if exists then whenever the gene X is UP the gene Y must become UP and then it cannot become DOWN Y X Y X
  18. dcuB% dcuR% frdC% narL% frdB% perturbation k dcuB% dcuR% frdC%

    narL% frdB% perturbation 2 Biological properties expressed in temporal logic narL frdC dcuR dcuB frdB narL frdC dcuR dcuB frdB Inferred network Initial network (fully connected or estimated with approaches exhibiting high recall) CCS model generation Model checking (network pruning) Discretization Multiple Time course experiments dcuB% dcuR% frdC% narL% frdB% perturbation 1 The overall schema - model checking For each arch in a fully connected network we check whether the CCS rules are satisfied by the model (i.e. they must return TRUE)
  19. Performance evaluation - baselines Reference Based on Multiple time series

    Undirected Directed Signed Aracne Margolin et al., 2006 Mutual Information X CLR Faith et al., 2007 Mutual Information X TD-Aracne Zoppoli et al., 2010 Shifted version of Mutual Information X X TSNIF Cantone et al., 2009 ODE X X X Banjo Yu et al., 2004 Bayesian Nets X X X X
  20. Performance evaluation - datasets • Experimental in vitro data •

    E.coli SOS pathway • S.Cerevisiae cell cycle • IRMA • Simulated in silico data benchmark • Generated with GeneNetWeaver 
 (http://gnw.sourceforge.net, adopted in DREAM challenges)
  21. E.coli SOS Pathway • DNA damage tolerance and repair activated

    through recA and lexA • Time course profiles of 14 time points made available by Ronen et al. Assigning numbers to the arrows: Parameterizing a gene regulation network by using accurate expression kinetics Michal Ronen†, Revital Rosenberg†, Boris I. Shraiman‡, and Uri Alon†§¶ Departments of †Molecular Cell Biology and §Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel; and ‡Bell Laboratories, Lucent Technologies, Murray Hill, NJ 07974 Edited by David Botstein, Stanford University School of Medicine, Stanford, CA, and approved June 5, 2002 (received for review January 28, 2002) A basic challenge in systems biology is to understand the dynam- ical behavior of gene regulation networks. Current approaches aim at determining the network structure based on genomic-scale data. However, the network connectivity alone is not sufficient to define its dynamics; one needs to also specify the kinetic parameters for the regulation reactions. Here, we ask whether effective kinetic parameters can be assigned to a transcriptional network based on expression data. We present a combined experimental and theo- retical approach based on accurate high temporal-resolution mea- surement of promoter activities from living cells by using green fluorescent protein (GFP) reporter plasmids. We present algorithms that use these data to assign effective kinetic parameters within a mathematical model of the network. To demonstrate this, we employ a well defined network, the SOS DNA repair system of Escherichia coli. We find a strikingly detailed temporal program of expression that correlates with the functional role of the SOS assembly pathway (10). Here, we extend it by presenting analysis algorithms that use accurate expression data to assign kinetic parameters that can be incorporated into a mathematical model of the dynamics. We apply this method to a well characterized transcriptional network, the SOS DNA repair system in Escherichia coli. The SOS system includes about 30 operons regulated at the tran- scriptional level (12–16). A master repressor (LexA) binds sites in the promoter regions of these operons (16, 17). One of the SOS proteins, RecA, acts as a sensor of DNA damage: by binding to single-stranded DNA it becomes activated and mediates LexA autocleavage. The drop in LexA levels causes the de-repression of the SOS genes (Fig. 2). Once damage has been repaired or bypassed, the level of activated RecA drops, LexA accumulates and represses the SOS operons, and the cells return to their original state. recA lexA uvrA umuDC polB uvrY uvrD ruvA
  22. Results - E.coli SOS pathway F-Measure 0% 20% 40% 60%

    80% 100% Aracne CLR TD-Aracne FormalM TSNIF Banjo Signed Directed Undirected recA lexA uvrA umuDC polB uvrY uvrD ruvA FormalM FormalM FormalM
  23. S.Cerevisiae cell cycle • Netwrok induced by Cln3 and Clb28,

    two cell cycle regulators • Time course profiles of 16 time points made available by Spellman et al. Molecular Biology of the Cell Vol. 9, 3273–3297, December 1998 Comprehensive Identification of Cell Cycle–regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization□ D Paul T. Spellman,*† Gavin Sherlock,*†‡ Michael Q. Zhang,‡ Vishwanath R. Iyer,§ Kirk Anders,* Michael B. Eisen,* Patrick O. Brown,§ሻ David Botstein,*¶ and Bruce Futcher‡ *Department of Genetics, Stanford University Medical Center, Stanford, California 94306-5120; ‡Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724-2209; §Department of Biochemistry, Stanford University Medical Center, Stanford, California 94306-5428; and ࿣Howard Hughes Medical Institute, Stanford, California 94305-5428 Submitted September 4, 1998; Accepted October 15, 1998 Monitoring Editor: Gerald R. Fink We sought to create a comprehensive catalog of yeast genes whose transcript levels vary periodically within the cell cycle. To this end, we used DNA microarrays and samples from yeast cultures synchronized by three independent methods: ␣ factor arrest, elutria- CLN3 SWI4 SWI6 MBP1 CDC28 CDC6 CLB5 CLB6 CLN1 CLN2 SIC1
  24. Results - S.Cerevisiae Cell Cycle F-Measure 0% 20% 40% 60%

    80% 100% Aracne CLR TD-Aracne FormalM TSNIF Banjo Signed Directed Undirected CLN3 SWI4 SWI6 MBP1 CDC28 CDC6 CLB5 CLB6 CLN1 CLN2 SIC1 FormalM FormalM FormalM
  25. IRMA • An in vivo fully controlled experimental network built

    in the S.Cerevisiae and composed of five genes • The netwrok is perturbed by culturing cells in presence of galactose or glucose • Time course profiles of 16 + 21 time points made available by Cantone et al. ASH1 CBF1 GAL80 GAL4 SWI5 Resource A Yeast Synthetic Network for In Vivo Assessment of Reverse-Engineering and Modeling Approaches Irene Cantone,1 Lucia Marucci,1,2 Francesco Iorio,1 Maria Aurelia Ricci,1 Vincenzo Belcastro,1 Mukesh Bansal,1 Stefania Santini,2 Mario di Bernardo,2 Diego di Bernardo,1,2,3,* and Maria Pia Cosma1,3,* 1Telethon Institute of Genetics and Medicine (TIGEM), Naples 80131, Italy 2Department of Computer and Systems Engineering, University of Naples ‘‘Federico II,’’ Naples 80125, Italy 3These authors contributed equally to this work *Correspondence: [email protected] (D.d.B.), [email protected] (M.P.C.) DOI 10.1016/j.cell.2009.01.055 SUMMARY Systems biology approaches are extensively used to model and reverse engineer gene regulatory networks from experimental data. Conversely, synthetic biology allows ‘‘de novo’’ construction of a regulatory network to seed new functions in the cell. At present, the usefulness and predictive ability describe changes in concentration of each gene transcript and protein in a network, as a function of their regulatory interactions (gene regulatory network). The usefulness of a model lies in its ability to formalize the knowledge about the biological process at hand, to identify inconsistencies between hypotheses and observations, and to predict the behavior of the biological process in yet untested conditions. There are a variety of mathematical formalisms proposed in literature (Di Ventura et al., 2006; Szallasi et al.,
  26. Results - IRMA F-Measure 0% 20% 40% 60% 80% 100%

    Aracne CLR TD-Aracne FormalM TSNIF Banjo Signed Directed Undirected ASH1 CBF1 GAL80 GAL4 SWI5 FormalM FormalM FormalM
  27. Simulated in silico data benchmark Research questions • How do

    the performance varies with the size of the network to be predicted? • How do the performance varies with the noise level in time series data? • How do the performance varies with the size number of available perturbations?
  28. Results - Simulated data • How do the performance varies

    with the size of the network to be predicted?
 
 We generated 20 random network of size N (4,5,6,10, 15, 20, 50, 70, 100) and for each network we generated 
 2N+1 time-course experiments of 100 time points each. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 0.00 0.25 0.50 0.75 1.00 4 5 6 10 20 30 50 70 100 No. of genes F−measure • • • • • • ARACNE CLR FormalM TD−Aracne TSNIF BANJO FormalM
  29. Results - Simulated data • How do the performance varies

    with the noise level in time series data?
 
 We applied on the previous generated time course data 5 increasing levels of gaussian noise • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 0.00 0.25 0.50 0.75 1.00 ns0 ns1 ns2 ns3 ns4 ns5 Noise level F−measure • • • • • • ARACNE CLR FormalM TD−Aracne TSNIF BANJO FormalM
  30. Results - Simulated data • How do the performance varies

    with the size number of available perturbations?
 
 We sampled (10 times) without replacement a random subset of R (1..N) time-course experiments from the total number of 2N+1 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 0.00 0.25 0.50 0.75 1.00 1 2 3 4 5 6 7 8 9 10 No. of perturbations F−measure • • • • • • ARACNE CLR FormalM TD−Aracne TSNIF BANJO FormalM
  31. Pros and Cons Pros • The performance drops slightly with

    large networks • The performance seems not dependent from data noise • The performance drops slightly when less perturbations are available • With formal models we have the opportunity to include new biological knowledge Cons • The computational cost increase exponentially with the network size and with the number of perturbations • Performance depends strongly from time series discretization methods.