Seminar A Presentation

Large Scale Gene Network Estimation with Bayesian Networks Matthew Barga
9/26/2013

Contents 1. Motivation 2. Bayesian Network Approach 3. Previous Works
4. Current Work

Biological Background Recent increase in the amount of biological data
available Rise of quantitative applications means increased use of computational analysis Access to datasets from DNA microarray measurements are one example

DNA Microarray Experiments • Map a microarray measurement to a
computer dataset • Can discretize gene expression {0: , 1: , 2: } or use continuous analysis 　 gene expressed equally in control and tumor samples gene expressed more in tumor sample gene expressed more in control sample Source: http://www.genome.gov/10000533

DNA Microarray Experiments Many traditional analysis methods including hierarchical clustering,
classification, discovery of tumor subtypes, etc. Source: http://www.uib.no

Bayesian Network Approach Reasons for adopting a Bayesian network approach:
• Probabilistic method that can capture causal relationships • Can effectively deal with noisy data • Allows integration of prior knowledge

Bayesian Network Approach Simply defined, a Bayesian network is: -
a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG) (wikipedia) Sprinkler Rain Wet Grass P(rain = T) P(rain = F) P(wet grass | r = T, s = F) P(wet grass | r = T, s = T) ... P(sprinkler | rain = T) P(sprinkler | rain = F) ...

Bayesian Network Approach Two parts of a network: 1. dependency
structure (structured as DAG) 2. local probability models Source: http://www.cs.huji.ac.il/labs/compbio/expression/html_ver/sld004.htm

Problem Definition Find the dependency structure that best explains the
data given Given a set of nodes and their expression levels, estimate a network by selecting the best set of parents for each node

Problem Definition Search and Score Method Given an initial graph
(can be empty), search candidate parents for each node, score the subgraph of the {child, parentset} candidate and choose the set that maximizes the score of the network The score is a measure of the probability that the data given was generated by a graph

Problem Definition T Source: Tamada, 2013

Difficulties • Number of possible networks is extremely large ◦
Current parallel large scale methods can only find optimal network for ~30 genes • Dynamic programming and caching can be used to alleviate some redundant score calculation • Heuristic methods must be employed to evaluate large scale networks

Previous Works - Heuristic Approach Greedy Hill-Climbing Algorithm Random Sampling
and Repeat Neighbor Node Sampling and Repeat w/ Random Walk

Previous Works - Heuristic Approach Greedy Hill-Climbing Algorithm 1. Start
with an empty graph 2. Visit nodes in random order 3. Calculate local scores for all possible candidate parents 4. Choose best operation that improves score (add, delete, reverse)

Previous Works - Heuristic Approach Random Sampling and Repeat 1.
Estimate randomly selected subnetworks repeatedly (can be done independently) a. Subnetwork estimated using hill-climbing algorithm 2. Final network can be chosen based on most frequently selected edges

Previous Works - Heuristic Approach Neighbor Node Sampling and Repeat
1. Estimate initial network using RSR 2. Run NNSR and select subnetwork by choosing nodes close to each other on the initial network a. Randomly pick up genes with probabilities proportional to the network score b. Perform subnetwork estimation using HC 3. Final network can be chosen based on most frequently selected edges

Parallelization Source: Tamada, 2013

Current Work • Higher resolution parallelization ◦ speed up the
bottleneck of HC instances running in each subprocess ▪ must keep track and synchronize edge changes and dependencies ◦ utilize Xeon Phi, etc. • Guaranteeing no cycles (DAG construction) ◦ parallel topological sort ▪ bottom up parent search BFS (2011, Beamer, et. al. )

Current Work • Calculate parent sets for each node in
subnetwork in parallel • Must guarantee that there are no cycles in subnetworks at this stage Process Process Process Gather most frequently selected edges and check that there are no cycles

Future Work • More efficient or accurate heuristic than HC?
• This method finds many local optimum networks and combines them - how does this compare to global network? • (Parallelize method that finds optimal bayesian network structure) ◦ Utilize memory more efficiently to avoid communication

Questions?

Seminar A Presentation

Seminar A Presentation

Matthew Barga

More Decks by Matthew Barga

Featured

Transcript

Large Scale Gene Network Estimation with Bayesian Networks Matthew Barga

Contents 1. Motivation 2. Bayesian Network Approach 3. Previous Works

Biological Background Recent increase in the amount of biological data

DNA Microarray Experiments • Map a microarray measurement to a

DNA Microarray Experiments Many traditional analysis methods including hierarchical clustering,

Bayesian Network Approach Reasons for adopting a Bayesian network approach:

Bayesian Network Approach Simply defined, a Bayesian network is: -

Bayesian Network Approach Two parts of a network: 1. dependency

Problem Definition Find the dependency structure that best explains the

Problem Definition Search and Score Method Given an initial graph

Problem Definition T Source: Tamada, 2013

Difficulties • Number of possible networks is extremely large ◦

Previous Works - Heuristic Approach Greedy Hill-Climbing Algorithm Random Sampling

Previous Works - Heuristic Approach Greedy Hill-Climbing Algorithm 1. Start

Previous Works - Heuristic Approach Random Sampling and Repeat 1.

Previous Works - Heuristic Approach Neighbor Node Sampling and Repeat

Parallelization Source: Tamada, 2013

Current Work • Higher resolution parallelization ◦ speed up the

Current Work • Calculate parent sets for each node in

Future Work • More efficient or accurate heuristic than HC?

Questions?