Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Seminar A Presentation

Matthew Barga
September 26, 2013
53

Seminar A Presentation

Update on recent research activity

Matthew Barga

September 26, 2013
Tweet

Transcript

  1. Biological Background Recent increase in the amount of biological data

    available Rise of quantitative applications means increased use of computational analysis Access to datasets from DNA microarray measurements are one example
  2. DNA Microarray Experiments • Map a microarray measurement to a

    computer dataset • Can discretize gene expression {0: , 1: , 2: } or use continuous analysis   gene expressed equally in control and tumor samples gene expressed more in tumor sample gene expressed more in control sample Source: http://www.genome.gov/10000533
  3. DNA Microarray Experiments Many traditional analysis methods including hierarchical clustering,

    classification, discovery of tumor subtypes, etc. Source: http://www.uib.no
  4. Bayesian Network Approach Reasons for adopting a Bayesian network approach:

    • Probabilistic method that can capture causal relationships • Can effectively deal with noisy data • Allows integration of prior knowledge
  5. Bayesian Network Approach Simply defined, a Bayesian network is: -

    a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG) (wikipedia) Sprinkler Rain Wet Grass P(rain = T) P(rain = F) P(wet grass | r = T, s = F) P(wet grass | r = T, s = T) ... P(sprinkler | rain = T) P(sprinkler | rain = F) ...
  6. Bayesian Network Approach Two parts of a network: 1. dependency

    structure (structured as DAG) 2. local probability models Source: http://www.cs.huji.ac.il/labs/compbio/expression/html_ver/sld004.htm
  7. Problem Definition Find the dependency structure that best explains the

    data given Given a set of nodes and their expression levels, estimate a network by selecting the best set of parents for each node
  8. Problem Definition Search and Score Method Given an initial graph

    (can be empty), search candidate parents for each node, score the subgraph of the {child, parentset} candidate and choose the set that maximizes the score of the network The score is a measure of the probability that the data given was generated by a graph
  9. Difficulties • Number of possible networks is extremely large ◦

    Current parallel large scale methods can only find optimal network for ~30 genes • Dynamic programming and caching can be used to alleviate some redundant score calculation • Heuristic methods must be employed to evaluate large scale networks
  10. Previous Works - Heuristic Approach Greedy Hill-Climbing Algorithm Random Sampling

    and Repeat Neighbor Node Sampling and Repeat w/ Random Walk
  11. Previous Works - Heuristic Approach Greedy Hill-Climbing Algorithm 1. Start

    with an empty graph 2. Visit nodes in random order 3. Calculate local scores for all possible candidate parents 4. Choose best operation that improves score (add, delete, reverse)
  12. Previous Works - Heuristic Approach Random Sampling and Repeat 1.

    Estimate randomly selected subnetworks repeatedly (can be done independently) a. Subnetwork estimated using hill-climbing algorithm 2. Final network can be chosen based on most frequently selected edges
  13. Previous Works - Heuristic Approach Neighbor Node Sampling and Repeat

    1. Estimate initial network using RSR 2. Run NNSR and select subnetwork by choosing nodes close to each other on the initial network a. Randomly pick up genes with probabilities proportional to the network score b. Perform subnetwork estimation using HC 3. Final network can be chosen based on most frequently selected edges
  14. Current Work • Higher resolution parallelization ◦ speed up the

    bottleneck of HC instances running in each subprocess ▪ must keep track and synchronize edge changes and dependencies ◦ utilize Xeon Phi, etc. • Guaranteeing no cycles (DAG construction) ◦ parallel topological sort ▪ bottom up parent search BFS (2011, Beamer, et. al. )
  15. Current Work • Calculate parent sets for each node in

    subnetwork in parallel • Must guarantee that there are no cycles in subnetworks at this stage Process Process Process Gather most frequently selected edges and check that there are no cycles
  16. Future Work • More efficient or accurate heuristic than HC?

    • This method finds many local optimum networks and combines them - how does this compare to global network? • (Parallelize method that finds optimal bayesian network structure) ◦ Utilize memory more efficiently to avoid communication