Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Better Identification Of Repeats In Metagenomic Scaffolding

ghuryejay
August 23, 2016
120

Better Identification Of Repeats In Metagenomic Scaffolding

This is the talk I gave at WABI meeting at Aarhus, Denmark

ghuryejay

August 23, 2016
Tweet

Transcript

  1. Better Identification Of Repeats In Metagenomic Scaffolding Jay Ghurye and

    Mihai Pop Department of Computer Science, University of Maryland Workshop for Algorithms in Bioinformatics - 2016 8/23/16 1
  2. Repeats, a big challenge! Case 1: Read length > Repeat

    length, not a big problem! Case 2: Read length ≅ Repeat length, Assembly problem is NP Hard, (Nagarajan and Pop, 2009) Case 3: Read length < Repeat length, # assemblies exponential in # repeats (Kingsford et al. 2010) 8/23/16 2
  3. How to handle repeats? 8/23/16 3 Identify high coverage regions

    and remove them beforehand Use statistical methods to identify these regions (A-stat value)
  4. Repeats in metagenome • High coverage regions -> abundant organisms

    • Repeats are genome sized due to closely related strains in the sample 8/23/16 4
  5. Betweenness Centrality 8/23/16 6 Assume all edge weights = 1

    Let sp(x,y,z) = Fraction of shortest path between x and y passing through z sp(a,d,b) = 1/1 = 1 sp(a,c,b) = 1/1 = 1 sp(a,e,b) = 2/2 = 1 sp(c,e,b) = 0 sp(d,e,b) = 0 sp(c,d,b) = 1/2 = 0.5 Bc(b) = 1 + 1 + 1 + 0.5 = 3.5 Usually normalized by number of node pairs excluding b Bc(b) = 3.5 / 6 = 0.583 Fraction of shortest paths from all vertices to all others that pass through the node
  6. Betweenness centrality • Exact algorithm takes (|m||n|) where m is

    number of nodes and n is number of edges (Brandes) • Sampling algorithm for speed ups • Parallelized algorithm by sampling shortest paths in the graph (Riondato, 2014) • Runs in O(|m| + |n|) for unweighted graphs and in O(|n|+ |m|log(|m|)) on weighted graphs 8/23/16 8
  7. Scaffolding Overview C1 C2 Mate pairs mapped to contigs C1

    C2 C1 C2 Link Bundling (Huson et al.) Final Scaffold Graph 8/23/16 9 Order and Orient Scaffold Graph (NP – Hard) (Kececioglu and Myers) Scaffolds/ Complete Genomes
  8. Repeat detection with centrality • First find approximate betweenness centrality

    values for all the contigs • Let μ be the mean and be the standard deviation of centrality values • Mark contigs with centrality value greater than μ + 3* as repeats • Same as Bambus2 but much faster centrality detection 8/23/16 10
  9. Expanded feature set Contig Length # invalidated links during orientation

    Contig Coverage Betweenness Centrality # Skewed Links Random Forest Classifier 8/23/16 11 Degree Mapping of contigs to reference genomes
  10. Skewed Links 8/23/16 12 A B C • Calculate per

    base coverage for all the nodes • Check coverage of nodes for each edge for statistical difference • Use Kolmogorov Smirnoff test • Calculate spurious edges for each node Non-uniform coverage profile Uniform coverage profile
  11. Number of invalidated links 8/23/16 13 Unoriented Graph Oriented Graph

    • Use greedy approximation algorithm • Orient the graph without removing repeats • Find for each node, how many links were removed while assigning orientation to it
  12. Dataset and assembly • Synthetic metagenomic dataset derived from 83

    organisms with known genomes (Shakya et al.) • Used IDBA-UD with default parameters for assembly resulting in 47,767 contigs • For training classifier, used simulated data for the set 40 genomes • Trained a random forest classifier on the features obtained from simulated data 8/23/16 14
  13. Effect on contig orientation • How does repeat removal improve

    scaffolding process? • Removed repeats from the scaffold graph and oriented it • Evaluated using two metrics: • % correct = # links with right orientation / # links in the original graph • % wrong = # links with wrong orientation / # links in repeat removed graph 8/23/16 17
  14. Effect on contig orientation Method Correct Wrong % Correct %

    Wrong Random Forest 12255 807 39.45 3.52 Bambus2 12042 867 38.77 4.11 Approximate BetweennessCentrality 12336 917 39.71 3.94 Coverage(MIP,SOPRA) 3840 315 17.49 4.72 Coverage (Opera) 2007 165 6.46 5.62 8/23/16 18
  15. Conclusion • Developed a novel approach using Random Forest with

    extended feature set • Used approximate betweenness centrality for speedups • Achieved accuracy and efficiency over previous methods • Plan to incorporate repeat detection in MetAMOS pipeline 8/23/16 19