Better Identification Of Repeats In Metagenomic Scaffolding

Better Identification Of Repeats In Metagenomic Scaffolding Jay Ghurye and
Mihai Pop Department of Computer Science, University of Maryland Workshop for Algorithms in Bioinformatics - 2016 8/23/16 1

Repeats, a big challenge! Case 1: Read length > Repeat
length, not a big problem! Case 2: Read length ≅ Repeat length, Assembly problem is NP Hard, (Nagarajan and Pop, 2009) Case 3: Read length < Repeat length, # assemblies exponential in # repeats (Kingsford et al. 2010) 8/23/16 2

How to handle repeats? 8/23/16 3 Identify high coverage regions
and remove them beforehand Use statistical methods to identify these regions (A-stat value)

Repeats in metagenome • High coverage regions -> abundant organisms
• Repeats are genome sized due to closely related strains in the sample 8/23/16 4

Repeats in metagenome 8/23/16 5 Observation : Repeats tend to
tangle the assembly graph

Betweenness Centrality 8/23/16 6 Assume all edge weights = 1
Let sp(x,y,z) = Fraction of shortest path between x and y passing through z sp(a,d,b) = 1/1 = 1 sp(a,c,b) = 1/1 = 1 sp(a,e,b) = 2/2 = 1 sp(c,e,b) = 0 sp(d,e,b) = 0 sp(c,d,b) = 1/2 = 0.5 Bc(b) = 1 + 1 + 1 + 0.5 = 3.5 Usually normalized by number of node pairs excluding b Bc(b) = 3.5 / 6 = 0.583 Fraction of shortest paths from all vertices to all others that pass through the node

8/23/16 7

Betweenness centrality • Exact algorithm takes (|m||n|) where m is
number of nodes and n is number of edges (Brandes) • Sampling algorithm for speed ups • Parallelized algorithm by sampling shortest paths in the graph (Riondato, 2014) • Runs in O(|m| + |n|) for unweighted graphs and in O(|n|+ |m|log(|m|)) on weighted graphs 8/23/16 8

Scaffolding Overview C1 C2 Mate pairs mapped to contigs C1
C2 C1 C2 Link Bundling (Huson et al.) Final Scaffold Graph 8/23/16 9 Order and Orient Scaffold Graph (NP – Hard) (Kececioglu and Myers) Scaffolds/ Complete Genomes

Repeat detection with centrality • First find approximate betweenness centrality
values for all the contigs • Let μ be the mean and be the standard deviation of centrality values • Mark contigs with centrality value greater than μ + 3* as repeats • Same as Bambus2 but much faster centrality detection 8/23/16 10

Expanded feature set Contig Length # invalidated links during orientation
Contig Coverage Betweenness Centrality # Skewed Links Random Forest Classifier 8/23/16 11 Degree Mapping of contigs to reference genomes

Skewed Links 8/23/16 12 A B C • Calculate per
base coverage for all the nodes • Check coverage of nodes for each edge for statistical difference • Use Kolmogorov Smirnoff test • Calculate spurious edges for each node Non-uniform coverage profile Uniform coverage profile

Number of invalidated links 8/23/16 13 Unoriented Graph Oriented Graph
• Use greedy approximation algorithm • Orient the graph without removing repeats • Find for each node, how many links were removed while assigning orientation to it

Dataset and assembly • Synthetic metagenomic dataset derived from 83
organisms with known genomes (Shakya et al.) • Used IDBA-UD with default parameters for assembly resulting in 47,767 contigs • For training classifier, used simulated data for the set 40 genomes • Trained a random forest classifier on the features obtained from simulated data 8/23/16 14

Accuracy of the extended feature set 8/23/16 15

Important parameters for repeat prediction 8/23/16 16

Effect on contig orientation • How does repeat removal improve
scaffolding process? • Removed repeats from the scaffold graph and oriented it • Evaluated using two metrics: • % correct = # links with right orientation / # links in the original graph • % wrong = # links with wrong orientation / # links in repeat removed graph 8/23/16 17

Effect on contig orientation Method Correct Wrong % Correct %
Wrong Random Forest 12255 807 39.45 3.52 Bambus2 12042 867 38.77 4.11 Approximate BetweennessCentrality 12336 917 39.71 3.94 Coverage(MIP,SOPRA) 3840 315 17.49 4.72 Coverage (Opera) 2007 165 6.46 5.62 8/23/16 18

Conclusion • Developed a novel approach using Random Forest with
extended feature set • Used approximate betweenness centrality for speedups • Achieved accuracy and efficiency over previous methods • Plan to incorporate repeat detection in MetAMOS pipeline 8/23/16 19

8/23/16 20 Thank you [email protected] [email protected]

Better Identification Of Repeats In Metagenomic...

Better Identification Of Repeats In Metagenomic Scaffolding

ghuryejay

More Decks by ghuryejay

Featured

Transcript