Vivek Krishnakumar
June 10, 2009
140

# M.S Bioinformatics Capstone Presentation - Indiana University - Jun 2009

Dissertation titled "Developing a computational method to help solve the sequencing gap closure problem", as part of research work conducted at the Center for Genomics and Bioinformatics (CGB) at Indiana University, Bloomington (April 2008 to May 2009)

Bioinformatics Capstone Presentation by Vivek Krishnakumar
http://bio.informatics.indiana.edu/capstone/

June 10, 2009

## Transcript

1. ### Advisors: Dr. Qunfeng Dong(CGB) Prof. Haixu Tang(SOI) INDIANA UNIVERSITY, BLOOMINGTON

DEVELOPING A COMPUTATIONAL METHOD TO HELP SOLVE THE SEQUENCING GAP CLOSURE PROBLEM Vivek Krishnakumar M.S. Bioinformatics CAPSTONE
2. ### A Typical Sequencing Project n Sequencing n Assembly n Finishing

n Annotation Steps in a whole genome shotgun sequencing project Claire M. Fraser, Jonathan A. Eisen and Steven L. Salzberg. Microbial genome sequencing. Nature 406, 799-803(2000)
3. ### Genome Assembly § Putting together a large number of short

DNA sequences called reads to create a representation of the original genome. § Relies on the assumption that reads sharing the same string of letters, originated from the same place on the genome § Ideally, at an 8-10 fold coverage, most of the genome will be assembled into a small number of contigs (around 5 for a 1Mbp genome)1 1. Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2(3): 231-239 (1988)
4. ### Assembly Algorithms ¨ Assembly is a complex computational problem (NP-hard)

¨ Several approximation algorithms to solve this problem. ¨ Types ¤ Simple greedy algorithms ¤ Graph based algorithms n Overlap-layout-consensus (Phrap, TIGR, VCAKE, etc.) n Eulerian Path (Newbler, Euler, ALLPATHS, etc.)
5. ### Assembly Algorithms – Graph based ¨ k-mers are computed ¨

k-mers are represented by edges of a deBruijn graph ¨ reads are represented as paths through k-mers ¨ reconstructing the genome involves finding the path that makes use of all the edges Overlap-layout- consensus1 Eulerian Path2 ¨ sequences represented as nodes ¨ edges correspond to read overlap ¨ use the overlap between sequence reads to create a link between them ¨ contigs are eventually formed by reading along the links as far as possible Daniel MacLean, Jonathan D. G. Jones & David J. Studholme. Application of 'next-generation' sequencing technologies to microbial genetics. Nature Reviews Microbiology 7, 287-296 (2009) 1. J.D. Kececioglu and E.W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica,vol. 13, 1995, pp. 7-51. 2. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. PNAS August 14, 2001 vol. 98 no. 17 9748-9753
6. ### Finishing (Post-assembly) ¨ No matter the type of assembly algorithm,

in practice, the assemblers produce not one, but hundreds or thousands of unordered contiguous sequences. Reasons: ¤ no or low coverage regions ¤ repeats ¤ sequencing errors ¨ The task of gap closure consists of several steps: ¤ Orienting the contigs ¤ Ordering the contigs ¤ Gap closure
7. ### Gap-closure problem C F A E B D G Un-ordered

& un-oriented contigs A C B D F G E A C Ordered & oriented Gap-closing (by PCR or walking)
8. ### Existing methods for contig ordering (Reference genome based) n For

biologists, it is imperative to move beyond the assembly step and elucidate the order of the contigs for effective gap-closure. Existing computational methods provide the biologists with tools to identify the order of the contigs, which enables them to move that much closer to reconstructing the genome n Projector1 n Projector 22 n CAAT-Box or Contigs-Assembly and Annotation Tool-Box3 n Pheromone trail-based genetic algorithm4 n PGAAS or Prokaryotic Genome Assembly Assistant System5 n As observed, the existing methods are based on similar working principles, i.e. they assume that there are reference genomes which are closely related to the organism in consideration 1. van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector: automatic contig mapping for gap closure purposes. Nucleic Acids Res. 2003 Nov 15;31(22):e144. 2. van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W560-6 3. Frangeul L, Glaser P, Rusniok C, Buchrieser C, Duchaud E, Dehoux P, Kunst F. CAAT-Box, Contigs-Assembly and Annotation Tool-Box for genome sequencing projects. Bioinformatics. 2004 Mar 22;20(5):790-7. Epub 2004 Jan 29. 4. Fangqing Zhao, Fanggeng Zhao, Tao Li and Donald A. Bryant. A new pheromone trail-based genetic algorithm for comparative genome assembly. Nucleic Acids Research, 2008, Vol. 36, No. 10 3455-3462 5. Yu Z, Li T, Zhao J, Luo J. PGAAS: a prokaryotic genome assembly assistant system. Bioinformatics. 2002 May;18(5):661-5.
9. ### Rationale n Post-sequencing and assembly of a novel genome (one

with no known nearest phylogenetic neighbor), the biologists are faced with a set of unordered contigs n It is not feasible to run gap closing Polymerase Chain Reactions (PCR) for each possible pair of contigs n Thus, we propose a computational method, employing graph theoretical concepts, to identify non-repeat contig neighborhoods (determining possible pairs of contigs and the corresponding gap sizes)
10. ### Proposed method C F A E B D G C

A A C F A A F G D D G Exhausting all possible pairs of contigs for gap- closing PCR C REPEAT E G B F D Using the available data, employing a graph theoretical approach, we generate probable non- repeat contig neighborhoods • contigs connected directly with each other • connected through repeat contigs Set of unordered, un-oriented contigs generated after assembly
11. ### Graph Theory n Graph theory is a mathematical/computational concept of

studying graphs. It is used to model pair-wise relationships between objects from a certain collection of such objects. Technically, a graph consists of: q nodes/vertices (representing the object) and q edges (representing the connection between two objects) – they are either directed or undirected. n For our study, we explored the use of the shortest path algorithm (Dijkstra’s Algorithm) node undirected edge directed edge
12. ### Simulation studies n Before analyzing the real world data, we

conducted a simulation study to explore a method for identifying possible pairs of contigs n Escheirichia coli 536 genome used for simulation GENOME CONTIG 1 CONTIG N+1 GAP 1 GAP N Generate reads at different coverages (10x, 20x, 30x). Introduce N random length gaps into the genome, producing N+1 contigs CONTIG 2 GAP 2 CONTIG N METHOD 1 CONTIG 1 CONTIG N+1 GAP 1 GAP N Splice the genome with N copies of the same DNA sequence generate reads at different coverages CONTIG 2 GAP 2 CONTIG N METHOD 2 Repeat 1 Repeat 2 Repeat N
13. ### Classifying the reads n From this population of reads we

classify and extract two types of reads: n Reads that belong to the gap regions n Reads that partially overlap with the edges of the contigs CONTIG 1 CONTIG N+1 GAP 1 GAP N CONTIG 2 GAP 2 CONTIG N CONTIG 1 GAP 1 CONTIG 2 Map the shortest path from a suffix to a prefix through a set of junction reads and unassembled reads suffix prefix
14. ### Running the simulation Junction Reads Reads from the gap region

Contig edges (suffix|prefix) Large-scale sequence alignment using vmatch Traverse the graph and find the shortest path between a suffix and prefix Compute all possible overlaps. Generate an overlap graph where the reads form the nodes and the overlaps form the edges. suffix prefix suffix prefix
15. ### Results – METHOD 1 Number of gaps introduced: 10 Number

of contiguous sequences: 11 (C1 through C11) Simulated read lengths: 400 bp Coverage: 10x, 20x and 30x All computed paths (at 20x coverage):
16. ### Results – METHOD 2 (Repeat-induced gaps) Start Node End Node

suffix_C2 prefix_C2 suffix_C2 suffix_C1 suffix_C2 prefix_C3 suffix_C1 prefix_C2 suffix_C1 prefix_C3 prefix_C2 prefix_C3 Number of repeat-induced gaps introduced: 2 Number of contiguous sequences: 3 (C1, C2 and C3) Simulated read lengths: 400 bp Coverage: 10x, 20x and 30x All computed paths (at 30x coverage): Repeat 1 Repeat 2 C1 C2 C3
17. ### Exploring the real world data n Eventually we want to

help the biologists develop PCR primers to close the gaps between the contigs in the genome. n For this study, we obtained assembly data from Prof. Yves Bruns’ Lab in the Biology department for the organism Brevundimonas diminuta. n Genome sequencing method: 454 (without paired ends) n Sequence Assembly program: Newbler n We intend to derive information from the .ace file produced by the Newbler assembler
18. ### Assembly data statistics Brevundimonas diminuta STATISTICS Total Number Of Reads

613425 Number of contigs 476 Length of contigs 3325300 Avg. Length of contigs 6985 ± 10125 Largest contig 77275 bp Smallest contig 102 bp Number of N50 contigs 61 Length of N50 contigs 14580
19. ### .ace file format § .ace file created by the Newbler

assembler contains the following assembly information: 1. the contig sequences 2. quality values 3. read alignment information. 4. contig linking information AS 476 537972 CO contig00001 755 63 28 U TTAC**AGCT*CCC*GCC*A**GGTC*TT*G*CCGG*TCATGCCTTCATA GTTGGTCGCGTGATAGACGCCGCGCGCCACGGCGCGGGCCAGGACGTCGG BQ 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 AF EYKMWLY02IBXNT.12-1.fm55 C -58 AF EYKMWLY01AHEVV.94-111.fm55 U -92 AF EYKMWLY01E0SVI.79-107.fm55 U -77 AF EYKMWLY01BQQ0K.153-181.fm55 U -151 AF EYKMWLY02GTKZH U 377 AF EYKMWLY01AUK2G U 466 AF EYKMWLY02IS15X C 466 AF EYKMWLY02IYK7T.265-2 C 485 AF EYKMWLY02IT5PF C 511 AF EYKMWLY01B4PII.51-1 C 500 AF EYKMWLY01D8FPB C 520 AF EYKMWLY02HKKY C 600 RD EYKMWLY01AHEVV.94-111.fm55 117 0 0 GGGGCTGAAGGCGCTGAGCGAGACGCTGCCGGTCGCGCAGGCGGTGCCGG CCGGGCGCGCCGATTTAATCAACCCTCTCCCGGCGGGAGAGGGTTAC**A GCTACCC*GCC*A**GG
20. ### Inferring information from the .ace file ¨ From the .ace

file, we extract information that suggests which reads a particular contig shares with which other contig(s) in the assembly CO contig00016 10537 1778 1241 U AF EYKMWLY02JASI2.1-151.to428 U 10379 AF EYKMWLY01CDRT0.1-130 U 10400 AF EYKMWLY02IF22Q.1-105 U 10426 AF EYKMWLY02HA9IP.1-105 U 10426 AF EYKMWLY01CV3A4.1-107 U 10426 AF EYKMWLY01BNJV1.1-91.to428 U 10440 AF EYKMWLY01D4P1Q C 10445 AF EYKMWLY02FFW9C.1-81 U 10450 AF EYKMWLY01C1SKF.1-75 U 10457 AF EYKMWLY01A3C1V.1-68 U 10465 AF EYKMWLY01CCJHV.1-66.to428 U 10465 AF EYKMWLY01D7WLL.1-52.to428 U 10480 AF EYKMWLY01CPNZS.1-49.to428 U 10484 AF EYKMWLY01CBBAK.216-196.to428 C 10508 AF EYKMWLY01DWD7X.196-176.to428 C 10506 AF EYKMWLY01A27AI.239-214.to428 C 10510 AF EYKMWLY01A14U9.208-187.to428 C 10513 AF EYKMWLY01A4PVY.242-220.to428 C 10513 CO contig00428 21606 2746 2071 U AF EYKMWLY01CPNZS.73-75.fm16 U -71 AF EYKMWLY01D7WLL.76-83.fm16 U -74 AF EYKMWLY02JASI2.175-195.fm16 U -173 AF EYKMWLY01CCJHV.90-117.fm16 U -88 AF EYKMWLY01BNJV1.115-149.fm16 U -113 AF EYKMWLY02JIVGZ.41-1.fm16 C -23 AF EYKMWLY02HY7R3.48-1.fm16 C -25 AF EYKMWLY02FW7CZ.58-1.fm16 C -27 AF EYKMWLY01B9IJ6 U 1 AF EYKMWLY01D4UXN.80-1.fm16 C -34 AF EYKMWLY01DBA88.146-1 C -93 AF EYKMWLY02HKXAA.152-1 C -96 AF EYKMWLY01DWD7X.152-1.fm16 C -47 AF EYKMWLY01B9SVF.62-221 U -60 AF EYKMWLY01A14U9.163-1.fm16 C -44 AF EYKMWLY01CBBAK.172-1.fm16 C -45 AF EYKMWLY01A27AI.190-1.fm16 C -48 AF EYKMWLY01A4PVY.196-1.fm16 C -45
21. ### Classification of contigs § Based on the number of recorded

connections, we try to classify the contigs into so-called: § Repeat contigs; and § Non-repeat contigs § After classification, we go on to build non-repeat contig neighborhoods (pairs of non-repeat contigs connected through one or more repeat contigs) CONTIG X CONTIG Y REPEAT CONTIG X CONTIG Y REPEAT A REPEAT B
22. ### Results Total number of contigs In the dataset 476 After

classification: Repeat contigs 31 Non-repeat contigs 445 Number of non-repeat contig pairs (through one or more repeat contigs) 184

24. ### Output from this method ¨ On analysis of the assembly

data, biologists are provided with the output in the following format suffix_C466 prefix_C411 1269 suffix_C20 prefix_C439 833 suffix_C20 prefix_C18 833 suffix_C34 prefix_C81 442 suffix_C466 prefix_C51 4095 suffix_C39 prefix_C425 1532 suffix_C42 prefix_C17 1532 suffix_C457 prefix_C18 2087 ¨ This output can be easily plugged into a primer design program such as Primer3 for developing unique PCR primers for each possible pair
25. ### Summary § Using this method, we were able to generate

probable non- repeat contig neighborhoods that can be effectively utilized by biologists for gap-closing by PCR § Limitations § The simulation studies conducted represent simple cases of the gap closure problem. In reality, we may encounter more complicated repeat structures that cannot be fully resolved by the simple method proposed here. We will conduct more simulation experiments that resemble the real-world sequencing data. § For the contigs that do not possess any connectivity information, further directed experiments can be conducted to close the physical gaps § Future Work § The generated contig-neighborhoods can be aligned to the available optical mapping data to improve the assembly
26. ### References ¨ Claire M. Fraser, Jonathan A. Eisen and Steven

L. Salzberg. Microbial genome sequencing. Nature 406, 799-803(2000) ¨ Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2(3): 231-239 (1988) ¨ Daniel MacLean, Jonathan D. G. Jones & David J. Studholme. Application of 'next-generation' sequencing technologies to microbial genetics. Nature Reviews Microbiology 7, 287-296 (2009) ¨ J.D. Kececioglu and E.W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica,vol. 13, 1995, pp. 7-51. ¨ Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. PNAS August 14, 2001 vol. 98 no. 17 9748-9753 ¨ van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector: automatic contig mapping for gap closure purposes. Nucleic Acids Res. 2003 Nov 15;31(22):e144. ¨ van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W560-6 ¨ Frangeul L, Glaser P, Rusniok C, Buchrieser C, Duchaud E, Dehoux P, Kunst F. CAAT-Box, Contigs-Assembly and Annotation Tool-Box for genome sequencing projects. Bioinformatics. 2004 Mar 22;20(5):790-7. Epub 2004 Jan 29. ¨ Fangqing Zhao, Fanggeng Zhao, Tao Li and Donald A. Bryant. A new pheromone trail-based genetic algorithm for comparative genome assembly. Nucleic Acids Research, 2008, Vol. 36, No. 10 3455-3462 ¨ Yu Z, Li T, Zhao J, Luo J. PGAAS: a prokaryotic genome assembly assistant system. Bioinformatics. 2002 May;18(5):661-5. ¨ Mark J. Chaisson1 and Pavel A. Pevzner. Short read fragment assembly of bacterial genomes. Genome Res. 2008. 18: 324-330
27. ### Acknowledgments § Dr. Qunfeng Dong § Prof. Haixu Tang §

Prof. Rajarshi Guha § Jeong-Hyeon (Justin) Choi – Genomics § Prof. Yves Bruns’ Lab § Colleagues at CGB § Faculty of the School of Informatics § CGB Computing Team – For the technical support and resources § Linda Hostetter & Rachel Lawmaster