Upgrade to Pro — share decks privately, control downloads, hide ads and more …

M.S Bioinformatics Capstone Presentation - Indiana University - Jun 2009

M.S Bioinformatics Capstone Presentation - Indiana University - Jun 2009

Dissertation titled "Developing a computational method to help solve the sequencing gap closure problem", as part of research work conducted at the Center for Genomics and Bioinformatics (CGB) at Indiana University, Bloomington (April 2008 to May 2009)

Bioinformatics Capstone Presentation by Vivek Krishnakumar
http://bio.informatics.indiana.edu/capstone/

Vivek Krishnakumar

June 10, 2009
Tweet

More Decks by Vivek Krishnakumar

Other Decks in Programming

Transcript

  1. Advisors: Dr. Qunfeng Dong(CGB)
    Prof. Haixu Tang(SOI)
    INDIANA UNIVERSITY, BLOOMINGTON
    DEVELOPING A COMPUTATIONAL
    METHOD TO HELP SOLVE THE
    SEQUENCING GAP CLOSURE PROBLEM
    Vivek Krishnakumar
    M.S. Bioinformatics
    CAPSTONE

    View Slide

  2. A Typical Sequencing Project
    n
    Sequencing
    n
    Assembly
    n
    Finishing
    n
    Annotation
    Steps in a whole genome shotgun sequencing project
    Claire M. Fraser, Jonathan A. Eisen and Steven L. Salzberg. Microbial genome sequencing.
    Nature 406, 799-803(2000)

    View Slide

  3. Genome Assembly
    § Putting together a large number of short DNA sequences
    called reads to create a representation of the original
    genome.
    § Relies on the assumption that reads sharing the same
    string of letters, originated from the same place on the
    genome
    § Ideally, at an 8-10 fold coverage, most of the genome will
    be assembled into a small number of contigs (around 5 for
    a 1Mbp genome)1
    1. Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2(3): 231-239 (1988)

    View Slide

  4. Assembly Algorithms
    ¨ Assembly is a complex computational problem
    (NP-hard)
    ¨ Several approximation algorithms to solve this
    problem.
    ¨ Types
    ¤ Simple greedy algorithms
    ¤ Graph based algorithms
    n Overlap-layout-consensus (Phrap, TIGR, VCAKE, etc.)
    n Eulerian Path (Newbler, Euler, ALLPATHS, etc.)

    View Slide

  5. Assembly Algorithms – Graph based
    ¨ k-mers are computed
    ¨ k-mers are represented
    by edges of a deBruijn
    graph
    ¨ reads are represented
    as paths through k-mers
    ¨ reconstructing the
    genome involves finding
    the path that makes use
    of all the edges
    Overlap-layout-
    consensus1
    Eulerian Path2
    ¨ sequences
    represented as nodes
    ¨ edges correspond to
    read overlap
    ¨ use the overlap
    between sequence
    reads to create a link
    between them
    ¨ contigs are eventually
    formed by reading
    along the links as far
    as possible
    Daniel MacLean, Jonathan D. G. Jones & David J. Studholme.
    Application of 'next-generation' sequencing technologies to microbial
    genetics. Nature Reviews Microbiology 7, 287-296 (2009)
    1. J.D. Kececioglu and E.W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica,vol. 13, 1995, pp. 7-51.
    2. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. PNAS August 14, 2001 vol. 98 no. 17
    9748-9753

    View Slide

  6. Finishing (Post-assembly)
    ¨ No matter the type of assembly algorithm, in
    practice, the assemblers produce not one, but
    hundreds or thousands of unordered contiguous
    sequences. Reasons:
    ¤ no or low coverage regions
    ¤ repeats
    ¤ sequencing errors
    ¨ The task of gap closure consists of several steps:
    ¤ Orienting the contigs
    ¤ Ordering the contigs
    ¤ Gap closure

    View Slide

  7. Gap-closure problem
    C
    F
    A
    E
    B
    D G
    Un-ordered &
    un-oriented
    contigs
    A C B D F
    G E
    A C
    Ordered &
    oriented
    Gap-closing
    (by PCR or
    walking)

    View Slide

  8. Existing methods for contig ordering
    (Reference genome based)
    n
    For biologists, it is imperative to move beyond the assembly step and elucidate
    the order of the contigs for effective gap-closure. Existing computational
    methods provide the biologists with tools to identify the order of the contigs,
    which enables them to move that much closer to reconstructing the genome
    n
    Projector1
    n
    Projector 22
    n
    CAAT-Box or Contigs-Assembly and Annotation Tool-Box3
    n
    Pheromone trail-based genetic algorithm4
    n
    PGAAS or Prokaryotic Genome Assembly Assistant System5
    n
    As observed, the existing methods are based on similar working principles, i.e.
    they assume that there are reference genomes which are closely related to the
    organism in consideration
    1. van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector: automatic contig mapping for gap closure purposes. Nucleic Acids Res. 2003 Nov 15;31(22):e144.
    2. van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W560-6
    3. Frangeul L, Glaser P, Rusniok C, Buchrieser C, Duchaud E, Dehoux P, Kunst F. CAAT-Box, Contigs-Assembly and Annotation Tool-Box for genome sequencing projects. Bioinformatics. 2004 Mar 22;20(5):790-7. Epub 2004 Jan 29.
    4. Fangqing Zhao, Fanggeng Zhao, Tao Li and Donald A. Bryant. A new pheromone trail-based genetic algorithm for comparative genome assembly. Nucleic Acids Research, 2008, Vol. 36, No. 10 3455-3462
    5. Yu Z, Li T, Zhao J, Luo J. PGAAS: a prokaryotic genome assembly assistant system. Bioinformatics. 2002 May;18(5):661-5.

    View Slide

  9. Rationale
    n
    Post-sequencing and assembly of a novel genome (one with no
    known nearest phylogenetic neighbor), the biologists are faced
    with a set of unordered contigs
    n
    It is not feasible to run gap closing Polymerase Chain Reactions
    (PCR) for each possible pair of contigs
    n
    Thus, we propose a computational method, employing graph
    theoretical concepts, to identify non-repeat contig neighborhoods
    (determining possible pairs of contigs and the corresponding gap
    sizes)

    View Slide

  10. Proposed method
    C
    F
    A
    E
    B
    D G
    C
    A
    A
    C
    F
    A
    A
    F
    G
    D
    D
    G
    Exhausting all
    possible pairs of
    contigs for gap-
    closing PCR
    C REPEAT
    E
    G
    B F
    D
    Using the available data,
    employing a graph
    theoretical approach, we
    generate probable non-
    repeat contig
    neighborhoods
    • contigs connected
    directly with each other
    • connected through
    repeat contigs
    Set of
    unordered,
    un-oriented
    contigs
    generated
    after
    assembly

    View Slide

  11. Graph Theory
    n
    Graph theory is a mathematical/computational concept of studying graphs. It is
    used to model pair-wise relationships between objects from a certain collection
    of such objects. Technically, a graph consists of:
    q
    nodes/vertices (representing the object) and
    q
    edges (representing the connection between two objects) – they are
    either directed or undirected.
    n
    For our study, we explored the use of the shortest path algorithm (Dijkstra’s
    Algorithm)
    node
    undirected
    edge directed
    edge

    View Slide

  12. Simulation studies
    n
    Before analyzing the real world data, we conducted a simulation study to
    explore a method for identifying possible pairs of contigs
    n
    Escheirichia coli 536 genome used for simulation
    GENOME
    CONTIG 1 CONTIG N+1
    GAP 1 GAP N
    Generate reads at
    different coverages
    (10x, 20x, 30x).
    Introduce N random
    length gaps into the
    genome, producing
    N+1 contigs
    CONTIG 2 GAP 2 CONTIG N
    METHOD 1
    CONTIG 1 CONTIG N+1
    GAP 1 GAP N
    Splice the genome
    with N copies of the
    same DNA
    sequence generate
    reads at different
    coverages
    CONTIG 2 GAP 2 CONTIG N
    METHOD 2
    Repeat 1 Repeat 2 Repeat N

    View Slide

  13. Classifying the reads
    n
    From this population of reads we classify and extract two types of reads:
    n
    Reads that belong to the gap regions
    n
    Reads that partially overlap with the edges of the contigs
    CONTIG 1 CONTIG N+1
    GAP 1 GAP N
    CONTIG 2 GAP 2 CONTIG N
    CONTIG 1 GAP 1 CONTIG 2
    Map the shortest path from a suffix to a prefix through a set of junction reads and
    unassembled reads
    suffix prefix

    View Slide

  14. Running the simulation
    Junction
    Reads
    Reads from
    the gap
    region
    Contig edges
    (suffix|prefix)
    Large-scale sequence
    alignment using vmatch
    Traverse the graph
    and find the shortest
    path between a
    suffix and prefix
    Compute all possible overlaps.
    Generate an overlap graph where the reads form the
    nodes and the overlaps form the edges.
    suffix prefix
    suffix prefix

    View Slide

  15. Results – METHOD 1
    Number of gaps introduced: 10
    Number of contiguous sequences: 11 (C1 through C11)
    Simulated read lengths: 400 bp Coverage: 10x, 20x and 30x
    All computed paths (at 20x coverage):

    View Slide

  16. Results – METHOD 2 (Repeat-induced gaps)
    Start Node End Node
    suffix_C2 prefix_C2
    suffix_C2 suffix_C1
    suffix_C2 prefix_C3
    suffix_C1 prefix_C2
    suffix_C1 prefix_C3
    prefix_C2 prefix_C3
    Number of repeat-induced gaps introduced: 2
    Number of contiguous sequences: 3 (C1, C2 and C3)
    Simulated read lengths: 400 bp Coverage: 10x, 20x and 30x
    All computed paths (at 30x coverage):
    Repeat 1 Repeat 2
    C1 C2 C3

    View Slide

  17. Exploring the real world data
    n
    Eventually we want to help the biologists develop PCR primers to
    close the gaps between the contigs in the genome.
    n
    For this study, we obtained assembly data from Prof. Yves Bruns’ Lab
    in the Biology department for the organism Brevundimonas diminuta.
    n
    Genome sequencing method: 454 (without paired ends)
    n
    Sequence Assembly program: Newbler
    n
    We intend to derive information from the .ace file produced by the
    Newbler assembler

    View Slide

  18. Assembly data statistics
    Brevundimonas diminuta
    STATISTICS
    Total Number Of Reads 613425
    Number of contigs 476
    Length of contigs 3325300
    Avg. Length of contigs 6985 ± 10125
    Largest contig 77275 bp
    Smallest contig 102 bp
    Number of N50 contigs 61
    Length of N50 contigs 14580

    View Slide

  19. .ace file format
    § .ace file created by the
    Newbler assembler
    contains the following
    assembly information:
    1. the contig sequences
    2. quality values
    3. read alignment
    information.
    4. contig linking
    information
    AS 476 537972
    CO contig00001 755 63 28 U
    TTAC**AGCT*CCC*GCC*A**GGTC*TT*G*CCGG*TCATGCCTTCATA
    GTTGGTCGCGTGATAGACGCCGCGCGCCACGGCGCGGGCCAGGACGTCGG
    BQ
    64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
    64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
    64 64 64 64 64 64
    64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
    64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
    64 64 64 64 64 64
    AF EYKMWLY02IBXNT.12-1.fm55 C -58
    AF EYKMWLY01AHEVV.94-111.fm55 U -92
    AF EYKMWLY01E0SVI.79-107.fm55 U -77
    AF EYKMWLY01BQQ0K.153-181.fm55 U -151
    AF EYKMWLY02GTKZH U 377
    AF EYKMWLY01AUK2G U 466
    AF EYKMWLY02IS15X C 466
    AF EYKMWLY02IYK7T.265-2 C 485
    AF EYKMWLY02IT5PF C 511
    AF EYKMWLY01B4PII.51-1 C 500
    AF EYKMWLY01D8FPB C 520
    AF EYKMWLY02HKKY C 600
    RD EYKMWLY01AHEVV.94-111.fm55 117 0 0
    GGGGCTGAAGGCGCTGAGCGAGACGCTGCCGGTCGCGCAGGCGGTGCCGG
    CCGGGCGCGCCGATTTAATCAACCCTCTCCCGGCGGGAGAGGGTTAC**A
    GCTACCC*GCC*A**GG

    View Slide

  20. Inferring information from the .ace file
    ¨ From the .ace file, we extract information that suggests
    which reads a particular contig shares with which other
    contig(s) in the assembly
    CO contig00016 10537 1778 1241 U
    AF EYKMWLY02JASI2.1-151.to428 U 10379
    AF EYKMWLY01CDRT0.1-130 U 10400
    AF EYKMWLY02IF22Q.1-105 U 10426
    AF EYKMWLY02HA9IP.1-105 U 10426
    AF EYKMWLY01CV3A4.1-107 U 10426
    AF EYKMWLY01BNJV1.1-91.to428 U 10440
    AF EYKMWLY01D4P1Q C 10445
    AF EYKMWLY02FFW9C.1-81 U 10450
    AF EYKMWLY01C1SKF.1-75 U 10457
    AF EYKMWLY01A3C1V.1-68 U 10465
    AF EYKMWLY01CCJHV.1-66.to428 U 10465
    AF EYKMWLY01D7WLL.1-52.to428 U 10480
    AF EYKMWLY01CPNZS.1-49.to428 U 10484
    AF EYKMWLY01CBBAK.216-196.to428 C 10508
    AF EYKMWLY01DWD7X.196-176.to428 C 10506
    AF EYKMWLY01A27AI.239-214.to428 C 10510
    AF EYKMWLY01A14U9.208-187.to428 C 10513
    AF EYKMWLY01A4PVY.242-220.to428 C 10513
    CO contig00428 21606 2746 2071 U
    AF EYKMWLY01CPNZS.73-75.fm16 U -71
    AF EYKMWLY01D7WLL.76-83.fm16 U -74
    AF EYKMWLY02JASI2.175-195.fm16 U -173
    AF EYKMWLY01CCJHV.90-117.fm16 U -88
    AF EYKMWLY01BNJV1.115-149.fm16 U -113
    AF EYKMWLY02JIVGZ.41-1.fm16 C -23
    AF EYKMWLY02HY7R3.48-1.fm16 C -25
    AF EYKMWLY02FW7CZ.58-1.fm16 C -27
    AF EYKMWLY01B9IJ6 U 1
    AF EYKMWLY01D4UXN.80-1.fm16 C -34
    AF EYKMWLY01DBA88.146-1 C -93
    AF EYKMWLY02HKXAA.152-1 C -96
    AF EYKMWLY01DWD7X.152-1.fm16 C -47
    AF EYKMWLY01B9SVF.62-221 U -60
    AF EYKMWLY01A14U9.163-1.fm16 C -44
    AF EYKMWLY01CBBAK.172-1.fm16 C -45
    AF EYKMWLY01A27AI.190-1.fm16 C -48
    AF EYKMWLY01A4PVY.196-1.fm16 C -45

    View Slide

  21. Classification of contigs
    § Based on the number of recorded connections, we
    try to classify the contigs into so-called:
    § Repeat contigs; and
    § Non-repeat contigs
    § After classification, we go on to build non-repeat
    contig neighborhoods (pairs of non-repeat contigs
    connected through one or more repeat contigs)
    CONTIG X CONTIG Y
    REPEAT CONTIG X CONTIG Y
    REPEAT A REPEAT B

    View Slide

  22. Results
    Total number of contigs
    In the dataset
    476
    After classification:
    Repeat contigs 31
    Non-repeat contigs 445
    Number of non-repeat contig pairs
    (through one or more repeat contigs)
    184

    View Slide

  23. Generated networks

    View Slide

  24. Output from this method
    ¨ On analysis of the assembly data, biologists are
    provided with the output in the following format
    suffix_C466 prefix_C411 1269
    suffix_C20 prefix_C439 833
    suffix_C20 prefix_C18 833
    suffix_C34 prefix_C81 442
    suffix_C466 prefix_C51 4095
    suffix_C39 prefix_C425 1532
    suffix_C42 prefix_C17 1532
    suffix_C457 prefix_C18 2087
    ¨ This output can be easily plugged into a primer design
    program such as Primer3 for developing unique PCR
    primers for each possible pair

    View Slide

  25. Summary
    § Using this method, we were able to generate probable non-
    repeat contig neighborhoods that can be effectively utilized by
    biologists for gap-closing by PCR
    § Limitations
    § The simulation studies conducted represent simple cases of the gap
    closure problem. In reality, we may encounter more complicated
    repeat structures that cannot be fully resolved by the simple method
    proposed here. We will conduct more simulation experiments that
    resemble the real-world sequencing data.
    § For the contigs that do not possess any connectivity information,
    further directed experiments can be conducted to close the physical
    gaps
    § Future Work
    § The generated contig-neighborhoods can be aligned to the available
    optical mapping data to improve the assembly

    View Slide

  26. References
    ¨ Claire M. Fraser, Jonathan A. Eisen and Steven L. Salzberg. Microbial genome sequencing. Nature 406, 799-803(2000)
    ¨ Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics
    2(3): 231-239 (1988)
    ¨ Daniel MacLean, Jonathan D. G. Jones & David J. Studholme. Application of 'next-generation' sequencing
    technologies to microbial genetics. Nature Reviews Microbiology 7, 287-296 (2009)
    ¨ J.D. Kececioglu and E.W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica,vol. 13, 1995,
    pp. 7-51.
    ¨ Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly.
    PNAS August 14, 2001 vol. 98 no. 17 9748-9753
    ¨ van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector: automatic contig mapping for gap closure purposes. Nucleic
    Acids Res. 2003 Nov 15;31(22):e144.
    ¨ van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic
    genome sequence assemblies. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W560-6
    ¨ Frangeul L, Glaser P, Rusniok C, Buchrieser C, Duchaud E, Dehoux P, Kunst F. CAAT-Box, Contigs-Assembly and
    Annotation Tool-Box for genome sequencing projects. Bioinformatics. 2004 Mar 22;20(5):790-7. Epub 2004 Jan 29.
    ¨ Fangqing Zhao, Fanggeng Zhao, Tao Li and Donald A. Bryant. A new pheromone trail-based genetic algorithm for
    comparative genome assembly. Nucleic Acids Research, 2008, Vol. 36, No. 10 3455-3462
    ¨ Yu Z, Li T, Zhao J, Luo J. PGAAS: a prokaryotic genome assembly assistant system. Bioinformatics. 2002
    May;18(5):661-5.
    ¨ Mark J. Chaisson1 and Pavel A. Pevzner. Short read fragment assembly of bacterial genomes. Genome Res. 2008. 18:
    324-330

    View Slide

  27. Acknowledgments
    § Dr. Qunfeng Dong
    § Prof. Haixu Tang
    § Prof. Rajarshi Guha
    § Jeong-Hyeon (Justin) Choi – Genomics
    § Prof. Yves Bruns’ Lab
    § Colleagues at CGB
    § Faculty of the School of Informatics
    § CGB Computing Team – For the technical support and
    resources
    § Linda Hostetter & Rachel Lawmaster

    View Slide

  28. THANK YOU!

    View Slide