Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BMMB554 2016 Lecture 3

BMMB554 2016 Lecture 3

2e0188eb9315b0ecb64537ccfe70eab8?s=128

Anton Nekrutenko

January 27, 2016
Tweet

More Decks by Anton Nekrutenko

Other Decks in Education

Transcript

  1. Based on Bioinformatics Algorithms (Compeau/Pevzner) and Mike Schatz slides Lecture

    3 Assembly basics
  2. Whole genome shotgun sequencing (Celera’s assembly of the  human

    genome)
  3. Whole genome shotgun • Randomly fragment genomic DNA • Select

    for fragments of a certain size • Insert fragment into a plasmid, grow up bacteria to create many copies of each fragment • Sequence from each end of the fragment
  4. None
  5. Shotgun assembly • Merge reads into contigs as described previously

  6. None
  7. Shotgun assembly • Merge reads into contigs as described previously

    • Order contigs into scaffolds • Mate-pair information provides order and approximate distance • Multiple insert sizes can make this much more effective
  8. A Whole-Genome Assembly of Drosophila Eugene W. Myers,1* Granger G.

    Sutton,1 Art L. Delcher,1 Ian M. Dew,1 Dan P. Fasulo,1 Michael J. Flanigan,1 Saul A. Kravitz,1 Clark M. Mobarry,1 Knut H. J. Reinert,1 Karin A. Remington,1 Eric L. Anson,1 Randall A. Bolanos,1 Hui-Hsien Chou,1 Catherine M. Jordan,1 Aaron L. Halpern,1 Stefano Lonardi,1 Ellen M. Beasley,1 Rhonda C. Brandon,1 Lin Chen,1 Patrick J. Dunn,1 Zhongwu Lai,1 Yong Liang,1 Deborah R. Nusskern,1 Ming Zhan,1 Qing Zhang,1 Xiangqun Zheng,1 Gerald M. Rubin,2 Mark D. Adams,1 J. Craig Venter1 We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accom- plished it. Three independent external data sources essentially agree with and support the assembly’s sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochro- matin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99.99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community. The primary obstacle to determining the se- quence of a very large genome is that, with current technology, one can directly deter- mine the sequence of at most a thousand consecutive base pairs at a time. The process, dideoxy sequencing, used to produce such sequencing reads was essentially invented by Sanger circa 1980 (1), with subsequent mod- est gains in read length, moderate gains in data accuracy, and significant gains in throughput. Given the limitation on read length, researchers employ a shotgun-se- quencing approach, in which an effectively random sampling of sequencing reads is col- lected from a larger target DNA sequence. With sufficient oversampling, the sequence of the target can be inferred by piecing the genomes were sequenced by first developing a set of cosmids or other clones covering the genomes by a process called physical map- ping, and then shotgun sequencing each clone as in (2–4). In 1994, the sequence of Haemophilus influenzae was obtained from the assembly of a whole-genome data set obtained by shotgun sequencing (5). This bacterial genome, at 1.8 Mbp, was much larger than was previously thought possible by a direct shotgun ap- proach, the largest previous genome so se- quenced being the lambda virus in 1982 (6). Critical to this accomplishment was the con- struction of a computer program capable of performing the assembly and the use of pairs of reads, called mates, from the ends of 2-kbp end to sequence next in an interactive walk across the genome. Weber and Myers then proposed the whole-genome shotgun se- quencing of the human genome in 1997 (8, 9). The protocol involved collecting a 10ϫ oversampling of the genome, with mate pairs from 0.9-kbp and 10-kbp inserts in a 1:1 ratio, and assembling this in conjunction with the long-range information provided by a ge- nome-wide sequence-tagged site (STS) map that is a series of unique, 300- to 500-bp sites ordered across the genome with an average spacing between sites of 100 kbp. In 1998, Venter and colleagues announced the under- taking of a whole-genome shotgun sequenc- ing of the human genome (10) with the se- quencing of Drosophila serving as a pilot project. For Drosophila, we set about collecting a 10ϫ oversampling of a genome using a 1-to-1 ratio of 2-kbp and 10-kbp mate pairs. In addition, enough BACs to provide 15ϫ coverage of the genome were to be collected and end-sequenced, effectively generating a set of mate pairs that give long-range infor- mation similar to that provided by the STS maps described above. Drosophila’s euchro- matic genome is estimated at 120 Mbp. Thus, the protocol would require collecting at least T H E D R O S O P H I L A G E N O M E R E V I E W on September 22, 2011 www.sciencemag.org rom
  9. Hierarchical shotgun sequencing (The public human genome project)

  10. Hierarchical strategy • Construct a set of large (100 to

    200kb) clones, and sequence each independently with the shotgun approach • Clones are mapped and selected to provide an ordered tiling of the genome • Devised to eliminate long-range misassembly and reduce the risk of short-range misassembly • Allows targeting specific regions of the genome for greater sequencing depth
  11. None
  12. Assembly of the Working Draft of the Human Genome with

    GigAssembler W. James Kent1,3 and David Haussler2 1Department of Biology, University of California at Santa Cruz, Santa Cruz, California 95064, USA; 2Howard Hughes Medical Institute, Department of Computer Science, University of California at Santa Cruz, Santa Cruz, California 95064, USA The data for the public working draft of the human genome contains roughly 400,000 initial sequence contigs in ∼30,000 large insert clones. Many of these initial sequence contigs overlap. A program, GigAssembler, was built to merge them and to order and orient the resulting larger sequence contigs based on mRNA, paired plasmid ends, EST, BAC end pairs, and other information. This program produced the first publicly available assembly of the human genome, a working draft containing roughly 2.7 billion base pairs and covering an estimated 88% of the genome that has been used for several recent studies of the genome. Here we describe the algorithm used by GigAssembler. On May 24, 2000, the public Human Genome Project staged the first “freeze” of all currently available sequence data, co- ordinated by the director, Francis Collins, Greg Schuler at the National Center for Biotechnology Information, Adam Felsenfeld at the National Human Genome Research Institute, and the twenty primary public human sequencing centers (Box 1). Public database accessions for ∼22,000 shotgun- sequenced clones were selected for this freeze, mostly bacte- rial artificial chromosome (BAC) clones (International Hu- man Genome Sequencing Consortium 2001). The sequence contigs were extracted from these accessions and cleaned up as necessary by Schuler. We will refer to these contigs as “ini- tial sequence contigs”. There were ∼375,000 such initial se- quence contigs. The complete public human genome se- quence is not projected to be available until 2003. To get a fingerprint clone contigs. Sequenced clones from a finger- print clone contig are used for the sequence assembly. The May 24 map of sequenced clones consisted of some 1700 fin- gerprint clone contigs, each with an approximate chromo- somal location, plus a few additional contigs that could not be reliably placed on a chromosome. The end points of the in- dividual sequenced clones, as well as their overlaps and rela- tive order along the chromosome, were only very roughly determined in these fingerprint clone contigs. Thus, the prob- lem of clone order needed to be solved along with the prob- lem of initial sequence contig order and orientation. Initial sequence contigs from different clones within a fingerprint clone contig often showed long sequence overlaps, giving strong evidence of clone order, but not giving an entirely unambiguous signal because of the occasional presence of Methods Cold Spring Harbor Laboratory Press on September 22, 2011 - Published by genome.cshlp.org Downloaded from
  13. HGP Assembly (1) • Assembling the reads in a clone:

    PHRAP • Find pairwise word matches between reads • Smith-waterman align all pairs incorporating quality score • Greedily combine, in order of a score based on SW score and Quality • (plus tons of heuristics and trimming)
  14. HGP Assembly (2) • Contigs into genome: gigAssembler • 400,000

    contigs from ~30,000 clones • Use physical map to order clones • Merge clones based on contigs • Build ordering graph
  15. Challenge: clone overlap (ii) For the nth clone, Offset(n +

    1) = Offset Size(n) מ Overlap(n,n + 1), where the size re 4 Three overlapping draft clones: A, B, and C. Each wo initial sequence contigs. Note that initial sequence c b1, and a2 overlap as do b2 and c1.
  16. None
  17. Comparison between assemblies ‣ Celera: 2.9 Gb in 54,061 sequences

    ‣ HGP: 2.9 Gb in 6,094 sequences Aach et al. 2001
  18. Maps ‣ Genetic (Linkage) and Radiation Hybrid (RH) Maps ‣

    Physical Maps
  19. Why maps are important? ‣ Human genome is repeat rich

    ‣ Humans are outbred ‣ Effective against cloning biases
  20. Linkage Mapping Requirement = polymorphic markers

  21. Microsatellite Linkage Map Weissenbach et al. 1992 Distance unit =

    Centimorgan (cM)
  22. Microsatellite Linkage Map Dib et al. 1996

  23. RH Maps Walter et al. 1993

  24. RH Maps Gyapay et al. 1996 Distance unit = centiRay

    (cRN rad)
  25. RH Maps Gyapay et al. 1996 Distance unit = centiRay

    (cRN rad)
  26. Physical Map arrangement of BACs along 8q21 Olson 2001

  27. Fingerprinting BACs McPherson et al. 2001

  28. Physical Map arrangement of BACs along 8q21

  29. A variety of markers ‣ STSs ‣ ESTs ‣ SSRs

    ‣ SNPs
  30. Understanding N50 Given a set of sequences of varying lengths,

    the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N.
  31. 2001 Genome Stats

  32. Close race ‣ June 22, 2000 - HGP assembly is

    produced ‣ June 25, 2000 - Celera assembly is produced ‣ June 7, 2000 - UCSC Genome Browser goes on-line
  33. Principal findings

  34. Compositional variation

  35. Variation in recombination rate

  36. Repetitive landscape

  37. Properties of genes

  38. Duplication complexity

  39. Conservation = Function From Dubchak et al. 2000

  40. Conservation = Function

  41. Disease through population genomics • To cure a disease we

    need to understand its molecular mechanism • Some diseases are caused by a single mutation (color blindness). • Most diseases have VERY complex genotype (cancer, diabetes, allergy, heart conditions, autism, bipolar disorder).
  42. Single Nucleotide Polymorphisms (SNPs) ATGGCGCTGCGCTGT ATGGCGCTGCGGTGT ATGGCGCTGCGGTGT ATGGCGCTGCGGTGT ATGGCGCTGCGATGT C

    = 0; G = 1; A = 2 unrelated individuals
  43. Polymorphism analysis helps understand complex diseases (takes 10-5 - 10-6

    SNPs) 010110201002001 210012021100010 010120120000012 210220001200210 000221012011020 210021000210220 121000210210211 000210222111020 Control (healthy) Case (disease)
  44. https://liorpachter.wordpress.com/seq/

  45. Reconstructing strings from k-mers Composition3(TATGGGGTGC) = {ATG,GGG,GGG,GGT,GTG,TAT,TGC,TGG}

  46. AAT ATG GTT TAA TGT

  47. TAA AAT ATG TGT GTT TAATGTT AAT ATG GTT TAA

    TGT
  48. AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG

    GTT TAA TGC TGG TGT TAA AAT ATG TGC? TGG? TGT?
  49. AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG

    GTT TAA TGC TGG TGT TAA AAT ATG TGT GTT…
  50. AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG

    GTT TAA TGC TGG TGT TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA GAT ATG TGT GTT TAATGCCATGGATGTT
  51. TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG

    GGA GAT ATG TGT GTT TAATGCCATGGGATGTT
  52. None
  53. None
  54. None
  55. Adjacency matrix

  56. Universal strings

  57. DeBrujin graph

  58. Simplifying by ‘gluing’ nodes

  59. Adjacency matrix

  60. Eulerian path through DeBrujin graph

  61. Node gluing construction

  62. k-mer composition construction

  63. Hamilton vs. Euler

  64. None
  65. Balanced/Unbalanced Connected/Disconnected

  66. Every balanced, strongly connected directed graph is Eulerian

  67. Tangledness vs. read length

  68. AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG

    GTT TAA TGC TGG TGT TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA GAT ATG TGT GTT TAATGCCATGGATGTT
  69. None
  70. None
  71. None
  72. None
  73. None
  74. k-mers and coverage

  75. None
  76. None
  77. None
  78. None
  79. None
  80. None
  81. N50 size Def: 50% of the genome is in contigs

    as large as the N50 value Example: 1 Mbp genome N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp) A greater N50 is indicative of improvement in every dimension: •  Better resolution of genes and flanking regulatory regions •  Better resolution of transposons and other complex sequences •  Better resolution of chromosome organization •  Better sequence for all downstream analysis 1000 300 45 30 100 20 15 15 10 . . . . . 45 50%
  82. High-quality draft assemblies of mammalian genomes from massively parallel sequence

    data Gnerre et al (2010) PNAS. doi: 10.1073/pnas.1017351108 Short Read Assembly with ALLPATHS