BMMB554 2016 Lecture 3 - Speaker Deck

Slide 1

Slide 1 text

Based on Bioinformatics Algorithms (Compeau/Pevzner) and Mike Schatz slides Lecture 3 Assembly basics

Slide 8

Slide 8 text

A Whole-Genome Assembly of Drosophila Eugene W. Myers,1* Granger G. Sutton,1 Art L. Delcher,1 Ian M. Dew,1 Dan P. Fasulo,1 Michael J. Flanigan,1 Saul A. Kravitz,1 Clark M. Mobarry,1 Knut H. J. Reinert,1 Karin A. Remington,1 Eric L. Anson,1 Randall A. Bolanos,1 Hui-Hsien Chou,1 Catherine M. Jordan,1 Aaron L. Halpern,1 Stefano Lonardi,1 Ellen M. Beasley,1 Rhonda C. Brandon,1 Lin Chen,1 Patrick J. Dunn,1 Zhongwu Lai,1 Yong Liang,1 Deborah R. Nusskern,1 Ming Zhan,1 Qing Zhang,1 Xiangqun Zheng,1 Gerald M. Rubin,2 Mark D. Adams,1 J. Craig Venter1 We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accom- plished it. Three independent external data sources essentially agree with and support the assembly’s sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochro- matin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99.99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientiﬁc community. The primary obstacle to determining the sequence of a very large genome is that, with current technology, one can directly deter- mine the sequence of at most a thousand consecutive base pairs at a time. The process, dideoxy sequencing, used to produce such sequencing reads was essentially invented by Sanger circa 1980 (1), with subsequent mod- est gains in read length, moderate gains in data accuracy, and significant gains in throughput. Given the limitation on read length, researchers employ a shotgun-sequencing approach, in which an effectively random sampling of sequencing reads is collected from a larger target DNA sequence. With sufficient oversampling, the sequence of the target can be inferred by piecing the genomes were sequenced by first developing a set of cosmids or other clones covering the genomes by a process called physical map- ping, and then shotgun sequencing each clone as in (2–4). In 1994, the sequence of Haemophilus influenzae was obtained from the assembly of a whole-genome data set obtained by shotgun sequencing (5). This bacterial genome, at 1.8 Mbp, was much larger than was previously thought possible by a direct shotgun approach, the largest previous genome so sequenced being the lambda virus in 1982 (6). Critical to this accomplishment was the con- struction of a computer program capable of performing the assembly and the use of pairs of reads, called mates, from the ends of 2-kbp end to sequence next in an interactive walk across the genome. Weber and Myers then proposed the whole-genome shotgun sequencing of the human genome in 1997 (8, 9). The protocol involved collecting a 10ϫ oversampling of the genome, with mate pairs from 0.9-kbp and 10-kbp inserts in a 1:1 ratio, and assembling this in conjunction with the long-range information provided by a genome-wide sequence-tagged site (STS) map that is a series of unique, 300- to 500-bp sites ordered across the genome with an average spacing between sites of 100 kbp. In 1998, Venter and colleagues announced the under- taking of a whole-genome shotgun sequencing of the human genome (10) with the sequencing of Drosophila serving as a pilot project. For Drosophila, we set about collecting a 10ϫ oversampling of a genome using a 1-to-1 ratio of 2-kbp and 10-kbp mate pairs. In addition, enough BACs to provide 15ϫ coverage of the genome were to be collected and end-sequenced, effectively generating a set of mate pairs that give long-range information similar to that provided by the STS maps described above. Drosophila’s euchromatic genome is estimated at 120 Mbp. Thus, the protocol would require collecting at least T H E D R O S O P H I L A G E N O M E R E V I E W on September 22, 2011 www.sciencemag.org rom

Slide 12

Slide 12 text

Assembly of the Working Draft of the Human Genome with GigAssembler W. James Kent1,3 and David Haussler2 1Department of Biology, University of California at Santa Cruz, Santa Cruz, California 95064, USA; 2Howard Hughes Medical Institute, Department of Computer Science, University of California at Santa Cruz, Santa Cruz, California 95064, USA The data for the public working draft of the human genome contains roughly 400,000 initial sequence contigs in ∼30,000 large insert clones. Many of these initial sequence contigs overlap. A program, GigAssembler, was built to merge them and to order and orient the resulting larger sequence contigs based on mRNA, paired plasmid ends, EST, BAC end pairs, and other information. This program produced the first publicly available assembly of the human genome, a working draft containing roughly 2.7 billion base pairs and covering an estimated 88% of the genome that has been used for several recent studies of the genome. Here we describe the algorithm used by GigAssembler. On May 24, 2000, the public Human Genome Project staged the first “freeze” of all currently available sequence data, co- ordinated by the director, Francis Collins, Greg Schuler at the National Center for Biotechnology Information, Adam Felsenfeld at the National Human Genome Research Institute, and the twenty primary public human sequencing centers (Box 1). Public database accessions for ∼22,000 shotgun- sequenced clones were selected for this freeze, mostly bacterial artificial chromosome (BAC) clones (International Hu- man Genome Sequencing Consortium 2001). The sequence contigs were extracted from these accessions and cleaned up as necessary by Schuler. We will refer to these contigs as “initial sequence contigs”. There were ∼375,000 such initial sequence contigs. The complete public human genome sequence is not projected to be available until 2003. To get a fingerprint clone contigs. Sequenced clones from a fingerprint clone contig are used for the sequence assembly. The May 24 map of sequenced clones consisted of some 1700 fingerprint clone contigs, each with an approximate chromo- somal location, plus a few additional contigs that could not be reliably placed on a chromosome. The end points of the in- dividual sequenced clones, as well as their overlaps and rela- tive order along the chromosome, were only very roughly determined in these fingerprint clone contigs. Thus, the prob- lem of clone order needed to be solved along with the prob- lem of initial sequence contig order and orientation. Initial sequence contigs from different clones within a fingerprint clone contig often showed long sequence overlaps, giving strong evidence of clone order, but not giving an entirely unambiguous signal because of the occasional presence of Methods Cold Spring Harbor Laboratory Press on September 22, 2011 - Published by genome.cshlp.org Downloaded from

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text