BMMB554 2016 Lecture 3

Based on Bioinformatics Algorithms (Compeau/Pevzner) and Mike Schatz slides Lecture
3 Assembly basics

Whole genome shotgun sequencing (Celera’s assembly of the human
genome)

Whole genome shotgun • Randomly fragment genomic DNA • Select
for fragments of a certain size • Insert fragment into a plasmid, grow up bacteria to create many copies of each fragment • Sequence from each end of the fragment

Shotgun assembly • Merge reads into contigs as described previously

Shotgun assembly • Merge reads into contigs as described previously
• Order contigs into scaﬀolds • Mate-pair information provides order and approximate distance • Multiple insert sizes can make this much more eﬀective

A Whole-Genome Assembly of Drosophila Eugene W. Myers,1* Granger G.
Sutton,1 Art L. Delcher,1 Ian M. Dew,1 Dan P. Fasulo,1 Michael J. Flanigan,1 Saul A. Kravitz,1 Clark M. Mobarry,1 Knut H. J. Reinert,1 Karin A. Remington,1 Eric L. Anson,1 Randall A. Bolanos,1 Hui-Hsien Chou,1 Catherine M. Jordan,1 Aaron L. Halpern,1 Stefano Lonardi,1 Ellen M. Beasley,1 Rhonda C. Brandon,1 Lin Chen,1 Patrick J. Dunn,1 Zhongwu Lai,1 Yong Liang,1 Deborah R. Nusskern,1 Ming Zhan,1 Qing Zhang,1 Xiangqun Zheng,1 Gerald M. Rubin,2 Mark D. Adams,1 J. Craig Venter1 We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accom- plished it. Three independent external data sources essentially agree with and support the assembly’s sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochro- matin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99.99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientiﬁc community. The primary obstacle to determining the sequence of a very large genome is that, with current technology, one can directly deter- mine the sequence of at most a thousand consecutive base pairs at a time. The process, dideoxy sequencing, used to produce such sequencing reads was essentially invented by Sanger circa 1980 (1), with subsequent mod- est gains in read length, moderate gains in data accuracy, and significant gains in throughput. Given the limitation on read length, researchers employ a shotgun-sequencing approach, in which an effectively random sampling of sequencing reads is collected from a larger target DNA sequence. With sufficient oversampling, the sequence of the target can be inferred by piecing the genomes were sequenced by first developing a set of cosmids or other clones covering the genomes by a process called physical mapping, and then shotgun sequencing each clone as in (2–4). In 1994, the sequence of Haemophilus influenzae was obtained from the assembly of a whole-genome data set obtained by shotgun sequencing (5). This bacterial genome, at 1.8 Mbp, was much larger than was previously thought possible by a direct shotgun approach, the largest previous genome so sequenced being the lambda virus in 1982 (6). Critical to this accomplishment was the construction of a computer program capable of performing the assembly and the use of pairs of reads, called mates, from the ends of 2-kbp end to sequence next in an interactive walk across the genome. Weber and Myers then proposed the whole-genome shotgun sequencing of the human genome in 1997 (8, 9). The protocol involved collecting a 10ϫ oversampling of the genome, with mate pairs from 0.9-kbp and 10-kbp inserts in a 1:1 ratio, and assembling this in conjunction with the long-range information provided by a genome-wide sequence-tagged site (STS) map that is a series of unique, 300- to 500-bp sites ordered across the genome with an average spacing between sites of 100 kbp. In 1998, Venter and colleagues announced the under- taking of a whole-genome shotgun sequencing of the human genome (10) with the sequencing of Drosophila serving as a pilot project. For Drosophila, we set about collecting a 10ϫ oversampling of a genome using a 1-to-1 ratio of 2-kbp and 10-kbp mate pairs. In addition, enough BACs to provide 15ϫ coverage of the genome were to be collected and end-sequenced, effectively generating a set of mate pairs that give long-range information similar to that provided by the STS maps described above. Drosophila’s euchromatic genome is estimated at 120 Mbp. Thus, the protocol would require collecting at least T H E D R O S O P H I L A G E N O M E R E V I E W on September 22, 2011 www.sciencemag.org rom

Hierarchical shotgun sequencing (The public human genome project)

Hierarchical strategy • Construct a set of large (100 to
200kb) clones, and sequence each independently with the shotgun approach • Clones are mapped and selected to provide an ordered tiling of the genome • Devised to eliminate long-range misassembly and reduce the risk of short-range misassembly • Allows targeting speciﬁc regions of the genome for greater sequencing depth

Assembly of the Working Draft of the Human Genome with
GigAssembler W. James Kent1,3 and David Haussler2 1Department of Biology, University of California at Santa Cruz, Santa Cruz, California 95064, USA; 2Howard Hughes Medical Institute, Department of Computer Science, University of California at Santa Cruz, Santa Cruz, California 95064, USA The data for the public working draft of the human genome contains roughly 400,000 initial sequence contigs in ∼30,000 large insert clones. Many of these initial sequence contigs overlap. A program, GigAssembler, was built to merge them and to order and orient the resulting larger sequence contigs based on mRNA, paired plasmid ends, EST, BAC end pairs, and other information. This program produced the first publicly available assembly of the human genome, a working draft containing roughly 2.7 billion base pairs and covering an estimated 88% of the genome that has been used for several recent studies of the genome. Here we describe the algorithm used by GigAssembler. On May 24, 2000, the public Human Genome Project staged the first “freeze” of all currently available sequence data, co- ordinated by the director, Francis Collins, Greg Schuler at the National Center for Biotechnology Information, Adam Felsenfeld at the National Human Genome Research Institute, and the twenty primary public human sequencing centers (Box 1). Public database accessions for ∼22,000 shotgun- sequenced clones were selected for this freeze, mostly bacterial artificial chromosome (BAC) clones (International Hu- man Genome Sequencing Consortium 2001). The sequence contigs were extracted from these accessions and cleaned up as necessary by Schuler. We will refer to these contigs as “initial sequence contigs”. There were ∼375,000 such initial sequence contigs. The complete public human genome sequence is not projected to be available until 2003. To get a fingerprint clone contigs. Sequenced clones from a fingerprint clone contig are used for the sequence assembly. The May 24 map of sequenced clones consisted of some 1700 fingerprint clone contigs, each with an approximate chromo- somal location, plus a few additional contigs that could not be reliably placed on a chromosome. The end points of the in- dividual sequenced clones, as well as their overlaps and rela- tive order along the chromosome, were only very roughly determined in these fingerprint clone contigs. Thus, the prob- lem of clone order needed to be solved along with the prob- lem of initial sequence contig order and orientation. Initial sequence contigs from different clones within a fingerprint clone contig often showed long sequence overlaps, giving strong evidence of clone order, but not giving an entirely unambiguous signal because of the occasional presence of Methods Cold Spring Harbor Laboratory Press on September 22, 2011 - Published by genome.cshlp.org Downloaded from

HGP Assembly (1) • Assembling the reads in a clone:
PHRAP • Find pairwise word matches between reads • Smith-waterman align all pairs incorporating quality score • Greedily combine, in order of a score based on SW score and Quality • (plus tons of heuristics and trimming)

HGP Assembly (2) • Contigs into genome: gigAssembler • 400,000
contigs from ~30,000 clones • Use physical map to order clones • Merge clones based on contigs • Build ordering graph

Challenge: clone overlap (ii) For the nth clone, Offset(n +
1) = Offset Size(n) מ Overlap(n,n + 1), where the size re 4 Three overlapping draft clones: A, B, and C. Each wo initial sequence contigs. Note that initial sequence c b1, and a2 overlap as do b2 and c1.

Comparison between assemblies ‣ Celera: 2.9 Gb in 54,061 sequences
‣ HGP: 2.9 Gb in 6,094 sequences Aach et al. 2001

Maps ‣ Genetic (Linkage) and Radiation Hybrid (RH) Maps ‣
Physical Maps

Why maps are important? ‣ Human genome is repeat rich
‣ Humans are outbred ‣ Eﬀective against cloning biases

Linkage Mapping Requirement = polymorphic markers

Microsatellite Linkage Map Weissenbach et al. 1992 Distance unit =
Centimorgan (cM)

Microsatellite Linkage Map Dib et al. 1996

RH Maps Walter et al. 1993

RH Maps Gyapay et al. 1996 Distance unit = centiRay
(cRN rad)

Physical Map arrangement of BACs along 8q21 Olson 2001

Fingerprinting BACs McPherson et al. 2001

Physical Map arrangement of BACs along 8q21

A variety of markers ‣ STSs ‣ ESTs ‣ SSRs
‣ SNPs

Understanding N50 Given a set of sequences of varying lengths,
the N50 length is deﬁned as the length N for which 50% of all bases in the sequences are in a sequence of length L < N.

2001 Genome Stats

Close race ‣ June 22, 2000 - HGP assembly is
produced ‣ June 25, 2000 - Celera assembly is produced ‣ June 7, 2000 - UCSC Genome Browser goes on-line

Principal ﬁndings

Compositional variation

Variation in recombination rate

Repetitive landscape

Properties of genes

Duplication complexity

Conservation = Function From Dubchak et al. 2000

Conservation = Function

Disease through population genomics • To cure a disease we
need to understand its molecular mechanism • Some diseases are caused by a single mutation (color blindness). • Most diseases have VERY complex genotype (cancer, diabetes, allergy, heart conditions, autism, bipolar disorder).

Single Nucleotide Polymorphisms (SNPs) ATGGCGCTGCGCTGT ATGGCGCTGCGGTGT ATGGCGCTGCGGTGT ATGGCGCTGCGGTGT ATGGCGCTGCGATGT C
= 0; G = 1; A = 2 unrelated individuals

Polymorphism analysis helps understand complex diseases (takes 10-5 - 10-6
SNPs) 010110201002001 210012021100010 010120120000012 210220001200210 000221012011020 210021000210220 121000210210211 000210222111020 Control (healthy) Case (disease)

https://liorpachter.wordpress.com/seq/

Reconstructing strings from k-mers Composition3(TATGGGGTGC) = {ATG,GGG,GGG,GGT,GTG,TAT,TGC,TGG}

AAT ATG GTT TAA TGT

TAA AAT ATG TGT GTT TAATGTT AAT ATG GTT TAA
TGT

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG
GTT TAA TGC TGG TGT TAA AAT ATG TGC? TGG? TGT?

GTT TAA TGC TGG TGT TAA AAT ATG TGT GTT…

GTT TAA TGC TGG TGT TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA GAT ATG TGT GTT TAATGCCATGGATGTT

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG
GGA GAT ATG TGT GTT TAATGCCATGGGATGTT

Adjacency matrix

Universal strings

DeBrujin graph

Simplifying by ‘gluing’ nodes

Adjacency matrix

Eulerian path through DeBrujin graph

Node gluing construction

k-mer composition construction

Hamilton vs. Euler

Balanced/Unbalanced Connected/Disconnected

Every balanced, strongly connected directed graph is Eulerian

Tangledness vs. read length

GTT TAA TGC TGG TGT TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA GAT ATG TGT GTT TAATGCCATGGATGTT

k-mers and coverage

N50 size Def: 50% of the genome is in contigs
as large as the N50 value Example: 1 Mbp genome N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp) A greater N50 is indicative of improvement in every dimension: •  Better resolution of genes and flanking regulatory regions •  Better resolution of transposons and other complex sequences •  Better resolution of chromosome organization •  Better sequence for all downstream analysis 1000 300 45 30 100 20 15 15 10 . . . . . 45 50%

High-quality draft assemblies of mammalian genomes from massively parallel sequence
data Gnerre et al (2010) PNAS. doi: 10.1073/pnas.1017351108 Short Read Assembly with ALLPATHS

BMMB554 2016 Lecture 3

BMMB554 2016 Lecture 3

More Decks by Anton Nekrutenko

Other Decks in Education

Featured

Transcript