Slide 1

Slide 1 text

Based on Bioinformatics Algorithms (Compeau/Pevzner) and Mike Schatz slides Lecture 3 Assembly basics

Slide 2

Slide 2 text

Whole genome shotgun sequencing (Celera’s assembly of the  human genome)

Slide 3

Slide 3 text

Whole genome shotgun • Randomly fragment genomic DNA • Select for fragments of a certain size • Insert fragment into a plasmid, grow up bacteria to create many copies of each fragment • Sequence from each end of the fragment

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Shotgun assembly • Merge reads into contigs as described previously

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Shotgun assembly • Merge reads into contigs as described previously • Order contigs into scaffolds • Mate-pair information provides order and approximate distance • Multiple insert sizes can make this much more effective

Slide 8

Slide 8 text

A Whole-Genome Assembly of Drosophila Eugene W. Myers,1* Granger G. Sutton,1 Art L. Delcher,1 Ian M. Dew,1 Dan P. Fasulo,1 Michael J. Flanigan,1 Saul A. Kravitz,1 Clark M. Mobarry,1 Knut H. J. Reinert,1 Karin A. Remington,1 Eric L. Anson,1 Randall A. Bolanos,1 Hui-Hsien Chou,1 Catherine M. Jordan,1 Aaron L. Halpern,1 Stefano Lonardi,1 Ellen M. Beasley,1 Rhonda C. Brandon,1 Lin Chen,1 Patrick J. Dunn,1 Zhongwu Lai,1 Yong Liang,1 Deborah R. Nusskern,1 Ming Zhan,1 Qing Zhang,1 Xiangqun Zheng,1 Gerald M. Rubin,2 Mark D. Adams,1 J. Craig Venter1 We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accom- plished it. Three independent external data sources essentially agree with and support the assembly’s sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochro- matin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99.99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community. The primary obstacle to determining the se- quence of a very large genome is that, with current technology, one can directly deter- mine the sequence of at most a thousand consecutive base pairs at a time. The process, dideoxy sequencing, used to produce such sequencing reads was essentially invented by Sanger circa 1980 (1), with subsequent mod- est gains in read length, moderate gains in data accuracy, and significant gains in throughput. Given the limitation on read length, researchers employ a shotgun-se- quencing approach, in which an effectively random sampling of sequencing reads is col- lected from a larger target DNA sequence. With sufficient oversampling, the sequence of the target can be inferred by piecing the genomes were sequenced by first developing a set of cosmids or other clones covering the genomes by a process called physical map- ping, and then shotgun sequencing each clone as in (2–4). In 1994, the sequence of Haemophilus influenzae was obtained from the assembly of a whole-genome data set obtained by shotgun sequencing (5). This bacterial genome, at 1.8 Mbp, was much larger than was previously thought possible by a direct shotgun ap- proach, the largest previous genome so se- quenced being the lambda virus in 1982 (6). Critical to this accomplishment was the con- struction of a computer program capable of performing the assembly and the use of pairs of reads, called mates, from the ends of 2-kbp end to sequence next in an interactive walk across the genome. Weber and Myers then proposed the whole-genome shotgun se- quencing of the human genome in 1997 (8, 9). The protocol involved collecting a 10ϫ oversampling of the genome, with mate pairs from 0.9-kbp and 10-kbp inserts in a 1:1 ratio, and assembling this in conjunction with the long-range information provided by a ge- nome-wide sequence-tagged site (STS) map that is a series of unique, 300- to 500-bp sites ordered across the genome with an average spacing between sites of 100 kbp. In 1998, Venter and colleagues announced the under- taking of a whole-genome shotgun sequenc- ing of the human genome (10) with the se- quencing of Drosophila serving as a pilot project. For Drosophila, we set about collecting a 10ϫ oversampling of a genome using a 1-to-1 ratio of 2-kbp and 10-kbp mate pairs. In addition, enough BACs to provide 15ϫ coverage of the genome were to be collected and end-sequenced, effectively generating a set of mate pairs that give long-range infor- mation similar to that provided by the STS maps described above. Drosophila’s euchro- matic genome is estimated at 120 Mbp. Thus, the protocol would require collecting at least T H E D R O S O P H I L A G E N O M E R E V I E W on September 22, 2011 www.sciencemag.org rom

Slide 9

Slide 9 text

Hierarchical shotgun sequencing (The public human genome project)

Slide 10

Slide 10 text

Hierarchical strategy • Construct a set of large (100 to 200kb) clones, and sequence each independently with the shotgun approach • Clones are mapped and selected to provide an ordered tiling of the genome • Devised to eliminate long-range misassembly and reduce the risk of short-range misassembly • Allows targeting specific regions of the genome for greater sequencing depth

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Assembly of the Working Draft of the Human Genome with GigAssembler W. James Kent1,3 and David Haussler2 1Department of Biology, University of California at Santa Cruz, Santa Cruz, California 95064, USA; 2Howard Hughes Medical Institute, Department of Computer Science, University of California at Santa Cruz, Santa Cruz, California 95064, USA The data for the public working draft of the human genome contains roughly 400,000 initial sequence contigs in ∼30,000 large insert clones. Many of these initial sequence contigs overlap. A program, GigAssembler, was built to merge them and to order and orient the resulting larger sequence contigs based on mRNA, paired plasmid ends, EST, BAC end pairs, and other information. This program produced the first publicly available assembly of the human genome, a working draft containing roughly 2.7 billion base pairs and covering an estimated 88% of the genome that has been used for several recent studies of the genome. Here we describe the algorithm used by GigAssembler. On May 24, 2000, the public Human Genome Project staged the first “freeze” of all currently available sequence data, co- ordinated by the director, Francis Collins, Greg Schuler at the National Center for Biotechnology Information, Adam Felsenfeld at the National Human Genome Research Institute, and the twenty primary public human sequencing centers (Box 1). Public database accessions for ∼22,000 shotgun- sequenced clones were selected for this freeze, mostly bacte- rial artificial chromosome (BAC) clones (International Hu- man Genome Sequencing Consortium 2001). The sequence contigs were extracted from these accessions and cleaned up as necessary by Schuler. We will refer to these contigs as “ini- tial sequence contigs”. There were ∼375,000 such initial se- quence contigs. The complete public human genome se- quence is not projected to be available until 2003. To get a fingerprint clone contigs. Sequenced clones from a finger- print clone contig are used for the sequence assembly. The May 24 map of sequenced clones consisted of some 1700 fin- gerprint clone contigs, each with an approximate chromo- somal location, plus a few additional contigs that could not be reliably placed on a chromosome. The end points of the in- dividual sequenced clones, as well as their overlaps and rela- tive order along the chromosome, were only very roughly determined in these fingerprint clone contigs. Thus, the prob- lem of clone order needed to be solved along with the prob- lem of initial sequence contig order and orientation. Initial sequence contigs from different clones within a fingerprint clone contig often showed long sequence overlaps, giving strong evidence of clone order, but not giving an entirely unambiguous signal because of the occasional presence of Methods Cold Spring Harbor Laboratory Press on September 22, 2011 - Published by genome.cshlp.org Downloaded from

Slide 13

Slide 13 text

HGP Assembly (1) • Assembling the reads in a clone: PHRAP • Find pairwise word matches between reads • Smith-waterman align all pairs incorporating quality score • Greedily combine, in order of a score based on SW score and Quality • (plus tons of heuristics and trimming)

Slide 14

Slide 14 text

HGP Assembly (2) • Contigs into genome: gigAssembler • 400,000 contigs from ~30,000 clones • Use physical map to order clones • Merge clones based on contigs • Build ordering graph

Slide 15

Slide 15 text

Challenge: clone overlap (ii) For the nth clone, Offset(n + 1) = Offset Size(n) מ Overlap(n,n + 1), where the size re 4 Three overlapping draft clones: A, B, and C. Each wo initial sequence contigs. Note that initial sequence c b1, and a2 overlap as do b2 and c1.

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Comparison between assemblies ‣ Celera: 2.9 Gb in 54,061 sequences ‣ HGP: 2.9 Gb in 6,094 sequences Aach et al. 2001

Slide 18

Slide 18 text

Maps ‣ Genetic (Linkage) and Radiation Hybrid (RH) Maps ‣ Physical Maps

Slide 19

Slide 19 text

Why maps are important? ‣ Human genome is repeat rich ‣ Humans are outbred ‣ Effective against cloning biases

Slide 20

Slide 20 text

Linkage Mapping Requirement = polymorphic markers

Slide 21

Slide 21 text

Microsatellite Linkage Map Weissenbach et al. 1992 Distance unit = Centimorgan (cM)

Slide 22

Slide 22 text

Microsatellite Linkage Map Dib et al. 1996

Slide 23

Slide 23 text

RH Maps Walter et al. 1993

Slide 24

Slide 24 text

RH Maps Gyapay et al. 1996 Distance unit = centiRay (cRN rad)

Slide 25

Slide 25 text

RH Maps Gyapay et al. 1996 Distance unit = centiRay (cRN rad)

Slide 26

Slide 26 text

Physical Map arrangement of BACs along 8q21 Olson 2001

Slide 27

Slide 27 text

Fingerprinting BACs McPherson et al. 2001

Slide 28

Slide 28 text

Physical Map arrangement of BACs along 8q21

Slide 29

Slide 29 text

A variety of markers ‣ STSs ‣ ESTs ‣ SSRs ‣ SNPs

Slide 30

Slide 30 text

Understanding N50 Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N.

Slide 31

Slide 31 text

2001 Genome Stats

Slide 32

Slide 32 text

Close race ‣ June 22, 2000 - HGP assembly is produced ‣ June 25, 2000 - Celera assembly is produced ‣ June 7, 2000 - UCSC Genome Browser goes on-line

Slide 33

Slide 33 text

Principal findings

Slide 34

Slide 34 text

Compositional variation

Slide 35

Slide 35 text

Variation in recombination rate

Slide 36

Slide 36 text

Repetitive landscape

Slide 37

Slide 37 text

Properties of genes

Slide 38

Slide 38 text

Duplication complexity

Slide 39

Slide 39 text

Conservation = Function From Dubchak et al. 2000

Slide 40

Slide 40 text

Conservation = Function

Slide 41

Slide 41 text

Disease through population genomics • To cure a disease we need to understand its molecular mechanism • Some diseases are caused by a single mutation (color blindness). • Most diseases have VERY complex genotype (cancer, diabetes, allergy, heart conditions, autism, bipolar disorder).

Slide 42

Slide 42 text

Single Nucleotide Polymorphisms (SNPs) ATGGCGCTGCGCTGT ATGGCGCTGCGGTGT ATGGCGCTGCGGTGT ATGGCGCTGCGGTGT ATGGCGCTGCGATGT C = 0; G = 1; A = 2 unrelated individuals

Slide 43

Slide 43 text

Polymorphism analysis helps understand complex diseases (takes 10-5 - 10-6 SNPs) 010110201002001 210012021100010 010120120000012 210220001200210 000221012011020 210021000210220 121000210210211 000210222111020 Control (healthy) Case (disease)

Slide 44

Slide 44 text

https://liorpachter.wordpress.com/seq/

Slide 45

Slide 45 text

Reconstructing strings from k-mers Composition3(TATGGGGTGC) = {ATG,GGG,GGG,GGT,GTG,TAT,TGC,TGG}

Slide 46

Slide 46 text

AAT ATG GTT TAA TGT

Slide 47

Slide 47 text

TAA AAT ATG TGT GTT TAATGTT AAT ATG GTT TAA TGT

Slide 48

Slide 48 text

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT TAA AAT ATG TGC? TGG? TGT?

Slide 49

Slide 49 text

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT TAA AAT ATG TGT GTT…

Slide 50

Slide 50 text

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA GAT ATG TGT GTT TAATGCCATGGATGTT

Slide 51

Slide 51 text

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT TAATGCCATGGGATGTT

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

Adjacency matrix

Slide 56

Slide 56 text

Universal strings

Slide 57

Slide 57 text

DeBrujin graph

Slide 58

Slide 58 text

Simplifying by ‘gluing’ nodes

Slide 59

Slide 59 text

Adjacency matrix

Slide 60

Slide 60 text

Eulerian path through DeBrujin graph

Slide 61

Slide 61 text

Node gluing construction

Slide 62

Slide 62 text

k-mer composition construction

Slide 63

Slide 63 text

Hamilton vs. Euler

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

Balanced/Unbalanced Connected/Disconnected

Slide 66

Slide 66 text

Every balanced, strongly connected directed graph is Eulerian

Slide 67

Slide 67 text

Tangledness vs. read length

Slide 68

Slide 68 text

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA GAT ATG TGT GTT TAATGCCATGGATGTT

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

k-mers and coverage

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

N50 size Def: 50% of the genome is in contigs as large as the N50 value Example: 1 Mbp genome N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp) A greater N50 is indicative of improvement in every dimension: •  Better resolution of genes and flanking regulatory regions •  Better resolution of transposons and other complex sequences •  Better resolution of chromosome organization •  Better sequence for all downstream analysis 1000 300 45 30 100 20 15 15 10 . . . . . 45 50%

Slide 82

Slide 82 text

High-quality draft assemblies of mammalian genomes from massively parallel sequence data Gnerre et al (2010) PNAS. doi: 10.1073/pnas.1017351108 Short Read Assembly with ALLPATHS