Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lecture 12: Genome Assembly I

Avatar for shaunmahony shaunmahony
February 22, 2022
90

Lecture 12: Genome Assembly I

BMMB 554 Lecture 12

Avatar for shaunmahony

shaunmahony

February 22, 2022
Tweet

Transcript

  1. Today’s learning objectives • Become familiar with basic concepts in

    genome assembly • Understand the overlap-layout-consensus methods of assembly • Introduce graph concepts & terminology
  2. Some Terminology read an individual sequence that comes out of

    sequencer mate pair a pair of reads from two ends of the same DNA fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequence multiple alignment of reads in a contig Steps to Assemble a Genome 77>?4>@@>?>>@>44@@77
  3. Lander-Waterman statistics G = genome length N = number of

    reads L = length of each read C = NL/G = “coverage” The expected number of gaps is given by: How many reads do we need to cover the whole genome? E(# gaps) = GC L e−C = Ne−C Described by a Poisson probability distribution
  4. Lander-Waterman statistics G = genome length N = number of

    reads L = length of each read C = NL/G = “coverage” T = minimum detectable overlap between reads The expected number of gaps is given by: where α = 1 – T/L How many reads do we need to cover the whole genome? E(# gaps) = Ne−Cα
  5. Repeats complicate assembly Repeat types: • Low-Complexity DNA (e.g. ATATATATACATA…)

    • Microsatellite repeats (a1 …ak )N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) • Transposons • SINE (Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 106 copies • LINE (Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies • LTR retrotransposons (Long Terminal Repeats (~700 bp) at each end) • Gene Families genes duplicate & then diverge (paralogs) • Recent duplications ~100,000-long, very similar copies
  6. What can we do about repeats? Two main approaches: •

    Cluster the reads experimentally • Link the reads
  7. What can we do about repeats? Two main approaches: •

    Cluster the reads experimentally • Link the reads
  8. What can we do about repeats? Two main approaches: •

    Cluster the reads experimentally • Link the reads
  9. The human genome project: Whole-genome shotgun vs hierarchical Gene Myers

    à Celera Let’s sequence the human genome with the shotgun strategy That’s impossible, and a bad idea anyway Phil Green à public effort 1997
  10. Steps to Assemble a Genome 67$$8)'-$.9*,:&55)'1$,*&-+ ;7$$<*,)9*$!.'+*'+"+$+*="*'!* 77>?4>@@>?>>@>44@@77 A7$$B*,1*$+.%*$“1..-” 5&),+$.3$,*&-+$)'#.$

    :.'1*,$!.'#)1+ C7$$D)'E$!.'#)1+$#.$3.,%$+"5*,!.'#)1+ Overlap Layout Consensus Example: ARACHNE genome assembler
  11. 1. Find Overlapping Reads aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct

    gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient., pos.) aaactgcag aactgcagt acggatcta actgcagta actgcagta cccaaactg cggatctac ctactacac ctgcagtac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc gtacggatc tacggatct tacggatct tactacaca
  12. A C C G A T G T A C

    T G T T A C T G T T T A A T C X: ACCGATGTACTGT------ Y: -------TACTGTTTAATC Overlap Alignment
  13. Overlap alignment • Problem: Find optimal overlap alignment between sequences

    X and Y. • Same as global alignment, but don’t penalize overhanging ends. X Y X Y Example overlap alignments
  14. Overlap alignment algorithm F0, 0 = 0 F0, 1…j =

    0 F1…i, 0 = 0 for each i = 1…M for each j = 1…N Fi-1, j-1 + s(Xi , Yj ) [match] Fi, j = max Fi-1, j – d [gap in X] Fi, j-1 – d [gap in Y] DIAG, if [match] Ptri, j = LEFT, if [gap in X] UP, if [gap in Y] Initialization Iteration Termination: Optimal alignment score is the maximum score in F{1…M},N or FM,{1…N}
  15. 1. Find Overlapping Reads • Find pairs of reads sharing

    a k-mer, k ~ 24 • Extend to full alignment F throw away if not >98% similar TAGATTACACAGATTAC TAGATTACACAGATTAC ||||||||||||||||| T GA TAGA | || TACA TAGT || • Caveat: repeats § A k-mer that occurs N times, causes O(N2) read/read comparisons § ALU k-mers could cause up to 1,000,0002 comparisons • Solution: § Discard all k-mers that occur !too often" • Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available
  16. 1. Find Overlapping Reads Create local multiple alignments from the

    overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA
  17. 1. Find Overlapping Reads • Correct errors using multiple alignment

    TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA insert A replace T with C correlated errors— probably caused by repeats à disentangle overlaps TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA In practice, error correction removes up to 98% of the errors TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA
  18. 2. Merge Reads into Contigs • Overlap graph: • Nodes:

    reads r1 …..rn • Edges: overlaps (ri , rj , shift, orientation, score) Note: of course, we don’t know the “color” of these nodes Reads that come from two regions of the genome (blue and red) that contain the same repeat
  19. 2. Merge Reads into Contigs We want to merge reads

    up to potential repeat boundaries !"#"$%&!"'()* Unique Contig Overcollapsed Contig
  20. 2. Merge Reads into Contigs • Ignore “hanging” reads, when

    detecting repeat boundaries +*="*'!)'1$*,,., ,*5*&#$G."'-&,(HHH G & …
  21. 2. Merge Reads into Contigs • Remove transitively inferable overlaps

    • If read r overlaps to the right reads r1 , r2 , and r1 overlaps r2 , then (r, r2 ) can be inferred by (r, r1 ) and (r1 , r2 ) r r1 r2 r3
  22. 4. Derive Consensus Sequence Derive multiple alignment from pairwise read

    alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA <*,)9*$*&!0$!.'+*'+"+$G&+*$G($L*)10#*-$9.#)'1 />:#*,'&#)9*O #&E*$%&P)%"%Q="&:)#($:*##*,2
  23. Base quality (PHRED scores) • PHRED: Phil’s Read Editor (Phil

    Green, U. Washington) • Method for calling base letters from Sanger sequencers • Phred quality scores are a convenient way to represent confidence in individual base calls. A C G A A T C A G 16 18 21 23 25 15 28 30 32 Quality scores: -10 * log10 (P(Error)) Phred score of 40: 10-40/10 chance of error = 99.99% confidence in call Phred score of 20: 10-20/10 chance of error = 99% confidence in call
  24. Some Assemblers • PHRAP – Phil’s Revised Assembly Program •

    Early assembler, widely used, good model of read errors • Overlap O(n2) à layout (no mate pairs) à consensus • Celera (Myers) • First assembler to handle large genomes (fly, human, mouse) • Overlap à layout à consensus • Arachne (Batzoglou) • Public assembler (mouse, several fungi) • Overlap à layout à consensus • Euler (Pevzner) • Indexing à de Bruijn graph à picking paths à consensus • Velvet (Birney) • Short reads à small genomes à simplification à error correction Overlapà layout à consensus String graphs, de Bruijn graphs
  25. Graphs • A graph is a network composed of two

    sets of objects: • Nodes: each node is represented by a point. • Edges: each edge is represented by a segment connecting two nodes. • Graph theory can be applied to many different problems. • Transportation networks • Disease epidemics • Computer network behavior • Genome sequencing Slides adapted from Compeau & Pevzner
  26. Icosian Game Graph (Hamilton, 1857) • For the Icosian Game,

    we create a graph: • Nodes = islands • Edges = bridges connecting the islands Problem: Find path that visits each island once
  27. Hamiltonian Cycles • A Hamiltonian cycle in a graph is

    a cycle that uses each node exactly once. • A graph containing such a cycle is called Hamiltonian. • Hamiltonian Cycle Problem (HCP): Find a Hamiltonian cycle in G or prove that G is not Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  28. Königsberg Bridges Graph (Euler, 1735) • For the Königsberg Bridge

    Problem, we create a graph: • Nodes = 4 land masses of the city • Edges = 7 bridges connecting land areas Problem: Find path that goes over each bridge once
  29. Eulerian Cycles • Cycle: path in graph where first and

    last nodes are the same. • An Eulerian cycle is a cycle that travels to each edge exactly once. • A graph containing such a cycle is called Eulerian. • If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. • However, no such cycle exists. • Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian.
  30. Directed Graphs • Directed Graph: A graph in which each

    edge has a direction (represented by an arrow). • An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction. Undirected Graph Directed Graph
  31. • indegree(v) = the number of edges leading into node

    v. • outdegree(v) = the number of edges leading out of v. • A graph is balanced if indegree(v) = outdegree(v) for every node v. • Label each node v with (indegree(v), outdegree(v)) • This graph isn’t balanced since some nodes don’t have equal indegree and outdegree. Balanced Graphs (1, 2) (2, 1) (1, 0) (2, 1) (1, 1) (0, 2) (1, 1)
  32. Euler’s Theorem (directed graphs) • A graph is connected if

    for every pair of nodes {u, v}, we can travel either from u to v or from v to u. Not Connected (2, 2) (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1) Connected + Balanced = Eulerian Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced.
  33. Summary • Overlap-Layout-Consensus approach is used to assemble genomes from

    long reads (e.g. from Sanger sequencing). • Relies on overlap alignment approaches. • Builds & processes overlap graphs to account for repeats.