Recent Progress in Long Read Genome Assembly: Theory, Practice and Future Challenges

5633e4eaa009d960042a8f32b55b3d7f?s=47 Jason Chin
December 06, 2016

Recent Progress in Long Read Genome Assembly: Theory, Practice and Future Challenges

With the long reads generated with PacBio® Single Molecule Real-Time Sequencing, many algorithms developed since the late 90’s for assembling reads > 500bp have become important again. Recent advances, in aspects of both computation and algorithm, have made assembling large and difficult genomes more feasible than before. We will overview the progress from both a theoretical and a practical point of view. We compare various assembly results for human genomes with different approaches. Beyond human genomes, we discuss the complexity for assembling high heterozygosity or highly repetitive genomes and the current solutions. Furthermore, we show how to resolve repeat-induced ambiguity with detailed analysis of the assembly graph motifs and how to recover useful information lost when the assembly graph is simplified into contigs. At the end we discuss the current challenges and open problems in the ongoing development work toward generating perfect genome assemblies.

5633e4eaa009d960042a8f32b55b3d7f?s=128

Jason Chin

December 06, 2016
Tweet

Transcript

  1. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Recent Progress in Long Read Genome Assembly: Theory, Practice and Future Challenges Jason Chin, Scientific Fellow, Pacific Biosciences Bioinforsummer 2016
  2. WHAT HAPPENS AT THE “KINK”? 2 Discontinuity in the Genomic

    “Moore’s Law” happens in both direction, we all ignore the inconvenient truth. According to NHGRI website, the definition of “sequencing a genome” is changed at year 2008. De Novo Genome “Re-sequencing” 600 bp 50 bp 300 bp PacBio Founded Oxford Nanopore Founded De novo human genome (N50 ~ 1 to 10Mbp) price with long read in 2014 Human Genome Assembled with PacBio Data 12x reduction in read length
  3. MANY GENOME ASSEMBLERS “Why is genome assembly not a solved

    problem?” - Gene Myers (Dagstuhl Workshop, Aug. 2016)
  4. 18 YEARS OLD “QUALITY-FIRST” CREDO FROM HGP 4 Accuracy minimum

    at QV40 Contiguity minimum at 100kb “Contiguity has proven particular elusive.”
  5. IN RETROSPECT…. http://www.pitt.edu/~super7/45011-46001/45771.ppt by Doug Brutlag

  6. Year Technology Assembler Sample 2007 ABI 3730 Celera HuRef 2009

    Illumina GA SOAP de novo BGI YH 2010 454 GS Flx Titanium Newbler KB1 2010 Illumina GA ALLPATHS-LG NA12878 2013 454 GS, HiSeq, MiSeq Newbler RP11_0.7 2014 HiSeq, BAC clones Reference- guided CHM1 2014 PacBio RS II FALCON CHM1 2015 PacBio RS II FALCON CHM13 2015 PacBio RS II FALCON AK1 2015 PacBio RS II FALCON HuRef 2015 PacBio RS II FALCON PC-9* 2015 PacBio RS II FALCON SK-BR-3* HUMAN ASSEMBLY BENCHMARKS (UP TO 2015) Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/ 20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/ early/2010/12/20/1017351108.abstract Table3); CHM1 Illumina (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/) *cancer cell lines 0.11 0.007 0.006 0.024 0.13 0.14 4.38 12.98 7.28 10.38 3.58 2.56 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 26.9 Mb - NCBI: GCA_001297185.1 (2015) Contig N50 (Mb)
  7. JUST TO BRAG MORE ABOUT THAT WE ARE REALLY GOOD

    AT HUMAN -The PacBio CHM1 assembly resolves the q arms of chromosomes 2 and 6 into very few contigs, with max contigs 107 Mb and 109 Mb long, respectively. Source: PacBio blog post, Tuesday September 29, 2015, http://pacb.com/blog
  8. #1MBCTGCLUB @ 2016 PLANT AND ANIMAL GENOME MEETING

  9. SINGLE MOLECULE REAL-TIME SEQUENCING http://www.pacb.com/smrt-science/smrt-sequencing/ • Optical confinement detecting single-

    molecule biochemistry event • Analog incorporation by single polymerase generates light pulse • Do not need some biochemical steps to synchronize multiple molecules
  10. SYNCHRONOUS AND ASYNCHRONOUS SEQUENCING Maxam-Gilbert Sanger (Chain-termination) SOLiD™ Sequencing by

    synthesis (Illumina®) 454™ Ion Torrent™ • Boost “signal”: chemical steps to synchronize molecule populations for every base • “De-synchronization,” e.g. caused by incomplete reactions, increases errors and limits the read lengths PacBio® SMRT Oxford Nanopore • No chemistry step to synchronize signals -- insertion and deletion errors are most common • Trade-off reduces accuracy to gain read length. Read length is not limited by “de-synchronization.” SYNC Sequencing (Synchronized many molecules) ASYNC Sequencing (Single molecules) Genia Nanopore
  11. IN SILICO SYNC - ALIGNMENT TO CONSENSUS T T 1

    C 1 A A 1 T 1 A 1 A C 1 A 1 T 1 T C 1 G 1 G 1 T 1 1 1 1 A 1 A 1 1 read A T 2 T 2 A 1 C 1 - 1 G 1 1 T T 1 A 1 1 A 3 T 3 A 1 2 1 A C 2 A 3 T 3 T C 1 G 3 G 3 - 1 T 2 T 1 1 - 2 3 2 1 1 2 A 2 1 A 3 2 + 1 read A T 3 A 1 T 4 A 1 1 C 1 - 3 G 1 1 T T 1 C 1 A 2 1 1 A 4 - 1 T 4 1 A 1 4 1 A C 3 C 1 - A 4 1 T 1 4 T C 1 G 5 G 5 - - 1 2 T 3 T 1 1 - 4 1 4 4 1 1 4 A 4 1 A 5 4 + 1 read A T 6 A 1 T 7 A 1 - 1 1 C 1 - 5 T 1 G 1 1 1 T T 1 C 1 A 4 1 1 A 8 - 1 T 7 - 1 1 A 1 7 - 1 1 1 A C 7 C 1 - A 8 1 T 1 6 C 1 A 1 T C 1 G 9 G 9 - - 1 2 T 7 T 1 1 - 8 A 1 1 6 1 1 6 3 1 1 1 7 1 A 7 2 A 9 8 + 1 read A T 11 A 1 T 13 A 1 - 3 1 C 1 - 11 T 1 G 1 1 3 T T 1 C 2 A 10 1 2 A 15 - 2 T 13 - 2 2 A 1 15 - 1 1 1 A C 13 C 2 - 1 A 14 1 1 T 2 12 C 1 A 2 T C 1 T 1 G 17 1 G 17 - - 1 4 T 12 2 T 2 1 1 - 15 A 1 1 14 1 1 13 4 1 2 2 13 2 A 12 4 - 1 C 1 A 15 1 16 + 1 read • Using dynamic programming / sequence alignment to synchronize bases from different reads • Representing all the alignments in a single graph • Finding the best path in the alignment graph to catch the correct sequences of (synchronized) bases Simulation shows the simple algorithm can get > 99.5% accuracy with modest coverage under different error models.
  12. HOW DO WE GET THERE FOR PERFECT ASSEMBLY?

  13. 21 SELECTED WGS ASSEMBLY PAPERS 1979-2013 Statistical theories for WGS

    assembly Greedy Heuristics Shortest Common Superstring Hamiltonian Path Landmark Genome Assemblies Automatic Assembly Overlap Graph / String Graph for Long Read Assembly De Bruijn Graph / Eulerian Path for Sequence By Hybridization / NGS
  14. THEORY FOR PERFECT ASSEMBLY All 35+ years work by many

    people summarized in 140 characters
  15. “WHY ASSEMBLY IS NOT A SOLVED PROBLEM?” Proposed answer: •

    We know too little about the true repeat structures in all genomes • We know too little about the structural difference between different haplotypes (for diploid/polyploid genome assembly) 1984 Peltola, Soderhun & Ukkonen 2016, still the same problem but at different scales A 100 kb nested repeat region
  16. SOLVE REPEAT PROBLEM: LONGER READ LENGTHS For many short read

    assembly, repeats are ignored and are hidden between the gaps. Some maybe junk, some may be important (e.g. CRISPR/Cas-9). If you can’t see them, you can’t study them.
  17. HETEROZYGOSITY INDUCE NEW GRAPH STRUCTURE Most assembly theory development focuses

    on haploid genomes and ignores more complicated assembly graph structure Biology Paired homologous chromosomes Diploid Assembly Graph
  18. - Quick review on how to make assembly contigs -

    Understand and process repeats in genomes - Understand how heterozygosity affects assembly and “unzipping” haplotypes for diploid genome assembly
  19. STRING GRAPH AND DE NOVO GENOME ASSEMBLY 19 String graph

    assembly for continuous long reads: 1. Remove contained reads 2. Convert read overlaps to string graphs 3. Construct contigs by identifying unbranched paths Reads Genome Read overlap pairs node edge
  20. UNDERSTAND AND PROCESS REPEATS IN GENOMES

  21. PRACTICAL CONSIDERATION Repeat Suppression to Reduce Graph Complexity Gene Myers

    2016 - Repeat causes tangled graph - Finding correct paths through hairball is hard - Algorithm complexity increases with more edges Example: A Human Genome Assembly Graph
  22. EXAMPLE OF UNEVEN 5’- AND 3’- OVERLAPS Read overlap pattern

    on a seed read seed read Repeat induced overlaps
  23. OVERLAP FILTERING FOR REPEAT SUPPRESSION 23 Only use these reads

    for assembly By looking at the number of 5’- or 3’- overlap, we can eliminate the reads ends in the long repeats to suppress repeat induced tanglements (Maize genome data shown below) 5’ overlap counts 3’ overlap counts
  24. RELATED RECENT PROGRESS: “HINGE” ASSEMBLER

  25. WHY CONTIGS BREAK? WHAT WE CAN DO ABOUT IT? Break

    points Contigs break because of • Not enough coverage • Repeat Induced ambiguity
  26. PARAMETER CHOICE CONSIDERATIONS • Length cutoff: Coverage vs. Repeat-resolution •

    More coverage at shorter length may not be good • Lower coverage of “enough” longer reads is good • What “enough” is “enough”? RIB limited Coverage limited Read length cutoff Assembly Contiguity (Just a guess) coverage Repeat-induced branching (RID) “Tuning assembler parameters (TAP)” developed by Shoudan may help: https://github.com/pb-sliang/TAP 748,163 1,047,929 1,293,937 1,442,972 1,272,783 753,370 2,184,510,437 2,079,537,346 2,029,248,251 1,984,532,045 1,943,822,269 1,833,838,523 1.8E+09 1.85E+09 1.9E+09 1.95E+09 2E+09 2.05E+09 2.1E+09 2.15E+09 2.2E+09 2.25E+09 - 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 0 5000 10000 15000 20000 25000 30000 ASSEMBLY SIZE NN50 LENGTH CUTOFF n50 assembly size
  27. GRAPH TO CONTIGS

  28. Live Demo

  29. None
  30. None
  31. None
  32. 000062 ó 000146F, AMY Ctg 62 Ctg 146 CNV region

    of AMY genes Haplotype difference
  33. 000305 ó 000074, NXF2B, LONG INVERT REPEAT Ctg 305 Ctg

    74 Ctg 305 (INV) Unique region
  34. MIS-ASSEMBLY Ctg 33 Ctg 120 Mis-assembly point

  35. CONTIGUITY: JOIN CONTIGS ACROSS COMPLICATED BUT LOCAL REPEATS • We

    find 91 junctions can be joined in our NA19240 assembly • Boost on contiguity: From N50: 24,239,162 / Max: 81,042,162 to N50: 28,125,580 / Max: 109,042,162 • What is the theoretical contig N50: • Take the current GRCh38 and breaks the reference on seg-dup > 50kb • The N50 of the unbroken segments are 30,332,297bp • Our N50 is close to the theoretical one 30.3Mbp • Largest 5 continuous non-seg-dup segments in GRCh38: 132.529Mb chr8:12,609,996-145,138,636 109.831Mb chr2:132,362,523-242,193,529 * 109.453Mb chr6:6,1159,830-170,612,704 108.732Mb chr4:9,735,201-118,467,376 104.559Mb chr5:71,364,489-175,923,361 * match our longest “scaffold” (of 3 contigs)
  36. HETEROZYGOSITY AND “UNZIPPING” HAPLOTYPES

  37. WE FINALLY DOCUMENTED DIPLOID ASSEMBLY IDEAS Col-0 Cvi-0 Col-0 x

    Cvi-0 - Build good datasets for algorithm development and validation - Testing on other challenging genomes Other diploid genomes have been assembled
  38. A COUPLE YEAR AGO: CLOSE LOOK FOR CHR. 5 38

    When aligning to the reference, contigs overlap but still break. Surprised ”inbred” assembly results observed in 2014
  39. BROKEN CONTIGS AND THE OVERLAPPED GRAPH 39 8 contigs mapped

    to Chr5:17,460,000- 17,490,000 “Best Edge” Overlap Graph of all reads mapped to the region indicate two haplotypes observed.
  40. A QUICK COMPUTATIONAL EXPERIMENT FOR DIPLOID ASSEMBLY How does the

    overlap graph looks like for a “simulated diploid genome”? - Take p-reads from the “good part” (10Mb) of Chr5 and do a assembly with Celera Assembler to check out the overlapped graph. - The overlap graph has a global linear structure + some “bubbles”. This is good from algorithm point of view. Namely, we can generate useful information about a diploid genome easier if overlap graph looks like this. 40 Contig Stats #Seqs 261 Max 912,094 Total 17,070,679 n50 109,783
  41. WHY DO WE SEE BUBBLES? SNPs SNPs SNPs SVs SVs

    SNPs SNPs SNPs SVs SVs Haplotype 1 Haplotype 2 Genome Sequences Assembly Graph In most OLC assembler designs, the overlapper does not catch differences at the SNP level but structural variations are naturally segregated.
  42. SOLVING THE DIPLOID ASSEMBLY PROBLEM - Bubbles = big variants

    between the haplotypes - Collapsed Path = smaller variants between the haplotypes - FALCON (a polyploid-aware assembler): generating the contigs through the bubbles - FALCON-Unzip: identifying smaller variants and using them to separate the haplotypes
  43. THE FALCON UNZIP PROCESS SNPs SNPs SNPs SVs SVs Associate

    contig 1 (Alternative allele) Associate contig 2 (Alternative allele) SNPs SNPs SNPs SVs SVs Primary contig FALCON FALCON-Unzip Updated primary contig + “associate haplotigs” Augmented with haplotype information of each reads Andrew J Makoff, Genome Biol. 2007; 8(6): R114.
  44. PHASING READ INTO HAPLOTYPE GROUPS Haplotype 0 Haplotype 1 Identify

    het- SNPs Phase het- SNPs Group reads with phased SNPs Reconstruct haplotypes Align SMRT reads to the initial primary contig More het-SNPs in longer reads: 8% to 15% sequence error rate is not an issues given enough long-read coverage for phasing.
  45. MERGE HAPLOTYPE INFORMATION AND “UNZIP” Tiling path of haplotype 0

    Tiling path of haplotype 1 Remove edges connecting different haplotypes
  46. PUT EVERYTHING TOGETHER “FALCON-Unzip Process” ~ 4.80 Mb Add missing

    haplotype specific nodes & edges Remove edges that connect different haplotypes The final graph comprises a primary contig (blue), a major haplotig (red) and other smaller haplotigs. 4 major haplotype phased blocks determined by het-SNPs Un-phased region
  47. ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred

    Cvi-0 Col-0 x Cvi-0 F1 Assembler FALCON (primary ctg) FALCON (primary ctg) FALCON FALCON-Unzip FALCON-Unzip primary contigs primary contigs haplotigs Assembly Size (Mb) 120 120 143 140 105 # contigs 337 260 426 172 248 N50 size (Mb) 7.35 6.07 7.92 7.96 6.92 Max Contig size (Mb) 12.20 14.37 13.39 13.32 11.65 120 120 143 140 105 0 20 40 60 80 100 120 140 160 (primary ctg) (primary ctg) primary contigs primary contigs haplotigs FALCONFALCON FALCON FALCON -Unzip FALCON -Unzip Assembly Size (Mb) 7.35 6.07 7.92 7.96 6.92 0 2 4 6 8 10 N50 size (Mb)
  48. HUMAN *DIPLOID* ASSEMBLY BENCHMARKS 2 Assembly Sequence Assembly size (Gb)

    # contigs Contig N50 size (kb) Contig N90 (kb) Max contig size (Mb) PacBio Assemblies HG002* p-contigs 2.77 1,833 13,103 1,046 57.2 haplotigs 1.45 17,831 115 37 2.9 WI-38^ p-contigs 2.90 1,849 20,781 2,099 93.7 haplotigs 1.90 7,668 399 116 2.4 NA12878 p-contigs 2.82 1,661 13,394 1,762 70.3 haplotigs 1.63 13,036 192 57 2.5 NA19240-1*,+ p-contigs 2.85 2,238 4,945 1,089 20.0 haplotigs 2.30 9,916 440 111 2.6 NA19240-2+ p-contigs 2.85 1,599 24,314 3,010 81.1 haplotigs 2.36 9,764 468 117 2.6 Linked short read assembly NA19240- short-read* pseudohap2.1 2.82 58,308 166 38 1.5 pseudohap2.2 2.82 58,308 166 38 1.5 * Publically available assemblies from bioRxiv 067447; High-Quality Assembly of an Individual of Yoruban Descent, Karyn Meltz Steinberg, et. al (https://www.dropbox.com/sh/8gsdycmqk04ei7s/AADZJgJtcK_mw16CYpoSpJhca) + NA19240-1 & NA19240-2 are assembled with the same input data. NA19240-2 uses a recently updated assembly algorithm. ^ See poster 2238F in “Genome Structure and Function” section
  49. HETEROZYGOUS VARIATIONS AT MANY SCALES

  50. GENERAL SCHEMATIC ABOUT HOW DIFFERENT LEVELS OF HETEROZYGOSITY CAN AFFECT

    THE CONTIG LAYOUT. Two haplotypes with various degree of heterozygosity The high heterozygosity part (orange) has no overlap with the counter part and becomes its own p-contig The low heterozygosity part (purple) has no associated haplotig as no reliable phasing information available Biological sequence FALCON assembly model FALCON-Unzip assembly model p-contig p-contig p-contig 1 h-contig p-contig 2 Low heterozygosity High heterozygosity ”unzip-able” region
  51. HETEROZYGOSITY BETWEEN HAPLOTYPES CAN BE REALLY HIGH Great “co-linear” (synteny)

    relationship between the two homologous contigs at large scale Detailed view of the alignment: many SVs at various scales (up to 10 to 20 kbp) - A diploid fish genome: - Heterozygosity measured by SNPs ~ 1 SNP / 200 bp - Heterozygosity measured with everything: < 90% base level identity over large regions
  52. SOME INTERESTING OPEN QUESTIONS • What is the right mathematical

    formulation for the diploid assembly graph traversal problem? • From haplo-tig to haplo-scaffolds • Understand the mapping from genome repeat structures to assembly graph motif structures • “Mosaic-like” repeat structures are probably common in plant genomes • Important for understanding the read length vs. coverage tradeoff • Leverage the assembly graph structure to assess the “quality value” (measurement of uncertainty) of the continuity of the output sequences • Go beyond diploid genomes to attack really hard polyploidy genomes
  53. FUTURE OUTLOOK 53 We just see the tip of the

    iceberg…. Re-sequencing: Need a reference genome Mostly SNP information High contiguity assembly: Detect all structural variations Better annotation Build graph genome model Enable comparative genomics at chromosome scale and more
  54. ACKONWLEDGEMENT -All PacBio Users -All PacBio colleagues and managements -Gene

    Myers -Ronan O’Malley, Chongyuan Luo, Joseph Ecker (HHMI / The Salk Institute ), Alicia Clum, Kerrie Barry, Alex Copeland (Joint Genome Institute), Maria Nattestad, Fritz Sedlazeck, Michael Schatz (CSHL, JHU), Dario Cantu (UCD), HLI, WUSTL - Open source toolsets -Daligner (https://dazzlerblog.wordpress.com), Gene Myers -BLASR (https://github.com/PacificBiosciences/blasr), Mark Chaisson -Python, NetworkX for rapid algorithm protyping -Gephi, Graphviz for graph visualization -FALCON (https://github.com/PacificBiosciences/falcon, https://github.com/PacificBiosciences/falcon)
  55. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. www.pacb.com