Recent Progress in Long Read Genome Assembly: Theory, Practice and Future Challenges

For Research Use Only. Not for use in diagnostics procedures.
© Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Recent Progress in Long Read Genome Assembly: Theory, Practice and Future Challenges Jason Chin, Scientific Fellow, Pacific Biosciences Bioinforsummer 2016

WHAT HAPPENS AT THE “KINK”? 2 Discontinuity in the Genomic
“Moore’s Law” happens in both direction, we all ignore the inconvenient truth. According to NHGRI website, the definition of “sequencing a genome” is changed at year 2008. De Novo Genome “Re-sequencing” 600 bp 50 bp 300 bp PacBio Founded Oxford Nanopore Founded De novo human genome (N50 ~ 1 to 10Mbp) price with long read in 2014 Human Genome Assembled with PacBio Data 12x reduction in read length

MANY GENOME ASSEMBLERS “Why is genome assembly not a solved
problem?” - Gene Myers (Dagstuhl Workshop, Aug. 2016)

18 YEARS OLD “QUALITY-FIRST” CREDO FROM HGP 4 Accuracy minimum
at QV40 Contiguity minimum at 100kb “Contiguity has proven particular elusive.”

IN RETROSPECT…. http://www.pitt.edu/~super7/45011-46001/45771.ppt by Doug Brutlag

Year Technology Assembler Sample 2007 ABI 3730 Celera HuRef 2009
Illumina GA SOAP de novo BGI YH 2010 454 GS Flx Titanium Newbler KB1 2010 Illumina GA ALLPATHS-LG NA12878 2013 454 GS, HiSeq, MiSeq Newbler RP11_0.7 2014 HiSeq, BAC clones Reference- guided CHM1 2014 PacBio RS II FALCON CHM1 2015 PacBio RS II FALCON CHM13 2015 PacBio RS II FALCON AK1 2015 PacBio RS II FALCON HuRef 2015 PacBio RS II FALCON PC-9* 2015 PacBio RS II FALCON SK-BR-3* HUMAN ASSEMBLY BENCHMARKS (UP TO 2015) Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/ 20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/ early/2010/12/20/1017351108.abstract Table3); CHM1 Illumina (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/) *cancer cell lines 0.11 0.007 0.006 0.024 0.13 0.14 4.38 12.98 7.28 10.38 3.58 2.56 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 26.9 Mb - NCBI: GCA_001297185.1 (2015) Contig N50 (Mb)

JUST TO BRAG MORE ABOUT THAT WE ARE REALLY GOOD
AT HUMAN -The PacBio CHM1 assembly resolves the q arms of chromosomes 2 and 6 into very few contigs, with max contigs 107 Mb and 109 Mb long, respectively. Source: PacBio blog post, Tuesday September 29, 2015, http://pacb.com/blog

#1MBCTGCLUB @ 2016 PLANT AND ANIMAL GENOME MEETING

SINGLE MOLECULE REAL-TIME SEQUENCING http://www.pacb.com/smrt-science/smrt-sequencing/ • Optical confinement detecting single-
molecule biochemistry event • Analog incorporation by single polymerase generates light pulse • Do not need some biochemical steps to synchronize multiple molecules

SYNCHRONOUS AND ASYNCHRONOUS SEQUENCING Maxam-Gilbert Sanger (Chain-termination) SOLiD™ Sequencing by
synthesis (Illumina®) 454™ Ion Torrent™ • Boost “signal”: chemical steps to synchronize molecule populations for every base • “De-synchronization,” e.g. caused by incomplete reactions, increases errors and limits the read lengths PacBio® SMRT Oxford Nanopore • No chemistry step to synchronize signals -- insertion and deletion errors are most common • Trade-off reduces accuracy to gain read length. Read length is not limited by “de-synchronization.” SYNC Sequencing (Synchronized many molecules) ASYNC Sequencing (Single molecules) Genia Nanopore

IN SILICO SYNC - ALIGNMENT TO CONSENSUS T T 1
C 1 A A 1 T 1 A 1 A C 1 A 1 T 1 T C 1 G 1 G 1 T 1 1 1 1 A 1 A 1 1 read A T 2 T 2 A 1 C 1 - 1 G 1 1 T T 1 A 1 1 A 3 T 3 A 1 2 1 A C 2 A 3 T 3 T C 1 G 3 G 3 - 1 T 2 T 1 1 - 2 3 2 1 1 2 A 2 1 A 3 2 + 1 read A T 3 A 1 T 4 A 1 1 C 1 - 3 G 1 1 T T 1 C 1 A 2 1 1 A 4 - 1 T 4 1 A 1 4 1 A C 3 C 1 - A 4 1 T 1 4 T C 1 G 5 G 5 - - 1 2 T 3 T 1 1 - 4 1 4 4 1 1 4 A 4 1 A 5 4 + 1 read A T 6 A 1 T 7 A 1 - 1 1 C 1 - 5 T 1 G 1 1 1 T T 1 C 1 A 4 1 1 A 8 - 1 T 7 - 1 1 A 1 7 - 1 1 1 A C 7 C 1 - A 8 1 T 1 6 C 1 A 1 T C 1 G 9 G 9 - - 1 2 T 7 T 1 1 - 8 A 1 1 6 1 1 6 3 1 1 1 7 1 A 7 2 A 9 8 + 1 read A T 11 A 1 T 13 A 1 - 3 1 C 1 - 11 T 1 G 1 1 3 T T 1 C 2 A 10 1 2 A 15 - 2 T 13 - 2 2 A 1 15 - 1 1 1 A C 13 C 2 - 1 A 14 1 1 T 2 12 C 1 A 2 T C 1 T 1 G 17 1 G 17 - - 1 4 T 12 2 T 2 1 1 - 15 A 1 1 14 1 1 13 4 1 2 2 13 2 A 12 4 - 1 C 1 A 15 1 16 + 1 read • Using dynamic programming / sequence alignment to synchronize bases from different reads • Representing all the alignments in a single graph • Finding the best path in the alignment graph to catch the correct sequences of (synchronized) bases Simulation shows the simple algorithm can get > 99.5% accuracy with modest coverage under different error models.

HOW DO WE GET THERE FOR PERFECT ASSEMBLY?

21 SELECTED WGS ASSEMBLY PAPERS 1979-2013 Statistical theories for WGS
assembly Greedy Heuristics Shortest Common Superstring Hamiltonian Path Landmark Genome Assemblies Automatic Assembly Overlap Graph / String Graph for Long Read Assembly De Bruijn Graph / Eulerian Path for Sequence By Hybridization / NGS

THEORY FOR PERFECT ASSEMBLY All 35+ years work by many
people summarized in 140 characters

“WHY ASSEMBLY IS NOT A SOLVED PROBLEM?” Proposed answer: •
We know too little about the true repeat structures in all genomes • We know too little about the structural difference between different haplotypes (for diploid/polyploid genome assembly) 1984 Peltola, Soderhun & Ukkonen 2016, still the same problem but at different scales A 100 kb nested repeat region

SOLVE REPEAT PROBLEM: LONGER READ LENGTHS For many short read
assembly, repeats are ignored and are hidden between the gaps. Some maybe junk, some may be important (e.g. CRISPR/Cas-9). If you can’t see them, you can’t study them.

HETEROZYGOSITY INDUCE NEW GRAPH STRUCTURE Most assembly theory development focuses
on haploid genomes and ignores more complicated assembly graph structure Biology Paired homologous chromosomes Diploid Assembly Graph

- Quick review on how to make assembly contigs -
Understand and process repeats in genomes - Understand how heterozygosity affects assembly and “unzipping” haplotypes for diploid genome assembly

STRING GRAPH AND DE NOVO GENOME ASSEMBLY 19 String graph
assembly for continuous long reads: 1. Remove contained reads 2. Convert read overlaps to string graphs 3. Construct contigs by identifying unbranched paths Reads Genome Read overlap pairs node edge

UNDERSTAND AND PROCESS REPEATS IN GENOMES

PRACTICAL CONSIDERATION Repeat Suppression to Reduce Graph Complexity Gene Myers
2016 - Repeat causes tangled graph - Finding correct paths through hairball is hard - Algorithm complexity increases with more edges Example: A Human Genome Assembly Graph

EXAMPLE OF UNEVEN 5’- AND 3’- OVERLAPS Read overlap pattern
on a seed read seed read Repeat induced overlaps

OVERLAP FILTERING FOR REPEAT SUPPRESSION 23 Only use these reads
for assembly By looking at the number of 5’- or 3’- overlap, we can eliminate the reads ends in the long repeats to suppress repeat induced tanglements (Maize genome data shown below) 5’ overlap counts 3’ overlap counts

RELATED RECENT PROGRESS: “HINGE” ASSEMBLER

WHY CONTIGS BREAK? WHAT WE CAN DO ABOUT IT? Break
points Contigs break because of • Not enough coverage • Repeat Induced ambiguity

PARAMETER CHOICE CONSIDERATIONS • Length cutoff: Coverage vs. Repeat-resolution •
More coverage at shorter length may not be good • Lower coverage of “enough” longer reads is good • What “enough” is “enough”? RIB limited Coverage limited Read length cutoff Assembly Contiguity (Just a guess) coverage Repeat-induced branching (RID) “Tuning assembler parameters (TAP)” developed by Shoudan may help: https://github.com/pb-sliang/TAP 748,163 1,047,929 1,293,937 1,442,972 1,272,783 753,370 2,184,510,437 2,079,537,346 2,029,248,251 1,984,532,045 1,943,822,269 1,833,838,523 1.8E+09 1.85E+09 1.9E+09 1.95E+09 2E+09 2.05E+09 2.1E+09 2.15E+09 2.2E+09 2.25E+09 - 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 0 5000 10000 15000 20000 25000 30000 ASSEMBLY SIZE NN50 LENGTH CUTOFF n50 assembly size

GRAPH TO CONTIGS

Live Demo

000062 ó 000146F, AMY Ctg 62 Ctg 146 CNV region
of AMY genes Haplotype difference

000305 ó 000074, NXF2B, LONG INVERT REPEAT Ctg 305 Ctg
74 Ctg 305 (INV) Unique region

MIS-ASSEMBLY Ctg 33 Ctg 120 Mis-assembly point

CONTIGUITY: JOIN CONTIGS ACROSS COMPLICATED BUT LOCAL REPEATS • We
find 91 junctions can be joined in our NA19240 assembly • Boost on contiguity: From N50: 24,239,162 / Max: 81,042,162 to N50: 28,125,580 / Max: 109,042,162 • What is the theoretical contig N50: • Take the current GRCh38 and breaks the reference on seg-dup > 50kb • The N50 of the unbroken segments are 30,332,297bp • Our N50 is close to the theoretical one 30.3Mbp • Largest 5 continuous non-seg-dup segments in GRCh38: 132.529Mb chr8:12,609,996-145,138,636 109.831Mb chr2:132,362,523-242,193,529 * 109.453Mb chr6:6,1159,830-170,612,704 108.732Mb chr4:9,735,201-118,467,376 104.559Mb chr5:71,364,489-175,923,361 * match our longest “scaffold” (of 3 contigs)

HETEROZYGOSITY AND “UNZIPPING” HAPLOTYPES

WE FINALLY DOCUMENTED DIPLOID ASSEMBLY IDEAS Col-0 Cvi-0 Col-0 x
Cvi-0 - Build good datasets for algorithm development and validation - Testing on other challenging genomes Other diploid genomes have been assembled

A COUPLE YEAR AGO: CLOSE LOOK FOR CHR. 5 38
When aligning to the reference, contigs overlap but still break. Surprised ”inbred” assembly results observed in 2014

BROKEN CONTIGS AND THE OVERLAPPED GRAPH 39 8 contigs mapped
to Chr5:17,460,000- 17,490,000 “Best Edge” Overlap Graph of all reads mapped to the region indicate two haplotypes observed.

A QUICK COMPUTATIONAL EXPERIMENT FOR DIPLOID ASSEMBLY How does the
overlap graph looks like for a “simulated diploid genome”? - Take p-reads from the “good part” (10Mb) of Chr5 and do a assembly with Celera Assembler to check out the overlapped graph. - The overlap graph has a global linear structure + some “bubbles”. This is good from algorithm point of view. Namely, we can generate useful information about a diploid genome easier if overlap graph looks like this. 40 Contig Stats #Seqs 261 Max 912,094 Total 17,070,679 n50 109,783

WHY DO WE SEE BUBBLES? SNPs SNPs SNPs SVs SVs
SNPs SNPs SNPs SVs SVs Haplotype 1 Haplotype 2 Genome Sequences Assembly Graph In most OLC assembler designs, the overlapper does not catch differences at the SNP level but structural variations are naturally segregated.

SOLVING THE DIPLOID ASSEMBLY PROBLEM - Bubbles = big variants
between the haplotypes - Collapsed Path = smaller variants between the haplotypes - FALCON (a polyploid-aware assembler): generating the contigs through the bubbles - FALCON-Unzip: identifying smaller variants and using them to separate the haplotypes

THE FALCON UNZIP PROCESS SNPs SNPs SNPs SVs SVs Associate
contig 1 (Alternative allele) Associate contig 2 (Alternative allele) SNPs SNPs SNPs SVs SVs Primary contig FALCON FALCON-Unzip Updated primary contig + “associate haplotigs” Augmented with haplotype information of each reads Andrew J Makoff, Genome Biol. 2007; 8(6): R114.

PHASING READ INTO HAPLOTYPE GROUPS Haplotype 0 Haplotype 1 Identify
het- SNPs Phase het- SNPs Group reads with phased SNPs Reconstruct haplotypes Align SMRT reads to the initial primary contig More het-SNPs in longer reads: 8% to 15% sequence error rate is not an issues given enough long-read coverage for phasing.

MERGE HAPLOTYPE INFORMATION AND “UNZIP” Tiling path of haplotype 0
Tiling path of haplotype 1 Remove edges connecting different haplotypes

PUT EVERYTHING TOGETHER “FALCON-Unzip Process” ~ 4.80 Mb Add missing
haplotype specific nodes & edges Remove edges that connect different haplotypes The final graph comprises a primary contig (blue), a major haplotig (red) and other smaller haplotigs. 4 major haplotype phased blocks determined by het-SNPs Un-phased region

ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred
Cvi-0 Col-0 x Cvi-0 F1 Assembler FALCON (primary ctg) FALCON (primary ctg) FALCON FALCON-Unzip FALCON-Unzip primary contigs primary contigs haplotigs Assembly Size (Mb) 120 120 143 140 105 # contigs 337 260 426 172 248 N50 size (Mb) 7.35 6.07 7.92 7.96 6.92 Max Contig size (Mb) 12.20 14.37 13.39 13.32 11.65 120 120 143 140 105 0 20 40 60 80 100 120 140 160 (primary ctg) (primary ctg) primary contigs primary contigs haplotigs FALCONFALCON FALCON FALCON -Unzip FALCON -Unzip Assembly Size (Mb) 7.35 6.07 7.92 7.96 6.92 0 2 4 6 8 10 N50 size (Mb)

HUMAN *DIPLOID* ASSEMBLY BENCHMARKS 2 Assembly Sequence Assembly size (Gb)
# contigs Contig N50 size (kb) Contig N90 (kb) Max contig size (Mb) PacBio Assemblies HG002* p-contigs 2.77 1,833 13,103 1,046 57.2 haplotigs 1.45 17,831 115 37 2.9 WI-38^ p-contigs 2.90 1,849 20,781 2,099 93.7 haplotigs 1.90 7,668 399 116 2.4 NA12878 p-contigs 2.82 1,661 13,394 1,762 70.3 haplotigs 1.63 13,036 192 57 2.5 NA19240-1*,+ p-contigs 2.85 2,238 4,945 1,089 20.0 haplotigs 2.30 9,916 440 111 2.6 NA19240-2+ p-contigs 2.85 1,599 24,314 3,010 81.1 haplotigs 2.36 9,764 468 117 2.6 Linked short read assembly NA19240- short-read* pseudohap2.1 2.82 58,308 166 38 1.5 pseudohap2.2 2.82 58,308 166 38 1.5 * Publically available assemblies from bioRxiv 067447; High-Quality Assembly of an Individual of Yoruban Descent, Karyn Meltz Steinberg, et. al (https://www.dropbox.com/sh/8gsdycmqk04ei7s/AADZJgJtcK_mw16CYpoSpJhca) + NA19240-1 & NA19240-2 are assembled with the same input data. NA19240-2 uses a recently updated assembly algorithm. ^ See poster 2238F in “Genome Structure and Function” section

HETEROZYGOUS VARIATIONS AT MANY SCALES

GENERAL SCHEMATIC ABOUT HOW DIFFERENT LEVELS OF HETEROZYGOSITY CAN AFFECT
THE CONTIG LAYOUT. Two haplotypes with various degree of heterozygosity The high heterozygosity part (orange) has no overlap with the counter part and becomes its own p-contig The low heterozygosity part (purple) has no associated haplotig as no reliable phasing information available Biological sequence FALCON assembly model FALCON-Unzip assembly model p-contig p-contig p-contig 1 h-contig p-contig 2 Low heterozygosity High heterozygosity ”unzip-able” region

HETEROZYGOSITY BETWEEN HAPLOTYPES CAN BE REALLY HIGH Great “co-linear” (synteny)
relationship between the two homologous contigs at large scale Detailed view of the alignment: many SVs at various scales (up to 10 to 20 kbp) - A diploid fish genome: - Heterozygosity measured by SNPs ~ 1 SNP / 200 bp - Heterozygosity measured with everything: < 90% base level identity over large regions

SOME INTERESTING OPEN QUESTIONS • What is the right mathematical
formulation for the diploid assembly graph traversal problem? • From haplo-tig to haplo-scaffolds • Understand the mapping from genome repeat structures to assembly graph motif structures • “Mosaic-like” repeat structures are probably common in plant genomes • Important for understanding the read length vs. coverage tradeoff • Leverage the assembly graph structure to assess the “quality value” (measurement of uncertainty) of the continuity of the output sequences • Go beyond diploid genomes to attack really hard polyploidy genomes

FUTURE OUTLOOK 53 We just see the tip of the
iceberg…. Re-sequencing: Need a reference genome Mostly SNP information High contiguity assembly: Detect all structural variations Better annotation Build graph genome model Enable comparative genomics at chromosome scale and more

ACKONWLEDGEMENT -All PacBio Users -All PacBio colleagues and managements -Gene
Myers -Ronan O’Malley, Chongyuan Luo, Joseph Ecker (HHMI / The Salk Institute ), Alicia Clum, Kerrie Barry, Alex Copeland (Joint Genome Institute), Maria Nattestad, Fritz Sedlazeck, Michael Schatz (CSHL, JHU), Dario Cantu (UCD), HLI, WUSTL - Open source toolsets -Daligner (https://dazzlerblog.wordpress.com), Gene Myers -BLASR (https://github.com/PacificBiosciences/blasr), Mark Chaisson -Python, NetworkX for rapid algorithm protyping -Gephi, Graphviz for graph visualization -FALCON (https://github.com/PacificBiosciences/falcon, https://github.com/PacificBiosciences/falcon)

For Research Use Only. Not for use in diagnostics procedures.
© Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. www.pacb.com

Recent Progress in Long Read Genome Assembly: T...

Recent Progress in Long Read Genome Assembly: Theory, Practice and Future Challenges

More Decks by Jason Chin

Other Decks in Science

Featured

Transcript