Slide 1

Slide 1 text

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Variant Detection and de novo assembly with HiFi reads Billy Rowell, Sr. Scientist, Bioinformatics Applications, PacBio @nothingclever #PBUGM

Slide 2

Slide 2 text

What could you do with a 10 to 20 kb Sanger quality CCS read?

Slide 3

Slide 3 text

AGENDA -Refresher: What is a HiFi read? -Two applications that benefit from HiFi reads: -Variant detection -De novo assembly

Slide 4

Slide 4 text

PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS) First round Rolling circle Generate consensus HiFi read Subreads (passes) Subread errors

Slide 5

Slide 5 text

PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS) First round Rolling circle Generate consensus HiFi read Subreads (passes) Subread errors Accuracy (Phred) Passes Passes 30 0 10 20 40 50 8 5 15 20 0 10 Sequel II (8M) QV30 = 99.9% accuracy

Slide 6

Slide 6 text

Variant detection Using Whole Genome Sequencing on the Sequel II System

Slide 7

Slide 7 text

A COMPREHENSIVE VIEW OF THE GENOME 5 Mb 3 Mb 10 Mb 1 bp SNVs ≥50 bp structural variants 1-49 bp indels vs

Slide 8

Slide 8 text

A COMPREHENSIVE VIEW OF THE GENOME 5 Mb 3 Mb 10 Mb 1 bp SNVs ≥50 bp structural variants 1-49 bp indels Prior tech vs

Slide 9

Slide 9 text

A COMPREHENSIVE VIEW OF THE GENOME - Unmappable regions - Segmental duplications and tandem repeats 5 Mb 3 Mb 10 Mb 1 bp SNVs ≥50 bp structural variants 1-49 bp indels Prior tech vs

Slide 10

Slide 10 text

A COMPREHENSIVE VIEW OF THE GENOME SMRT Sequencing provides even coverage across difficult to sequence regions of the genome. Almost no coverage with prior tech PacBio reads sequence straight through and detect variants, some falling in coding regions Prior tech PacBio reads STRC Haplotype 1 Haplotype 2

Slide 11

Slide 11 text

MORE COVERAGE IN MEDICALLY-RELEVANT GENES Mandelker, D. et al., Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. 2016. Genetics in Medicine 18, 1282-1289 Wenger, A. et al., Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. 2019. Nature Biotechnology. % problem exons resolved Genes 100% ABCC6, ABCD1, ACAN, ACSM2B, AKR1C2, ALG1, ANKRD11, BCR, CATSPER2, CD177, CEL, CES1, CFH, CFHR1, CFHR3, CFHR4, CGB, CHEK2, CISD2, CLCNKA, CLCNKB, CORO1A, COX10, CRYBB2, CSH1, CYP11B1, CYP11B2, CYP21A2, CYP2A6, CYP2D6, CYP2F1, CYP4A22, DDX11, DHRS4L1, DIS3L2, DND1, DPY19L2, DUOX2, ESRRA, F8, FAM120A, FAM205A, FANCD2, FCGR1A, FCGR2A, FCGR3A, FCGR3B, FLG, FLNC, FOXD4, FOXO3, FUT3, GBA, GFRA2, GON4L, GRM5, GSTM1, GYPA, GYPB, GYPE, HBA1, HBA2, HBG1, HBG2, HP, HS6ST1, IDS, IFT122, IKBKG, IL9R, KIR2DL1, KIR2DL3, KMT2C, KRT17, KRT6A, KRT6B, KRT6C, KRT81, KRT86, LEFTY2, LPA, MST1, MUC5B, MYH6, MYH7, NEB, NLGN4X, NLGN4Y, NOS2, NOTCH2, NXF5, OPN1LW, OR2T5, OR51A2, PCDH11X, PCDHB4, PGAM1, PHC1, PIK3CA, PKD1, PLA2G10, PLEKHM1, PLG, PMS2, PRB1, PRDM9, PROS1, RAB40AL, RALGAPA1, RANBP2, RHCE, RHD, RHPN2, ROCK1, SAA1, SDHA, SDHC, SFTPA1, SFTPA2, SIGLEC14, SLC6A8, SMG1, SPATA31C1, SPTLC1, SRGAP2, SSX7, STAT5B, STK19, STRC, SULT1A1, SUZ12, TBX20, TCEB3C, TLR1, TLR6, TMEM231, TNXB, TRIOBP, TRPA1, TTN, TUBA1A, TUBB2B, UGT1A5, UGT2B15, UGT2B17, UNC93B1, VCY, VWF, WDR72, ZNF419, ZNF592, ZNF674 [75%, 100%) ANAPC1, C4A, C4B, CHRNA7, CR1, DUX4, FCGR2B, HYDIN, OTOA, PDPK1, TMLHE [50%, 75%) ADAMTSL2, CDY2A, DAZ1, GTF2I, NAIP, OCLN, RPS17 [25%, 50%) DAZ2, DAZ3, KIR3DL1, OPN1MW, PPIP5K1 (0%, 25%) NCF1, RBMY1A1 0% BPY2, CCL3L1, CCL4L1, CDY1, CFC1, CFC1B, GTF2IRD2, HSFY1, MRC1, OR4F5, PRY, PRY2, SMN1, SMN2, TSPY1, XKRY 16 2 5 7 11 152 Genes

Slide 12

Slide 12 text

A COMPREHENSIVE VIEW OF THE GENOME - Long insertions - Events in repeat regions 5 Mb 3 Mb 10 Mb 1 bp SNVs ≥50 bp structural variants 1-49 bp indels Prior tech vs

Slide 13

Slide 13 text

A COMPREHENSIVE VIEW OF THE GENOME SMRT Sequencing provides long read lengths to span large structural variants. 1,733 1,733 bp deletion deletion not detected 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 Haplotype 1 Haplotype 2 PacBio reads Prior tech Repeats

Slide 14

Slide 14 text

PRECISION & RECALL – DEFINITIONS NIST Genome in a Bottle Consortium Query callset Benchmark (“truth”) callset Recall percentage of truth that is called = TP/(TP+FN) Precision percentage of calls that are correct = TP/(TP+FP) Metric Abbreviation Benchmark Variant calls True Positive TP ✓ ✓ False Positive FP - ✓ False Negative FN ✓ - Benchmark “high confidence” regions

Slide 15

Slide 15 text

TYPES OF VARIANTS IN A GENOME SMRT Sequencing provides comprehensive detection of all variant types. (<50 bp) (<50 bp) Small variants

Slide 16

Slide 16 text

gDNA reads alignment variant detection variant calls (vcf) VARIANT DETECTION WORKFLOWS

Slide 17

Slide 17 text

Prior tech BWA MEM GATK HaplotypeCaller variant calls (vcf) alignment variant detection VARIANT DETECTION WORKFLOWS

Slide 18

Slide 18 text

BWA MEM GATK HaplotypeCaller alignment variant detection HiFi reads minimap2 GATK HaplotypeCaller variant calls (vcf) VARIANT DETECTION WORKFLOWS

Slide 19

Slide 19 text

HiFi reads pbmm2 Google DeepVariant variant calls (vcf) alignment variant detection WGS SMALL VARIANT CALLING WORKFLOW WITH DEEPVARIANT

Slide 20

Slide 20 text

HiFi reads pbmm2 Google DeepVariant variant calls (vcf) -DeepVariant learns error model of HiFi reads from training data -Improved precision and recall for both SNVs and Indels DeepVariant v0.9 was released today with training on Sequel II System Chemistry 2.0 and improved performance on all chemistries. 15-fold HiFi vs GIAB v3.3.2 benchmarks Sample SNV Recall SNV Precision Indel Recall Indel Precision HG001 99.1% 99.5% 94.1% 95.0% HG002 99.2% 99.5% 95.4% 96.6% HG005 99.4% 99.7% 97.0% 97.5% alignment variant detection WGS SMALL VARIANT CALLING WORKFLOW WITH DEEPVARIANT HAS HIGH PRECISION AND RECALL

Slide 21

Slide 21 text

TYPES OF VARIANTS IN A GENOME SMRT Sequencing provides comprehensive detection of all variant types. (≥50 bp) (≥50 bp) Structural variants

Slide 22

Slide 22 text

WGS STRUCTURAL VARIANT CALLING WORKFLOW WITH PBSV HiFi reads pbmm2 pbsv variant calls (vcf) alignment variant detection

Slide 23

Slide 23 text

15-FOLD HIFI READS PROVIDE A COMPREHENSIVE VIEW OF STRUCTURAL VARIANTS SV Type HG001 HG002 HG005 DEL 21,784 22,074 21,897 INS 24,691 25,354 25,231 DUP 2,906 2,916 2,831 CNV 108 107 97 INV 45 45 44 BND 752 712 708

Slide 24

Slide 24 text

15-FOLD HIFI READS PROVIDE A COMPREHENSIVE VIEW OF STRUCTURAL VARIANTS SV Type HG001 HG002 HG005 DEL 21,784 22,074 21,897 INS 24,691 25,354 25,231 DUP 2,906 2,916 2,831 CNV 108 107 97 INV 45 45 44 BND 752 712 708 Recall Precision 97.5% 94.6%

Slide 25

Slide 25 text

15-FOLD HIFI READS PROVIDE A COMPREHENSIVE VIEW OF STRUCTURAL VARIANTS SV Type HG001 HG002 HG005 DEL 21,784 22,074 21,897 INS 24,691 25,354 25,231 DUP 2,906 2,916 2,831 CNV 108 107 97 INV 45 45 44 BND 752 712 708 Recall Precision 97.5% 94.6% Also, structural variants detected from HiFi reads are reported with base pair precision.

Slide 26

Slide 26 text

15-FOLD HiFi READ COVERAGE RECOMMENDATION Wenger, A. et al., Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. 2019. Nature Biotechnology.

Slide 27

Slide 27 text

De novo assembly Using Whole Genome Sequencing on the Sequel II System

Slide 28

Slide 28 text

HIFI ASSEMBLY WORKFLOW SKIPS PRE-ASSEMBLY Long Reads Corrected Reads (preads) Error Correction/Pre-assembly Contigs Overlap and Assembly Graph Read Alignment and Consensus Polished Contigs

Slide 29

Slide 29 text

HIFI ASSEMBLY WORKFLOW SKIPS PRE-ASSEMBLY HiFi Reads Contigs Overlap and Assembly Graph Read Alignment and Consensus Polished Contigs Long Reads Corrected Reads (preads) Error Correction/Pre-assembly Contigs Overlap and Assembly Graph Read Alignment and Consensus Polished Contigs

Slide 30

Slide 30 text

Human (HG002) Library Name Long Reads HiFi Library Type BluePippin >15kb 11 kb SageELF Coverage 50-fold 20-fold Assembly Length (Gb) 2.85 2.86 Assembly Contig N50 (Mb) 12.6 16.2 Contig Basepair Accuracy (Q) 45.1 50.0 HIFI DATA GENERATES CONTIGUOUS, ACCURATE HUMAN ASSEMBLIES

Slide 31

Slide 31 text

Human (HG002) Library Name Long Reads HiFi HiFi Library Type BluePippin >15kb 11 kb SageELF 15 kb SageELF Coverage 50-fold 20-fold 22-fold Assembly Length (Gb) 2.85 2.86 2.92 Assembly Contig N50 (Mb) 12.6 16.2 30.5 Contig Basepair Accuracy (Q) 45.1 50.0 50.6 Exposed new FALCON parameters to ignore indels and modify the minimum identity during overlap. All Diff % identity Repeat-induced overlaps True overlaps Mismatches CHEMISTRY AND FALCON CHANGES BOOST CONTIGUITY

Slide 32

Slide 32 text

CHEMISTRY IMPROVEMENTS CLOSE THE GAPS IN RICE 24-kb Library Chr6, Chr7, Chr12 in single contig Minimap2, dgenies Single contig Asm Length (Mb) 403 400 400 402 Contig N50 (Mb) 11.2 9.20 10.7 10.7 N Contigs 209 296 135 154 40 42 44 46 48 50 CLR HiFi 14 kb HiFi 17 kb HiFi 24 kb Phred Quality Contig Base Quality in 100kb Windows

Slide 33

Slide 33 text

GENERATE COMPLETE AND ACCURATE GENOME ASSEMBLIES Accuracies >Q40 (99.99%) >94% of genes in frame

Slide 34

Slide 34 text

REDUCED ANALYSIS TIME WITH HIFI READS Compute times for de novo assembly of a human genome Data Type HiFi Reads Long Reads Input File Type CCS.FASTQ.GZ SUBREADS.BAM Input File Size (GB) 48 323 Read Correction Method CCS Analysis Pre-assembly Time to Results (Hours) Read Correction 17.5 43.5 Contig Assembly 13.7 18.9 Analyses run with PacBio recommended compute infrastructure ~31 hrs ~62 hrs Faster compute time, lower compute and storage costs

Slide 35

Slide 35 text

HIFI DATA NOW SUPPORTED BY FALCON-UNZIP https://github.com/PacificBiosciences/pbbioconda/wiki/Assembling-HiFi-data:-FALCON-Unzip3 -Native support for HiFi -Faster read tracking and polishing -Incorporating HiFi-specific overlapping options -Phasing algorithm improvements ✔ ✔ ✔

Slide 36

Slide 36 text

CONCLUSIONS -HiFi reads enable accurate, comprehensive detection of genetic variation. -We recommend using pbmm2 for alignment, DeepVariant for small variant detection, and pbsv for structural variant detection. -Improvements to the Sequel II chemistry and optimization of the FALCON workflow have improved the contiguity and accuracy of HiFi assemblies at a fraction of the compute time and compute costs of CLR assembly.

Slide 37

Slide 37 text

Small variant detection Andrew Carroll Pi-Chuan Chang Mark DePristo Richard Hall Alexey Kolesnikov Justin Zook, Justin Wagner, and the Genome in a Bottle Consortium Structural variant detection Armin Töpfer Aaron Wenger ACKNOWLEDGMENTS De novo assembly Sarah Kingnan Greg Concepcion Jim Drake Chris Dunn Richard Hall Jonas Korlach Zev Kronenberg Ivan Sovic Michelle Vierra Aaron Wenger Alicia Yang Greg Young

Slide 38

Slide 38 text

PACBIO SEQUEL II SYSTEM HIFI DATASETS Sample Insert length Reads (SRA) HG001/NA12878 11 kb (30-fold) PRJNA540705 HG002/NA24385 11 kb (30-fold) PRJNA527278 HG005/NA24631 11 kb (30-fold) PRJNA540706 HG002/NA24385 15 kb (30-fold), 20 kb (15-fold) PRJNA586863 Rice MH63 17 kb (38-fold) SRX6908794 Rice MH63 24 kb (64-fold) SRX6957825

Slide 39

Slide 39 text

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. www.pacb.com