Variant detection and de novo assembly with HiFi reads

860c43c4f8fb36f71342e9257cd05671?s=47 William Rowell
November 14, 2019

Variant detection and de novo assembly with HiFi reads

This applications update was presented as part of the 2019 PacBio User Group Meeting for Europe, the Middle East, and Africa in Milan, Italy.

860c43c4f8fb36f71342e9257cd05671?s=128

William Rowell

November 14, 2019
Tweet

Transcript

  1. 1.

    For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Variant Detection and de novo assembly with HiFi reads Billy Rowell, Sr. Scientist, Bioinformatics Applications, PacBio @nothingclever #PBUGM
  2. 2.

    What could you do with a 10 to 20 kb

    Sanger quality CCS read?
  3. 3.

    AGENDA -Refresher: What is a HiFi read? -Two applications that

    benefit from HiFi reads: -Variant detection -De novo assembly
  4. 4.

    PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS) First round Rolling circle Generate

    consensus HiFi read Subreads (passes) Subread errors
  5. 5.

    PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS) First round Rolling circle Generate

    consensus HiFi read Subreads (passes) Subread errors Accuracy (Phred) Passes Passes 30 0 10 20 40 50 8 5 15 20 0 10 Sequel II (8M) QV30 = 99.9% accuracy
  6. 7.

    A COMPREHENSIVE VIEW OF THE GENOME 5 Mb 3 Mb

    10 Mb 1 bp SNVs ≥50 bp structural variants 1-49 bp indels vs
  7. 8.

    A COMPREHENSIVE VIEW OF THE GENOME 5 Mb 3 Mb

    10 Mb 1 bp SNVs ≥50 bp structural variants 1-49 bp indels Prior tech vs
  8. 9.

    A COMPREHENSIVE VIEW OF THE GENOME - Unmappable regions -

    Segmental duplications and tandem repeats 5 Mb 3 Mb 10 Mb 1 bp SNVs ≥50 bp structural variants 1-49 bp indels Prior tech vs
  9. 10.

    A COMPREHENSIVE VIEW OF THE GENOME SMRT Sequencing provides even

    coverage across difficult to sequence regions of the genome. Almost no coverage with prior tech PacBio reads sequence straight through and detect variants, some falling in coding regions Prior tech PacBio reads STRC Haplotype 1 Haplotype 2
  10. 11.

    MORE COVERAGE IN MEDICALLY-RELEVANT GENES Mandelker, D. et al., Navigating

    highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. 2016. Genetics in Medicine 18, 1282-1289 Wenger, A. et al., Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. 2019. Nature Biotechnology. % problem exons resolved Genes 100% ABCC6, ABCD1, ACAN, ACSM2B, AKR1C2, ALG1, ANKRD11, BCR, CATSPER2, CD177, CEL, CES1, CFH, CFHR1, CFHR3, CFHR4, CGB, CHEK2, CISD2, CLCNKA, CLCNKB, CORO1A, COX10, CRYBB2, CSH1, CYP11B1, CYP11B2, CYP21A2, CYP2A6, CYP2D6, CYP2F1, CYP4A22, DDX11, DHRS4L1, DIS3L2, DND1, DPY19L2, DUOX2, ESRRA, F8, FAM120A, FAM205A, FANCD2, FCGR1A, FCGR2A, FCGR3A, FCGR3B, FLG, FLNC, FOXD4, FOXO3, FUT3, GBA, GFRA2, GON4L, GRM5, GSTM1, GYPA, GYPB, GYPE, HBA1, HBA2, HBG1, HBG2, HP, HS6ST1, IDS, IFT122, IKBKG, IL9R, KIR2DL1, KIR2DL3, KMT2C, KRT17, KRT6A, KRT6B, KRT6C, KRT81, KRT86, LEFTY2, LPA, MST1, MUC5B, MYH6, MYH7, NEB, NLGN4X, NLGN4Y, NOS2, NOTCH2, NXF5, OPN1LW, OR2T5, OR51A2, PCDH11X, PCDHB4, PGAM1, PHC1, PIK3CA, PKD1, PLA2G10, PLEKHM1, PLG, PMS2, PRB1, PRDM9, PROS1, RAB40AL, RALGAPA1, RANBP2, RHCE, RHD, RHPN2, ROCK1, SAA1, SDHA, SDHC, SFTPA1, SFTPA2, SIGLEC14, SLC6A8, SMG1, SPATA31C1, SPTLC1, SRGAP2, SSX7, STAT5B, STK19, STRC, SULT1A1, SUZ12, TBX20, TCEB3C, TLR1, TLR6, TMEM231, TNXB, TRIOBP, TRPA1, TTN, TUBA1A, TUBB2B, UGT1A5, UGT2B15, UGT2B17, UNC93B1, VCY, VWF, WDR72, ZNF419, ZNF592, ZNF674 [75%, 100%) ANAPC1, C4A, C4B, CHRNA7, CR1, DUX4, FCGR2B, HYDIN, OTOA, PDPK1, TMLHE [50%, 75%) ADAMTSL2, CDY2A, DAZ1, GTF2I, NAIP, OCLN, RPS17 [25%, 50%) DAZ2, DAZ3, KIR3DL1, OPN1MW, PPIP5K1 (0%, 25%) NCF1, RBMY1A1 0% BPY2, CCL3L1, CCL4L1, CDY1, CFC1, CFC1B, GTF2IRD2, HSFY1, MRC1, OR4F5, PRY, PRY2, SMN1, SMN2, TSPY1, XKRY 16 2 5 7 11 152 Genes
  11. 12.

    A COMPREHENSIVE VIEW OF THE GENOME - Long insertions -

    Events in repeat regions 5 Mb 3 Mb 10 Mb 1 bp SNVs ≥50 bp structural variants 1-49 bp indels Prior tech vs
  12. 13.

    A COMPREHENSIVE VIEW OF THE GENOME SMRT Sequencing provides long

    read lengths to span large structural variants. 1,733 1,733 bp deletion deletion not detected 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 1,733 Haplotype 1 Haplotype 2 PacBio reads Prior tech Repeats
  13. 14.

    PRECISION & RECALL – DEFINITIONS NIST Genome in a Bottle

    Consortium Query callset Benchmark (“truth”) callset Recall percentage of truth that is called = TP/(TP+FN) Precision percentage of calls that are correct = TP/(TP+FP) Metric Abbreviation Benchmark Variant calls True Positive TP ✓ ✓ False Positive FP - ✓ False Negative FN ✓ - Benchmark “high confidence” regions
  14. 15.

    TYPES OF VARIANTS IN A GENOME SMRT Sequencing provides comprehensive

    detection of all variant types. (<50 bp) (<50 bp) Small variants
  15. 17.

    Prior tech BWA MEM GATK HaplotypeCaller variant calls (vcf) alignment

    variant detection VARIANT DETECTION WORKFLOWS
  16. 18.

    BWA MEM GATK HaplotypeCaller alignment variant detection HiFi reads minimap2

    GATK HaplotypeCaller variant calls (vcf) VARIANT DETECTION WORKFLOWS
  17. 19.

    HiFi reads pbmm2 Google DeepVariant variant calls (vcf) alignment variant

    detection WGS SMALL VARIANT CALLING WORKFLOW WITH DEEPVARIANT
  18. 20.

    HiFi reads pbmm2 Google DeepVariant variant calls (vcf) -DeepVariant learns

    error model of HiFi reads from training data -Improved precision and recall for both SNVs and Indels DeepVariant v0.9 was released today with training on Sequel II System Chemistry 2.0 and improved performance on all chemistries. 15-fold HiFi vs GIAB v3.3.2 benchmarks Sample SNV Recall SNV Precision Indel Recall Indel Precision HG001 99.1% 99.5% 94.1% 95.0% HG002 99.2% 99.5% 95.4% 96.6% HG005 99.4% 99.7% 97.0% 97.5% alignment variant detection WGS SMALL VARIANT CALLING WORKFLOW WITH DEEPVARIANT HAS HIGH PRECISION AND RECALL
  19. 21.

    TYPES OF VARIANTS IN A GENOME SMRT Sequencing provides comprehensive

    detection of all variant types. (≥50 bp) (≥50 bp) Structural variants
  20. 22.

    WGS STRUCTURAL VARIANT CALLING WORKFLOW WITH PBSV HiFi reads pbmm2

    pbsv variant calls (vcf) alignment variant detection
  21. 23.

    15-FOLD HIFI READS PROVIDE A COMPREHENSIVE VIEW OF STRUCTURAL VARIANTS

    SV Type HG001 HG002 HG005 DEL 21,784 22,074 21,897 INS 24,691 25,354 25,231 DUP 2,906 2,916 2,831 CNV 108 107 97 INV 45 45 44 BND 752 712 708
  22. 24.

    15-FOLD HIFI READS PROVIDE A COMPREHENSIVE VIEW OF STRUCTURAL VARIANTS

    SV Type HG001 HG002 HG005 DEL 21,784 22,074 21,897 INS 24,691 25,354 25,231 DUP 2,906 2,916 2,831 CNV 108 107 97 INV 45 45 44 BND 752 712 708 Recall Precision 97.5% 94.6%
  23. 25.

    15-FOLD HIFI READS PROVIDE A COMPREHENSIVE VIEW OF STRUCTURAL VARIANTS

    SV Type HG001 HG002 HG005 DEL 21,784 22,074 21,897 INS 24,691 25,354 25,231 DUP 2,906 2,916 2,831 CNV 108 107 97 INV 45 45 44 BND 752 712 708 Recall Precision 97.5% 94.6% Also, structural variants detected from HiFi reads are reported with base pair precision.
  24. 26.

    15-FOLD HiFi READ COVERAGE RECOMMENDATION Wenger, A. et al., Accurate

    circular consensus long-read sequencing improves variant detection and assembly of a human genome. 2019. Nature Biotechnology.
  25. 28.

    HIFI ASSEMBLY WORKFLOW SKIPS PRE-ASSEMBLY Long Reads Corrected Reads (preads)

    Error Correction/Pre-assembly Contigs Overlap and Assembly Graph Read Alignment and Consensus Polished Contigs
  26. 29.

    HIFI ASSEMBLY WORKFLOW SKIPS PRE-ASSEMBLY HiFi Reads Contigs Overlap and

    Assembly Graph Read Alignment and Consensus Polished Contigs Long Reads Corrected Reads (preads) Error Correction/Pre-assembly Contigs Overlap and Assembly Graph Read Alignment and Consensus Polished Contigs
  27. 30.

    Human (HG002) Library Name Long Reads HiFi Library Type BluePippin

    >15kb 11 kb SageELF Coverage 50-fold 20-fold Assembly Length (Gb) 2.85 2.86 Assembly Contig N50 (Mb) 12.6 16.2 Contig Basepair Accuracy (Q) 45.1 50.0 HIFI DATA GENERATES CONTIGUOUS, ACCURATE HUMAN ASSEMBLIES
  28. 31.

    Human (HG002) Library Name Long Reads HiFi HiFi Library Type

    BluePippin >15kb 11 kb SageELF 15 kb SageELF Coverage 50-fold 20-fold 22-fold Assembly Length (Gb) 2.85 2.86 2.92 Assembly Contig N50 (Mb) 12.6 16.2 30.5 Contig Basepair Accuracy (Q) 45.1 50.0 50.6 Exposed new FALCON parameters to ignore indels and modify the minimum identity during overlap. All Diff % identity Repeat-induced overlaps True overlaps Mismatches CHEMISTRY AND FALCON CHANGES BOOST CONTIGUITY
  29. 32.

    CHEMISTRY IMPROVEMENTS CLOSE THE GAPS IN RICE 24-kb Library Chr6,

    Chr7, Chr12 in single contig Minimap2, dgenies Single contig Asm Length (Mb) 403 400 400 402 Contig N50 (Mb) 11.2 9.20 10.7 10.7 N Contigs 209 296 135 154 40 42 44 46 48 50 CLR HiFi 14 kb HiFi 17 kb HiFi 24 kb Phred Quality Contig Base Quality in 100kb Windows
  30. 34.

    REDUCED ANALYSIS TIME WITH HIFI READS Compute times for de

    novo assembly of a human genome Data Type HiFi Reads Long Reads Input File Type CCS.FASTQ.GZ SUBREADS.BAM Input File Size (GB) 48 323 Read Correction Method CCS Analysis Pre-assembly Time to Results (Hours) Read Correction 17.5 43.5 Contig Assembly 13.7 18.9 Analyses run with PacBio recommended compute infrastructure ~31 hrs ~62 hrs Faster compute time, lower compute and storage costs
  31. 35.

    HIFI DATA NOW SUPPORTED BY FALCON-UNZIP https://github.com/PacificBiosciences/pbbioconda/wiki/Assembling-HiFi-data:-FALCON-Unzip3 -Native support for

    HiFi -Faster read tracking and polishing -Incorporating HiFi-specific overlapping options -Phasing algorithm improvements ✔ ✔ ✔
  32. 36.

    CONCLUSIONS -HiFi reads enable accurate, comprehensive detection of genetic variation.

    -We recommend using pbmm2 for alignment, DeepVariant for small variant detection, and pbsv for structural variant detection. -Improvements to the Sequel II chemistry and optimization of the FALCON workflow have improved the contiguity and accuracy of HiFi assemblies at a fraction of the compute time and compute costs of CLR assembly.
  33. 37.

    Small variant detection Andrew Carroll Pi-Chuan Chang Mark DePristo Richard

    Hall Alexey Kolesnikov Justin Zook, Justin Wagner, and the Genome in a Bottle Consortium Structural variant detection Armin Töpfer Aaron Wenger ACKNOWLEDGMENTS De novo assembly Sarah Kingnan Greg Concepcion Jim Drake Chris Dunn Richard Hall Jonas Korlach Zev Kronenberg Ivan Sovic Michelle Vierra Aaron Wenger Alicia Yang Greg Young
  34. 38.

    PACBIO SEQUEL II SYSTEM HIFI DATASETS Sample Insert length Reads

    (SRA) HG001/NA12878 11 kb (30-fold) PRJNA540705 HG002/NA24385 11 kb (30-fold) PRJNA527278 HG005/NA24631 11 kb (30-fold) PRJNA540706 HG002/NA24385 15 kb (30-fold), 20 kb (15-fold) PRJNA586863 Rice MH63 17 kb (38-fold) SRX6908794 Rice MH63 24 kb (64-fold) SRX6957825
  35. 39.

    For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. www.pacb.com