$30 off During Our Annual Pro Sale. View Details »

Variant detection and de novo assembly with HiFi reads

William Rowell
November 14, 2019

Variant detection and de novo assembly with HiFi reads

This applications update was presented as part of the 2019 PacBio User Group Meeting for Europe, the Middle East, and Africa in Milan, Italy.

William Rowell

November 14, 2019
Tweet

More Decks by William Rowell

Other Decks in Science

Transcript

  1. For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved.
    Variant Detection and de novo assembly with HiFi
    reads
    Billy Rowell, Sr. Scientist, Bioinformatics Applications, PacBio
    @nothingclever
    #PBUGM

    View Slide

  2. What could you do with a
    10 to 20 kb Sanger quality
    CCS read?

    View Slide

  3. AGENDA
    -Refresher: What is a HiFi read?
    -Two applications that benefit from HiFi reads:
    -Variant detection
    -De novo assembly

    View Slide

  4. PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS)
    First round
    Rolling circle
    Generate
    consensus HiFi read
    Subreads
    (passes)
    Subread errors

    View Slide

  5. PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS)
    First round
    Rolling circle
    Generate
    consensus HiFi read
    Subreads
    (passes)
    Subread errors
    Accuracy (Phred)
    Passes
    Passes
    30
    0
    10
    20
    40
    50
    8
    5 15 20
    0 10
    Sequel II (8M)
    QV30 = 99.9% accuracy

    View Slide

  6. Variant detection
    Using Whole Genome Sequencing on the Sequel II System

    View Slide

  7. A COMPREHENSIVE VIEW OF THE GENOME
    5 Mb 3 Mb 10 Mb
    1 bp
    SNVs
    ≥50 bp
    structural variants
    1-49 bp
    indels
    vs

    View Slide

  8. A COMPREHENSIVE VIEW OF THE GENOME
    5 Mb 3 Mb 10 Mb
    1 bp
    SNVs
    ≥50 bp
    structural variants
    1-49 bp
    indels
    Prior tech
    vs

    View Slide

  9. A COMPREHENSIVE VIEW OF THE GENOME
    - Unmappable regions
    - Segmental duplications and tandem repeats
    5 Mb 3 Mb 10 Mb
    1 bp
    SNVs
    ≥50 bp
    structural variants
    1-49 bp
    indels
    Prior tech
    vs

    View Slide

  10. A COMPREHENSIVE VIEW OF THE GENOME
    SMRT Sequencing provides even coverage across difficult to sequence regions of the genome.
    Almost no coverage with prior tech
    PacBio reads sequence straight through and detect
    variants, some falling in coding regions
    Prior
    tech
    PacBio
    reads
    STRC
    Haplotype 1
    Haplotype 2

    View Slide

  11. MORE COVERAGE IN MEDICALLY-RELEVANT GENES
    Mandelker, D. et al., Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. 2016. Genetics in Medicine 18, 1282-1289
    Wenger, A. et al., Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. 2019. Nature Biotechnology.
    % problem
    exons resolved Genes
    100% ABCC6, ABCD1, ACAN, ACSM2B, AKR1C2, ALG1, ANKRD11, BCR, CATSPER2, CD177, CEL, CES1, CFH,
    CFHR1, CFHR3, CFHR4, CGB, CHEK2, CISD2, CLCNKA, CLCNKB, CORO1A, COX10, CRYBB2, CSH1, CYP11B1,
    CYP11B2, CYP21A2, CYP2A6, CYP2D6, CYP2F1, CYP4A22, DDX11, DHRS4L1, DIS3L2, DND1, DPY19L2,
    DUOX2, ESRRA, F8, FAM120A, FAM205A, FANCD2, FCGR1A, FCGR2A, FCGR3A, FCGR3B, FLG, FLNC, FOXD4,
    FOXO3, FUT3, GBA, GFRA2, GON4L, GRM5, GSTM1, GYPA, GYPB, GYPE, HBA1, HBA2, HBG1, HBG2, HP,
    HS6ST1, IDS, IFT122, IKBKG, IL9R, KIR2DL1, KIR2DL3, KMT2C, KRT17, KRT6A, KRT6B, KRT6C, KRT81, KRT86,
    LEFTY2, LPA, MST1, MUC5B, MYH6, MYH7, NEB, NLGN4X, NLGN4Y, NOS2, NOTCH2, NXF5, OPN1LW, OR2T5,
    OR51A2, PCDH11X, PCDHB4, PGAM1, PHC1, PIK3CA, PKD1, PLA2G10, PLEKHM1, PLG, PMS2, PRB1, PRDM9,
    PROS1, RAB40AL, RALGAPA1, RANBP2, RHCE, RHD, RHPN2, ROCK1, SAA1, SDHA, SDHC, SFTPA1, SFTPA2,
    SIGLEC14, SLC6A8, SMG1, SPATA31C1, SPTLC1, SRGAP2, SSX7, STAT5B, STK19, STRC, SULT1A1, SUZ12,
    TBX20, TCEB3C, TLR1, TLR6, TMEM231, TNXB, TRIOBP, TRPA1, TTN, TUBA1A, TUBB2B, UGT1A5, UGT2B15,
    UGT2B17, UNC93B1, VCY, VWF, WDR72, ZNF419, ZNF592, ZNF674
    [75%, 100%) ANAPC1, C4A, C4B, CHRNA7, CR1, DUX4, FCGR2B, HYDIN, OTOA, PDPK1, TMLHE
    [50%, 75%) ADAMTSL2, CDY2A, DAZ1, GTF2I, NAIP, OCLN, RPS17
    [25%, 50%) DAZ2, DAZ3, KIR3DL1, OPN1MW, PPIP5K1
    (0%, 25%) NCF1, RBMY1A1
    0% BPY2, CCL3L1, CCL4L1, CDY1, CFC1, CFC1B, GTF2IRD2, HSFY1, MRC1, OR4F5, PRY, PRY2, SMN1, SMN2,
    TSPY1, XKRY
    16
    2
    5
    7
    11
    152
    Genes

    View Slide

  12. A COMPREHENSIVE VIEW OF THE GENOME
    - Long insertions
    - Events in repeat regions
    5 Mb 3 Mb 10 Mb
    1 bp
    SNVs
    ≥50 bp
    structural variants
    1-49 bp
    indels
    Prior tech
    vs

    View Slide

  13. A COMPREHENSIVE VIEW OF THE GENOME
    SMRT Sequencing provides long read lengths to span large structural variants.
    1,733
    1,733 bp deletion
    deletion not
    detected
    1,733
    1,733
    1,733
    1,733
    1,733
    1,733
    1,733
    1,733
    1,733
    1,733
    1,733
    1,733
    1,733
    Haplotype 1
    Haplotype 2
    PacBio
    reads
    Prior
    tech
    Repeats

    View Slide

  14. PRECISION & RECALL – DEFINITIONS
    NIST Genome in a Bottle Consortium
    Query callset
    Benchmark (“truth”) callset
    Recall percentage of truth that is called = TP/(TP+FN)
    Precision percentage of calls that are correct = TP/(TP+FP)
    Metric Abbreviation Benchmark Variant calls
    True Positive TP ✓ ✓
    False Positive FP - ✓
    False Negative FN ✓ -
    Benchmark “high confidence” regions

    View Slide

  15. TYPES OF VARIANTS IN A GENOME
    SMRT Sequencing provides comprehensive detection of all variant types.
    (<50 bp) (<50 bp)
    Small variants

    View Slide

  16. gDNA reads
    alignment
    variant detection
    variant calls (vcf)
    VARIANT DETECTION WORKFLOWS

    View Slide

  17. Prior tech
    BWA MEM
    GATK
    HaplotypeCaller
    variant calls (vcf)
    alignment
    variant detection
    VARIANT DETECTION WORKFLOWS

    View Slide

  18. BWA MEM
    GATK
    HaplotypeCaller
    alignment
    variant detection
    HiFi reads
    minimap2
    GATK
    HaplotypeCaller
    variant calls (vcf)
    VARIANT DETECTION WORKFLOWS

    View Slide

  19. HiFi reads
    pbmm2
    Google DeepVariant
    variant calls (vcf)
    alignment
    variant
    detection
    WGS SMALL VARIANT CALLING WORKFLOW WITH
    DEEPVARIANT

    View Slide

  20. HiFi reads
    pbmm2
    Google DeepVariant
    variant calls (vcf)
    -DeepVariant learns error model of
    HiFi reads from training data
    -Improved precision and recall for
    both SNVs and Indels
    DeepVariant v0.9 was released today with training on Sequel II System Chemistry 2.0 and improved performance on all chemistries.
    15-fold HiFi vs
    GIAB v3.3.2 benchmarks
    Sample
    SNV
    Recall
    SNV
    Precision
    Indel
    Recall
    Indel
    Precision
    HG001 99.1% 99.5% 94.1% 95.0%
    HG002 99.2% 99.5% 95.4% 96.6%
    HG005 99.4% 99.7% 97.0% 97.5%
    alignment
    variant
    detection
    WGS SMALL VARIANT CALLING WORKFLOW WITH
    DEEPVARIANT HAS HIGH PRECISION AND RECALL

    View Slide

  21. TYPES OF VARIANTS IN A GENOME
    SMRT Sequencing provides comprehensive detection of all variant types.
    (≥50 bp) (≥50 bp)
    Structural variants

    View Slide

  22. WGS STRUCTURAL VARIANT CALLING WORKFLOW WITH
    PBSV
    HiFi reads
    pbmm2
    pbsv
    variant calls (vcf)
    alignment
    variant
    detection

    View Slide

  23. 15-FOLD HIFI READS PROVIDE A COMPREHENSIVE VIEW OF
    STRUCTURAL VARIANTS
    SV Type HG001 HG002 HG005
    DEL 21,784 22,074 21,897
    INS 24,691 25,354 25,231
    DUP 2,906 2,916 2,831
    CNV 108 107 97
    INV 45 45 44
    BND 752 712 708

    View Slide

  24. 15-FOLD HIFI READS PROVIDE A COMPREHENSIVE VIEW OF
    STRUCTURAL VARIANTS
    SV Type HG001 HG002 HG005
    DEL 21,784 22,074 21,897
    INS 24,691 25,354 25,231
    DUP 2,906 2,916 2,831
    CNV 108 107 97
    INV 45 45 44
    BND 752 712 708
    Recall Precision
    97.5% 94.6%

    View Slide

  25. 15-FOLD HIFI READS PROVIDE A COMPREHENSIVE VIEW OF
    STRUCTURAL VARIANTS
    SV Type HG001 HG002 HG005
    DEL 21,784 22,074 21,897
    INS 24,691 25,354 25,231
    DUP 2,906 2,916 2,831
    CNV 108 107 97
    INV 45 45 44
    BND 752 712 708
    Recall Precision
    97.5% 94.6%
    Also, structural variants detected from HiFi reads are reported with base pair
    precision.

    View Slide

  26. 15-FOLD HiFi READ COVERAGE RECOMMENDATION
    Wenger, A. et al., Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. 2019. Nature
    Biotechnology.

    View Slide

  27. De novo assembly
    Using Whole Genome Sequencing on the Sequel II System

    View Slide

  28. HIFI ASSEMBLY WORKFLOW SKIPS PRE-ASSEMBLY
    Long Reads
    Corrected Reads
    (preads)
    Error Correction/Pre-assembly
    Contigs
    Overlap and Assembly Graph
    Read Alignment and Consensus
    Polished Contigs

    View Slide

  29. HIFI ASSEMBLY WORKFLOW SKIPS PRE-ASSEMBLY
    HiFi Reads
    Contigs
    Overlap and Assembly Graph
    Read Alignment and Consensus
    Polished Contigs
    Long Reads
    Corrected Reads
    (preads)
    Error Correction/Pre-assembly
    Contigs
    Overlap and Assembly Graph
    Read Alignment and Consensus
    Polished Contigs

    View Slide

  30. Human (HG002) Library Name Long Reads HiFi
    Library Type BluePippin >15kb 11 kb SageELF
    Coverage 50-fold 20-fold
    Assembly Length (Gb) 2.85 2.86
    Assembly Contig N50 (Mb) 12.6 16.2
    Contig Basepair Accuracy (Q) 45.1 50.0
    HIFI DATA GENERATES CONTIGUOUS, ACCURATE HUMAN
    ASSEMBLIES

    View Slide

  31. Human (HG002) Library Name Long Reads HiFi HiFi
    Library Type BluePippin >15kb 11 kb SageELF 15 kb SageELF
    Coverage 50-fold 20-fold 22-fold
    Assembly Length (Gb) 2.85 2.86 2.92
    Assembly Contig N50 (Mb) 12.6 16.2 30.5
    Contig Basepair Accuracy (Q) 45.1 50.0 50.6
    Exposed new FALCON parameters to
    ignore indels and modify the minimum
    identity during overlap.
    All Diff
    % identity
    Repeat-induced
    overlaps True overlaps
    Mismatches
    CHEMISTRY AND FALCON CHANGES BOOST CONTIGUITY

    View Slide

  32. CHEMISTRY IMPROVEMENTS CLOSE THE GAPS IN RICE
    24-kb Library
    Chr6, Chr7, Chr12
    in single contig
    Minimap2, dgenies
    Single contig
    Asm Length
    (Mb)
    403 400 400 402
    Contig N50
    (Mb)
    11.2 9.20 10.7 10.7
    N Contigs 209 296 135 154
    40
    42
    44
    46
    48
    50
    CLR
    HiFi
    14 kb
    HiFi
    17 kb
    HiFi
    24 kb
    Phred Quality
    Contig Base Quality in 100kb Windows

    View Slide

  33. GENERATE COMPLETE AND ACCURATE GENOME
    ASSEMBLIES
    Accuracies >Q40 (99.99%)
    >94% of genes in frame

    View Slide

  34. REDUCED ANALYSIS TIME WITH HIFI READS
    Compute times for de novo assembly of a human genome
    Data Type HiFi Reads Long Reads
    Input File Type CCS.FASTQ.GZ SUBREADS.BAM
    Input File Size (GB) 48 323
    Read Correction Method CCS Analysis Pre-assembly
    Time to Results
    (Hours)
    Read Correction 17.5 43.5
    Contig Assembly 13.7 18.9
    Analyses run with PacBio recommended compute infrastructure
    ~31
    hrs
    ~62
    hrs
    Faster compute time,
    lower compute and storage costs

    View Slide

  35. HIFI DATA NOW SUPPORTED BY FALCON-UNZIP
    https://github.com/PacificBiosciences/pbbioconda/wiki/Assembling-HiFi-data:-FALCON-Unzip3
    -Native support for HiFi
    -Faster read tracking and
    polishing
    -Incorporating HiFi-specific
    overlapping options
    -Phasing algorithm
    improvements



    View Slide

  36. CONCLUSIONS
    -HiFi reads enable accurate, comprehensive detection of genetic variation.
    -We recommend using pbmm2 for alignment, DeepVariant for small variant detection,
    and pbsv for structural variant detection.
    -Improvements to the Sequel II chemistry and optimization of the FALCON
    workflow have improved the contiguity and accuracy of HiFi assemblies at a
    fraction of the compute time and compute costs of CLR assembly.

    View Slide

  37. Small variant detection
    Andrew Carroll
    Pi-Chuan Chang
    Mark DePristo
    Richard Hall
    Alexey Kolesnikov
    Justin Zook, Justin Wagner, and the Genome in
    a Bottle Consortium
    Structural variant detection
    Armin Töpfer
    Aaron Wenger
    ACKNOWLEDGMENTS
    De novo assembly
    Sarah Kingnan
    Greg Concepcion
    Jim Drake
    Chris Dunn
    Richard Hall
    Jonas Korlach
    Zev Kronenberg
    Ivan Sovic
    Michelle Vierra
    Aaron Wenger
    Alicia Yang
    Greg Young

    View Slide

  38. PACBIO SEQUEL II SYSTEM HIFI DATASETS
    Sample Insert length Reads (SRA)
    HG001/NA12878 11 kb (30-fold) PRJNA540705
    HG002/NA24385 11 kb (30-fold) PRJNA527278
    HG005/NA24631 11 kb (30-fold) PRJNA540706
    HG002/NA24385 15 kb (30-fold), 20 kb (15-fold) PRJNA586863
    Rice MH63 17 kb (38-fold) SRX6908794
    Rice MH63 24 kb (64-fold) SRX6957825

    View Slide

  39. For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio,
    SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO
    Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc.
    All other trademarks are the sole property of their respective owners.
    www.pacb.com

    View Slide