Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling-up Production of High-accuracy Long Reads and Haplotype-phased Assemblies

GenomeArk
January 16, 2019

Scaling-up Production of High-accuracy Long Reads and Haplotype-phased Assemblies

Jonas Korlach

GenomeArk

January 16, 2019
Tweet

More Decks by GenomeArk

Other Decks in Research

Transcript

  1. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Scaling-up Production of High-accuracy Long Reads and Haplotype-phased Assemblies J. Korlach January 16, 2019
  2. TOPICS - Sequel 8M Chip - Assembly Developments - Iso-Seq

    Developments
  3. 8M DATA IN HOUSE Long-insert human sample (10-hour acquisition) :

    METRICS 1M 8M 8M Number of Bases 15 Gb 96 Gb 99 Gb Number of Reads 542,168 4,350,437 5,089,596 Pol. Read Length (Mean) 29,047 bp 22,079 bp 19,451 Pol. Read Length (N50) 43,868 bp 36,813 bp 33,088 - Up to 8 Cells per machine run - Up to 0.8 Tb per machine run (80 hours)
  4. CURRENT PERFORMANCE – LONG INSERTS - Example yield data: ~80-120

    Gb in 10 hours B. subtilis E. coli O. sativa METRICS 1M 8M - human 8M - human 8M - E.coli 8M - B.subtilis 8M - rice Number of Bases 15 Gb 96 Gb 99 Gb 86 Gb 116 Gb 85 Gb Number of Reads 542,168 4,350,437 5,089,596 5,362,197 5,997,224 5,095,529 Pol. Read Length (Mean) 29,047 bp 22,079 bp 19,451 bp 15,952 bp 19,303 bp 16,758 bp Pol. Read Length (N50) 43,868 bp 36,813 bp 33,088 bp 27,815 bp 32,932 bp 28,975 bp
  5. DATA IN HOUSE WITH PROTOTYPE CELLS METRICS 1M 8M 8M

    Number of Bases 44 Gb 320 Gb 318 Gb Number of Reads 517,746 4,053,000 3,725,756 Pol. Read Length (Mean) 84,420 bp 78,877 bp 85,294 Pol. Read Length (N50) 167,267 bp 166,571 bp 173,706 10-12 kb human CCS template (20-30 hr acquisition): METRICS Number of >QV20 Bases 21 Gb Number of >QV20 Reads 1,855,642 CCS Insert Length (Mean) 11,322 bp CCS Read Score (Mean) 99.8% - Up to 8 Cells per machine run - Up to 2.4 Tb per machine run (200 hrs)
  6. PACIFIC BIOSCIENCES® CONFIDENTIAL - 1M: - ~4 passes for Q20

    - ~10 passes for Q30 CCS ACCURACY COMPARISON - 8M: - ~3 passes for Q20 - ~8 passes for Q30
  7. ASSEMBLY DEVELOPMENTS 1. FALCON(-UNZIP), Arrow - Pb-assembly available in Bioconda

    (re-engineered Unzip code) - Next release will use minimap instead of blasr for polishing 2. FALCON-Phase - https://www.biorxiv.org/content/early/2018/05/21/327064 - Works well in our hands; VGP has numerous data sets that could be tried
  8. APPLICATION TO LOW AND HIGH HETEROZYGOSITY SAMPLES Photo Credit: The

    Cut Photo credit: Tim Smith Photo credit: Jim Bendon Sample Human (HG00733) F1 Bull Zebrafinch Heterozygosity 0.17 - 0.21 % 0.65 - 0.93 % 1.57 – 1.72 % Genome Size 2.9 Gb 2.7 Gb 1.1 Gb Unzipped 84 % 87 % 74 % HiC Read Pairs 504M 2X100bp 203M 2X80bp 319M 2X150bp Accuracy 80 % 97 % 98 %
  9. PACIFIC BIOSCIENCES® CONFIDENTIAL PHASING CHROMOSOME-SCALE SCAFFOLDS - Scaffold one set

    of full-length haplotigs with Proximo (Phase Genomics) - Scaffolds are chromosome-scale - We know: - order of contigs along scaffold - pairing of phase 0 and phase 1 - Run FALCON-Phase on scaffolds Scaffold Phase0 Contigs Rerun Phasing PARENTAL SNVS AFTER SCAFFOLD PHASING Output: Chromosome-scale, phased, diploid assembly! Scaffold 0 Scaffold 1
  10. PACIFIC BIOSCIENCES® CONFIDENTIAL

  11. ASSEMBLY DEVELOPMENTS 1. FALCON(-UNZIP), Arrow - Pb-assembly available in Bioconda

    (re-engineered Unzip code) - Next release will use minimap instead of blasr for polishing 2. FALCON-Phase - https://www.biorxiv.org/content/early/2018/05/21/327064 - Works well in our hands; VGP has numerous data sets that could be tried 3. HiFi-based assembly - https://www.biorxiv.org/content/early/2019/01/13/519025 - Exploratory, only tested on very limited number of species (human, grape, tuna in progress), not yet a full workflow
  12. CCS READS PRODUCE HIGHLY CONTIGUOUS AND ACCURATE DE NOVO GENOME

    ASSEMBLIES (TABLE 2, FIGURE 4)
  13. CCS READS PRODUCE HIGHLY CONTIGUOUS AND ACCURATE DE NOVO GENOME

    ASSEMBLIES (TABLE 2, FIGURE 4) 6x 77x 202x
  14. 15-FOLD CCS COVERAGE IS SUFFICIENT FOR ALL APPLICATIONS (SUPPLEMENTARY FIGURES

    11, 12)
  15. BLUE FIN TUNA (IN PROGRESS) Collaboration with B. Block (Stanford)

    - Have good CLR assembly (presented at webinar Nov 1, 2018) - Example CCS run performance: https://www.nature.com/webcasts/event/assembling-high-quality-genomes-to-solve-natures-mysteries/ METRICS 1M Number of Bases 42 Gb Number of Reads 351,297 Pol. Read Length (Mean) 119,705 bp Pol. Read Length (N50) 212,769 bp METRICS Number of >QV20 Bases 2.6 Gb Number of >QV20 Reads 203,524 CCS Insert Length (Mean) 12,624 bp CCS Read Score (Median) Q33
  16. ASSEMBLY RESULTS FOR CABERNET SAUVIGNON CLONE 8

  17. ASSEMBLY DEVELOPMENTS 1. FALCON(-UNZIP), Arrow - Pb-assembly available in Bioconda

    (re-engineered Unzip code) - Next release will use minimap instead of blasr for polishing 2. FALCON-Phase - https://www.biorxiv.org/content/early/2018/05/21/327064 - Works well in our hands; VGP has numerous data sets that could be tried 3. HiFi-based assembly - https://www.biorxiv.org/content/early/2019/01/13/519025 - Exploratory, only tested on very limited number of species (human, grape, tuna in progress), not yet a full workflow 4. HiFi + Hi-C - Map Hi-C data directly to the reads - Another way of ‘binning’ the reads for samples where parents not available (may be easier than raw reads)
  18. ISO-SEQ DEVELOPMENTS https://www.pacb.com/software Iso-Seq 3: - Improved runtime - Increased

    de-multiplexing accuracy - Improved artifact detection - Same transcript recovery as Iso-Seq 1 and 2 - Works for whole and targeted transcriptome IsoPhase: - Allele-specific expression resolution
  19. CAPACITY CONSIDERATIONS FOR VGL FOR 2019 - Assuming average genome

    size of 2 Gb - Throughput per day: - 60 Gb / SMRT Cell (long-insert, 2 x 10 hr) - 60-fold CLR coverage for 2 Gb genome (good for 1 species @60x traditional long-insert assembly) - 300 Gb / SMRT Cell (Hi-Fi mode, 24 hr) - 10-fold HiFi coverage (good for 0.5 species @20x Hi-Fi read assembly) - CLR approach: - 1 species per day = ~30 species per instrument per month - 2 instruments = ~60 species per month, ~200 species in 4 months Sequence ~200 genomes by September
  20. DIPLOID GENOMES DESERVE “DIPLOID TRANSCRIPTOMES” Allele-specific expression resolution: IsoPhase algorithm

    (E. Tseng) only only
  21. ALLELE-SPECIFIC EXPRESSION RESOLUTION Two parents express different isoforms (3’ exon

    difference)
  22. ALLELE-SPECIFIC EXPRESSION RESOLUTION - Two parents express different isoforms (3’

    exon difference) For more information see poster PO0087 Both F1s inherit the allele-specific isoform expression
  23. TOPICS - Sequel 8M Chip - Assembly Developments - Iso-Seq

    Developments