Falcon Unzip Presentation for Genome Informatics 2015

5633e4eaa009d960042a8f32b55b3d7f?s=47 Jason Chin
October 30, 2015

Falcon Unzip Presentation for Genome Informatics 2015

5633e4eaa009d960042a8f32b55b3d7f?s=128

Jason Chin

October 30, 2015
Tweet

Transcript

  1. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Diploid Genome Assembly and Comprehensive Haplotype Sequence Reconstruction Jason Chin, Paul Peluso, David Rank, Fritz Sedlazeck, Maria Nattestad, Michael Schatz, Alicia Clum, Kerrie Barry, Alex Copeland
  2. Acknowlegments -All PacBio Colleagues -Ronan O’Malley, Chongyuan Luo, Joseph Ecker

    (HHMI / The Salk Institute ) -Alicia Clum, Kerrie Barry, Alex Copeland (Joint Genome Institute) -Maria Nattestad, Fritz Sedlazeck, Michael Schatz (CSHL) - Open source toolsets -Daligner, Gene Myers -Blasr, Mark Chaisson -Python, NetworkX for rapid algorithm protyping -Gephi, Graphviz for graph visualization
  3. IN 2013

  4. DIPLOID ASSEMBLY GRAPH IN 2013 “Bubbles” caused by SV between

    homologous copies Branching point caused by repeats An unitig graph from Ler-0 + Col-0 data The graph “diameter” ~ 12 M bp Mean edge size=17.4 k bp
  5. PACBIO HAPLOID HUMAN ASSEMBLY IN 2015 - ~12,000 CPU hours

    - N50 ~ 10 Mb to 30 Mb, depending on DNA sample and sequencing quality - Longest contig that we ever assembled ~ 109 Mb http://www.pacb.com/blog/toward-platinum-genomes-pacbio-releases-a-new-higher-quality-chm1-assembly-to-ncbi/ Google search, “pacbio chm1 assembly blog”
  6. SOLVING THE DIPLOID ASSEMBLY PROBLEM - Falcon (a polyploid-aware assembler)

    : generating the contigs through the bubbles - Falcon Unzip: identifying smaller variants and using them to separate the haplotypes
  7. THE FALCON UNZIP PROCESS SNPs SNPs SNPs SVs SVs Associate

    contig 1 (Alternative allele) Associate contig 2 (Alternative allele) SNPs SNPs SNPs SVs SVs Primary contig Agumented with haplotype information of each reads Falcon Falcon Unzip Updated primary contig + “associate haplotigs”
  8. PHASING READ INTO HAPLOTYPE GROUPS Haplotype 0 Haplotype 1 Identify

    het- SNPs Phase het- SNPs Group reads with phased SNPs Reconstruct haplotypes Align SMRT reads to the initial primary contig
  9. MERGE HAPLOTYPE INFORMATION AND “UNZIP” Tiling path of haplotype 0

    Tiling path of haplotype 1 Remove edges connecting different haplotypes
  10. PUT EVERYTHING TOGETHER “Falcon Unzip Process” ~ 4.80 Mbp Add

    missing haplotype specific nodes & edges Remove edges that connect different haplotypes The final graph comprises a primary contig (blue), a major haplotig (red) and other smaller haplotigs. 4 major haplotype phased blocks Un-phased region
  11. USING IN SILICO F1 FOR EVALUATING PHASING ACCURACY -Two inbred

    lines, CVI-0 and Col-0, were sequenced separately about 1.5 years ago with P5C3 chemistry -Characterize the variations between the two strains with the per-strain haploid assemblies: -High SV density: big SV every 80 kb -High SNP density: SNP every 100 to 300 bp -In silico diploid dataset: mixture of the two datasets to emulate a diploid genome at about 80x coverage. 9.49 Mb haplotype fused assembly graph
  12. ARABIDOPSIS THALIANA ASSEMBLY COMPARISON 0 50 100 150 200 250

    COL-0, CA/HGAP CVI-0, CA/HGAP COL-0 x CVI-0 CA/HGAP COL-0 x CVI-0 Falcon Primary COL-0 x CVI-0 Falcon Associate COL-0 x CVI-0 Falcon-unzip Primary COL-0 x CVI-0 Falcon-unzip Associate Assembly Size (Mbp) 0 2 4 6 8 N50 Size (Mbp) Falcon Unzip: 85% genome resolved to haplotigs with haplotig N50=973 kb 49kb 39kb Expected 1n genome size = 135 Mb
  13. HAPLOTYPE ACCURACY 0.00% 0.50% 1.00% 1.50% 2.00% - 0.50 1.00

    1.50 2.00 2.50 3.00 3.50 Switching Rate Contig Length(Mbp) - Over the full haplotig assembly, the switching error rate is about 0.5% “Switching rate” defined as “incorrect junctions / total fragments in the contigs”. For example, switching rate = 1/5 = 0.20 COL CVI
  14. BIOLOGICAL F1: CLAVICORONA PYXIDATA ASSEMBLY - first sequenced coral fungus

    - possible to explore the enzymatic wood-decay systems - potential to uncover the factors related to mushroom formation - ~ 42 Mb size genome (1n) - various orthogonal data sets (Illumina genome / transcripts) available 0 20 40 60 80 "Clappy1" JGI contig "Clappy1" JGI scaffold CA/HGAP Falcon Primary Falcon Associate Falcon-unzip Primary Falcon-unzip Associate Assembly Size (Mbp) 0 0.5 1 1.5 N50 Size (Mbp) Falcon Unzip: 55% genome resolved to haplotigs with haplotig N50 = 1.07 Mb “Clappy1”: Illumina based genome assembly for Clavicorona pyxidata HHB10654, URL: http://genome.jgi.doe.gov/Clapy1/Clapy1.home.html
  15. ALELLE-SPECIFIC ALIGNMENT FOR FINAL CONSENSUS “Augmented alignment”: Each read has

    extra attribute (e.g., contig id, phasing block, haplotype phase), an aligner uses those information to place the read to specific reference sequence or regions. Align the “red” haplotig Align the “blue” haplotig Read from same region but different haplotypes
  16. ACCURACY ASSESSMENT - Align Illumina 150 bp reads to the

    assembly contigs - 100% concordance interval = every base in the interval has at least one 150 bp exact matches - Higher percentage of the Falcon Unzip contig in bigger full-concordance intervals - Comparing to simulated data, most of Falcon Unzip assembly is above QV50. QV50 QV40 QV30 QV50 QV60 QV40 Inverted cumulative full-concordance length distribution
  17. DIFFERENTIALLY EXPRESSED TRANSCRIPTS NEARBY STRUCTURAL VARIATIONS 13.3 kb 1.8 kb

    2.6 kb A-haloptig CDS Transcripts Genome Allele-specific transcripts Transcript spanned the SV break points (Fritz Sedlazeck)
  18. LARGE GENOME: DIPLOID HUMAN GENOME 18 9.21 Mb Haplotig Assembly

    Graph (3079 nodes, 3997 edges) Total 70 haplotigs Total size 14,918,026 bp N50 size 483,236 bp Example: Phased SVs across 150 kb HLA class II, HAL-DQA/B region My personal bold prediction: In 3 to 5 years, we will regularly de novo construct many diploid human genomes to find missing secrets.
  19. SUMMARY - Single data type for routine diploid assembly -

    We need to keep developing evaluation frameworks to improve the performance - Large genomes are challenging but it is mostly an engineering problem now: -Haplotype phasing improvement, incorporate 3rd party phasing code -Develop a sequence aligner for “augmented alignment” for faster Quiver consensus process -Want to attack polyploid genome assembly problem? Let us help you!
  20. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. Thanks for your attention!