PAGXXIV (2016) Bioinformatics Workshop

5633e4eaa009d960042a8f32b55b3d7f?s=47 Jason Chin
January 12, 2016

PAGXXIV (2016) Bioinformatics Workshop

Diploid Genome Assembly and Comprehensive Haplotype Sequence Reconstruction

This talk is a recent update on the FALCON-Unzip for assembling diploid genome generating haplotype specific contigs.

5633e4eaa009d960042a8f32b55b3d7f?s=128

Jason Chin

January 12, 2016
Tweet

Transcript

  1. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Diploid Genome Assembly and Comprehensive Haplotype Sequence Reconstruction Jason Chin, Paul Peluso, David Rank, Fritz Sedlazeck, Maria Nattestad, Michael Schatz, Greg Concepcion, Alicia Clum, Kerrie Barry, Alex Copeland, Ronan O’Malley
  2. Acknowledgments -All PacBio Colleagues -Ronan O’Malley, Chongyuan Luo, Joseph Ecker

    (HHMI / The Salk Institute ) -Alicia Clum, Kerrie Barry, Alex Copeland (Joint Genome Institute) -Maria Nattestad, Fritz Sedlazeck, Michael Schatz (CSHL) - Open source toolsets -Daligner (https://dazzlerblog.wordpress.com), Gene Myers -BLASR (https://github.com/PacificBiosciences/blasr), Mark Chaisson -Python, NetworkX for rapid algorithm protyping -Gephi, Graphviz for graph visualization -FALCON (https://github.com/PacificBiosciences/falcon, https://github.com/PacificBiosciences/falcon)
  3. SOLVING THE DIPLOID ASSEMBLY PROBLEM - Falcon (a polyploid-aware assembler)

    : generating the contigs through the bubbles - Falcon Unzip: identifying smaller variants and using them to separate the haplotypes • Bubbles = big variants between the haplotypes • Collapsed Path = smaller variants between the haplotypes
  4. WHY DO WE SEE BUBBLES? SNPs SNPs SNPs SVs SVs

    SNPs SNPs SNPs SVs SVs Haplotype 1 Haplotype 2 Genome Sequences Assembly Graph In most OLC assembler design, the overlapper does not catch differences at SNP level but structural variations are naturally segregated.
  5. THE FALCON UNZIP PROCESS SNPs SNPs SNPs SVs SVs Associate

    contig 1 (Alternative allele) Associate contig 2 (Alternative allele) SNPs SNPs SNPs SVs SVs Primary contig Augmented with haplotype information of each reads FALCON FALCON-Unzip Updated primary contig + “associate haplotigs”
  6. PHASING READ INTO HAPLOTYPE GROUPS Haplotype 0 Haplotype 1 Identify

    het- SNPs Phase het- SNPs Group reads with phased SNPs Reconstruct haplotypes Align SMRT reads to the initial primary contig More het-SNPs in longer reads: 8% to 15% sequence error rate is not an issues given enough long read coverage for phasing.
  7. QUESTION: HOW TO RESOLVE STRUCTURAL VARIATIONS & HET-SNPS PHASING AT

    ONCE 3 kb – 100 kb 300 b – 10 kb Structural Variations het-SNP ü Overlap-layout process catches SV haplotypes ✗ Collapsed paths when there is no SV ü Easy to group SNPs/reads into different haplotypes ✗ No phasing information associated with SVs ü Nearby SVs may be phased automatically ✗ Haplotype-fused paths ü Haplotype-specific paths ✗ More fragmented contigs Information Sources Pros & Cons Assembly graph features
  8. MERGE HAPLOTYPE INFORMATION AND “UNZIP” Tiling path of haplotype 0

    Tiling path of haplotype 1 Remove edges connecting different haplotypes
  9. PUT EVERYTHING TOGETHER “Falcon Unzip Process” ~ 4.80 Mb Add

    missing haplotype specific nodes & edges Remove edges that connect different haplotypes The final graph comprises a primary contig (blue), a major haplotig (red) and other smaller haplotigs. 4 major haplotype phased blocks determined by het-SNPs Un-phased region
  10. POLISHING: ALLELE-SPECIFIC ALIGNMENT FOR FINAL CONSENSUS “Augmented alignment”: Each read

    has extra attribute (e.g., contig identifier, phasing block, haplotype phase), an aligner uses those information to place the read to specific reference sequence or regions. Align the “red” haplotig Align the “blue” haplotig Read from same region but different haplotypes
  11. CONSTRUCT ARABIDOPSIS THALIANA COL-0 X CVI-0 DIPLOID F1 LINE Image

    credits: Pajoro, et al, Trends in plant science 21.1 (2016): 6-8. Col-0 Cvi-0 Col-0 x Cvi-0 • Two inbred lines sequenced in 2013 (P4 chemistry), assembled as haploid genomes • F1 line constructed and sequenced in 2015 (P6 chemistry), assembled with FALCON and FALCON-Unzip
  12. Col-0 x Cvi-0 assembly DIPLOID ASSEMBLY PRIMARY CONTIGS AND HAPLOTIGS

    . Col-0 chromosome Cvi-0 chromosome haplotigs primary contig • Primary contigs ~ 1n representation of the genome • Haplotigs ~ phased sequences from where the homologuous chromosomes are distinguishable
  13. ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS Strain Inbred Col-0 Inbred

    Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP FALCON FALCON-Unzip FALCON-Unzip primary contigs primary contigs haplotigs Assembly Size (Mb) 126 119 143 140 105 # contigs 1325 194 426 172 248 N50 size (Mb) 6.210 4.79 7.92 7.96 6.92 Max Contigsize (Mb) 10.25 11.25 13.39 13.32 11.65 126 119 143 57 140 105 0 20 40 60 80 100 120 140 160 Inbred Col-0 Inbred Cvi-0 F1 FALCON p-contigs F1 FALCON a-contigs F1 Unzip p-contigs F1 Unzip haplotigs Assembly Size (Mb) 6.21 4.79 7.92 0.146 7.96 6.92 0 2 4 6 8 10 N50 size (Mb)
  14. Col-0 x Cvi-0 assembly EVALUATE THE DIPLOID ASSEMBLY RESULT .

    Col-0 assembly Cvi-0 assembly haplotigs primary contig Haploid-like contig in the inbred-line assemblies Many variations Few or no variations Few or no variations Many variations By aligning the haplotigs to the parental genome assemblies, we can evaluate the haplotigs’ quality, e.g. haplotyping accuracy and CDS prediction consistency.
  15. COMPARE F1 ASSEMBLY TO THE INBRED ASSEMBLIES - We call

    the SNP and SVs against the parental inbred assemblies for all primary contigs and haplotigs. - Most haplotigs can be fully assigned to one of the parental haplotypes.
  16. COMPARE F1 ASSEMBLY TO THE INBRED ASSEMBLIES - We call

    the SNP and SVs against the parental inbred assemblies for all primary contigs and haplotigs. - Most haplotigs can be fully assigned to one of the parental haplotypes. Cvi-0 Col-0 Primary Contigs Haplotigs Col-0 Cvi-0
  17. ANNOTATION COMPARISION Predicted Coding Sequences TAIR 10 Genome & Predicted

    Transcripts Homopolymer Length Distributions Compare de novo gene prediction (with AUGUSTUS (Stanke 2003)) between different assemblies Assemblies TAIR 10 Col-0 Cvi-0 Number of predicted CDS 27,946 30,006 27,393 100% indel-free full length overlaps Col-0 30006 25,966 (92.9%) Col-0 x Cvi-0 56775 25,865 (92.5%) 26,537 (88.4%) 27,370 (99.9%)
  18. OTHER SMALLER AND LARGER DIPLOID GENOMES Clavicorona pyxidata (Coral Fungus)

    Cabernet Sauvignon+* Human* Haploid Genome Size: ~ 44 Mb ~ 500 Mb ~ 3 Gb FALCON-Unzip Results: Primary contig size 41.9 Mb 591.0 Mb 2.76 Gb Primary contig N50 1.5 Mb 2.2 Mb 22.9 Mb Haplotig size 25.5 Mb 372.2 Mb 2.0 Gb Haplotig N50 872 kb 767 kb 330 kb +Led by Cantu lab, UC Davis and Cramer lab, UN Reno *Preliminary results. Fast file system and efficient computational infrastructure are currently needed for large genomes.
  19. SUMMARY -Single data type for routine diploid assembly -Large genomes

    are more computationally challenging but it is mostly an engineering problem now: -Haplotype phasing improvement, incorporate 3rd party phasing code -Develop a sequence aligner for “augmented alignment” for faster Quiver consensus process - FALCON-Unzip code: (No code, No truth!!) if you like to hack it for now, email me (jchin@pacb.com) - Want to attack the algorithm problem for polyploid assembly? Let us help you! Thanks for your attention!
  20. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. www.pacb.com