Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaffolding of PacBio assemblies with Hi-C data

Avatar for ghuryejay ghuryejay
August 16, 2016

Scaffolding of PacBio assemblies with Hi-C data

This is the work I did at Pacific Bioscience during the summer of 2016. I also presented it at Research in Progress meeting at CBCB, UMD.

Avatar for ghuryejay

ghuryejay

August 16, 2016
Tweet

More Decks by ghuryejay

Other Decks in Research

Transcript

  1. PACIFIC BIOSCIENCES® CONFIDENTIAL For Research Use Only. Not for use

    in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Scaffolding of long read assemblies using long range contact information Jay Ghurye and Jason Chin
  2. PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLDING OVERVIEW -Finding relative orientation and ordering

    of pre-assembled contigs -Usually done using data ‘NOT’ used during assembly -Proved to be NP-Hard problem -Greedy heuristics are used -Used Hi-C data to assemble long-read assembly
  3. PACIFIC BIOSCIENCES® CONFIDENTIAL CURRENT APPROACHES -LACHESIS (Burton et al. ,

    Nature Biotech) most widely used -Needs # chromosomes to be pre-specified -Super hard to get it running from their code on github (still haven’t figured out) -Super slow clustering process due to hierarchical clustering -DNA-Triangulation (Kaplan et al., Nature Biotech) -Paper just shows result for chromosome 4 and 14 -Does not orient contigs in clusters -Assumes all contigs have same length -Both tools do not remove mis-assemblies from initial assembly -Used Hi-C data for scaffolding in recent Goat Genome project
  4. PACIFIC BIOSCIENCES® CONFIDENTIAL METHOD Assembly Hi-C pair ended reads Mapped

    Hi-C reads to contigs using BWA Filter low quality alignments (QV < 30) Final Alignments
  5. PACIFIC BIOSCIENCES® CONFIDENTIAL DETECTION OF MIS-ASSEMBLIES - Hi-C Read pair

    - Contig - Physical coverage of read pair - Chr 1 - Chr 10
  6. PACIFIC BIOSCIENCES® CONFIDENTIAL DETECTION OF MIS-ASSEMBLIES • Need to calculate

    per base physical coverage • Linear time algorithm – runs in O(M + N) where M is length of contig and N is number of read pairs • If coverage falls below threshold, break the contig at that position • Calculated using a variation of maximum sum subarray problem • For C contigs, process takes O(|C|(M + N)) time
  7. PACIFIC BIOSCIENCES® CONFIDENTIAL EDGE WEIGHT SCORING – METHOD:1 - E

    - side - B - side - Linear relationship between log(distance) and number of mate pairs present - For long contigs ( > 10 MB) fit a linear model to get expected number of links shared for a certain genomic distance - Contig 2 - E - Contig 1 - B - B - E - For a pair of contigs, find expected # links using linear model for all orientations (E) - Let A be actual number of links shared - Score = 1/(|E - A| + 1) - Removes length bias
  8. PACIFIC BIOSCIENCES® CONFIDENTIAL EDGE WEIGHT SCORING – METHOD:2 - Number

    of sites where restriction enzyme cuts can also reduce length bias - Find # cut sites for each contig - B - B - E - E - Just consider pair aligned to ‘end’ of contigs - Score for particular orientation = # mate pairs / # restriction sites
  9. PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM -Sort all the links

    by the score -For each edge in the sorted list -Add that edge to the graph if both of its nodes are not present in the graph -Add edges between ‘:B’ and ‘:E’ nodes of same contigs -Cycles are removed using lowest cost edge from the cycle CTG1:B CTG1:E CTG2:B CTG2:E CTG3:B CTG3:E - Automatically eliminates forks by picking higher weight edges - Ctg1:B Ctg2:E 0.6 - Ctg1:B Ctg3:B 0.4 - Ctg2:B Ctg3:B 0.3 - Ctg1:E Ctg2:B 0.08 - Ctg2:E Ctg3:B 0.07
  10. PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM -For each connected component

    in the graph -Find nodes with degree 1, there will always be 2 such nodes -Find a path between these 2 nodes -This is a ‘backbone’ scaffold -Assign backbone scaffold to all the contigs that are not in any seed scaffolds using score -For each unplaced contig -Insert it at the place in seed scaffold that maximizes the total score for that path (Sum of all edges in the path) -Output the expanded seed scaffolds as final scaffolds - [Ctg1:B, Ctg1:E, Ctg2:B, Ctg2:E, Ctg3:E, Ctg3:B, ….....] - Forward - Forward - Reverse
  11. PACIFIC BIOSCIENCES® CONFIDENTIAL NA 19240 DATASET - Used assembly done

    at PacBio (Jason Chin) - Number of contigs = 3242 - Contig N50 = 23.98 Mb - Number of Hi-C pairs used approximately 11.5 Million
  12. PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD STATISTICS Feature Value # Scaffolds 118

    N50 69.49 Mb Number of Bases 2789088362 (2.78 Gb) # chr arms (p and q) covered by single scaffold 26 # chr arms covered by 2 scaffolds 14
  13. PACIFIC BIOSCIENCES® CONFIDENTIAL DOT PLOTS - Chr 1 to 7

    - Chr 8 to 14 - Chr 15 to 21 - Chr 22 & X - X : True position on Chr - Y: Derived scaffold
  14. PACIFIC BIOSCIENCES® CONFIDENTIAL NA 12878 DATASET - Pacbio assembly done

    by Icahn School of Medicine at Mt. Sinai - Contig N50 = 1.55 Mb - Number of Contigs = 21,235 - Number of Scaffolds = 18,903 - Scaffold N50 = 26.83 Mb - Scaffolding was done by them using BioNano data - We used Hi-C reads with 725 Million read pairs along with their assembly - Number of Scaffolds = 1555 - Scaffold N50 = 80 Mb - Probably really good Hi-C data?
  15. PACIFIC BIOSCIENCES® CONFIDENTIAL DOT PLOTS - Chr 1 to 6

    - Chr 7 to 12 - Chr 19 to X - Chr 13 to 18 - X : True position on Chr - Y: Derived scaffold
  16. PACIFIC BIOSCIENCES® CONFIDENTIAL COMPARISON WITH LACHESIS RESULTS - Taken figures

    from paper since could not run their tool - Chromosome 17 - Chromosome 7
  17. PACIFIC BIOSCIENCES® CONFIDENTIAL NEXT? - Do scaffolding NA12878 assembly with

    N50 = 13 Mb - With 726 million pairs, got bad scaffolds and lot of mis-joins - After down-sampling to 12 million pairs, got much better scaffolds - Higher the N50, down-sampling Hi-C data improves scaffolds (Why?) - Possibly a nice information theoretic argument, longer contigs -> less information for scaffolding ? - Haplotype specific scaffolding
  18. PACIFIC BIOSCIENCES® CONFIDENTIAL SCORING OF LINKS - Just considered the

    reads those map to only 2 distinct contigs - Less informative than - M and N be the #chunks mapped to 2 different contigs. - When chunks are skewed, M*N will be small - Case 1: M*N = 6 - Case 2: M*N = 12
  19. PACIFIC BIOSCIENCES® CONFIDENTIAL NEXT? - GIAB NA24385 pacbio assembly with

    N50 = 4.5 Mb - Assembly done at Pacbio with N50 ~ 13 Mb - Cergentis data for generated at PacBio - Still in exploration, this type of data is not studied well yet
  20. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. www.pacb.com PACIFIC BIOSCIENCES® CONFIDENTIAL