Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaffolding of PacBio assemblies with Hi-C data

ghuryejay
August 16, 2016

Scaffolding of PacBio assemblies with Hi-C data

This is the work I did at Pacific Bioscience during the summer of 2016. I also presented it at Research in Progress meeting at CBCB, UMD.

ghuryejay

August 16, 2016
Tweet

More Decks by ghuryejay

Other Decks in Research

Transcript

  1. PACIFIC BIOSCIENCES® CONFIDENTIAL For Research Use Only. Not for use

    in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Scaffolding of long read assemblies using long range contact information Jay Ghurye and Jason Chin
  2. PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLDING OVERVIEW -Finding relative orientation and ordering

    of pre-assembled contigs -Usually done using data ‘NOT’ used during assembly -Proved to be NP-Hard problem -Greedy heuristics are used -Used Hi-C data to assemble long-read assembly
  3. PACIFIC BIOSCIENCES® CONFIDENTIAL CURRENT APPROACHES -LACHESIS (Burton et al. ,

    Nature Biotech) most widely used -Needs # chromosomes to be pre-specified -Super hard to get it running from their code on github (still haven’t figured out) -Super slow clustering process due to hierarchical clustering -DNA-Triangulation (Kaplan et al., Nature Biotech) -Paper just shows result for chromosome 4 and 14 -Does not orient contigs in clusters -Assumes all contigs have same length -Both tools do not remove mis-assemblies from initial assembly -Used Hi-C data for scaffolding in recent Goat Genome project
  4. PACIFIC BIOSCIENCES® CONFIDENTIAL METHOD Assembly Hi-C pair ended reads Mapped

    Hi-C reads to contigs using BWA Filter low quality alignments (QV < 30) Final Alignments
  5. PACIFIC BIOSCIENCES® CONFIDENTIAL DETECTION OF MIS-ASSEMBLIES - Hi-C Read pair

    - Contig - Physical coverage of read pair - Chr 1 - Chr 10
  6. PACIFIC BIOSCIENCES® CONFIDENTIAL DETECTION OF MIS-ASSEMBLIES • Need to calculate

    per base physical coverage • Linear time algorithm – runs in O(M + N) where M is length of contig and N is number of read pairs • If coverage falls below threshold, break the contig at that position • Calculated using a variation of maximum sum subarray problem • For C contigs, process takes O(|C|(M + N)) time
  7. PACIFIC BIOSCIENCES® CONFIDENTIAL EDGE WEIGHT SCORING – METHOD:1 - E

    - side - B - side - Linear relationship between log(distance) and number of mate pairs present - For long contigs ( > 10 MB) fit a linear model to get expected number of links shared for a certain genomic distance - Contig 2 - E - Contig 1 - B - B - E - For a pair of contigs, find expected # links using linear model for all orientations (E) - Let A be actual number of links shared - Score = 1/(|E - A| + 1) - Removes length bias
  8. PACIFIC BIOSCIENCES® CONFIDENTIAL EDGE WEIGHT SCORING – METHOD:2 - Number

    of sites where restriction enzyme cuts can also reduce length bias - Find # cut sites for each contig - B - B - E - E - Just consider pair aligned to ‘end’ of contigs - Score for particular orientation = # mate pairs / # restriction sites
  9. PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM -Sort all the links

    by the score -For each edge in the sorted list -Add that edge to the graph if both of its nodes are not present in the graph -Add edges between ‘:B’ and ‘:E’ nodes of same contigs -Cycles are removed using lowest cost edge from the cycle CTG1:B CTG1:E CTG2:B CTG2:E CTG3:B CTG3:E - Automatically eliminates forks by picking higher weight edges - Ctg1:B Ctg2:E 0.6 - Ctg1:B Ctg3:B 0.4 - Ctg2:B Ctg3:B 0.3 - Ctg1:E Ctg2:B 0.08 - Ctg2:E Ctg3:B 0.07
  10. PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM -For each connected component

    in the graph -Find nodes with degree 1, there will always be 2 such nodes -Find a path between these 2 nodes -This is a ‘backbone’ scaffold -Assign backbone scaffold to all the contigs that are not in any seed scaffolds using score -For each unplaced contig -Insert it at the place in seed scaffold that maximizes the total score for that path (Sum of all edges in the path) -Output the expanded seed scaffolds as final scaffolds - [Ctg1:B, Ctg1:E, Ctg2:B, Ctg2:E, Ctg3:E, Ctg3:B, ….....] - Forward - Forward - Reverse
  11. PACIFIC BIOSCIENCES® CONFIDENTIAL NA 19240 DATASET - Used assembly done

    at PacBio (Jason Chin) - Number of contigs = 3242 - Contig N50 = 23.98 Mb - Number of Hi-C pairs used approximately 11.5 Million
  12. PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD STATISTICS Feature Value # Scaffolds 118

    N50 69.49 Mb Number of Bases 2789088362 (2.78 Gb) # chr arms (p and q) covered by single scaffold 26 # chr arms covered by 2 scaffolds 14
  13. PACIFIC BIOSCIENCES® CONFIDENTIAL DOT PLOTS - Chr 1 to 7

    - Chr 8 to 14 - Chr 15 to 21 - Chr 22 & X - X : True position on Chr - Y: Derived scaffold
  14. PACIFIC BIOSCIENCES® CONFIDENTIAL NA 12878 DATASET - Pacbio assembly done

    by Icahn School of Medicine at Mt. Sinai - Contig N50 = 1.55 Mb - Number of Contigs = 21,235 - Number of Scaffolds = 18,903 - Scaffold N50 = 26.83 Mb - Scaffolding was done by them using BioNano data - We used Hi-C reads with 725 Million read pairs along with their assembly - Number of Scaffolds = 1555 - Scaffold N50 = 80 Mb - Probably really good Hi-C data?
  15. PACIFIC BIOSCIENCES® CONFIDENTIAL DOT PLOTS - Chr 1 to 6

    - Chr 7 to 12 - Chr 19 to X - Chr 13 to 18 - X : True position on Chr - Y: Derived scaffold
  16. PACIFIC BIOSCIENCES® CONFIDENTIAL COMPARISON WITH LACHESIS RESULTS - Taken figures

    from paper since could not run their tool - Chromosome 17 - Chromosome 7
  17. PACIFIC BIOSCIENCES® CONFIDENTIAL NEXT? - Do scaffolding NA12878 assembly with

    N50 = 13 Mb - With 726 million pairs, got bad scaffolds and lot of mis-joins - After down-sampling to 12 million pairs, got much better scaffolds - Higher the N50, down-sampling Hi-C data improves scaffolds (Why?) - Possibly a nice information theoretic argument, longer contigs -> less information for scaffolding ? - Haplotype specific scaffolding
  18. PACIFIC BIOSCIENCES® CONFIDENTIAL SCORING OF LINKS - Just considered the

    reads those map to only 2 distinct contigs - Less informative than - M and N be the #chunks mapped to 2 different contigs. - When chunks are skewed, M*N will be small - Case 1: M*N = 6 - Case 2: M*N = 12
  19. PACIFIC BIOSCIENCES® CONFIDENTIAL NEXT? - GIAB NA24385 pacbio assembly with

    N50 = 4.5 Mb - Assembly done at Pacbio with N50 ~ 13 Mb - Cergentis data for generated at PacBio - Still in exploration, this type of data is not studied well yet
  20. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. www.pacb.com PACIFIC BIOSCIENCES® CONFIDENTIAL