Scaffolding of PacBio assemblies with Hi-C data

PACIFIC BIOSCIENCES® CONFIDENTIAL For Research Use Only. Not for use
in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Scaffolding of long read assemblies using long range contact information Jay Ghurye and Jason Chin

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLDING OVERVIEW -Finding relative orientation and ordering
of pre-assembled contigs -Usually done using data ‘NOT’ used during assembly -Proved to be NP-Hard problem -Greedy heuristics are used -Used Hi-C data to assemble long-read assembly

PACIFIC BIOSCIENCES® CONFIDENTIAL HI-C TECHNOLOGY Image taken from http://science.sciencemag.org/content/326/5950/289

PACIFIC BIOSCIENCES® CONFIDENTIAL CURRENT APPROACHES -LACHESIS (Burton et al. ,
Nature Biotech) most widely used -Needs # chromosomes to be pre-specified -Super hard to get it running from their code on github (still haven’t figured out) -Super slow clustering process due to hierarchical clustering -DNA-Triangulation (Kaplan et al., Nature Biotech) -Paper just shows result for chromosome 4 and 14 -Does not orient contigs in clusters -Assumes all contigs have same length -Both tools do not remove mis-assemblies from initial assembly -Used Hi-C data for scaffolding in recent Goat Genome project

PACIFIC BIOSCIENCES® CONFIDENTIAL METHOD Assembly Hi-C pair ended reads Mapped
Hi-C reads to contigs using BWA Filter low quality alignments (QV < 30) Final Alignments

PACIFIC BIOSCIENCES® CONFIDENTIAL DETECTION OF MIS-ASSEMBLIES - Hi-C Read pair
- Contig - Physical coverage of read pair - Chr 1 - Chr 10

PACIFIC BIOSCIENCES® CONFIDENTIAL DETECTION OF MIS-ASSEMBLIES • Need to calculate
per base physical coverage • Linear time algorithm – runs in O(M + N) where M is length of contig and N is number of read pairs • If coverage falls below threshold, break the contig at that position • Calculated using a variation of maximum sum subarray problem • For C contigs, process takes O(|C|(M + N)) time

PACIFIC BIOSCIENCES® CONFIDENTIAL HI-C DATA PROPERTY

PACIFIC BIOSCIENCES® CONFIDENTIAL EDGE WEIGHT SCORING – METHOD:1 - E
- side - B - side - Linear relationship between log(distance) and number of mate pairs present - For long contigs ( > 10 MB) fit a linear model to get expected number of links shared for a certain genomic distance - Contig 2 - E - Contig 1 - B - B - E - For a pair of contigs, find expected # links using linear model for all orientations (E) - Let A be actual number of links shared - Score = 1/(|E - A| + 1) - Removes length bias

PACIFIC BIOSCIENCES® CONFIDENTIAL EDGE WEIGHT SCORING – METHOD:2 - Number
of sites where restriction enzyme cuts can also reduce length bias - Find # cut sites for each contig - B - B - E - E - Just consider pair aligned to ‘end’ of contigs - Score for particular orientation = # mate pairs / # restriction sites

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM -Sort all the links
by the score -For each edge in the sorted list -Add that edge to the graph if both of its nodes are not present in the graph -Add edges between ‘:B’ and ‘:E’ nodes of same contigs -Cycles are removed using lowest cost edge from the cycle CTG1:B CTG1:E CTG2:B CTG2:E CTG3:B CTG3:E - Automatically eliminates forks by picking higher weight edges - Ctg1:B Ctg2:E 0.6 - Ctg1:B Ctg3:B 0.4 - Ctg2:B Ctg3:B 0.3 - Ctg1:E Ctg2:B 0.08 - Ctg2:E Ctg3:B 0.07

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM - ASSIGNMENT - Backbone
scaffolds - Small scaffolds

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM - PLACEMENT - Small
scaffold - Backbone scaffold

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM -For each connected component
in the graph -Find nodes with degree 1, there will always be 2 such nodes -Find a path between these 2 nodes -This is a ‘backbone’ scaffold -Assign backbone scaffold to all the contigs that are not in any seed scaffolds using score -For each unplaced contig -Insert it at the place in seed scaffold that maximizes the total score for that path (Sum of all edges in the path) -Output the expanded seed scaffolds as final scaffolds - [Ctg1:B, Ctg1:E, Ctg2:B, Ctg2:E, Ctg3:E, Ctg3:B, ….....] - Forward - Forward - Reverse

PACIFIC BIOSCIENCES® CONFIDENTIAL NA 19240 DATASET - Used assembly done
at PacBio (Jason Chin) - Number of contigs = 3242 - Contig N50 = 23.98 Mb - Number of Hi-C pairs used approximately 11.5 Million

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD STATISTICS Feature Value # Scaffolds 118
N50 69.49 Mb Number of Bases 2789088362 (2.78 Gb) # chr arms (p and q) covered by single scaffold 26 # chr arms covered by 2 scaffolds 14

PACIFIC BIOSCIENCES® CONFIDENTIAL DOT PLOTS - Chr 1 to 7
- Chr 8 to 14 - Chr 15 to 21 - Chr 22 & X - X : True position on Chr - Y: Derived scaffold

PACIFIC BIOSCIENCES® CONFIDENTIAL ALIGNMENTS VISUALIZATION - Chromosome 19

PACIFIC BIOSCIENCES® CONFIDENTIAL ALIGNMENTS VISUALIZATION - Chromosome 11

PACIFIC BIOSCIENCES® CONFIDENTIAL NA 12878 DATASET - Pacbio assembly done
by Icahn School of Medicine at Mt. Sinai - Contig N50 = 1.55 Mb - Number of Contigs = 21,235 - Number of Scaffolds = 18,903 - Scaffold N50 = 26.83 Mb - Scaffolding was done by them using BioNano data - We used Hi-C reads with 725 Million read pairs along with their assembly - Number of Scaffolds = 1555 - Scaffold N50 = 80 Mb - Probably really good Hi-C data?

PACIFIC BIOSCIENCES® CONFIDENTIAL DOT PLOTS - Chr 1 to 6
- Chr 7 to 12 - Chr 19 to X - Chr 13 to 18 - X : True position on Chr - Y: Derived scaffold

PACIFIC BIOSCIENCES® CONFIDENTIAL COMPARISON WITH LACHESIS RESULTS - Taken figures
from paper since could not run their tool - Chromosome 17 - Chromosome 7

PACIFIC BIOSCIENCES® CONFIDENTIAL NEXT? - Do scaffolding NA12878 assembly with
N50 = 13 Mb - With 726 million pairs, got bad scaffolds and lot of mis-joins - After down-sampling to 12 million pairs, got much better scaffolds - Higher the N50, down-sampling Hi-C data improves scaffolds (Why?) - Possibly a nice information theoretic argument, longer contigs -> less information for scaffolding ? - Haplotype specific scaffolding

PACIFIC BIOSCIENCES® CONFIDENTIAL Scaffolding using TLA data

PACIFIC BIOSCIENCES® CONFIDENTIAL TLA TECHNOLOGY Figure take from www.nature.com/nbt/journal/v32/n10/full/nbt.2959.html

PACIFIC BIOSCIENCES® CONFIDENTIAL SIMPLIFIED VERSION - TLA Read - Chromosome

PACIFIC BIOSCIENCES® CONFIDENTIAL SCORING OF LINKS - Just considered the
reads those map to only 2 distinct contigs - Less informative than - M and N be the #chunks mapped to 2 different contigs. - When chunks are skewed, M*N will be small - Case 1: M*N = 6 - Case 2: M*N = 12

PACIFIC BIOSCIENCES® CONFIDENTIAL NEXT? - GIAB NA24385 pacbio assembly with
N50 = 4.5 Mb - Assembly done at Pacbio with N50 ~ 13 Mb - Cergentis data for generated at PacBio - Still in exploration, this type of data is not studied well yet

For Research Use Only. Not for use in diagnostics procedures.
© Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. www.pacb.com PACIFIC BIOSCIENCES® CONFIDENTIAL

Scaffolding of PacBio assemblies with Hi-C data

Scaffolding of PacBio assemblies with Hi-C data

ghuryejay

More Decks by ghuryejay

Other Decks in Research

Featured

Transcript

PACIFIC BIOSCIENCES® CONFIDENTIAL For Research Use Only. Not for use

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLDING OVERVIEW -Finding relative orientation and ordering

PACIFIC BIOSCIENCES® CONFIDENTIAL HI-C TECHNOLOGY Image taken from http://science.sciencemag.org/content/326/5950/289

PACIFIC BIOSCIENCES® CONFIDENTIAL CURRENT APPROACHES -LACHESIS (Burton et al. ,

PACIFIC BIOSCIENCES® CONFIDENTIAL METHOD Assembly Hi-C pair ended reads Mapped

PACIFIC BIOSCIENCES® CONFIDENTIAL DETECTION OF MIS-ASSEMBLIES - Hi-C Read pair

PACIFIC BIOSCIENCES® CONFIDENTIAL DETECTION OF MIS-ASSEMBLIES • Need to calculate

PACIFIC BIOSCIENCES® CONFIDENTIAL HI-C DATA PROPERTY

PACIFIC BIOSCIENCES® CONFIDENTIAL EDGE WEIGHT SCORING – METHOD:1 - E

PACIFIC BIOSCIENCES® CONFIDENTIAL EDGE WEIGHT SCORING – METHOD:2 - Number

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM -Sort all the links

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM - ASSIGNMENT - Backbone

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM - PLACEMENT - Small

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD CONSTRUCTION ALGORITHM -For each connected component

PACIFIC BIOSCIENCES® CONFIDENTIAL NA 19240 DATASET - Used assembly done

PACIFIC BIOSCIENCES® CONFIDENTIAL SCAFFOLD STATISTICS Feature Value # Scaffolds 118

PACIFIC BIOSCIENCES® CONFIDENTIAL DOT PLOTS - Chr 1 to 7

PACIFIC BIOSCIENCES® CONFIDENTIAL ALIGNMENTS VISUALIZATION - Chromosome 19

PACIFIC BIOSCIENCES® CONFIDENTIAL ALIGNMENTS VISUALIZATION - Chromosome 11

PACIFIC BIOSCIENCES® CONFIDENTIAL NA 12878 DATASET - Pacbio assembly done

PACIFIC BIOSCIENCES® CONFIDENTIAL DOT PLOTS - Chr 1 to 6

PACIFIC BIOSCIENCES® CONFIDENTIAL COMPARISON WITH LACHESIS RESULTS - Taken figures

PACIFIC BIOSCIENCES® CONFIDENTIAL NEXT? - Do scaffolding NA12878 assembly with

PACIFIC BIOSCIENCES® CONFIDENTIAL Scaffolding using TLA data

PACIFIC BIOSCIENCES® CONFIDENTIAL TLA TECHNOLOGY Figure take from www.nature.com/nbt/journal/v32/n10/full/nbt.2959.html

PACIFIC BIOSCIENCES® CONFIDENTIAL SIMPLIFIED VERSION - TLA Read - Chromosome

PACIFIC BIOSCIENCES® CONFIDENTIAL SCORING OF LINKS - Just considered the

PACIFIC BIOSCIENCES® CONFIDENTIAL NEXT? - GIAB NA24385 pacbio assembly with

For Research Use Only. Not for use in diagnostics procedures.