Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Estimating the effects of repeats on assembly contiguity

Estimating the effects of repeats on assembly contiguity

For a perfect assembler and at a high coverage, the contiguity of the assembly at a finite read length is limited by repetitive sequences. We study the limit imposed by repeat structures in plants, and contrast it to human, as the read length is increased. We started with assembled contigs from long reads and perform an all-against-all alignment. Non- unique regions of the contigs define repeats. We compare the repetitive sequences in human, a fish and several plants including coffee, grape and maize. We show the tendency of repeats to cluster in several plant genomes. Clustered repeats are especially difficult to assemble from short reads because even when all short reads are identified to be from the same 100 kb region, they are still repetitive in the repeat-cluster.

Shoudan Liang

June 17, 2016
Tweet

Other Decks in Technology

Transcript

  1. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Estimating the effects of repeats on assembly contiguity Shoudan Liang and Jason Chin
  2. TWO INTERLACED REPEATS Æ AMBIGUITY Y A A B B

    Z Y A A B B Z Ukkonen E, THEOR COMPUT SCI. 1992, 92 (1): 191-211 ?
  3. Koren et al. Genome Biology 2013, 14:R101 REPEATS IN BACTERIA

    - from Koren & Phillippy - Density plot for 2,267 bacteria and archaea genomes - Long repeats are due to ribosomes - CRISPR/Cas9 system (for genome editing) also show up as a cluster of repeats.
  4. GENOMES EXAMINED Animals: human and sea bass Plants: coffee, grape,

    maize chm1, haploid hydatidiform mole sea bass coffee grape maize total genome sequence 3.0Gb 0.67Gb 1.2Gb 0.96Gb 2.1Gb # of contigs 3641 3807 3929 2755 11436 contig N50 26.9Mb 1.1Mb 1.5Mb 1.4Mb 0.48Mb
  5. WE DEFINE A REPEAT BY ALIGNMENT - Take a PacBio

    assembly - All-versus-all alignment of contigs - Gene Myers’ daligner, which finds all local alignments. - Require 99% identity and a minimum length of L (125bp) genome repeats In this example, the number of hits is 4
  6. HUMAN ASSEMBLIES: SHORT READS VS LONG READS PacBio: CHM1 ALLPATHS-LG,

    DISCOVAR: NA12878 < 400 ALLPATHS-LG DISCOVAR PacBio
  7. HOW ARE REPEATS CLUSTERED? • Going local: can one assemble

    repetitive regions • By knowing all reads are from the same 100kb region • Using BAC libraries or synthetic long reads
  8. Cell, Vol. 100, 377–386 MULTIPLE REPEATS - From Cell. 100,

    377–386 (2000) - Sequenced by eight tiling BAC - 0.51Mb of isolated heterochromatin in Arabidopsis thaliana - resembles the chromosomal knobs described by Barbara McClintock in maize - involved in compacting of chromosomes 100kb 1bp 510kb
  9. BAC LIBRARY AND LINKED LONG READS Are there repeats within

    a 100kb region? - Pick 100kb region randomly from PacBio assemblies. - Using 125bp minimum alignment length and 99% identity to count the number of repetitive links in a 100kb region.
  10. IN HUMAN, ABOUT 50% OF 100KB REGIONS HAS NO LOCAL

    REPEAT IN THEM. 0 2 4 6 8 10 0 10 20 30 40 50 60 70 number_of_alignments_hits percentage 0 2 4 6 8 10 0 10 20 30 40 50 60 70 number_of_alignments_hits percentage Human Sea Bass
  11. IN PLANTS, ALMOST ALL 100KB REGIONS HAS MULTIPLE REPEATS IN

    THEM. 0 20 40 60 0 2 4 6 number_of_alignments_hits percentage 0 20 40 60 0 2 4 6 number_of_alignments_hits percentage 0 20 40 60 0 2 4 6 number_of_alignments_hits percentage Grape Coffee Maize
  12. REPEAT MOSAIC genome repeats Long repeat Repeat mosaic -A region

    can be covered by short repeats -However, the region as a whole is unique because of the particular combination of repeats is not found elsewhere.
  13. EVIDENCE FOR REPEAT MOSAIC IN COFFEE GENOME 4Kb 1Kb Screening

    length 1000 4000 8000 Length of Merged Repeats 16K 32K
  14. SUMMARY - Plants have significantly more repeats than human. -

    Within 100kb region, we find repeats with high probability, especially for plant genomes - Repeat mosaic is one mechanism that long reads are more unique. True long read ≠ Synthetic long reads
  15. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. www.pacb.com