Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning Genome Structrues From De Novo Assembly and Long-read Mapping

Jason Chin
September 20, 2014

Learning Genome Structrues From De Novo Assembly and Long-read Mapping

Presented in GRC workshop before Genome Informatics Meeting 2014. Some examples about analyzing assembly graph to find unusual spots in a genome and centromere repeat characterization with long reads

Jason Chin

September 20, 2014
Tweet

More Decks by Jason Chin

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences

    of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. Jason Chin (@infoecho) / Sept. 20 2014, GRC Workshop, Cambridge, UK Learning Genomic Structures From De Novo Assembly and Long-read Mapping de novol
  2. Cost per Genome Dilemma 2 Sequencing cost is down for

    sure, but getting a de novo human genome that has the same scientific standard as the initial work does NOT follow Moore’s law. PacBio® CHM1: 4378 kb from just single random fragment library HGP, N50 ~100kb NCBI-34 Contig N50 29Mb HuRef: 107kb BGI YH: 7.4kb KB1: 5.5kb NA12878: 24kb CHM1: 144kb RP11: 127kb According to the NHGRI website, the definition of “sequencing a genome” changed in 2008. The 1000 Genomes Project starts in 2008, too.
  3. Question Asked!! •  Since the 1000 Genomes Project, we have

    learned a lot of about point mutations. Can we go beyond that? •  What if we have 50, 100 or more human assemblies so we can address all genetic variations as much as possible? •  Will one day all human genome sequencing be done in de novo fashion? –  If so, how can we get ready for that as bioinformatists? 3 Evan Eichler , In Future Opportunities for Genome Sequencing and Beyond, July 28-29, 2014
  4. Where We Are Now •  One PacBio® human data set

    is publicly available, more are likely to come •  Multiple groups have successfully assembled the public CHM1 data set independently with new algorithms from raw data •  With new alignment/assembly tools from Gene Myers: one can assemble a genome in ~ 20,000 CPU-hours. (20X faster than 400,000+ CPU-hours from previous effort.) 4 New Assembly Statistics done With Daligner:   #Seqs      5,058   Mean        562,695   Max          27,292,514   n50          5,265,098   Total      2,846,115,586   http://dazzlerblog.wordpress.com
  5. What Can We Learn from High-contiguity Human Assemblies? •  Low-hanging

    Fruits –  Calling SNPs (assembly not needed, but it helps) –  Calling structure variants with whole-genome alignment approaches –  Inferring repeats by coverage analysis •  Assembly graph can provide information for understanding more complicated polymorphisms 6
  6. Call Structure Variation By Whole-genome Alignment •  Whole-genome alignments (

    ~ 1 hr in a 32-core machine) –  With multi-threaded Mummer –  Clustering the hits with Mgaps and identified “gaps” in the alignments, convert to bed format for visualization 8 Structure Variants Called in Chromosome 1
  7. Distribution of The Structure Variation Sizes •  Number of insertions/deletions:

    13796 SV calls (for insertion or deletion > 100 bp against hg19) 9
  8. Assembly Graph 11 Each edge is associated with a sequence.

    Every path is a candidate of a model of part of the genome. From Gene Myers’ ISMB 2014 Keynote talk
  9. Dissect a Contig from a String Graph The autonomy of

    a contig from a string graph layout 12 A contig: a linear non-branching path Each node: the begin (5’) or end (3’) of a read Each edge: a continuous sub- sequence from one read Ek :  (V1 ,  V2 ,  Read,  Range)  =    (  00099576_1:B,  00101043_0:B,  00101043_0,  1991-­‐0  )     Read  1:  00099576_1,  Read  2:  00101043_0     In practice, we might just encode the paths in a contig rather than each single edge: C  =  (Ek ,  Ek+1 ,  Ek+2 ,  Ek+2 )  =  (Pj  Pj+1 )       V1 V2 V3 V4 V5 Ek Ek+1 Ek+2 Ek+3 V1 V3 V5 Pj Pj+1 C = =
  10. Assembly String Graph of CHM1 Genome •  Largest connect component:

    31998 nodes, 39399 edges, ~36.5% (~1Gbp) of the human genome (total: 87572 nodes, 94530 edges) 13 Centromere? Casey Bergman: “it almost looks like an electron micrograph of the nucleus” #convergence
  11. Polymorphism Structure vs. Local Assembly Graph Structure 14 SNPs SNPs

    SNPs SVs SVs Diploid Genome Segmental Duplication Similar String Graph
  12. Identify Contigs: A New Proposal SNPs SNPs SNPs SVs SVs

    Associated contig 1 Associated contig 2 Primary contig 1 full length contig + 2 associated contigs Keep the long-range information while maintaining the relations of the alternative alleles.
  13. Contig Graph and Segmental Duplication Contig 4076, one primary contig,

    3 associate contigs, aligned to Chr7 and Chr12
  14. Examining an Assembly Graph at Contig Level Around 1q21 • 

    Contig graph, 1q21, contig 4108, another potential segmental duplication? 20
  15. Another Intriguing Case 21 •  Contig 4006 mapped to chr

    9 The aligned region changes a lot in GRC38.
  16. Contig Coverage Analysis 22 18.5 X 2 * 18.5 X

    3 * 18.5 X High coverage long contigs 40 contigs > 100kbp > 2.5 * 18.5 X Poor assemblies, alignment artifacts, or sequence errors? High repeat elements
  17. Checking the Complexity of the High-coverage Contigs 23 Contig 4006,

    687kb, 53x coverage Contig 4235, 453k, 59x coverage Contig 3842, 235k, 54x coverage Warning: These contigs may not be 100% correctly assembled due to some nasty repeats. However, the local graphs give hints about the true genome structures.
  18. Identify Centromere Alpha-satellite Structure •  Most of the nasty contig

    graphs are around the centromere. Currently, it remains hard to get long contigs around those very long tandem repeats. •  However, we can still learn many useful things from long-read data •  Tool In Development: α-Centauri for identifying different high-order repeat structures (https://github.com/volkansevim/alpha-CENTAURI, Volkan Sevim, Ali Bashir & Karen Miga ) 27
  19. Example: A Read Reconstructs a 24-mer HOR 29 Align monomer

    to each other to identify near identical mon0mers Identify HOR with the monomer IDs and positions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
  20. Many Other Open Topics •  Low-coverage assembly: cost vs. quality

    analysis •  Phasing for haplotypes •  Crowd-sourcing infrastructure for examining / annotating / correcting genome assemblies •  Evaluation about SNPs calling with short reads on better assembly •  Large-scale comparative genomes with de novo assemblies •  Assembly-graph data format •  Visualization Techniques •  Combining other data types, e.g. optical mapping 30 It is a very exciting time. We still need more tools to harvest information to generate new knowledge.
  21. For Research Use Only. Not for use in diagnostic procedures.

    Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners. 31