Learning Genome Structrues From De Novo Assembly and Long-read Mapping

FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences
of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. Jason Chin (@infoecho) / Sept. 20 2014, GRC Workshop, Cambridge, UK Learning Genomic Structures From De Novo Assembly and Long-read Mapping de novol

Cost per Genome Dilemma 2 Sequencing cost is down for
sure, but getting a de novo human genome that has the same scientific standard as the initial work does NOT follow Moore’s law. PacBio® CHM1: 4378 kb from just single random fragment library HGP, N50 ~100kb NCBI-34 Contig N50 29Mb HuRef: 107kb BGI YH: 7.4kb KB1: 5.5kb NA12878: 24kb CHM1: 144kb RP11: 127kb According to the NHGRI website, the definition of “sequencing a genome” changed in 2008. The 1000 Genomes Project starts in 2008, too.

Question Asked!! •  Since the 1000 Genomes Project, we have
learned a lot of about point mutations. Can we go beyond that? •  What if we have 50, 100 or more human assemblies so we can address all genetic variations as much as possible? •  Will one day all human genome sequencing be done in de novo fashion? –  If so, how can we get ready for that as bioinformatists? 3 Evan Eichler , In Future Opportunities for Genome Sequencing and Beyond, July 28-29, 2014

Where We Are Now •  One PacBio® human data set
is publicly available, more are likely to come •  Multiple groups have successfully assembled the public CHM1 data set independently with new algorithms from raw data •  With new alignment/assembly tools from Gene Myers: one can assemble a genome in ~ 20,000 CPU-hours. (20X faster than 400,000+ CPU-hours from previous effort.) 4 New Assembly Statistics done With Daligner: #Seqs 5,058 Mean 562,695 Max 27,292,514 n50 5,265,098 Total 2,846,115,586 http://dazzlerblog.wordpress.com

What Can We Learn from High-contiguity De Novo Human Assemblies?
5

What Can We Learn from High-contiguity Human Assemblies? •  Low-hanging
Fruits –  Calling SNPs (assembly not needed, but it helps) –  Calling structure variants with whole-genome alignment approaches –  Inferring repeats by coverage analysis •  Assembly graph can provide information for understanding more complicated polymorphisms 6

Call SNPs / Example: HLA-B 7

Call Structure Variation By Whole-genome Alignment •  Whole-genome alignments (
~ 1 hr in a 32-core machine) –  With multi-threaded Mummer –  Clustering the hits with Mgaps and identified “gaps” in the alignments, convert to bed format for visualization 8 Structure Variants Called in Chromosome 1

Distribution of The Structure Variation Sizes •  Number of insertions/deletions:
13796 SV calls (for insertion or deletion > 100 bp against hg19) 9

PacBio® vs. Short-read Alignment View for SV in the MHC
region 10 318bp insertion

Assembly Graph 11 Each edge is associated with a sequence.
Every path is a candidate of a model of part of the genome. From Gene Myers’ ISMB 2014 Keynote talk

Dissect a Contig from a String Graph The autonomy of
a contig from a string graph layout 12 A contig: a linear non-branching path Each node: the begin (5’) or end (3’) of a read Each edge: a continuous sub- sequence from one read Ek : (V1 , V2 , Read, Range) = ( 00099576_1:B, 00101043_0:B, 00101043_0, 1991-‐0 ) Read 1: 00099576_1, Read 2: 00101043_0 In practice, we might just encode the paths in a contig rather than each single edge: C = (Ek , Ek+1 , Ek+2 , Ek+2 ) = (Pj Pj+1 ) V1 V2 V3 V4 V5 Ek Ek+1 Ek+2 Ek+3 V1 V3 V5 Pj Pj+1 C = =

Assembly String Graph of CHM1 Genome •  Largest connect component:
31998 nodes, 39399 edges, ~36.5% (~1Gbp) of the human genome (total: 87572 nodes, 94530 edges) 13 Centromere? Casey Bergman: “it almost looks like an electron micrograph of the nucleus” #convergence

Polymorphism Structure vs. Local Assembly Graph Structure 14 SNPs SNPs
SNPs SVs SVs Diploid Genome Segmental Duplication Similar String Graph

Identify Contigs: A New Proposal SNPs SNPs SNPs SVs SVs
Associated contig 1 Associated contig 2 Primary contig 1 full length contig + 2 associated contigs Keep the long-range information while maintaining the relations of the alternative alleles.

Contig 4076 Alignment Around DPY19L2 Locus Same contig

Contig Graph and Segmental Duplication Contig 4076, one primary contig,
3 associate contigs, aligned to Chr7 and Chr12

Coting 4076 Alignment to Chr7 Same contig SV calls from
CHM1 asm SV calls from GRC38

Local Neighborhood Subgraph of Contig 4076 19

Examining an Assembly Graph at Contig Level Around 1q21 • 
Contig graph, 1q21, contig 4108, another potential segmental duplication? 20

Another Intriguing Case 21 •  Contig 4006 mapped to chr
9 The aligned region changes a lot in GRC38.

Contig Coverage Analysis 22 18.5 X 2 * 18.5 X
3 * 18.5 X High coverage long contigs 40 contigs > 100kbp > 2.5 * 18.5 X Poor assemblies, alignment artifacts, or sequence errors? High repeat elements

Checking the Complexity of the High-coverage Contigs 23 Contig 4006,
687kb, 53x coverage Contig 4235, 453k, 59x coverage Contig 3842, 235k, 54x coverage Warning: These contigs may not be 100% correctly assembled due to some nasty repeats. However, the local graphs give hints about the true genome structures.

How does the High-coverage Contig Look? 24 >2000X in this
region

How does The High-coverage Contig Look? 25 High-coverage Region Alpha
satellites?

For Research Use Only. Not for use in diagnostic procedures.
Extreme Repeats 26

Identify Centromere Alpha-satellite Structure •  Most of the nasty contig
graphs are around the centromere. Currently, it remains hard to get long contigs around those very long tandem repeats. •  However, we can still learn many useful things from long-read data •  Tool In Development: α-Centauri for identifying different high-order repeat structures (https://github.com/volkansevim/alpha-CENTAURI, Volkan Sevim, Ali Bashir & Karen Miga ) 27

Centromere Alpha Satellites Have Non-trivial High-order Repeat Structure 28 Karen
Miga

Example: A Read Reconstructs a 24-mer HOR 29 Align monomer
to each other to identify near identical mon0mers Identify HOR with the monomer IDs and positions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Many Other Open Topics •  Low-coverage assembly: cost vs. quality
analysis •  Phasing for haplotypes •  Crowd-sourcing infrastructure for examining / annotating / correcting genome assemblies •  Evaluation about SNPs calling with short reads on better assembly •  Large-scale comparative genomes with de novo assemblies •  Assembly-graph data format •  Visualization Techniques •  Combining other data types, e.g. optical mapping 30 It is a very exciting time. We still need more tools to harvest information to generate new knowledge.

For Research Use Only. Not for use in diagnostic procedures.
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners. 31

Learning Genome Structrues From De Novo Assembl...

Learning Genome Structrues From De Novo Assembly and Long-read Mapping

Jason Chin

More Decks by Jason Chin

Other Decks in Science

Featured

Transcript

FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences

Cost per Genome Dilemma 2 Sequencing cost is down for

Question Asked!! •  Since the 1000 Genomes Project, we have

Where We Are Now •  One PacBio® human data set

What Can We Learn from High-contiguity De Novo Human Assemblies?

What Can We Learn from High-contiguity Human Assemblies? •  Low-hanging

Call SNPs / Example: HLA-B 7

Call Structure Variation By Whole-genome Alignment •  Whole-genome alignments (

Distribution of The Structure Variation Sizes •  Number of insertions/deletions:

PacBio® vs. Short-read Alignment View for SV in the MHC

Assembly Graph 11 Each edge is associated with a sequence.

Dissect a Contig from a String Graph The autonomy of

Assembly String Graph of CHM1 Genome •  Largest connect component:

Polymorphism Structure vs. Local Assembly Graph Structure 14 SNPs SNPs

Identify Contigs: A New Proposal SNPs SNPs SNPs SVs SVs

Contig 4076 Alignment Around DPY19L2 Locus Same contig

Contig Graph and Segmental Duplication Contig 4076, one primary contig,

Coting 4076 Alignment to Chr7 Same contig SV calls from

Local Neighborhood Subgraph of Contig 4076 19

Examining an Assembly Graph at Contig Level Around 1q21 •

Another Intriguing Case 21 •  Contig 4006 mapped to chr

Contig Coverage Analysis 22 18.5 X 2 * 18.5 X

Checking the Complexity of the High-coverage Contigs 23 Contig 4006,

How does the High-coverage Contig Look? 24 >2000X in this

How does The High-coverage Contig Look? 25 High-coverage Region Alpha

For Research Use Only. Not for use in diagnostic procedures.

Identify Centromere Alpha-satellite Structure •  Most of the nasty contig

Centromere Alpha Satellites Have Non-trivial High-order Repeat Structure 28 Karen

Example: A Read Reconstructs a 24-mer HOR 29 Align monomer

Many Other Open Topics •  Low-coverage assembly: cost vs. quality

For Research Use Only. Not for use in diagnostic procedures.