Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The importance of high quality reference genome assemblies to personal and medical genomics

The importance of high quality reference genome assemblies to personal and medical genomics

#gi2015

Karyn Meltz Steinberg

October 28, 2015
Tweet

Other Decks in Science

Transcript

  1. The importance of high quality reference genome assemblies to personal

    and medical genomics Karyn Meltz Steinberg Genome Informatics 2015 @KMS_Meltzy
  2. 0 100000 200000 300000 400000 CHM1_1.1 HuRef ALLPATHS YH_2.0 Contig

    Number Contig N50 Figure 1 Last year… Steinberg et al, 2014
  3. This year… 0 5000000 10000000 15000000 20000000 25000000 30000000 CHM13

    Draft CHM1 PB_2 CHM1 PB_1 CHM1_1.1 HuRef ALLPATHS YH_2.0 Contig Number Contig N50
  4. This year… Log scale 1 10 100 1000 10000 100000

    1000000 10000000 100000000 CHM13 Draft CHM1 PB_2 CHM1 PB_1 CHM1_1.1 HuRef ALLPATHS YH_2.0 Contig Number Contig N50
  5. How do we define platinum and gold standards? GRCh38 Platinum

    (CHM1) Gold (NA19240) % Reference genome covered 100 98.40 90.80 % Assigned chromosomes 99.60 98.40 90.80 % gene models covered (>95% id, >90% length) 99.96 98.78 94.26 Contig N50 67.8 Mb 26.9 Mb 6.0 Mb Number of gaps 875 3,640 3,568 Total Assembled size 3.067 Gb 2.996 Gb 2.745 Gb % haplotype blocks (>1kb) resolved NA >95 >80 http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
  6. CHM13 Draft Assembly (GCA_000983455.1) •  60X PacBio (P5 and P6

    chemistry) •  Average read length ~11kb •  Daligner/Falcon v 0.2 Total sequence length 2,851,367,788 Number of contigs 2,873 Contig N50 12,981,785 Contig L50 68
  7. CHM13 Hybrid Scaffolds Improve Contiguity BioNano Map PacBio Assmbly Hybrid

    Scaffold # of Contigs 3593 1590 * 254 Min Contig Length 0.08 Mb 0 0.27 Mb Median Contig Length 0.61 Mb 0.06 Mb 4.35 Mb Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb Contig N50 1.02 Mb 12.98 Mb 20.79 Mb Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb Total Contig Length 2812 Mb 2824 Mb 2457.75 Mb *Number of contigs used in hybrid scaffolding
  8. BioNano can be used to size gaps and identify structural

    variants Collapse Expansion in Assembly Gap in Sequence PacBio Assembly BioNano Map SV_TYPES   DELETIONS   41   INVERSIONS   10   INSERTIONS   15 TOTAL   66   BioNano alignment to CHM13
  9. BioNano reveals collapse in PacBio assembly due to highly homologous

    segmental duplications SD = 96% CHR1   46746040   46857004   40   W   LBHZ01000938.1   110965   CHR1   46857005   47034202   41   N   177198   gap   CHR1   47034203   52157695   42   W   LBHZ01000245.1   5123493   PacBio Assembly BioNano Map
  10. This region is rich in medically relevant genes chr1 (p33)

    p31.1 1q12 q41 43 44 CYP4Z2P CYP4A11 CYP4X1 CYP4Z1 CYP4A22 SegDups Genes CHM13 PacBio LBHZ010000938.1 LBHZ010000938.1 LBHZ010000245.1 This locus has an assigned GRC issue due to unresolved variation and may be a candidate locus for alternative representation in the reference
  11. Reference based Analyses •  100X Illumina sequence from CHM13 • 

    Align to GRCh37 and GRCh38 with BWA-MEM •  Variant calling via SpeedSeq (Chiang et al, 2015) •  SNVs, indels: FreeBayes •  SVs: LUMPY, SVTyper •  CNV: CNVnator
  12. tl;dpa* •  The reference genome assembly is constantly being improved

    •  New PacBio-based assemblies are orders of magnitude more contiguous than previous WGS assemblies •  Integration of other data (e.g. BioNano, Dovetail) can improve contiguity even further and be used to identify structurally variant haplotypes that can be added to reference as alternative loci •  Platinum genome sequences integrated into GRCh38 have greatly improved read mapping and variant calling *too long; didn’t pay attention
  13. Acknowledgements The McDonnell Genome Institute at Washington University in St.

    Louis Rick Wilson Bob Fulton Wes Warren Tina Graves-Lindsay Vince Magrini Sean McGrath Derek Albracht Milinn Kremitzki Susan Rock Debbie Scheer Aye Wollam The Finishing and Bioinformatics Teams at The Genome Institute University of Washington Evan Eichler John Huddleston Archana Raja NCBI Valerie Schneider University of Pittsburgh School of Medicine (CHM13 cell line) Urvashi Surti Personalis Deanna Church BioNano Genomics Palak Sheth Pacific Biosciences Jason Chin Nick Sisneros