Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Assembling the genome sequences of the plastid and mitochondrion of white spruce

Shaun Jackman
January 14, 2014

Assembling the genome sequences of the plastid and mitochondrion of white spruce

The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) are assembled from whole genome Illumina sequencing data using ABySS and aligned to the Norway spruce (Picea abies) using BWA-MEM. The putative mitochondrial sequences are classified using k-means clustering in R. The plastid genome is 120 kbp and the putative mitochondrial genome is 6 Mbp.

Shaun Jackman

January 14, 2014
Tweet

More Decks by Shaun Jackman

Other Decks in Science

Transcript

  1. Assembling the genome
    sequences of the plastid and
    mitochondrion of white spruce
    PAG 2014 Bioinformatics Workshop

    Shaun Jackman @sjackman

    2014-01-14
    1
    Shaun D Jackman1, Anthony Raymond1, Ben Vandervalk1, Hamid Mohamadi1, René Warren1, Stephen Pleasance1,

    Robin Coope1, Macaire MS Yuen2, Christopher Keeling2, Carol Ritland2, Jean Bousquet3, Alvin Yanchuk4,

    Kermit Ritland2, John MacKay3, Steven JM Jones1, Jörg C Bohlmann2 and İnanç Birol1
    (1) BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, Canada, (2) University of British Columbia, Vancouver, BC, Canada,

    (3) Univesité Laval, Quebec, QC, Canada, (4) British Columbia Ministry of Forests, Victoria, BC, Canada
    Photo credit: Joseph O'Brien, USDA Forest Service, bugwood.org

    View full-size slide

  2. Assembling the Genome Sequences of the Plastid
    and Mitochondrion of White Spruce (Picea glauca)
    PAG 2014 Bioinformatics Workshop

    Shaun Jackman @sjackman

    2014-01-14
    2

    View full-size slide

  3. 10.1101/gr.089532.108
    Access the most recent version at doi:
    2009 19: 1117-1123 originally published online February 27, 2009
    Genome Res.
    Jared T. Simpson, Kim Wong, Shaun D. Jackman, et al.
    ABySS: A parallel assembler for short read sequence data
    Material
    Supplemental http://genome.cshlp.org/content/suppl/2009/04/27/gr.089532.108.DC1.html
    References
    http://genome.cshlp.org/content/19/6/1117.full.html#related-urls
    Article cited in:
    http://genome.cshlp.org/content/19/6/1117.full.html#ref-list-1
    This article cites 31 articles, 14 of which can be accessed free at:
    Open Access Freely available online through the Genome Research Open Access option.
    Related Content
    Genome Res. December 7, 2011 :
    Jared T Simpson and Richard Durbin
    structures
    Efficient de novo assembly of large genomes using compressed data
    Genome Res. December 6, 2011 :
    Steven L Salzberg, Adam M Phillippy, Aleksey V Zimin, et al.
    GAGE: A critical evaluation of genome assemblies and assembly algorithms
    service
    Email alerting
    click here
    top right corner of the article or
    Receive free email alerts when new articles cite this article - sign up in the box at th
    Cold Spring Harbor Laboratory Press
    on January 4, 2012 - Published by
    genome.cshlp.org
    Downloaded from
    ARTICLE OPEN
    doi:10.1038/nature12211
    The Norway spruce genome sequence
    and conifer genome evolution
    Lists of authors and their affiliations appear at the end of the paper
    Conifers have dominated forests for more than 200 million years and are of huge ecological and economic importance.
    Here we present the draft assembly of the 20-gigabase genome of Norway spruce (Picea abies), the first available for any
    gymnosperm. The number of well-supported genes (28,354) is similar to the .100 times smaller genome of Arabidopsis
    thaliana, and there is no evidence of a recent whole-genome duplication in the gymnosperm lineage. Instead, the large
    genome size seems to result from the slow and steady accumulation of a diverse set of long-terminal repeat transposable
    elements, possibly owing to the lack of an efficient elimination mechanism. Comparative sequencing of Pinus sylvestris,
    Abies sibirica, Juniperus communis, Taxus baccata and Gnetum gnemon reveals that the transposable element diversity
    is shared among extant conifers. Expression of 24-nucleotide small RNAs, previously implicated in transposable element
    silencing, is tissue-specific and much lower than in other plants. We further identify numerous long (.10,000 base
    pairs) introns, gene-like fragments, uncharacterized long non-coding RNAs and short RNAs. This opens up new
    genomic avenues for conifer forestry and breeding.
    Gymnosperms are a group of land plants comprising the extant taxa,
    cycads,Ginkgo, gnetophytes and conifers. Gymnospermsfirst appeared
    more than300 million years ago (Myrago)1, wellbefore theangiosperm
    lineage separated from the stem group of extant gymnosperms2. The
    negates the production of inbred lines that could facilitate genome
    assembly.
    The availability of conifer genome sequences would enable com-
    parative analyses of genome architecture and the evolution of key
    Vol. 29 no. 12 2013, pages 1492–1497
    BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btt178
    Genome analysis Advance Access publication May 22, 2013
    Assembling the 20 Gb white spruce (Picea glauca) genome from
    whole-genome shotgun sequencing data
    Inanc Birol1,2,3,*, Anthony Raymond1, Shaun D. Jackman1, Stephen Pleasance1,
    Robin Coope1, Greg A. Taylor1, Macaire Man Saint Yuen4, Christopher I. Keeling4,
    Dana Brand1, Benjamin P. Vandervalk1, Heather Kirk1, Pawan Pandoh1, Richard A. Moore1,
    Yongjun Zhao1, Andrew J. Mungall1, Barry Jaquish5, Alvin Yanchuk5, Carol Ritland4,6,
    Brian Boyle7, Jean Bousquet7,8, Kermit Ritland6, John MacKay7,8, Jo
    ¨ rg Bohlmann4,6 and
    Steven J.M. Jones1,2,9
    1Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada, 2Department of Medical
    Genetics, University of British Columbia, Vancouver, BC V6H 3N1, Canada, 3School of Computing Science, Simon
    Fraser University, Burnaby, BC V5A 1S6, Canada, 4Michael Smith Laboratories, University of British Columbia,
    Vancouver, BC V6T 1Z4, Canada, 5British Columbia Ministry of Forests, Lands and Natural Resource Operations,
    Victoria, BC V8W 9C2, Canada, 6Department of Forest Sciences, University of British Columbia, Vancouver, BC V6T
    1Z4, Canada, 7Institute for Systems and Integrative Biology, Universite
    ´ Laval, Que
    ´ bec, QC G1K 7P4, Canada,
    8Department of Wood and Forest Sciences, Universite
    ´ Laval, Que
    ´ bec, QC G1V 0A6, Canada and 9Department of
    Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
    Associate Editor: Michael Brudno
    ABSTRACT
    White spruce (Picea glauca) is a dominant conifer of the boreal forests
    of North America, and providing genomics resources for this commer-
    cially valuable tree will help improve forest management and conser-
    vation efforts. Sequencing and assembling the large and highly
    repetitive spruce genome though pushes the boundaries of the current
    technology. Here, we describe a whole-genome shotgun sequencing
    strategy using two Illumina sequencing platforms and an assembly
    approach using the ABySS software. We report a 20.8 giga base
    pairs draft genome in 4.9 million scaffolds, with a scaffold N50 of
    20356bp. We demonstrate how recent improvements in the sequen-
    cing technology, especially increasing read lengths and paired end
    reads from longer fragments have a major impact on the assembly
    contiguity. We also note that scalable bioinformatics tools are instru-
    mental in providing rapid draft assemblies.
    Availability: The Picea glauca genome sequencing and assembly data
    are available through NCBI (Accession#: ALWZ0100000000 PID:
    PRJNA83435). http://www.ncbi.nlm.nih.gov/bioproject/83435.
    Contact: [email protected]
    Supplementary information: Supplementary data are available at
    Bioinformatics online.
    Received on March 20, 2013; revised on April 10, 2013; accepted on
    April 11, 2013
    1 INTRODUCTION
    The assembly of short reads to develop genomic resources for
    non-model species remains an active area of development (Schatz
    et al., 2012). The feasibility of the approach and its scalability to
    large genomes was demonstrated by the ABySS publication
    (Simpson et al., 2009) using human genome sequencing data
    and was later used to assemble the panda genome with the
    SOAPdenovo tool (Li et al., 2010). The technology provides
    high quality results, as demonstrated for bacteria (Bankevich
    et al., 2012; Ladner et al., 2013; Ribeiro et al., 2012), and has
    been successfully applied numerous times on more complex gen-
    omes (Chan et al., 2011; Chu et al., 2011; Diguistini et al., 2009,
    2011; Godel et al., 2012; Swart et al., 2012).
    Estimated at 20 giga base pairs (Gb) (Murray, 1998), sequen-
    cing and assembly of the genome of this gymnosperm species of
    the pine (Pinaceae) family present unique challenges. On the data
    generation end, those challenges include representation biases in
    whole-genome shotgun sequencing data, and difficulties in build-
    ing reduced representation resources to scale down the magni-
    tude of the problem. On the bioinformatics end, assembling
    massive sequencing datasets is extremely demanding on comput-
    ing cycles, memory usage, storage requirements, and for parallel
    programming implementations on communication traffic.
    We addressed the data representation challenges by preparing
    and sequencing multiple whole-genome shotgun libraries on the
    HiSeq 2000 and MiSeq sequencers from Illumina (San Diego,
    CA, USA). Compared with localized sequencing protocols, such
    as building and sequencing fosmid libraries, or the recent
    approach of isolating $10 kb DNA strands to generate indexed
    sequencing fragments in high throughput (Moleculo, San Diego,
    CA, USA), a shotgun only sequencing approach rapidly provides
    sequence data effectively covering the target genome at a cost
    that can be an order of magnitude less. The difference in cost is
    especially substantial when sequencing a large genome.
    In this work, we demonstrate that shotgun sequence assembly
    at this scale remains viable and produces valuable results. To
    *To whom correspondence should be addressed.
    ß The Author 2013. Published by Oxford University Press.
    at University of British Columbia on September 6, 2013
    http://bioinformatics.oxfordjournals.org/
    Downloaded from

    View full-size slide

  4. Sequencing Data
    • 65-fold coverage with HiSeq
    • 4-fold coverage with MiSeq
    4
    150 bp
    150 bp
    250 bp
    11x
    150 bp 150 bp
    500 bp
    54x
    100 bp 100 bp
    6 kb
    8 kb
    12 kb
    HiSeq 2000
    300 bp
    300 bp
    500 bp
    3x
    500 bp
    500 bp
    500 bp
    1x
    MiSeq

    View full-size slide

  5. 500-bp MiSeq reads Courtesy of Robin Coope @robincoope
    5
    Cartridge splitter
    MiSeq-XL cartridge
    base
    MiSeq-XL reagent tray &
    lid
    Screws for reagent tray
    lid
    Splash guard

    View full-size slide

  6. Merge overlapping reads
    FastQC plot of base quality Courtesy of Tony Raymond @tgjraymond
    6

    View full-size slide

  7. Genome Assembly of White Spruce
    • Assembled using ABySS
    • Unitigs 1,560 cores and 5,460 GB of RAM for two days
    • Contigs 288 cores and 73 GB of RAM for four days
    • Scaffolds 36 cores and 62 GB of RAM for four days
    7
    White Spruce PG29 Published Latest
    ABySS version 1.3.5 1.3.7
    Number of contigs (≥500 bp) 4.9 M 4.2 M
    N50 20.4 kbp 34.5 kbp
    Largest scaffold 1.05 Mbp 1.45 Mbp
    Assembled genome size 20.8 Gbp 20.8 Gbp

    View full-size slide

  8. Organellar Sequence in
    the Genome Assembly
    Courtesy of Tony Raymond
    @tgjraymond
    8
    ~6 Mbp

    View full-size slide

  9. Plastid Genome Photo credit Kristian Peters

    View full-size slide

  10. Plastid Genome Sequence
    • 4.7 million MiSeq read pairs of 300 bp
    • Merged the overlapping paired reads
    • 3.0 million merged reads of 492 bp median
    • Assembled these reads using ABySS
    • Separated six plastidial sequences by

    length and depth of coverage
    10

    View full-size slide

  11. The plastid genome Six scaffolds with depth of coverage >70x
    and length >5 kbp reconstruct the plastid
    11
    Six plastid sequences

    View full-size slide

  12. Plastid Genome Assembly
    • 125 kbp in six scaffolds with a 70 kbp N50
    • Scaffold using 230 M mate-pair HiSeq read pairs
    • One circular scaffold of 125 kbp
    • 21 thousand reads (1/140 or 0.7%)

    map to the assembled plastid
    • 80-fold coverage of the plastid
    12

    View full-size slide

  13. Plastid Genome Comparison
    • Aligned the white spruce plastid

    to the Norway spruce plastid
    • 99.2% identity and 98.8% coverage

    of the Norway spruce plastid
    • All 117 annotated genes are covered
    • 114 full length and 3 partial
    13

    View full-size slide

  14. Mitochondrial Genome Illustration courtesy of Gary Carlson

    http://gcarlson.com/

    View full-size slide

  15. Mitochondrial Genome Sequence
    • 133 million HiSeq read pairs of 150 bp
    • Filled the gap between the paired-end reads using a
    Bloom filter de Bruijn Graph (ABySS-connectpairs)
    • 1.4 million merged reads of 465 bp median
    • Assembled these reads using ABySS
    • 377 thousand merged reads (1/350 or 0.3%)

    map to the assembled mitochondrion
    • 30-fold coverage of the mitochondrion
    15

    View full-size slide

  16. Mitochondrial Genome Assembly
    • Assembled one lane of HiSeq data using ABySS
    • 8.4 Mbp in 1001 scaffolds larger than 2 kbp with a 29 kbp N50
    • Separated putative mitochondrial sequence by

    length, depth of coverage and GC content
    • 6.0 Mbp in 223 scaffolds larger than 2 kbp with a 39 kbp N50
    • Scaffold using 230 M mate-pair HiSeq read pairs
    • 6.0 Mbp in 78 scaffolds larger than 2 kbp with a 157 kbp N50
    • The largest scaffold is 519 kbp
    16

    View full-size slide

  17. k-mer coverage vs GC content
    17
    Putative"
    mitochondrion

    View full-size slide

  18. Classifying the sequences
    using k-means clustering
    18

    View full-size slide

  19. Mitochondrial Genome Comparison
    • The white spruce putative mitochondrial sequence is

    6.0 Mbp in 78 scaffolds larger than 2 kbp with a 157 kbp N50
    • The Norway spruce putative mitochondrial sequence is

    5.5 Mbp in 294 scaffolds larger than 4 kbp with a 28 kbp N50
    • 3.3 Mbp of these two assemblies align to each other with BWA
    • 98.3% identity and 59.6% coverage of the Norway spruce
    putative mitochondrial sequence
    19

    View full-size slide

  20. Summary of Results
    • One lane of MiSeq data assembles the

    124 kbp plastid genome of white spruce
    • One lane of HiSeq data assembles the estimated

    6 Mbp mitochondrion genome of white spruce
    • Aligned to the complete plastid genome (NC_021456)
    and putative mitochondrial sequences of Norway spruce
    20
    Alignment Identity! Coverage
    Plastid 99.2% 98.8%
    Mitochondrion 98.3% 59.6%

    View full-size slide

  21. Further Work
    • Improve both assemblies by scaffolding

    and closing gaps
    • Annotate the genes of the plastid and mitochondrion
    • Determine whether the putative mitochondrial
    sequences are in fact mitochondrial

    (BLAST, circular scaffolds)
    • Investigate how the mitochondrial genome grew

    to such a large size
    21

    View full-size slide

  22. Assembling the genome
    sequences of the plastid and
    mitochondrion of white spruce
    PAG 2014 Bioinformatics Workshop

    Shaun Jackman @sjackman

    2014-01-14
    22
    Shaun D Jackman1, Anthony Raymond1, Ben Vandervalk1, Hamid Mohamadi1, René Warren1, Stephen Pleasance1,

    Robin Coope1, Macaire MS Yuen2, Christopher Keeling2, Carol Ritland2, Jean Bousquet3, Alvin Yanchuk4,

    Kermit Ritland2, John MacKay3, Steven JM Jones1, Jörg C Bohlmann2 and İnanç Birol1
    (1) BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, Canada, (2) University of British Columbia, Vancouver, BC, Canada,

    (3) Univesité Laval, Quebec, QC, Canada, (4) British Columbia Ministry of Forests, Victoria, BC, Canada
    Photo credit: Joseph O'Brien, USDA Forest Service, bugwood.org

    View full-size slide

  23. Population Structure - Skimikin
    • Initial structure analyses
    • 5k random SNPs in HWE
    – 100k burn in
    – 200k MCMC generations
    – 1-15 genetic components (K), and K= 3
    1/7/14

    View full-size slide

  24. Genome Sequencing of White Spruce PG29
    25
    Read
    Format
    Read
    Length
    (bp)
    Sequencing
    Platform
    Fragment
    Length (bp)
    # Libraries # Reads (M) Fold
    Coverage
    PET 150 HiSeq 2000 250 2 1,520 11.4
    PET 150 HiSeq 2000 500 19 7,000 52.5
    PET 300 MiSeq 500 4 170 2.6
    PET 500 MiSeq 500 1 46 1.2
    MPET 100 HiSeq 2000 6,000 1 268 7%
    MPET 100 HiSeq 2000 8,000 1 248 15%
    MPET 100 HiSeq 2000 12,000 7 34 60%

    View full-size slide

  25. Align the white spruce plastid
    to the Norway spruce plastid 99.2% identity and 98.8% coverage
    26

    View full-size slide

  26. Connecting Paired-end Reads
    27
    2x250 2x150 2x300
    400 bp 500 bp 600 bp
    Exists?
    Bloom Filter
    Courtesy of İnanç Birol

    View full-size slide

  27. Classifying using principle
    component analysis
    28

    View full-size slide