Improving genome annotation strategies for biodiverse species using cloud technologies

IMPROVING GENOME ANNOTATION STRATEGIES FOR BIODIVERSE SPECIES USING CLOUD TECHNOLOGIES
Rebecca B. Dikow*, Paul B. Frandsen*, David Cruley, Daniel Davis, Sandeep Gupta, Stephanie Speirs, Beth A. Stern, Mathew Taylor, and Deron Burba

THE BIODIVERSE GENOME CHALLENGE ▸ Huge range of genome sizes
and complexities ▸ Every genome assembly and annotation is truly de novo: no reference genomes ▸ We often need special techniques: hybrid assembly, lots of long-read sequencing to get a contiguous genome worth annotating ▸ Usually less funding out there on the lonely branches of the tree of life ▸ Not everyone has HPC

WHO WE ARE Smithsonian Institution Paul & Rebecca: Smithsonian Data
Science Team (Office of the CIO) Beth & Dan: OCIO Office of Research Information Services Deron: CIO Intel Corporation Sandeep: Platform Applications Engineer Mathew: Senior Solutions Strategist and Architect Amazon Web Services Stephanie: Federal Account Manager Dave: Senior Solution Architect Dikow, R. B., Gupta, S., Taylor, M. H. 2016. Accelerating Plant and Animal Genomics for Biodiversity with the Latest Intel Technologies. Intel Corporation white paper.

GOALS OF THIS PROJECT ▸ The 1 hour, $10 annotation
▸ Annotate 25 Smithsonian genomes in 12 months ▸ Figure out the best way to do implement existing annotation pipelines in the cloud ▸ Write our own pipeline that takes advantage of cloud strengths ▸ fast AND easy to spin-up

CLOUD VS. LOCAL HPC ▸ Cloud is an incremental investment
- but you have to be careful! ▸ Can easily deploy other people’s AMIs, use Docker containers ▸ HPC often appears free to most users (but it’s not!) ▸ Often not agile in terms of software installation ▸ Queue limits ▸ When it’s full, it’s full

SI HPC Usage, December 2016 3,300 CPUs, max 64 CPU
per node, 18 total TB RAM, max 1TB ram per node

SOME SMITHSONIAN GENOMES AT BIOGENOMICS CONFERENCE check out our posters!
check out HC Lim’s talk! check out Mirian Tsuchiya’s poster! Other Smithsonian genomes: Amakihi Golden collared manakin Raccoon Heliconius spp. Greater bamboo lemur …

ANNOTATION IN A NUTSHELL fasta file fasta with masked repeats
Align RNAseq data RepeatMasker RepeatRunner WinMasker repbase libraries Polish BLAST alignments Intermediate files with genome alignment info (GFFs) Chooser algorithm Annotation tBLASTx StringTie TopHat Augustus SNAP GENEMARK FGENESH JIGSAW EvidenceModeler GLEAN Evigan Repeat Masking ab initio gene prediction Intermediate files with two sets of evidence exonerate

EXAMPLE RESULTING GFF FILE #gff-version 3 flattened_line_246 . contig 1
496513 . . . ID=flattened_line_246;Name=flattened_line flattened_line_246 maker gene 61267 68049 . + . ID=maker-flattened_line_246-augustus flattened_line_246 maker mRNA 61267 68049 . + . ID=maker-flattened_line_246-augustus flattened_line_246 maker exon 61267 61319 . + . ID=maker-flattened_line_246-augustus flattened_line_246 maker exon 61481 61621 . + . ID=maker-flattened_line_246-augustus flattened_line_246 maker exon 64776 64905 . + . ID=maker-flattened_line_246-augustus flattened_line_246 maker exon 64965 65339 . + . ID=maker-flattened_line_246-augustus flattened_line_246 maker exon 65391 65585 . + . ID=maker-flattened_line_246-augustus flattened_line_246 maker exon 67078 67162 . + . ID=maker-flattened_line_246-augustus

MAKER ▸ MAKER (Cantarel et al. 2008; Holt & Yandell
2011) came out in 2008 ▸ Major steps: identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and synthesizes these into gene annotations ▸ Operationally, each of these processes is performed on each contig in turn, producing a GFF file (General Feature Format) for each contig ▸ Large contigs are “chunked” into smaller pieces and then the GFFs are knitted together to optimize compute time (enable more operations to run in parallel) and to save RAM

MAKER ▸ This analysis workflow is well suited to massive
parallelization (each contig can in theory be processed simultaneously), but implementation is a bit tricky ▸ MAKER2 allows parallelization using MPI, but has problems utilizing NFS (Network File System) data storage, making it hard to run on HPC. ▸ It’s also frustrating to install, with lots of dependencies: ▸ BioPerl ▸ RepeatMasker ▸ exonerate ▸ SNAP ▸ Augustus ▸ OpenMPI

WQ-MAKER ▸ Work Queue is a framework for building large
master-worker applications that span thousands of machines drawn from clusters, clouds, and grids. ▸ Thrasher et al., 2012 implements MAKER in the Work Queue framework (wq-maker).

WQ-MAKER ON AWS ▸ Workflow on AWS: ▸ Configure and
install Work Queue, MAKER, and dependencies ▸ Save your environment as an AMI (Amazon Machine Image) ▸ Spin up as many workers with that same AMI as you’d like ▸ Run wq-maker on the master node and send off jobs to the workers ▸ Results are written to the master node ▸ While we were working on this, we saw a presentation by the CyVerse folks showing a similar implementation in Jetstream (XSEDE’s “cloud”) and tested that too after requesting an allocation.

CYVERSE WQ-MAKER IMPLEMENTATION ▸ Jobs are sent off with Ansible
Playbook ▸ Ansible is an open source automation platform ▸ Under active development and the developers are very friendly! ▸ https://wiki.cyverse.org/wiki/display/TUT/MAKER+2.31.8+with+CCTOOLS+Jetstream +Tutorial

WQ-MAKER ON AWS ▸ Hints: ▸ Workers must be writing
to “instance store” or “ephemeral” storage ▸ Augustus config directory with gene models must be in the working directory ▸ wq-maker relies on MAKER 2.31.8 ▸ The Augustus version that you can install through MAKER is not the most recent version - you should install separately to get all the gene models

DOWNSIDES TO THESE IMPLEMENTATIONS ▸ The user relies on developers
to keep AMIs and/or Jetstream instances updated ▸ Dependencies are complex and very specific ▸ Pipeline is not modular - it runs from start to finish ▸ MAKER was written for a specific use case and has been built out to accomplish more with legacy code

USING WORKFLOW ENGINES TO BUILD OUR OWN PIPELINE ▸ Toil:
written in Python, uses Common Workflow Language definitions ▸ First Toil paper: analyzed 20,000 RNAseq samples on 32,000 cores in 4 days (Vivian et al. 2016 bioRxiv doi: 10.1101/062497) ‣ Why we like Toil: ▸ can also be used on HPC, local workstation, or laptop ▸ fault tolerant ▸ easy to install - can provide virtual machines or containerize ▸ steps are modular steps because it uses CWL

ACKNOWLEDGMENTS ▸ Intel: Ketan Paranjape, Alice Borrelli , Claudine Conway
▸ Smithsonian: Lesli Creedon, John Kress, Rob Fleischer, Sylvain Korzennik, DJ Ding ▸ Technical help: Upendra Kumar Devisetty, Mark Corwin

Improving genome annotation strategies for biod...

Improving genome annotation strategies for biodiverse species using cloud technologies

rebeccadikow

Featured

Transcript

IMPROVING GENOME ANNOTATION STRATEGIES FOR BIODIVERSE SPECIES USING CLOUD TECHNOLOGIES

THE BIODIVERSE GENOME CHALLENGE ▸ Huge range of genome sizes

WHO WE ARE Smithsonian Institution Paul & Rebecca: Smithsonian Data

GOALS OF THIS PROJECT ▸ The 1 hour, $10 annotation

CLOUD VS. LOCAL HPC ▸ Cloud is an incremental investment

SI HPC Usage, December 2016 3,300 CPUs, max 64 CPU

SOME SMITHSONIAN GENOMES AT BIOGENOMICS CONFERENCE check out our posters!

ANNOTATION IN A NUTSHELL fasta ﬁle fasta with masked repeats

EXAMPLE RESULTING GFF FILE #gff-version 3 flattened_line_246 . contig 1

MAKER ▸ MAKER (Cantarel et al. 2008; Holt & Yandell

MAKER ▸ This analysis workflow is well suited to massive

WQ-MAKER ▸ Work Queue is a framework for building large

WQ-MAKER ON AWS ▸ Workflow on AWS: ▸ Configure and

CYVERSE WQ-MAKER IMPLEMENTATION ▸ Jobs are sent off with Ansible

WQ-MAKER ON AWS ▸ Hints: ▸ Workers must be writing

DOWNSIDES TO THESE IMPLEMENTATIONS ▸ The user relies on developers

USING WORKFLOW ENGINES TO BUILD OUR OWN PIPELINE ▸ Toil:

ACKNOWLEDGMENTS ▸ Intel: Ketan Paranjape, Alice Borrelli , Claudine Conway