Galaxy CME Class

Galaxy: Data Intensive Biology for Everyone Stephen Turner Bioinformatics Core
Director [email protected] bit.ly/uva-galaxy

Contact Me: Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog
Galaxy: Web: http://galaxyproject.org (Mailing list, wiki, screencasts, etc). Twitter: @galaxyproject #useGalaxy IRC: Server: irc.freenode.net Channel: #galaxyproject Slides available at: bit.ly/uva-galaxy

Some slides adapted with permission from: The Galaxy Team (Dave
Clements) galaxyproject.org/wiki/ Slides available at: bit.ly/uva-galaxy

Bioinformatics Core Mission: help scientists publish their work and obtain
new funding through service and training. Service / data analysis must be: • Transparent • Reproducible • Accessible

As science becomes increasingly dependent on computation: How best to
ensure that analysis are reproducible? How can methods best be made accessible to scientists? How to facilitate transparent communication of analyses?

A crisis in genomics research: reproducibility

Key reproducibility problems • Datasets: not all available, difficult to
access • Tools: inaccessible, poor version control, difficult to record details of workflow • Publication: results, data, methods, separate. Data isn't in the papers anymore.

Microarray reproducibility • 18 Nat. Genet. microarray experiments • Less
than 50% reproducible • Problems: ◦ Missing data (38%) ◦ Missing software/hardware details (50%) ◦ Missing method/processing details (66%) Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat Genet 41, 149-155 (2009)

What about next-gen sequencing? Next Generation Genomics: World Map of
High-throughput Sequencers Nick Loman, James Hadfield omicsmaps.com

NGS (ir)reproducibility Variant Calling (common NGS application) Procedure: sequence genomic
DNA, compare to reference to catalog SNPs, SVs, etc. Workflow: remove PCR duplicates, recalibrate quality scores, genotype calling, refining calls, annotating variants. (1000 Genomes) Software: Picard, SAMtools, GATK, etc.

NGS (ir)reproducibility • 299 articles published in 2011 citing the
1000 Genomes project pilot publication • 19 were NGS studies with similar design • Only 10 used tools recommended by 1000G. • Only 4 used full 1000G workflow (realignment & quality score recalibration) Taylor J & Nekrutenko A. Next- generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).

Most straightforward part of analysis: alignment. Survey of 50 papers
using BWA: • 31 provide neither the software version, parameters, nor version of genomic reference. • Of remaining 19: ◦ 4 provide settings ◦ 8 list version information ◦ Only 7 provide all necessary details. • In 2 cases, authors provided links to their own website where the primary data were deposited. In both cases, the links were broken. NGS (ir)reproducibility Taylor J & Nekrutenko A. Next- generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).

If these studies were representative: Then most results reported in
today's publications using NGS data cannot be accurately verified, reproduced, adapted, or used to educate others. This creates an alarming reproducibility crisis. Taylor J & Nekrutenko A. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).

Galaxy Interface: A web-based genomic analysis toolkit & workflow management
system Type "usegalaxy.org" in your browser

Galaxy Interface: A web-based genomic analysis toolkit & workflow management
system Tools Display data and tool dialog History

Galaxy Interface: Integrating command-line tools into web-based platform • Nearly
any tool that can be run at the command line can be described as Galaxy Tool. • Tools described with inputs and outputs. E.g. Tophat: ◦ input = sequence reads (FASTQ) ◦ output = alignments (BAM) • Hundreds of tools currently available. • Easily extensible with XML description about interface and how to generate a commane line.

Tools: Hundreds built-in; thousands available at toolshed.g2.bx.psu.edu

Galaxy Interface: Consistent interface for integrated tools Read Mapping ChIP-seq
Peak Calling

Galaxy Interface: Consistent interface for integrated tools Read Mapping Select
an index (upload own or use built-in) Select a reference genome Select sequencing configuration Select input dataset (from history) Use common settings or change them? Run the tool

Galaxy Interface: Automatically tracks every step of every analysis Tools
Display data and tool dialog History • History system facilitates and tracks multi-step analyses • Exact parameters of a step can always be inspected, and easily rerun

Galaxy histories "View details": inspect which parameters, software version, &
input data used.

Galaxy histories "Run this job again": run with new parameters
or on new input data

Galaxy histories Supply user generated metadata and annotation on any
analysis step: Tags Free text annotation

Galaxy workflows

Workflows can be created from scratch or extracted from existing
analysis histories. Workflows facilitate sharing and reuse, and provide precise reproducibility of an arbitrarily complex analysis.

Transparency: sharing & publishing

Transparency: sharing & publishing • All analysis components (datasets, histories,
workflows) can be shared among Galaxy users and published. • Published pages and annotation allow analyses to be augmented with textual content and provided in the form of an integrated document.

Transparency: sharing & publishing https://main.g2.bx.psu.edu/u/aun1/p/windshield-splatter

Data Visualization Send data to external genome browsers: UCSC, Ensembl,
IGV, etc. Trackster: Galaxy's built-in genome browser

Trackster View data from within Galaxy (no data transfer to
external site) Supports common filetypes: BAM, BED, WIG

Just another genome browser? From static browsing to visual analysis.

Just another genome browser? Visual feedback and experimentation needed for
complex tools with many parameters.

http://usegalaxy.org (a.k.a. "Galaxy Main") • Free public website • No
registration required • Anyone can use it • Hundreds of tools • >24,000 registered users • >300 TB user data • >140,000+ jobs/month But not all. 1000s available but implementation is not trivial. Disk storage and compute time is not infinite. Central solution not scalable.

http://getgalaxy.org • Galaxy is open-source • Designed for local installation/customization
• Easy to deploy/manage: ◦ $ hg clone https://bitbucket.org/galaxy/galaxy-dist/ ◦ $ sh run.sh ◦ Point browser to http://localhost:8080 • Requires existing computational resources ◦ Large server (bioconnector.virginia.edu) ◦ Compute cluster • ...Or on the cloud

• Partnership between: ◦ Bioinformatics core ◦ Health Sciences Library
◦ Division of Clinical Informatics • Mission: get researchers connected to the tools and people they need. • bioconnector.virginia.edu • Tools: ◦ Galaxy server ◦ VIVO (collaboration) ◦ CDR/MUSIC ◦ Awesome space Local Galaxy Installation: bioconnector.virginia.edu

Galaxy CloudMan http://usegalaxy.org/cloud • Start your own fully configured and
populated (tools + data) Galaxy instance • ~Infinitely scalable (pay on-demand) • Someone else manages the data center

Step-by-step instructions for AWS: http://usegalaxy.org/cloud

Instant CloudMan Launch CloudMan instance from Galaxy Main, or transfer
your current history.

Live Demo galaxyproject.org/wiki/Learn "Hey [Mom/Dad], which coding exon has the
highest number of SNPs on chromosome 22" Simple question. You know where to find the data. But how do you answer quickly? http://usegalaxy.org

Get data from UCSC Click "Get data, UCSC Main". Set
position to "chr22" Output format: "BED" Check "Send output to Galaxy" box. Hit get output

Get data from UCSC At this screen, select one BED
record per Coding Exons

Get data from UCSC At this screen, the history pane
will go from gray (preparing), to yellow (running) to green (done):

Get SNPs Same as last time, but set group to
"Variation and Repeats" to select table "snp131".

History management Rename the two history items to "Exons" and
"SNPs" by clicking pencil. Rename history to "Galaxy 101 Demo"

Join exons with SNPs Recall: task is to find exons
containing the most SNPs. First step: join exons with SNP (print exons and SNPs that overlap side-by-side). Use the "Operate on genomic intervals --> Join" tool. Select exons first, SNPs second.

Join exons with SNPs Containing the following data: Once you
do this, you'll see a third history item: chr22 16258185 16258303 uc002zlh.1_cds_1_0_chr22_16258186_r 0 - chr22 16258278 16258279 rs2845178 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267011 16267012 rs7290262 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16266963 16266964 rs10154680 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267037 16267038 rs2818572 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267031 16267032 rs7292200 0 + Data for SNPs Data for exons Note that exon with ID uc002zlh.1_cds_2_0_chr22_16266929_r contains four SNPs with IDs rs7290262, rs10154680, rs2818572, and rs7292200.

Join exons with SNPs We can easily compute the number
of SNPs per exon by counting the number of repetitions of name for each exon. This can be easily done with the "Join, Subtract, and Group -> Group" tool. Choose column 4 by selecting "c4" in Group by column.

Counting SNPs per exon Then click on Add new Operation
and make sure the interface looks exactly as shown below: Then Execute. Your history now looks like this: The result of grouping (dataset #4) contains two columns. This first contains the exon name while the second shows the number of times this name has been repeated in dataset #3.

Sorting exons by SNP count To see which exon has
the highest number of SNPs, sort dataset #4 on the second column in descending order. This is done with "Filter and Sort -> Sort": This generates a 5th history item. The highest number of SNPs is 67.

Select the top five Select the top five with "Text
manipulation -> Select First" tool Executing this creates a sixth history item with only 5 lines:

Recover exon info Now we know that in this dataset
the five top exons contain between 41 and 67 SNPs. To know more we need to get back the positional information of these exons. This information was lost at the grouping step and now all we have is just two columns. To get coordinates back we will match the names of exons in dataset #6 (column 1) against names of the exons in the original dataset #1 (column 4). This can be done with "Join, Subtract and Group -> Compare two Queries" tool (note the settings of the tool in the middle pane). This creates a seventh history item.

Display in genome browser

Extract workflow from history

Homework: RNA-seq http://bit.ly/galaxy-rnaseq • Get some data (Illumina BodyMap) •
QC / trim your reads • Map to hg19 with tophat • Visualize where reads map • Assemble with cufflinks • Differential expression with cuffdiff

Resources • Twitter: @galaxyproject • Tutorials: http://galaxyproject.org/wiki/Learn • Mailing list:
http://user.list.galaxyproject.org • Biostar ◦ Web: http://biostars.org/ ◦ Twitter: @BioStarQuestion • SEQanswers: ◦ Web: http://seqanswers.com/ ◦ Twitter: @SEQquestions

Contact Me: Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog
Galaxy: Web: http://galaxyproject.org (Mailing list, wiki, screencasts, etc). Twitter: @galaxyproject #useGalaxy IRC: Server: irc.freenode.net Channel: #galaxyproject Slides available at: bit.ly/uva-galaxy Evaluation forms in the back. Please fill these out!

Galaxy CME Class

Galaxy CME Class

More Decks by Stephen Turner

Other Decks in Education

Featured

Transcript