access • Tools: inaccessible, poor version control, difficult to record details of workflow • Publication: results, data, methods, separate. Data isn't in the papers anymore.
1000 Genomes project pilot publication • 19 were NGS studies with similar design • Only 10 used tools recommended by 1000G. • Only 4 used full 1000G workflow (realignment & quality score recalibration) Taylor J & Nekrutenko A. Next- generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).
using BWA: • 31 provide neither the software version, parameters, nor version of genomic reference. • Of remaining 19: ◦ 4 provide settings ◦ 8 list version information ◦ Only 7 provide all necessary details. • In 2 cases, authors provided links to their own website where the primary data were deposited. In both cases, the links were broken. NGS (ir)reproducibility Taylor J & Nekrutenko A. Next- generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).
today's publications using NGS data cannot be accurately verified, reproduced, adapted, or used to educate others. This creates an alarming reproducibility crisis. Taylor J & Nekrutenko A. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).
any tool that can be run at the command line can be described as Galaxy Tool. • Tools described with inputs and outputs. E.g. Tophat: ◦ input = sequence reads (FASTQ) ◦ output = alignments (BAM) • Hundreds of tools currently available. • Easily extensible with XML description about interface and how to generate a commane line.
an index (upload own or use built-in) Select a reference genome Select sequencing configuration Select input dataset (from history) Use common settings or change them? Run the tool
Display data and tool dialog History • History system facilitates and tracks multi-step analyses • Exact parameters of a step can always be inspected, and easily rerun
workflows) can be shared among Galaxy users and published. • Published pages and annotation allow analyses to be augmented with textual content and provided in the form of an integrated document.
registration required • Anyone can use it • Hundreds of tools • >24,000 registered users • >300 TB user data • >140,000+ jobs/month But not all. 1000s available but implementation is not trivial. Disk storage and compute time is not infinite. Central solution not scalable.
• Easy to deploy/manage: ◦ $ hg clone https://bitbucket.org/galaxy/galaxy-dist/ ◦ $ sh run.sh ◦ Point browser to http://localhost:8080 • Requires existing computational resources ◦ Large server (bioconnector.virginia.edu) ◦ Compute cluster • ...Or on the cloud
◦ Division of Clinical Informatics • Mission: get researchers connected to the tools and people they need. • bioconnector.virginia.edu • Tools: ◦ Galaxy server ◦ VIVO (collaboration) ◦ CDR/MUSIC ◦ Awesome space Local Galaxy Installation: bioconnector.virginia.edu
◦ Division of Clinical Informatics • Mission: get researchers connected to the tools and people they need. • bioconnector.virginia.edu • Tools: ◦ Galaxy server ◦ VIVO (collaboration) ◦ CDR/MUSIC ◦ Awesome space Local Galaxy Installation: bioconnector.virginia.edu
containing the most SNPs. First step: join exons with SNP (print exons and SNPs that overlap side-by-side). Use the "Operate on genomic intervals --> Join" tool. Select exons first, SNPs second.
of SNPs per exon by counting the number of repetitions of name for each exon. This can be easily done with the "Join, Subtract, and Group -> Group" tool. Choose column 4 by selecting "c4" in Group by column.
and make sure the interface looks exactly as shown below: Then Execute. Your history now looks like this: The result of grouping (dataset #4) contains two columns. This first contains the exon name while the second shows the number of times this name has been repeated in dataset #3.
the highest number of SNPs, sort dataset #4 on the second column in descending order. This is done with "Filter and Sort -> Sort": This generates a 5th history item. The highest number of SNPs is 67.
the five top exons contain between 41 and 67 SNPs. To know more we need to get back the positional information of these exons. This information was lost at the grouping step and now all we have is just two columns. To get coordinates back we will match the names of exons in dataset #6 (column 1) against names of the exons in the original dataset #1 (column 4). This can be done with "Join, Subtract and Group -> Compare two Queries" tool (note the settings of the tool in the middle pane). This creates a seventh history item.
Galaxy: Web: http://galaxyproject.org (Mailing list, wiki, screencasts, etc). Twitter: @galaxyproject #useGalaxy IRC: Server: irc.freenode.net Channel: #galaxyproject Slides available at: bit.ly/uva-galaxy Evaluation forms in the back. Please fill these out!