Slide 1

Slide 1 text

Galaxy: Data Intensive Biology for Everyone Stephen Turner Bioinformatics Core Director [email protected] bit.ly/uva-galaxy

Slide 2

Slide 2 text

Contact Me: Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog Galaxy: Web: http://galaxyproject.org (Mailing list, wiki, screencasts, etc). Twitter: @galaxyproject #useGalaxy IRC: Server: irc.freenode.net Channel: #galaxyproject Slides available at: bit.ly/uva-galaxy

Slide 3

Slide 3 text

Some slides adapted with permission from: The Galaxy Team (Dave Clements) galaxyproject.org/wiki/ Slides available at: bit.ly/uva-galaxy

Slide 4

Slide 4 text

Bioinformatics Core Mission: help scientists publish their work and obtain new funding through service and training. Service / data analysis must be: ● Transparent ● Reproducible ● Accessible

Slide 5

Slide 5 text

As science becomes increasingly dependent on computation: How best to ensure that analysis are reproducible? How can methods best be made accessible to scientists? How to facilitate transparent communication of analyses?

Slide 6

Slide 6 text

A crisis in genomics research: reproducibility

Slide 7

Slide 7 text

Key reproducibility problems ● Datasets: not all available, difficult to access ● Tools: inaccessible, poor version control, difficult to record details of workflow ● Publication: results, data, methods, separate. Data isn't in the papers anymore.

Slide 8

Slide 8 text

Microarray reproducibility ● 18 Nat. Genet. microarray experiments ● Less than 50% reproducible ● Problems: ○ Missing data (38%) ○ Missing software/hardware details (50%) ○ Missing method/processing details (66%) Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat Genet 41, 149-155 (2009)

Slide 9

Slide 9 text

What about next-gen sequencing? Next Generation Genomics: World Map of High-throughput Sequencers Nick Loman, James Hadfield omicsmaps.com

Slide 10

Slide 10 text

NGS (ir)reproducibility Variant Calling (common NGS application) Procedure: sequence genomic DNA, compare to reference to catalog SNPs, SVs, etc. Workflow: remove PCR duplicates, recalibrate quality scores, genotype calling, refining calls, annotating variants. (1000 Genomes) Software: Picard, SAMtools, GATK, etc.

Slide 11

Slide 11 text

NGS (ir)reproducibility ● 299 articles published in 2011 citing the 1000 Genomes project pilot publication ● 19 were NGS studies with similar design ● Only 10 used tools recommended by 1000G. ● Only 4 used full 1000G workflow (realignment & quality score recalibration) Taylor J & Nekrutenko A. Next- generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).

Slide 12

Slide 12 text

Most straightforward part of analysis: alignment. Survey of 50 papers using BWA: ● 31 provide neither the software version, parameters, nor version of genomic reference. ● Of remaining 19: ○ 4 provide settings ○ 8 list version information ○ Only 7 provide all necessary details. ● In 2 cases, authors provided links to their own website where the primary data were deposited. In both cases, the links were broken. NGS (ir)reproducibility Taylor J & Nekrutenko A. Next- generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).

Slide 13

Slide 13 text

If these studies were representative: Then most results reported in today's publications using NGS data cannot be accurately verified, reproduced, adapted, or used to educate others. This creates an alarming reproducibility crisis. Taylor J & Nekrutenko A. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).

Slide 14

Slide 14 text

Galaxy Interface: A web-based genomic analysis toolkit & workflow management system Type "usegalaxy.org" in your browser

Slide 15

Slide 15 text

Galaxy Interface: A web-based genomic analysis toolkit & workflow management system Tools Display data and tool dialog History

Slide 16

Slide 16 text

Galaxy Interface: Integrating command-line tools into web-based platform ● Nearly any tool that can be run at the command line can be described as Galaxy Tool. ● Tools described with inputs and outputs. E.g. Tophat: ○ input = sequence reads (FASTQ) ○ output = alignments (BAM) ● Hundreds of tools currently available. ● Easily extensible with XML description about interface and how to generate a commane line.

Slide 17

Slide 17 text

Tools: Hundreds built-in; thousands available at toolshed.g2.bx.psu.edu

Slide 18

Slide 18 text

Galaxy Interface: Consistent interface for integrated tools Read Mapping ChIP-seq Peak Calling

Slide 19

Slide 19 text

Galaxy Interface: Consistent interface for integrated tools Read Mapping Select an index (upload own or use built-in) Select a reference genome Select sequencing configuration Select input dataset (from history) Use common settings or change them? Run the tool

Slide 20

Slide 20 text

Galaxy Interface: Automatically tracks every step of every analysis Tools Display data and tool dialog History ● History system facilitates and tracks multi-step analyses ● Exact parameters of a step can always be inspected, and easily rerun

Slide 21

Slide 21 text

Galaxy histories "View details": inspect which parameters, software version, & input data used.

Slide 22

Slide 22 text

Galaxy histories "Run this job again": run with new parameters or on new input data

Slide 23

Slide 23 text

Galaxy histories Supply user generated metadata and annotation on any analysis step: Tags Free text annotation

Slide 24

Slide 24 text

Galaxy workflows

Slide 25

Slide 25 text

Galaxy workflows

Slide 26

Slide 26 text

Workflows can be created from scratch or extracted from existing analysis histories. Workflows facilitate sharing and reuse, and provide precise reproducibility of an arbitrarily complex analysis.

Slide 27

Slide 27 text

Transparency: sharing & publishing

Slide 28

Slide 28 text

Transparency: sharing & publishing ● All analysis components (datasets, histories, workflows) can be shared among Galaxy users and published. ● Published pages and annotation allow analyses to be augmented with textual content and provided in the form of an integrated document.

Slide 29

Slide 29 text

Transparency: sharing & publishing https://main.g2.bx.psu.edu/u/aun1/p/windshield-splatter

Slide 30

Slide 30 text

Data Visualization Send data to external genome browsers: UCSC, Ensembl, IGV, etc. Trackster: Galaxy's built-in genome browser

Slide 31

Slide 31 text

Trackster View data from within Galaxy (no data transfer to external site) Supports common filetypes: BAM, BED, WIG

Slide 32

Slide 32 text

Just another genome browser? From static browsing to visual analysis.

Slide 33

Slide 33 text

Just another genome browser? Visual feedback and experimentation needed for complex tools with many parameters.

Slide 34

Slide 34 text

http://usegalaxy.org (a.k.a. "Galaxy Main") ● Free public website ● No registration required ● Anyone can use it ● Hundreds of tools ● >24,000 registered users ● >300 TB user data ● >140,000+ jobs/month But not all. 1000s available but implementation is not trivial. Disk storage and compute time is not infinite. Central solution not scalable.

Slide 35

Slide 35 text

http://getgalaxy.org ● Galaxy is open-source ● Designed for local installation/customization ● Easy to deploy/manage: ○ $ hg clone https://bitbucket.org/galaxy/galaxy-dist/ ○ $ sh run.sh ○ Point browser to http://localhost:8080 ● Requires existing computational resources ○ Large server (bioconnector.virginia.edu) ○ Compute cluster ● ...Or on the cloud

Slide 36

Slide 36 text

● Partnership between: ○ Bioinformatics core ○ Health Sciences Library ○ Division of Clinical Informatics ● Mission: get researchers connected to the tools and people they need. ● bioconnector.virginia.edu ● Tools: ○ Galaxy server ○ VIVO (collaboration) ○ CDR/MUSIC ○ Awesome space Local Galaxy Installation: bioconnector.virginia.edu

Slide 37

Slide 37 text

● Partnership between: ○ Bioinformatics core ○ Health Sciences Library ○ Division of Clinical Informatics ● Mission: get researchers connected to the tools and people they need. ● bioconnector.virginia.edu ● Tools: ○ Galaxy server ○ VIVO (collaboration) ○ CDR/MUSIC ○ Awesome space Local Galaxy Installation: bioconnector.virginia.edu

Slide 38

Slide 38 text

Galaxy CloudMan http://usegalaxy.org/cloud ● Start your own fully configured and populated (tools + data) Galaxy instance ● ~Infinitely scalable (pay on-demand) ● Someone else manages the data center

Slide 39

Slide 39 text

Step-by-step instructions for AWS: http://usegalaxy.org/cloud

Slide 40

Slide 40 text

Instant CloudMan Launch CloudMan instance from Galaxy Main, or transfer your current history.

Slide 41

Slide 41 text

Live Demo galaxyproject.org/wiki/Learn "Hey [Mom/Dad], which coding exon has the highest number of SNPs on chromosome 22" Simple question. You know where to find the data. But how do you answer quickly? http://usegalaxy.org

Slide 42

Slide 42 text

Get data from UCSC Click "Get data, UCSC Main". Set position to "chr22" Output format: "BED" Check "Send output to Galaxy" box. Hit get output

Slide 43

Slide 43 text

Get data from UCSC At this screen, select one BED record per Coding Exons

Slide 44

Slide 44 text

Get data from UCSC At this screen, the history pane will go from gray (preparing), to yellow (running) to green (done):

Slide 45

Slide 45 text

Get SNPs Same as last time, but set group to "Variation and Repeats" to select table "snp131".

Slide 46

Slide 46 text

History management Rename the two history items to "Exons" and "SNPs" by clicking pencil. Rename history to "Galaxy 101 Demo"

Slide 47

Slide 47 text

Join exons with SNPs Recall: task is to find exons containing the most SNPs. First step: join exons with SNP (print exons and SNPs that overlap side-by-side). Use the "Operate on genomic intervals --> Join" tool. Select exons first, SNPs second.

Slide 48

Slide 48 text

Join exons with SNPs Containing the following data: Once you do this, you'll see a third history item: chr22 16258185 16258303 uc002zlh.1_cds_1_0_chr22_16258186_r 0 - chr22 16258278 16258279 rs2845178 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267011 16267012 rs7290262 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16266963 16266964 rs10154680 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267037 16267038 rs2818572 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267031 16267032 rs7292200 0 + Data for SNPs Data for exons Note that exon with ID uc002zlh.1_cds_2_0_chr22_16266929_r contains four SNPs with IDs rs7290262, rs10154680, rs2818572, and rs7292200.

Slide 49

Slide 49 text

Join exons with SNPs We can easily compute the number of SNPs per exon by counting the number of repetitions of name for each exon. This can be easily done with the "Join, Subtract, and Group -> Group" tool. Choose column 4 by selecting "c4" in Group by column.

Slide 50

Slide 50 text

Counting SNPs per exon Then click on Add new Operation and make sure the interface looks exactly as shown below: Then Execute. Your history now looks like this: The result of grouping (dataset #4) contains two columns. This first contains the exon name while the second shows the number of times this name has been repeated in dataset #3.

Slide 51

Slide 51 text

Sorting exons by SNP count To see which exon has the highest number of SNPs, sort dataset #4 on the second column in descending order. This is done with "Filter and Sort -> Sort": This generates a 5th history item. The highest number of SNPs is 67.

Slide 52

Slide 52 text

Select the top five Select the top five with "Text manipulation -> Select First" tool Executing this creates a sixth history item with only 5 lines:

Slide 53

Slide 53 text

Recover exon info Now we know that in this dataset the five top exons contain between 41 and 67 SNPs. To know more we need to get back the positional information of these exons. This information was lost at the grouping step and now all we have is just two columns. To get coordinates back we will match the names of exons in dataset #6 (column 1) against names of the exons in the original dataset #1 (column 4). This can be done with "Join, Subtract and Group -> Compare two Queries" tool (note the settings of the tool in the middle pane). This creates a seventh history item.

Slide 54

Slide 54 text

Display in genome browser

Slide 55

Slide 55 text

Extract workflow from history

Slide 56

Slide 56 text

Homework: RNA-seq http://bit.ly/galaxy-rnaseq ● Get some data (Illumina BodyMap) ● QC / trim your reads ● Map to hg19 with tophat ● Visualize where reads map ● Assemble with cufflinks ● Differential expression with cuffdiff

Slide 57

Slide 57 text

Resources ● Twitter: @galaxyproject ● Tutorials: http://galaxyproject.org/wiki/Learn ● Mailing list: http://user.list.galaxyproject.org ● Biostar ○ Web: http://biostars.org/ ○ Twitter: @BioStarQuestion ● SEQanswers: ○ Web: http://seqanswers.com/ ○ Twitter: @SEQquestions

Slide 58

Slide 58 text

Contact Me: Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog Galaxy: Web: http://galaxyproject.org (Mailing list, wiki, screencasts, etc). Twitter: @galaxyproject #useGalaxy IRC: Server: irc.freenode.net Channel: #galaxyproject Slides available at: bit.ly/uva-galaxy Evaluation forms in the back. Please fill these out!