Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy CME Class

Galaxy CME Class

Overview and demo of Galaxy for UVA CME class. Slides adapted from permission from the Galaxy team.

Stephen Turner

October 09, 2013
Tweet

More Decks by Stephen Turner

Other Decks in Education

Transcript

  1. Contact Me: Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog

    Galaxy: Web: http://galaxyproject.org (Mailing list, wiki, screencasts, etc). Twitter: @galaxyproject #useGalaxy IRC: Server: irc.freenode.net Channel: #galaxyproject Slides available at: bit.ly/uva-galaxy
  2. Some slides adapted with permission from: The Galaxy Team (Dave

    Clements) galaxyproject.org/wiki/ Slides available at: bit.ly/uva-galaxy
  3. Bioinformatics Core Mission: help scientists publish their work and obtain

    new funding through service and training. Service / data analysis must be: • Transparent • Reproducible • Accessible
  4. As science becomes increasingly dependent on computation: How best to

    ensure that analysis are reproducible? How can methods best be made accessible to scientists? How to facilitate transparent communication of analyses?
  5. Key reproducibility problems • Datasets: not all available, difficult to

    access • Tools: inaccessible, poor version control, difficult to record details of workflow • Publication: results, data, methods, separate. Data isn't in the papers anymore.
  6. Microarray reproducibility • 18 Nat. Genet. microarray experiments • Less

    than 50% reproducible • Problems: ◦ Missing data (38%) ◦ Missing software/hardware details (50%) ◦ Missing method/processing details (66%) Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat Genet 41, 149-155 (2009)
  7. What about next-gen sequencing? Next Generation Genomics: World Map of

    High-throughput Sequencers Nick Loman, James Hadfield omicsmaps.com
  8. NGS (ir)reproducibility Variant Calling (common NGS application) Procedure: sequence genomic

    DNA, compare to reference to catalog SNPs, SVs, etc. Workflow: remove PCR duplicates, recalibrate quality scores, genotype calling, refining calls, annotating variants. (1000 Genomes) Software: Picard, SAMtools, GATK, etc.
  9. NGS (ir)reproducibility • 299 articles published in 2011 citing the

    1000 Genomes project pilot publication • 19 were NGS studies with similar design • Only 10 used tools recommended by 1000G. • Only 4 used full 1000G workflow (realignment & quality score recalibration) Taylor J & Nekrutenko A. Next- generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).
  10. Most straightforward part of analysis: alignment. Survey of 50 papers

    using BWA: • 31 provide neither the software version, parameters, nor version of genomic reference. • Of remaining 19: ◦ 4 provide settings ◦ 8 list version information ◦ Only 7 provide all necessary details. • In 2 cases, authors provided links to their own website where the primary data were deposited. In both cases, the links were broken. NGS (ir)reproducibility Taylor J & Nekrutenko A. Next- generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).
  11. If these studies were representative: Then most results reported in

    today's publications using NGS data cannot be accurately verified, reproduced, adapted, or used to educate others. This creates an alarming reproducibility crisis. Taylor J & Nekrutenko A. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13:667-672 (2012).
  12. Galaxy Interface: A web-based genomic analysis toolkit & workflow management

    system Tools Display data and tool dialog History
  13. Galaxy Interface: Integrating command-line tools into web-based platform • Nearly

    any tool that can be run at the command line can be described as Galaxy Tool. • Tools described with inputs and outputs. E.g. Tophat: ◦ input = sequence reads (FASTQ) ◦ output = alignments (BAM) • Hundreds of tools currently available. • Easily extensible with XML description about interface and how to generate a commane line.
  14. Galaxy Interface: Consistent interface for integrated tools Read Mapping Select

    an index (upload own or use built-in) Select a reference genome Select sequencing configuration Select input dataset (from history) Use common settings or change them? Run the tool
  15. Galaxy Interface: Automatically tracks every step of every analysis Tools

    Display data and tool dialog History • History system facilitates and tracks multi-step analyses • Exact parameters of a step can always be inspected, and easily rerun
  16. Workflows can be created from scratch or extracted from existing

    analysis histories. Workflows facilitate sharing and reuse, and provide precise reproducibility of an arbitrarily complex analysis.
  17. Transparency: sharing & publishing • All analysis components (datasets, histories,

    workflows) can be shared among Galaxy users and published. • Published pages and annotation allow analyses to be augmented with textual content and provided in the form of an integrated document.
  18. Data Visualization Send data to external genome browsers: UCSC, Ensembl,

    IGV, etc. Trackster: Galaxy's built-in genome browser
  19. Trackster View data from within Galaxy (no data transfer to

    external site) Supports common filetypes: BAM, BED, WIG
  20. http://usegalaxy.org (a.k.a. "Galaxy Main") • Free public website • No

    registration required • Anyone can use it • Hundreds of tools • >24,000 registered users • >300 TB user data • >140,000+ jobs/month But not all. 1000s available but implementation is not trivial. Disk storage and compute time is not infinite. Central solution not scalable.
  21. http://getgalaxy.org • Galaxy is open-source • Designed for local installation/customization

    • Easy to deploy/manage: ◦ $ hg clone https://bitbucket.org/galaxy/galaxy-dist/ ◦ $ sh run.sh ◦ Point browser to http://localhost:8080 • Requires existing computational resources ◦ Large server (bioconnector.virginia.edu) ◦ Compute cluster • ...Or on the cloud
  22. • Partnership between: ◦ Bioinformatics core ◦ Health Sciences Library

    ◦ Division of Clinical Informatics • Mission: get researchers connected to the tools and people they need. • bioconnector.virginia.edu • Tools: ◦ Galaxy server ◦ VIVO (collaboration) ◦ CDR/MUSIC ◦ Awesome space Local Galaxy Installation: bioconnector.virginia.edu
  23. • Partnership between: ◦ Bioinformatics core ◦ Health Sciences Library

    ◦ Division of Clinical Informatics • Mission: get researchers connected to the tools and people they need. • bioconnector.virginia.edu • Tools: ◦ Galaxy server ◦ VIVO (collaboration) ◦ CDR/MUSIC ◦ Awesome space Local Galaxy Installation: bioconnector.virginia.edu
  24. Galaxy CloudMan http://usegalaxy.org/cloud • Start your own fully configured and

    populated (tools + data) Galaxy instance • ~Infinitely scalable (pay on-demand) • Someone else manages the data center
  25. Live Demo galaxyproject.org/wiki/Learn "Hey [Mom/Dad], which coding exon has the

    highest number of SNPs on chromosome 22" Simple question. You know where to find the data. But how do you answer quickly? http://usegalaxy.org
  26. Get data from UCSC Click "Get data, UCSC Main". Set

    position to "chr22" Output format: "BED" Check "Send output to Galaxy" box. Hit get output
  27. Get data from UCSC At this screen, the history pane

    will go from gray (preparing), to yellow (running) to green (done):
  28. Get SNPs Same as last time, but set group to

    "Variation and Repeats" to select table "snp131".
  29. History management Rename the two history items to "Exons" and

    "SNPs" by clicking pencil. Rename history to "Galaxy 101 Demo"
  30. Join exons with SNPs Recall: task is to find exons

    containing the most SNPs. First step: join exons with SNP (print exons and SNPs that overlap side-by-side). Use the "Operate on genomic intervals --> Join" tool. Select exons first, SNPs second.
  31. Join exons with SNPs Containing the following data: Once you

    do this, you'll see a third history item: chr22 16258185 16258303 uc002zlh.1_cds_1_0_chr22_16258186_r 0 - chr22 16258278 16258279 rs2845178 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267011 16267012 rs7290262 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16266963 16266964 rs10154680 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267037 16267038 rs2818572 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267031 16267032 rs7292200 0 + Data for SNPs Data for exons Note that exon with ID uc002zlh.1_cds_2_0_chr22_16266929_r contains four SNPs with IDs rs7290262, rs10154680, rs2818572, and rs7292200.
  32. Join exons with SNPs We can easily compute the number

    of SNPs per exon by counting the number of repetitions of name for each exon. This can be easily done with the "Join, Subtract, and Group -> Group" tool. Choose column 4 by selecting "c4" in Group by column.
  33. Counting SNPs per exon Then click on Add new Operation

    and make sure the interface looks exactly as shown below: Then Execute. Your history now looks like this: The result of grouping (dataset #4) contains two columns. This first contains the exon name while the second shows the number of times this name has been repeated in dataset #3.
  34. Sorting exons by SNP count To see which exon has

    the highest number of SNPs, sort dataset #4 on the second column in descending order. This is done with "Filter and Sort -> Sort": This generates a 5th history item. The highest number of SNPs is 67.
  35. Select the top five Select the top five with "Text

    manipulation -> Select First" tool Executing this creates a sixth history item with only 5 lines:
  36. Recover exon info Now we know that in this dataset

    the five top exons contain between 41 and 67 SNPs. To know more we need to get back the positional information of these exons. This information was lost at the grouping step and now all we have is just two columns. To get coordinates back we will match the names of exons in dataset #6 (column 1) against names of the exons in the original dataset #1 (column 4). This can be done with "Join, Subtract and Group -> Compare two Queries" tool (note the settings of the tool in the middle pane). This creates a seventh history item.
  37. Homework: RNA-seq http://bit.ly/galaxy-rnaseq • Get some data (Illumina BodyMap) •

    QC / trim your reads • Map to hg19 with tophat • Visualize where reads map • Assemble with cufflinks • Differential expression with cuffdiff
  38. Resources • Twitter: @galaxyproject • Tutorials: http://galaxyproject.org/wiki/Learn • Mailing list:

    http://user.list.galaxyproject.org • Biostar ◦ Web: http://biostars.org/ ◦ Twitter: @BioStarQuestion • SEQanswers: ◦ Web: http://seqanswers.com/ ◦ Twitter: @SEQquestions
  39. Contact Me: Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog

    Galaxy: Web: http://galaxyproject.org (Mailing list, wiki, screencasts, etc). Twitter: @galaxyproject #useGalaxy IRC: Server: irc.freenode.net Channel: #galaxyproject Slides available at: bit.ly/uva-galaxy Evaluation forms in the back. Please fill these out!