Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy: Biology of Genomes 2010

Galaxy: Biology of Genomes 2010

Given in May 2010 for Biology of Genomes, analysis of Heteroplasmy using Galaxy Cloud. This was when Galaxy Cloud and Cloudman were first becoming mature.

3ee44f53c39bcd4bc663a2ea0e21d526?s=128

James Taylor

May 01, 2010
Tweet

Transcript

  1. Discovery of human heteroplasmic sites enabled by an accessible interface

    to cloud-computing infrastructure Enis Afgan, Hiroki Goto, Ian Paul, Francesca Chiaromonte Kateryna Makova, Anton Nekrutenko, James Taylor Emory University School of Nursing Development 1520 Clifton Road, NE Atlanta, Georgia 30322-4207 P 404.727.1234 E adorrill@emory.edu C A M P A I G N Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors
  2. Mitochondrial heteroplasmy • Typical human cells have ~100s of mitochondria,

    each with ~10 of copies of the mitochondrial genome • Heteroplasmy refers to variation among the mitochondrial genomes within a cell or individual • Maternal, multi-copy inheritance
  3. Pilot study Mother age = 40 Mother age = 40

    Mother age = 40 Mother age = 40 Child age = 14 Child age = 14 Child age = 14 Child age = 14 blood blood cheek swab cheek swab blood blood cheek swab cheek swab PCR 1 PCR 2 PCR 1 PCR 2 PCR 1 PCR 2 PCR 1 PCR 2 Mother age = 44 Mother age = 44 Mother age = 44 Mother age = 44 Child age = 9 Child age = 9 Child age = 9 Child age = 9 blood blood cheek swab cheek swab blood blood cheek swab cheek swab PCR 1 PCR 2 PCR 1 PCR 2 PCR 1 PCR 2 PCR 1 PCR 2 Mother age = 39 Mother age = 39 Mother age = 39 Mother age = 39 Child age = 5 Child age = 5 Child age = 5 Child age = 5 blood blood cheek swab cheek swab blood blood cheek swab cheek swab PCR 1 PCR 2 PCR 1 PCR 2 PCR 1 PCR 2 PCR 1 PCR 2 Pair 1 Pair 2 Pair 3 24 datasets sequenced using Illumina (76bp reads)
  4. Producing the sequence was easy enough... Now what?

  5. As science becomes increasingly dependent on computation: • How can

    methods best be made accessible to scientists? • How to facilitate transparent communication of analyses? • How best to ensure that analyses are reproducible?
  6. Galaxy: accessible analysis system • Consistent tool user interfaces automatically

    generated • History system facilitates and tracks multistep analyses • Exact parameters of a step can always be inspected, and easily rerun • Workflow system
  7. Could use the Galaxy web site for this data... However,

    we would be sharing compute availability and bandwidth with other users, and need to upload our potential private data
  8. Could use a local Galaxy instance... • Galaxy is designed

    for local installation and customization • Easily integrate new tools • Easy to deploy and manage on nearly any (unix) system • Run jobs on existing compute clusters • But, requires an existing computational resource on which to be deployed
  9. Welcome to Galaxy on the Cloud(s) http://usegalaxy.org/cloud

  10. Cloud computing • Computing using resources acquired on demand •

    Spectrum from infrastructure as a service (virtual machines, e.g. Amazon EC2) to software as a service (e.g. Google docs) • Goal for Galaxy: deliver the provider independence of an IaaS based solution, while approaching the ease of use of a SaaS based solution
  11. Using Amazon EC2: Startup in 3 steps

  12. None
  13. None
  14. None
  15. Analysis defined by three Galaxy workflows... each created by example,

    and extracted for reuse
  16. Workflow 1: determine maximum variation at any site between PCR

    replicates (run for each of 12 replicate pairs)
  17. Workflow 2: pool replicates and identify sites with variability greater

    than the maximum PCR variation (run for each of 12 replicate pairs)
  18. Workflow 3: Combine sites within pair to produce final tables

    (run once for each of the 3 mother/child pairs)
  19. After running first set of workflows, many jobs waiting to

    run but cluster is completely utilized...
  20. None
  21. None
  22. ~1 hour and ~$20 later...

  23. Pair 1 Mother (cheek) Mother (blood) Child (cheek) Child (blood)

    Pair 2 Mother (cheek) Mother (blood) Child (cheek) Child (blood) Pair 3 Mother (cheek) Mother (blood) Child (cheek) Child (blood) 299 302 310 3434 3480 5063 8992 10398 10550 11299 14053 16184 16189 16190 TRNF RNR1 TRNV RNR2 TRNL1 ND1 TRNI TRNQ TRNM ND2 TRNW TRNA TRNN TRNC TRNY COX1 TRNS1 TRND COX2 TRNK ATP8 ATP6 COX3 TRNG ND3 TRNR ND4L ND4 TRNH TRNS2 TRNL2 ND5 ND6 TRNE CYTB TRNT TRNP
  24. Pair 1 Mother (cheek) Mother (blood) Child (cheek) Child (blood)

    Pair 2 Mother (cheek) Mother (blood) Child (cheek) Child (blood) Pair 3 Mother (cheek) Mother (blood) Child (cheek) Child (blood) 299 302 310 3434 3480 5063 8992 10398 10550 11299 14053 16184 16189 16190 TRNF RNR1 TRNV RNR2 TRNL1 ND1 TRNI TRNQ TRNM ND2 TRNW TRNA TRNN TRNC TRNY COX1 TRNS1 TRND COX2 TRNK ATP8 ATP6 COX3 TRNG ND3 TRNR ND4L ND4 TRNH TRNS2 TRNL2 ND5 ND6 TRNE CYTB TRNT TRNP 0.08% 0.24% 1.92% 2.05% 47.4% 42.2% 31.2% 31.4% 1.86% 1.68% 0.19% 0.35%
  25. Pair 1 Mother (cheek) Mother (blood) Child (cheek) Child (blood)

    Pair 2 Mother (cheek) Mother (blood) Child (cheek) Child (blood) Pair 3 Mother (cheek) Mother (blood) Child (cheek) Child (blood) 299 302 310 3434 3480 5063 8992 10398 10550 11299 14053 16184 16189 16190 TRNF RNR1 TRNV RNR2 TRNL1 ND1 TRNI TRNQ TRNM ND2 TRNW TRNA TRNN TRNC TRNY COX1 TRNS1 TRND COX2 TRNK ATP8 ATP6 COX3 TRNG ND3 TRNR ND4L ND4 TRNH TRNS2 TRNL2 ND5 ND6 TRNE CYTB TRNT TRNP 0.08% 0.11% 1.82% 0.26% 0.15% 0.00% 2.13% 0.17% 0.06% 0.22% 2.07% 0.42% 0.04% 0.03% 1.78% 0.09%
  26. Even this simple analysis gets complicated with this much data,

    how do we communicate it transparently?
  27. None
  28. None
  29. None
  30. None
  31. None
  32. Looking forward: Galaxy Cloud • Dynamic and predictive scaling •

    Should not require user to manage cluster size, but should allow setting limits • Models for defining parallelization at the within tool level
  33. Looking forward: Galaxy Pages • What happens when a Galaxy

    instance goes away? • Big problem as instances get more distributed, but even “Galaxy main” may not be there forever • Enhanced support for dumping data, pages, workflows from Galaxy with human readable data and sufficient metadata to load into a new Galaxy • But some mechanism must exist to archive the data that underlies publications • Implementing support for depositing data and metadata directly into such repositories, e.g. with Dryad
  34. Acknowledgements • co-PIs David Bader (Georgia Tech), Sergei Kosakovsky- Pond

    (UCSD), Ross Lazarus (Harvard Medical School) • Funding from NHGRI, NSF, Penn State, Emory, Beckman Foundation, Pennsylvania Department of Health • The Galaxy Team at Penn State and Emory
  35. Enis Afgan Guru Ananda Dan Blankenberg Ramkrishna Chakrabarty Nate Coraor

    Jeremy Goecks Greg von Kuster Kanwei Li
  36. None
  37. Want next cover to be yours? Makova Lab (makova.bx.psu.edu) is

    hiring Contact kmakova@bx.psu.edu with your CV Galaxy Developer Conference May 15 - 17, 2010 HERE! Immediately following this meeting! Join us! The Galaxy community is all about you (and we have actual job openings too, for Galaxy and other stuff; contact james.taylor@emory.edu)
  38. None