PyData 2013: Python Tools for Reproducible Research in Brain Imaging

PyData 2013: Python Tools for Reproducible Research in Brain Imaging

The goal of this talk is to motivate and provide a landscape of python tools available for reproducible research. And to indicate some use of such tools in the context of brain imaging.

1c8c8eeba90d924df74f588bc2f1de23?s=128

Satrajit Ghosh

July 27, 2013
Tweet

Transcript

  1. Python Tools for Reproducible Research in Brain Imaging Satrajit S.

    Ghosh - satra@mit.edu July 27, 2013 Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 1 / 47
  2. 1 Introduction 2 Provenance 3 Python Tools 4 Challenges &

    Future directions Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 2 / 47
  3. Introduction Introduction Satrajit S. Ghosh - satra@mit.edu Python Tools for

    Reproducible Researchin Brain Imaging July 27, 2013 3 / 47
  4. Introduction Literary and Philosophical Society of Manchester “The sanction which

    the Society gives to the work now published under its auspices, extends only to the novelty, ingenuity or importance of the several memoirs which it contains. Responsibility concerning the truth of facts, the soundness of reasoning, in the accuracy of calculations is wholly disclaimed: and must rest alone, on the knowledge, judgement, or ability of the authors who have respectfully furnished such communications. - 1785” (Kronick, 1990) Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 4 / 47
  5. Introduction Research workflow in Brain Imaging Reproducibility can mean many

    things. (Poline et al., 2012) Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 5 / 47
  6. Introduction Reproducibility is complicated Components to reproduce 1 Participants Screening

    criteria Demographics Experimental setup Stimuli Experiment control software Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 6 / 47
  7. Introduction Components to reproduce 2 Data acquisition MR scanner Pulse

    sequences Reconstruction algorithms Cognitive or Neuropsychological assessments Data analysis: Software tools Environments Quality control Analysis scripts Figure creation Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 7 / 47
  8. Introduction Reproducible scientific computing “In my own experience, error is

    ubiquitous in scientific computing, and one needs to work very diligently and energetically to eliminate it. One needs a very clear idea of what has been done in order to know where to look for likely sources of error. I often cannot really be sure what a student or colleague has done from his/her own presentation, and in fact often his/her description does not agree with my own understanding of what has been done, once I look carefully at the scripts. Actually, I find that researchers quite generally forget what they have done and misrepresent their computations.” (Donoho, 2010) Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 8 / 47
  9. Introduction Where does the information end up? Journals Web pages,

    blogs Documents Code repositories Databases Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 9 / 47
  10. Introduction Can I trust the information? Are the claims valid?

    Were the computations accurate? Where did the data come from? What are the characteristics of the data? What analysis methods were used? What hardware, OS and software versions? Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 10 / 47
  11. Introduction Aims of reproducible analysis Ability to reproduce analysis Increase

    accuracy Ability to verify analyses are consistent with intentions Ability to review analysis choices Increase clarity of communication Increased trustworthiness Ability for others to verify Extensibility Ability to easily modify or re-use existing analyses Granularity Reproducible for a given application or within a given tolerance Source: https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/blob/master/talk/talk.md Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 11 / 47
  12. Introduction Capturing research The laboratory notebook (e.g., Documents, Google, Dropbox)

    Code Directories on filesystem Code repositories (e.g., Github, Sourceforge) Data (e.g., Databases, Archives) Environments Python requirements.txt Virtual Machines Cloud (e.g., Amazon Web Services, Azure, Rackspace) Supplementary information MIT DSpace Journal archives Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 12 / 47
  13. Introduction Example of reproducible analysis (Ragan-Kelley et al., 2012) Data

    availability The IPython notebooks and all data files referenced here are available at http://qiime.org/home static/nih-cloud-apr2012/. The Amazon Machine Identifier used for these analyses is ami-9f69c1f6. Tutorials for using QIIME are available at http://www.qiime.org; tutorials for using the IPython notebook are available at http://ipython.org/ipython-doc/rel-0.13/index.html and http://ipython.org/videos.html; tutorials for using StarCluster are available at http://web.mit.edu/star/cluster/. Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 13 / 47
  14. Introduction A central theme: Capturing information Satrajit S. Ghosh -

    satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 14 / 47
  15. Provenance Provenance Satrajit S. Ghosh - satra@mit.edu Python Tools for

    Reproducible Researchin Brain Imaging July 27, 2013 15 / 47
  16. Provenance Some definitions What is Provenance? Provenance is information about

    entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. (source: w3c) What is a ‘data model’? A data model is an abstract conceptual formulation of information that explictly determines the structure of data and allows software and people to communicate and interpret data precisely. (source: wikipedia) Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 16 / 47
  17. Provenance What is PROV-DM? PROV-DM is the conceptual data model

    that forms a basis for the W3C provenance (PROV) family of specifications. PROV-DM provides a generic basis that captures relationships associated with the creation and modification of entities by activities and agents. Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 17 / 47
  18. Provenance PROV-DM components 1 Entities and activities, and the time

    at which they were created, used, or ended 2 Derivations of entities from entities 3 Agents bearing responsibility for entities that were generated or activities that happened 4 A notion of bundle, a mechanism to support provenance of provenance 5 Properties to link entities that refer to the same thing 6 Collections forming a logical structure for its members Source: PROV-DM Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 18 / 47
  19. Provenance used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity

    Entity Agent xsd:dateTime startedAtTime xsd:dateTime http://www.w3.org/TR/prov-o/diagrams/starting-points.svg Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 19 / 47
  20. Provenance generatedAtTime value hadMember invalidatedAtTime wasStartedBy / wasEndedBy wasInvalidatedBy wasInfluencedBy

    / wasQuotedFrom / wasRevisionOf / hadPrimarySource Activity Entity Collection xsd:dateTime xsd:dateTime alternateOf / specializationOf atLocation Location Agent Person SoftwareAgent Organization Plan Bundle http://www.w3.org/TR/prov-o/diagrams/expanded.svg Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 20 / 47
  21. Provenance Why PROV-DM? 1 A formal, technology-agnostic representation of information

    (can be translated to RDF/XML/JSON) 2 Machine-accessible structured representation of data 3 Federated queries using SPARQL when represented as RDF 4 Provenance is not an afterthought Captures data and metadata (about entities, activities and agents) within the same context 5 A standard simplifies app development 6 A W3C recommendation Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 21 / 47
  22. Python Tools Python Tools Satrajit S. Ghosh - satra@mit.edu Python

    Tools for Reproducible Researchin Brain Imaging July 27, 2013 22 / 47
  23. Python Tools Solutions for provenance tracking Satrajit S. Ghosh -

    satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 23 / 47
  24. Python Tools 1 IPython history 2 Sumatra 3 Synapse (with

    Python client) 4 Prov Python library (with RDF extensions) Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 24 / 47
  25. Python Tools IPython history Notebook serves as a good summary

    of the final analysis Lacks history of things that you may have tried Needs user to craft the analysis so that it is re-usable Tracking history as a graph will require code changes Beyond the polished script “In most cases we know that the material contained in a paper or scientific communication is wrong. It is probably more useful when it is wrong than when it is right, because when it is wrong, you can do something to make it better. When it is right, it finishes the whole thing” (Bernal, 1960). Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 25 / 47
  26. Python Tools Sumatra a command-line interface, smt, for launching simulations/analyses

    with automatic recording of information about the experiment, annotating these records, linking to data files, etc. a web interface with a built-in web-server, smtweb, for browsing and annotating simulation/analysis results. a LaTeX package and Sphinx extension for including Sumatra-tracked figures and links to provenance information in papers and other documents. a Python API, on which smt and smtweb are based, that can be used in your own scripts in place of using smt, or could be integrated into a GUI-based application. Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 26 / 47
  27. Python Tools $ smt init myProject $ smt configure --simulator=python

    --main=main.py $ smt run default.param $ smt info $ smt comment "Ran something really useful" Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 27 / 47
  28. Python Tools Satrajit S. Ghosh - satra@mit.edu Python Tools for

    Reproducible Researchin Brain Imaging July 27, 2013 28 / 47
  29. Python Tools Synapse Uses a central store model Works with

    Projects, Folders, Files and Activites Conforms to a subset of W3C PROV wondrous_project = syn.get(’syn1901847’) stored_file = syn.get(’syn1906479’) import numpy as np my_array = np.loadtxt(stored_file.path) Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 29 / 47
  30. Python Tools ## GET THE FILE WITH RESPONSES AND EXPRESSION

    DATA resp_file = syn.get(’syn1906479’) resp = np.loadtxt(resp_file.path, ...) expr_file = syn.get(’syn1906480’) expr = np.loadtxt(expr_file.path, ...) ## CODE WHICH WAS USED TO GENERATE THE P-VALUE HISTOGRAM code_file = syn.get(’syn1917825’) ## STORE IN OUR FOLDER WE CREATED EARLIER plot_file = File(’hist.png’, ...) plot_file = syn.store(plot_file) ## LINK THESE TOGETHER WITH PROVENANCE act = Activity(name="p-value histogram", used=[expr_file, resp_file], executed=code_file) plot_file = syn.store(plot_file, activity=act) Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 30 / 47
  31. Python Tools Prov Python Library Recent library to support the

    W3C PROV Data Model In spirit similar to Sumatra and Synapse All terms are formalized Supports all PROV-DM constructs PROV-N notation output can be converted to RDF/JSON/XML Base library on which applications can be built Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 31 / 47
  32. Python Tools import prov.model as prov g = prov.ProvBundle() e1

    = g.entity(’e1’, [(’foo:toast’, "b")]) a1 = g.activity(’a1’, startTime="2012-04-03T23:59:59Z") g.wasGeneratedBy(e1, a1, time="2012-04-03T23:59:59Z", other_attributes=[(’foo:jam’, "d")]) In [6]: print g.get_provn() document entity(e1, [foo:toast="b"]) activity(a1, 2012-04-03T23:59:59, -) wasGeneratedBy(e1, a1, 2012-04-03T23:59:59, [foo:jam="d"]) endDocument Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 32 / 47
  33. Python Tools In [8]: print g.rdf().serialize(format=’turtle’) @prefix ns1: <foo:> .

    @prefix prov: <http://www.w3.org/ns/prov#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <e1> a prov:Entity ; ns1:toast "b" ; prov:qualifiedGeneration [ a prov:Generation ; ns1:jam "d" ; prov:activity <a1> ; prov:time "2012-04-03T23:59:59"^^xsd:dateTime ] ; prov:wasGeneratedBy <a1> . <a1> a prov:Activity ; prov:startTime "2012-04-03T23:59:59"^^xsd:dateTime . Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 33 / 47
  34. Python Tools Python Workflow tools supporting provenance 1 VisTrails Scientific

    Workflow and Provenance Management Manage rapidly evolving workflows Can be used via a graphical interface 2 Nipype A brain imaging focused workflow environment Flexible semantics for scripting complex workflows Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 34 / 47
  35. Python Tools Common properties of workflows Processing pipeline is a

    graph (typically a DAG) Nodes are processes Edges represent data flow Compact representation for any process Much like a function, creates code and data separation Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 35 / 47
  36. Python Tools Nipype: A Workflow environment for brain imaging (Gorgolewski

    et al., 2011) Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 36 / 47
  37. Python Tools Nipype components Interface: Wraps a program or function

    (Executables, MATLAB, Python, JAVA) and provides uniform semantics Node/MapNode: Wraps an Interface for use in a Workflow that provides caching and other goodies (e.g., pseudo-sandbox) Workflow: A graph or forest of graphs whose nodes are of type Node, MapNode or Workflow and whose edges represent data flow Plugin: A component that describes how a Workflow should be executed Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 37 / 47
  38. Python Tools Execution Plugins Allows seamless execution across many architectures

    Local Serial Multicore Clusters SSH (via IPython) PBS/Torque/SGE/LSF (native and via IPython) HTCondor Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 38 / 47
  39. Python Tools Attempts at provenance in Nipype Logging to file

    Restructured text output per interface Exporting the script Executable IPython notebooks Using Prov library and storing RDF in a file or triplestore Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 39 / 47
  40. Python Tools Example storage of provenance From the analysis of

    a single participant we get: 429 statements from an interface runtime dependencies (fingerprint) inputs outputs md5/sha512 hashes and pointers to files 6021 statements from the workflow includes relations between processes includes links to shared input output entities Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 40 / 47
  41. Python Tools What does this buy us? Rich information that

    a human can but won’t read Structured graphical information No need for building a relational schema Can be repurposed for applications execution duration could be used to instrument schedulers parametric failure modes can be tracked across large databases coupled with testing frameworks (e.g., testkraut) Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 41 / 47
  42. Challenges & Future directions Challenges & Future directions Satrajit S.

    Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 42 / 47
  43. Challenges & Future directions Where are we now? Satrajit S.

    Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 43 / 47
  44. Challenges & Future directions Where are we headed? NI-DM a

    brain imaging extension of PROV-DM Relate to the Linked Data Web Federation query of provenance and other metadata Common vocabulary across software Common data representation (perhaps RDF) to store metadata App instrumentation and development with built in provenance tracking Cooperation is essential Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 44 / 47
  45. Challenges & Future directions Thanks Neuroimaging in Python Community Nipype

    contributors NeuroDebian International Neuroinformatics Coordinating Facilities BIRN derived-data working group W3C PROV Working group NIH, INCF for support Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 45 / 47
  46. Challenges & Future directions The picture of the future (Bechhofer

    et al., 2013) http://www.researchobject.org/ Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 46 / 47
  47. Challenges & Future directions Bechhofer, S., Buchan, I., Roure, D.

    D., Missier, P., Ainsworth, J., Bhagat, J., Couch, P., et al. (2013). Why linked data is not enough for scientists. Future Generation Computer Systems, 29(2), 599–611. doi:http://dx.doi.org/10.1016/j.future.2011.08.004 Bernal, J. D. (1960). Scientific information and its users. In Aslib proceedings (Vol. 12, pp. 432–438). MCB UP Ltd. Donoho, D. L. (2010). An invitation to reproducible computational research. Biostatistics, 11(3), 385–388. Gorgolewski, K., Burns, C. D., Madison, C., Clark, D., Halchenko, Y. O., Waskom, M. L., & Ghosh, S. S. (2011). Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in neuroinformatics, 5. Kronick, D. A. (1990). Peer review in 18th-century scientific journalism. JAMA: the journal of the American Medical Association, 263(10), 1321–1322. Poline, J.-B., Breeze, J. L., Ghosh, S., Gorgolewski, K., Halchenko, Y. O., Hanke, M., Haselgrove, C., et al. (2012). Data sharing in neuroimaging research. Frontiers in neuroinformatics, 6. Ragan-Kelley, B., Walters, W. A., McDonald, D., Riley, J., Granger, B. E., Gonzalez, A., Knight, R., et al. (2012). Collaborative cloud-enabled tools allow rapid, reproducible biological insights. The ISME journal. Satrajit S. Ghosh - satra@mit.edu Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 47 / 47