Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData 2013: Python Tools for Reproducible Research in Brain Imaging

PyData 2013: Python Tools for Reproducible Research in Brain Imaging

The goal of this talk is to motivate and provide a landscape of python tools available for reproducible research. And to indicate some use of such tools in the context of brain imaging.

Satrajit Ghosh

July 27, 2013
Tweet

More Decks by Satrajit Ghosh

Other Decks in Science

Transcript

  1. Python Tools for Reproducible Research
    in Brain Imaging
    Satrajit S. Ghosh - [email protected]
    July 27, 2013
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 1 / 47

    View Slide

  2. 1 Introduction
    2 Provenance
    3 Python Tools
    4 Challenges & Future directions
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 2 / 47

    View Slide

  3. Introduction
    Introduction
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 3 / 47

    View Slide

  4. Introduction
    Literary and Philosophical Society of Manchester
    “The sanction which the Society gives to the work now published under its
    auspices, extends only to the novelty, ingenuity or importance of the
    several memoirs which it contains. Responsibility concerning the truth of
    facts, the soundness of reasoning, in the accuracy of calculations is wholly
    disclaimed: and must rest alone, on the knowledge, judgement, or ability
    of the authors who have respectfully furnished such communications. -
    1785” (Kronick, 1990)
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 4 / 47

    View Slide

  5. Introduction
    Research workflow in Brain Imaging
    Reproducibility can mean many things.
    (Poline et al., 2012)
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 5 / 47

    View Slide

  6. Introduction
    Reproducibility is complicated
    Components to reproduce 1
    Participants
    Screening criteria
    Demographics
    Experimental setup
    Stimuli
    Experiment control software
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 6 / 47

    View Slide

  7. Introduction
    Components to reproduce 2
    Data acquisition
    MR scanner
    Pulse sequences
    Reconstruction algorithms
    Cognitive or Neuropsychological assessments
    Data analysis:
    Software tools
    Environments
    Quality control
    Analysis scripts
    Figure creation
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 7 / 47

    View Slide

  8. Introduction
    Reproducible scientific computing
    “In my own experience, error is ubiquitous in scientific computing, and one
    needs to work very diligently and energetically to eliminate it. One needs a
    very clear idea of what has been done in order to know where to look for
    likely sources of error. I often cannot really be sure what a student or
    colleague has done from his/her own presentation, and in fact often
    his/her description does not agree with my own understanding of what has
    been done, once I look carefully at the scripts. Actually, I find that
    researchers quite generally forget what they have done and misrepresent
    their computations.” (Donoho, 2010)
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 8 / 47

    View Slide

  9. Introduction
    Where does the information end up?
    Journals
    Web pages, blogs
    Documents
    Code repositories
    Databases
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 9 / 47

    View Slide

  10. Introduction
    Can I trust the information?
    Are the claims valid?
    Were the computations accurate?
    Where did the data come from?
    What are the characteristics of the data?
    What analysis methods were used?
    What hardware, OS and software versions?
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 10 / 47

    View Slide

  11. Introduction
    Aims of reproducible analysis
    Ability to reproduce analysis
    Increase accuracy
    Ability to verify analyses are consistent with intentions
    Ability to review analysis choices
    Increase clarity of communication
    Increased trustworthiness
    Ability for others to verify
    Extensibility
    Ability to easily modify or re-use existing analyses
    Granularity
    Reproducible for a given application or within a given tolerance
    Source: https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/blob/master/talk/talk.md
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 11 / 47

    View Slide

  12. Introduction
    Capturing research
    The laboratory notebook (e.g., Documents, Google, Dropbox)
    Code
    Directories on filesystem
    Code repositories (e.g., Github, Sourceforge)
    Data (e.g., Databases, Archives)
    Environments
    Python requirements.txt
    Virtual Machines
    Cloud (e.g., Amazon Web Services, Azure, Rackspace)
    Supplementary information
    MIT DSpace
    Journal archives
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 12 / 47

    View Slide

  13. Introduction
    Example of reproducible analysis
    (Ragan-Kelley et al., 2012)
    Data availability
    The IPython notebooks and all data files referenced here are available at
    http://qiime.org/home static/nih-cloud-apr2012/. The Amazon Machine
    Identifier used for these analyses is ami-9f69c1f6. Tutorials for using
    QIIME are available at http://www.qiime.org; tutorials for using the
    IPython notebook are available at
    http://ipython.org/ipython-doc/rel-0.13/index.html and
    http://ipython.org/videos.html; tutorials for using StarCluster are available
    at http://web.mit.edu/star/cluster/.
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 13 / 47

    View Slide

  14. Introduction
    A central theme: Capturing information
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 14 / 47

    View Slide

  15. Provenance
    Provenance
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 15 / 47

    View Slide

  16. Provenance
    Some definitions
    What is Provenance?
    Provenance is information about entities, activities, and people involved in
    producing a piece of data or thing, which can be used to form assessments
    about its quality, reliability or trustworthiness. (source: w3c)
    What is a ‘data model’?
    A data model is an abstract conceptual formulation of information that
    explictly determines the structure of data and allows software and people
    to communicate and interpret data precisely. (source: wikipedia)
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 16 / 47

    View Slide

  17. Provenance
    What is PROV-DM?
    PROV-DM is the conceptual data model that forms a basis for the W3C
    provenance (PROV) family of specifications.
    PROV-DM provides a generic basis that captures relationships associated
    with the creation and modification of entities by activities and agents.
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 17 / 47

    View Slide

  18. Provenance
    PROV-DM components
    1 Entities and activities, and the time at which they were created, used,
    or ended
    2 Derivations of entities from entities
    3 Agents bearing responsibility for entities that were generated or
    activities that happened
    4 A notion of bundle, a mechanism to support provenance of provenance
    5 Properties to link entities that refer to the same thing
    6 Collections forming a logical structure for its members
    Source: PROV-DM
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 18 / 47

    View Slide

  19. Provenance
    used
    endedAtTime
    wasAssociatedWith
    actedOnBehalfOf
    wasGeneratedBy
    wasAttributedTo
    wasDerivedFrom
    wasInformedBy
    Activity
    Entity
    Agent
    xsd:dateTime
    startedAtTime
    xsd:dateTime
    http://www.w3.org/TR/prov-o/diagrams/starting-points.svg
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 19 / 47

    View Slide

  20. Provenance
    generatedAtTime
    value
    hadMember
    invalidatedAtTime
    wasStartedBy /
    wasEndedBy
    wasInvalidatedBy
    wasInfluencedBy /
    wasQuotedFrom /
    wasRevisionOf /
    hadPrimarySource
    Activity
    Entity
    Collection
    xsd:dateTime
    xsd:dateTime
    alternateOf /
    specializationOf
    atLocation
    Location
    Agent
    Person
    SoftwareAgent
    Organization
    Plan
    Bundle
    http://www.w3.org/TR/prov-o/diagrams/expanded.svg
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 20 / 47

    View Slide

  21. Provenance
    Why PROV-DM?
    1 A formal, technology-agnostic representation of information (can be
    translated to RDF/XML/JSON)
    2 Machine-accessible structured representation of data
    3 Federated queries using SPARQL when represented as RDF
    4 Provenance is not an afterthought
    Captures data and metadata (about entities, activities and agents)
    within the same context
    5 A standard simplifies app development
    6 A W3C recommendation
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 21 / 47

    View Slide

  22. Python Tools
    Python Tools
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 22 / 47

    View Slide

  23. Python Tools
    Solutions for provenance tracking
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 23 / 47

    View Slide

  24. Python Tools
    1 IPython history
    2 Sumatra
    3 Synapse (with Python client)
    4 Prov Python library (with RDF extensions)
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 24 / 47

    View Slide

  25. Python Tools
    IPython history
    Notebook serves as a good summary of the final analysis
    Lacks history of things that you may have tried
    Needs user to craft the analysis so that it is re-usable
    Tracking history as a graph will require code changes
    Beyond the polished script
    “In most cases we know that the material contained in a paper or scientific
    communication is wrong. It is probably more useful when it is wrong than
    when it is right, because when it is wrong, you can do something to make
    it better. When it is right, it finishes the whole thing” (Bernal, 1960).
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 25 / 47

    View Slide

  26. Python Tools
    Sumatra
    a command-line interface, smt, for launching simulations/analyses
    with automatic recording of information about the experiment,
    annotating these records, linking to data files, etc.
    a web interface with a built-in web-server, smtweb, for browsing and
    annotating simulation/analysis results.
    a LaTeX package and Sphinx extension for including Sumatra-tracked
    figures and links to provenance information in papers and other
    documents.
    a Python API, on which smt and smtweb are based, that can be used
    in your own scripts in place of using smt, or could be integrated into
    a GUI-based application.
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 26 / 47

    View Slide

  27. Python Tools
    $ smt init myProject
    $ smt configure --simulator=python --main=main.py
    $ smt run default.param
    $ smt info
    $ smt comment "Ran something really useful"
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 27 / 47

    View Slide

  28. Python Tools
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 28 / 47

    View Slide

  29. Python Tools
    Synapse
    Uses a central store model
    Works with Projects, Folders, Files and Activites
    Conforms to a subset of W3C PROV
    wondrous_project = syn.get(’syn1901847’)
    stored_file = syn.get(’syn1906479’)
    import numpy as np
    my_array = np.loadtxt(stored_file.path)
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 29 / 47

    View Slide

  30. Python Tools
    ## GET THE FILE WITH RESPONSES AND EXPRESSION DATA
    resp_file = syn.get(’syn1906479’)
    resp = np.loadtxt(resp_file.path, ...)
    expr_file = syn.get(’syn1906480’)
    expr = np.loadtxt(expr_file.path, ...)
    ## CODE WHICH WAS USED TO GENERATE THE P-VALUE HISTOGRAM
    code_file = syn.get(’syn1917825’)
    ## STORE IN OUR FOLDER WE CREATED EARLIER
    plot_file = File(’hist.png’, ...)
    plot_file = syn.store(plot_file)
    ## LINK THESE TOGETHER WITH PROVENANCE
    act = Activity(name="p-value histogram",
    used=[expr_file, resp_file],
    executed=code_file)
    plot_file = syn.store(plot_file, activity=act)
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 30 / 47

    View Slide

  31. Python Tools
    Prov Python Library
    Recent library to support the W3C PROV Data Model
    In spirit similar to Sumatra and Synapse
    All terms are formalized
    Supports all PROV-DM constructs
    PROV-N notation output can be converted to RDF/JSON/XML
    Base library on which applications can be built
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 31 / 47

    View Slide

  32. Python Tools
    import prov.model as prov
    g = prov.ProvBundle()
    e1 = g.entity(’e1’, [(’foo:toast’, "b")])
    a1 = g.activity(’a1’, startTime="2012-04-03T23:59:59Z")
    g.wasGeneratedBy(e1, a1, time="2012-04-03T23:59:59Z",
    other_attributes=[(’foo:jam’, "d")])
    In [6]: print g.get_provn()
    document
    entity(e1, [foo:toast="b"])
    activity(a1, 2012-04-03T23:59:59, -)
    wasGeneratedBy(e1, a1, 2012-04-03T23:59:59,
    [foo:jam="d"])
    endDocument
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 32 / 47

    View Slide

  33. Python Tools
    In [8]: print g.rdf().serialize(format=’turtle’)
    @prefix ns1: .
    @prefix prov: .
    @prefix xsd: .
    a prov:Entity ;
    ns1:toast "b" ;
    prov:qualifiedGeneration [ a prov:Generation ;
    ns1:jam "d" ;
    prov:activity ;
    prov:time "2012-04-03T23:59:59"^^xsd:dateTime ] ;
    prov:wasGeneratedBy .
    a prov:Activity ;
    prov:startTime "2012-04-03T23:59:59"^^xsd:dateTime .
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 33 / 47

    View Slide

  34. Python Tools
    Python Workflow tools supporting provenance
    1 VisTrails
    Scientific Workflow and Provenance Management
    Manage rapidly evolving workflows
    Can be used via a graphical interface
    2 Nipype
    A brain imaging focused workflow environment
    Flexible semantics for scripting complex workflows
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 34 / 47

    View Slide

  35. Python Tools
    Common properties of workflows
    Processing pipeline is a graph (typically a DAG)
    Nodes are processes
    Edges represent data flow
    Compact representation for any process
    Much like a function, creates code and data separation
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 35 / 47

    View Slide

  36. Python Tools
    Nipype: A Workflow environment for brain imaging
    (Gorgolewski et al., 2011)
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 36 / 47

    View Slide

  37. Python Tools
    Nipype components
    Interface: Wraps a program or function (Executables, MATLAB,
    Python, JAVA) and provides uniform semantics
    Node/MapNode: Wraps an Interface for use in a Workflow that
    provides caching and other goodies (e.g., pseudo-sandbox)
    Workflow: A graph or forest of graphs whose nodes are of type
    Node, MapNode or Workflow and whose edges represent data flow
    Plugin: A component that describes how a Workflow should be
    executed
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 37 / 47

    View Slide

  38. Python Tools
    Execution Plugins
    Allows seamless execution across many architectures
    Local
    Serial
    Multicore
    Clusters
    SSH (via IPython)
    PBS/Torque/SGE/LSF (native and via IPython)
    HTCondor
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 38 / 47

    View Slide

  39. Python Tools
    Attempts at provenance in Nipype
    Logging to file
    Restructured text output per interface
    Exporting the script
    Executable IPython notebooks
    Using Prov library and storing RDF in a file or triplestore
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 39 / 47

    View Slide

  40. Python Tools
    Example storage of provenance
    From the analysis of a single participant we get:
    429 statements from an interface
    runtime dependencies (fingerprint)
    inputs
    outputs
    md5/sha512 hashes and pointers to files
    6021 statements from the workflow
    includes relations between processes
    includes links to shared input output entities
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 40 / 47

    View Slide

  41. Python Tools
    What does this buy us?
    Rich information that a human can but won’t read
    Structured graphical information
    No need for building a relational schema
    Can be repurposed for applications
    execution duration could be used to instrument schedulers
    parametric failure modes can be tracked across large databases
    coupled with testing frameworks (e.g., testkraut)
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 41 / 47

    View Slide

  42. Challenges & Future directions
    Challenges & Future directions
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 42 / 47

    View Slide

  43. Challenges & Future directions
    Where are we now?
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 43 / 47

    View Slide

  44. Challenges & Future directions
    Where are we headed?
    NI-DM a brain imaging extension of PROV-DM
    Relate to the Linked Data Web
    Federation query of provenance and other metadata
    Common vocabulary across software
    Common data representation (perhaps RDF) to store metadata
    App instrumentation and development with built in provenance
    tracking
    Cooperation is essential
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 44 / 47

    View Slide

  45. Challenges & Future directions
    Thanks
    Neuroimaging in Python Community
    Nipype contributors
    NeuroDebian
    International Neuroinformatics Coordinating Facilities
    BIRN derived-data working group
    W3C PROV Working group
    NIH, INCF for support
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 45 / 47

    View Slide

  46. Challenges & Future directions
    The picture of the future
    (Bechhofer et al., 2013) http://www.researchobject.org/
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 46 / 47

    View Slide

  47. Challenges & Future directions
    Bechhofer, S., Buchan, I., Roure, D. D., Missier, P., Ainsworth, J., Bhagat, J., Couch, P., et al. (2013). Why linked data is not
    enough for scientists. Future Generation Computer Systems, 29(2), 599–611.
    doi:http://dx.doi.org/10.1016/j.future.2011.08.004
    Bernal, J. D. (1960). Scientific information and its users. In Aslib proceedings (Vol. 12, pp. 432–438). MCB UP Ltd.
    Donoho, D. L. (2010). An invitation to reproducible computational research. Biostatistics, 11(3), 385–388.
    Gorgolewski, K., Burns, C. D., Madison, C., Clark, D., Halchenko, Y. O., Waskom, M. L., & Ghosh, S. S. (2011). Nipype: a
    flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in neuroinformatics, 5.
    Kronick, D. A. (1990). Peer review in 18th-century scientific journalism. JAMA: the journal of the American Medical
    Association, 263(10), 1321–1322.
    Poline, J.-B., Breeze, J. L., Ghosh, S., Gorgolewski, K., Halchenko, Y. O., Hanke, M., Haselgrove, C., et al. (2012). Data
    sharing in neuroimaging research. Frontiers in neuroinformatics, 6.
    Ragan-Kelley, B., Walters, W. A., McDonald, D., Riley, J., Granger, B. E., Gonzalez, A., Knight, R., et al. (2012). Collaborative
    cloud-enabled tools allow rapid, reproducible biological insights. The ISME journal.
    Satrajit S. Ghosh - [email protected] Python Tools for Reproducible Researchin Brain Imaging July 27, 2013 47 / 47

    View Slide