Enabling knowledge generation and reproducible research by embedding provenance models in metadata stores

Enabling knowledge generation and reproducible research by embedding provenance models in metadata stores

Reproducible research requires that information pertaining to all aspects of a research activity are captured and represented richly. However, most scientific domains, including neuroscience, only capture pieces of information that are deemed relevant. In this talk, we provide an overview of the components necessary to create this information-rich landscape and describe a prototype platform for knowledge exploration. In particular, we focus on a technology agnostic data provenance model as the core representation and Semantic Web technologies that leverage such a representation. While the data and analysis methods are related to brain imaging, the same principles and architecture are applicable to any scientific domain.

1c8c8eeba90d924df74f588bc2f1de23?s=128

Satrajit Ghosh

August 28, 2013
Tweet

Transcript

  1. Enabling knowledge generation and reproducible research by embedding provenance models

    in metadata stores Satrajit S. Ghosh - satra@mit.edu August 28, 2013 Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 1 / 50
  2. 1 Knowledge Generation and Reproducible Analysis 2 Provenance and Semantic

    Web Tools 3 A Prototype Platform 4 Challenges & Future directions Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 2 / 50
  3. Knowledge Generation and Reproducible Analysis Knowledge Generation and Reproducible Analysis

    Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 3 / 50
  4. Knowledge Generation and Reproducible Analysis Huh! How did that happen?

    Source: Timothy Lebo - http://bit.ly/lebo_cogsci_issues_2011 Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 4 / 50
  5. Knowledge Generation and Reproducible Analysis Structured questions From journal articles

    published between 2008 and 2010, retrieve all brain volumes and ADOS scores of persons with autism spectrum disorder who are right handed and under the age of 10. Rerun the analysis used in publication X on my data. Is the volume of the caudate nucleus smaller in persons with Obsessive Compulsive Disorder compared to controls? Find data-use agreements for open-accessible datasets used in articles by author Y. Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 5 / 50
  6. Knowledge Generation and Reproducible Analysis Why can’t we do this?

    There is no formal vocabulary to describe all entities, activities and agents in the domain, and vocabulary creation is a time-consuming process Standardized provenance tracking tools are typically not integrated into scientific software, making the curation process time consuming, resource intensive, and error prone Binary data formats do not provide standardized access to metadata The actual data can vary in size from 1-bit survey answers to terabytes In many research laboratories much of the derived data are deleted, keeping only the bits essential for publication There are no standards for computational platforms Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 6 / 50
  7. Knowledge Generation and Reproducible Analysis Why should we do this?

    A fundamental challenge in neuroscience is to integrate data across species, spatial scales (nanometers to inches), temporal scales (microseconds to years), instrumentation (e.g., electron microscopy, magnetic resonance imaging) and disorders (e.g., autism, schizophrenia). Datasets contain ad hoc metadata and are processed with methods specific to the sub-domain, limiting integration. The lack of shared and relevant metadata and the lack of provenance about data and computation in neuroscience precludes or complicates machine readability or reproducibility. Beyond the significant human effort to answer the previous queries, errors can happen from the lack of complete specification of data or methods, as well as from misinterpretation of methods Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 7 / 50
  8. Knowledge Generation and Reproducible Analysis Research workflow in Brain Imaging

    Reproducibility can mean many things. (Poline et al., 2012) Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 8 / 50
  9. Knowledge Generation and Reproducible Analysis Reproducibility is complicated Components to

    reproduce 1 Participants Screening, inclusion and exclusion criteria Demographic matching Experimental setup Stimuli Experiment control software Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 9 / 50
  10. Knowledge Generation and Reproducible Analysis Components to reproduce 2 Data

    acquisition MR scanner Pulse sequences and reconstruction algorithms Cognitive or neuropsychological assessments Data analysis: Software tools Environments Quality control/assurance Analysis scripts Figure creation Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 10 / 50
  11. Knowledge Generation and Reproducible Analysis Reproducibility is necessary “In my

    own experience, error is ubiquitous in scientific computing, and one needs to work very diligently and energetically to eliminate it. One needs a very clear idea of what has been done in order to know where to look for likely sources of error. I often cannot really be sure what a student or colleague has done from his/her own presentation, and in fact often his/her description does not agree with my own understanding of what has been done, once I look carefully at the scripts. Actually, I find that researchers quite generally forget what they have done and misrepresent their computations.” (Donoho, 2010) Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 11 / 50
  12. Knowledge Generation and Reproducible Analysis Aims of reproducible analysis Ability

    to reproduce analysis Increase accuracy Ability to verify analyses are consistent with intentions Ability to review analysis choices Increase clarity of communication Increased trustworthiness Ability for others to verify Extensibility Ability to easily modify and/or re-use existing analyses Contextualize Ability to establish bounds of a given application or within a given tolerance Source: https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/blob/master/talk/talk.md Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 12 / 50
  13. Knowledge Generation and Reproducible Analysis Capturing research today The laboratory

    notebook (e.g., Documents, Google, Dropbox) Code Directories on filesystem Code repositories (e.g., Github, Sourceforge) Data (e.g., Databases, Archives) Environments Python requirements.txt Virtual Machines Cloud (e.g., Amazon Web Services, Azure, Rackspace) Supplementary information MIT DSpace Journal archives Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 13 / 50
  14. Provenance and Semantic Web Tools Provenance and Semantic Web Tools

    Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 14 / 50
  15. Provenance and Semantic Web Tools A central theme: Capturing information

    Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 15 / 50
  16. Provenance and Semantic Web Tools What will this platform look

    like and enable? Provide a decentralized linked data and computational network Encode information in standardized and machine accessible form View data from a provenance perspective as products of activities or transformations carried out by people, software or machines Allow any individual, laboratory, or institution to discover and share data and computational services, along with the provenance of that data Immediately re-test an algorithm, re-validate results or test a new hypothesis on new data Develop applications based on a consistent, federated query and update interface Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 16 / 50
  17. Provenance and Semantic Web Tools Some definitions What is Provenance?

    Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. (source: w3c) What is a ‘data model’? A data model is an abstract conceptual formulation of information that explictly determines the structure of data and allows software and people to communicate and interpret data precisely. (source: wikipedia) What is PROV-DM? PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications. PROV-DM provides a generic basis that captures relationships associated with the creation and modification of entities by activities and agents. Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 17 / 50
  18. Provenance and Semantic Web Tools PROV-DM components 1 Entities and

    activities, and the time at which they were created, used, or ended 2 Derivations of entities from entities 3 Agents bearing responsibility for entities that were generated or activities that happened 4 A notion of bundle, a mechanism to support provenance of provenance 5 Properties to link entities that refer to the same thing 6 Collections forming a logical structure for its members Source: PROV-DM Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 18 / 50
  19. Provenance and Semantic Web Tools used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy

    wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime http://www.w3.org/TR/prov-o/diagrams/starting-points.svg Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 19 / 50
  20. Provenance and Semantic Web Tools generatedAtTime value hadMember invalidatedAtTime wasStartedBy

    / wasEndedBy wasInvalidatedBy wasInfluencedBy / wasQuotedFrom / wasRevisionOf / hadPrimarySource Activity Entity Collection xsd:dateTime xsd:dateTime alternateOf / specializationOf atLocation Location Agent Person SoftwareAgent Organization Plan Bundle http://www.w3.org/TR/prov-o/diagrams/expanded.svg Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 20 / 50
  21. Provenance and Semantic Web Tools Why PROV-DM? Provenance is not

    an afterthought Captures data and metadata (about entities, activities and agents) within the same context A formal, technology-agnostic representation of machine-accessible structured information Federated queries using SPARQL when represented as RDF A W3C recommendation simplifies app development and allows integration with other future services Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 21 / 50
  22. Provenance and Semantic Web Tools Semantic Web Tools The Semantic

    Web provides a common framework that allows data sharing and reuse, is based on the Resource Description Framework (RDF), and extends the principles of the Web from pages to machine useful data Data and descriptors are accessed using uniform resource identifiers (URIs) Unlike the traditional Web, the source and the target along with the relationship itself are unambiguously named with URIs and form a ‘triple’ of a subject, a relationship and an object nif:tbi rdf:type nif:mental disorder . This flexible approach allows data to be easily added and for the nature of the relations to evolve, resulting in an architecture that allows retrieving answers to more complex queries Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 22 / 50
  23. Provenance and Semantic Web Tools RDF Example @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.

    @prefix contact: <http://www.w3.org/2000/10/swap/pim/contact#>. <http://www.w3.org/People/EM/contact#me> rdf:type contact:Person; contact:fullName "Eric Miller"; contact:mailbox <mailto:em@w3.org>; contact:personalTitle "Dr.". Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 23 / 50
  24. Provenance and Semantic Web Tools SPARQL Protocol and RDF Query

    Language (SPARQL) A query language for RDF Triples can be represented with compact syntaxes (e.g., Turtle) The queries are themselves similar in syntax SPARQL 1.1 (official W3C recommdation in March, 2013) SPARQL allows users to write unambiguous queries Supports federation: a query can be distributed to multiple SPARQL endpoints, computed and results gathered A SPARQL client library can query a static RDF document or a SPARQL endpoint Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 24 / 50
  25. Provenance and Semantic Web Tools SPARQL examples PREFIX foaf: <http://xmlns.com/foaf/0.1/>

    SELECT ?name ?email WHERE { ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email. } PREFIX abc: <http://example.com/exampleOntology#> SELECT ?capital ?country WHERE { ?x abc:cityname ?capital ; abc:isCapitalOf ?y . ?y abc:countryname ?country ; abc:isInContinent abc:Africa . } Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 25 / 50
  26. A Prototype Platform A Prototype Platform Satrajit S. Ghosh -

    satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 26 / 50
  27. A Prototype Platform Requirements A standardized data model (NI-DM) Provenance

    tracking (Prov + workflow tools) Decentralized content creation and storage (Workflow tools, RDF triples, triple stores) Federated query (SPARQL) Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 27 / 50
  28. A Prototype Platform 1. A standardized data model Neuroimaging Data

    Model (NI-DM) (Keator et al., 2013) Based on PROV-DM hence borrows PROV ontology (PROV-O) Structured information encoding Consistent vocabulary Metadata standards via domain specific object models Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 28 / 50
  29. A Prototype Platform NIDM components Terms A lexicon of all

    things brain imaging. (e.g., DICOM terms, software specific terms, statistic terms, paradigm terms) Object Models Structured information in brain imaging (e.g., directory structures, CSV/Tab delimited files, brain imaging file formats) Integrated provenance How are entities generated or derived and by what or who? Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 29 / 50
  30. A Prototype Platform NIDM platform Satrajit S. Ghosh - satra@mit.edu

    Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 30 / 50
  31. A Prototype Platform 2. Provenance tracking Satrajit S. Ghosh -

    satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 31 / 50
  32. A Prototype Platform Provenance tracking tools in Python 1 IPython

    notebook 2 Sumatra 3 Synapse (with Python client) 4 Prov Python library (with RDF extensions) Similar tools exist for other languages and some of the above systems allow HTTP based tracking with a RESTful service API. Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 32 / 50
  33. A Prototype Platform Workflow tools supporting W3C PROV 1 Nipype

    A brain imaging focused workflow environment Flexible semantics for scripting complex workflows 2 VisTrails Scientific Workflow and Provenance Management Manage rapidly evolving workflows Can be used via a graphical interface 3 Taverna/Kepler/Galaxy (or supports the precursor to PROV, the open provenance model) Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 33 / 50
  34. A Prototype Platform Nipype: A Workflow environment for brain imaging

    (Gorgolewski et al., 2011) Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 34 / 50
  35. A Prototype Platform Attempts at provenance in Nipype Logging to

    file Restructured text output per interface Exporting the script Executable IPython notebooks Using Prov library and storing RDF in a file or triplestore Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 35 / 50
  36. A Prototype Platform 3. Decentralized content creation and storage Create

    and expose metadata where you do analysis Register dataset with central authority Use robots Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 36 / 50
  37. A Prototype Platform Example storage of provenance From the Nipype

    analysis of a single participant we get: 429 statements/triples from a single interface/function runtime dependencies inputs outputs md5/sha512 hashes and pointers to files 6021 statements/triples from the workflow includes relations between processes includes links to shared input output entities Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 37 / 50
  38. A Prototype Platform 4. Federated Query using SPARQL on triplestores

    select ?id ?age ?vol ?viq ?dx where { ?c fs:subject_id ?id; prov:hadMember ?e1 . ?sc prov:wasDerivedFrom ?e1; a nidm:FreeSurferStatsCollection; prov:hadMember [ nidm:AnatomicalAnnotation ?annot; fs:Volume_mm3 ?vol] . FILTER regex(?annot, "Right-Amy") SERVICE <http://computor.mit.edu:8890/sparql> { ?c2 nidm:ID ?id . ?c2 nidm:Age ?age . ?c2 nidm:Verbal_IQ ?viq . ?c2 nidm:DX ?dx . } } LIMIT 100 Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 38 / 50
  39. A Prototype Platform Example applications Extracting data Javascript Using PROV

    for determining relations Federated query Python + Javascript Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 39 / 50
  40. A Prototype Platform Javascript example select ?val (count(?s) as ?nsubjects)

    WHERE { ?c fs:subject_id ?s; prov:hadMember ?e1 . ?sc prov:wasDerivedFrom ?e1; a nidm:FreeSurferStatsCollection; prov:hadMember [ nidm:AnatomicalAnnotation ?annot; fs:Volume_mm3 ?val] . FILTER regex(?annot, "Right-Amy") } Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 40 / 50
  41. A Prototype Platform Output visualized directly via Javascript Satrajit S.

    Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 41 / 50
  42. A Prototype Platform Federated query select ?id ?age ?vol ?viq

    ?dx where { ?c fs:subject_id ?id; prov:hadMember ?e1 . ?sc prov:wasDerivedFrom ?e1; a nidm:FreeSurferStatsCollection; prov:hadMember [ nidm:AnatomicalAnnotation ?annot; fs:Volume_mm3 ?vol] . FILTER regex(?annot, "Right-Amy") SERVICE <http://computor.mit.edu:8890/sparql> { ?c2 nidm:ID ?id . ?c2 nidm:Age ?age . ?c2 nidm:Verbal_IQ ?viq . ?c2 nidm:DX ?dx . } } LIMIT 100 Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 42 / 50
  43. A Prototype Platform Interactive csv browser App call: http://localhost:5000/u?url=http://bit.ly/1atAL00 Scatterize:

    https://github.com/njvack/scatterize Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 43 / 50
  44. Challenges & Future directions Challenges & Future directions Satrajit S.

    Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 44 / 50
  45. Challenges & Future directions What does all this buy us?

    Common vocabulary for communication Rich structured information including provenance Domain specific object models that are embedded in the common structure Data/Content can be repurposed differentially for applications Execution duration could be used to instrument schedulers Parametric failure modes can be tracked across large databases Determine “amount” of existing data on a particular topic Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 45 / 50
  46. Challenges & Future directions Where are we headed? Formalize NI-DM

    object models as extensions to PROV-O Common vocabulary across software Relate to the Linked Data Web Publications, authors, grants Re-use existing vocabularies and ontologies Integrate with existing databases App instrumentation and development with built in provenance tracking Publish more structured data Reproduce analysis on a VM with an existing analysis pathway Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 46 / 50
  47. Challenges & Future directions A lighweight decentralized architecture Satrajit S.

    Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 47 / 50
  48. Challenges & Future directions Thanks International Neuroinformatics Coordinating Facilities BIRN

    derived-data working group Neuroimaging in Python Community W3C PROV Working group NIH, INCF for support Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 48 / 50
  49. Challenges & Future directions The picture of the future (Bechhofer

    et al., 2013) http://www.researchobject.org/ Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 49 / 50
  50. Challenges & Future directions Bechhofer, S., Buchan, I., Roure, D.

    D., Missier, P., Ainsworth, J., Bhagat, J., Couch, P., et al. (2013). Why linked data is not enough for scientists. Future Generation Computer Systems, 29(2), 599–611. doi:http://dx.doi.org/10.1016/j.future.2011.08.004 Donoho, D. L. (2010). An invitation to reproducible computational research. Biostatistics, 11(3), 385–388. Gorgolewski, K., Burns, C. D., Madison, C., Clark, D., Halchenko, Y. O., Waskom, M. L., & Ghosh, S. S. (2011). Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in neuroinformatics, 5. Keator, D. B., Helmer, K., Steffener, J., Turner, J. A., Van Erp, T. G., Gadde, S., Ashish, N., et al. (2013). Towards structured sharing of raw and derived neuroimaging data across existing resources. NeuroImage. Poline, J.-B., Breeze, J. L., Ghosh, S., Gorgolewski, K., Halchenko, Y. O., Hanke, M., Haselgrove, C., et al. (2012). Data sharing in neuroimaging research. Frontiers in neuroinformatics, 6. Satrajit S. Ghosh - satra@mit.edu Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 50 / 50