Enabling knowledge generation and reproducible research by embedding provenance models in metadata stores

Enabling knowledge generation and reproducible research by embedding provenance models
in metadata stores Satrajit S. Ghosh - [email protected] August 28, 2013 Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 1 / 50

1 Knowledge Generation and Reproducible Analysis 2 Provenance and Semantic
Web Tools 3 A Prototype Platform 4 Challenges & Future directions Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 2 / 50

Knowledge Generation and Reproducible Analysis Knowledge Generation and Reproducible Analysis
Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 3 / 50

Knowledge Generation and Reproducible Analysis Huh! How did that happen?
Source: Timothy Lebo - http://bit.ly/lebo_cogsci_issues_2011 Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 4 / 50

Knowledge Generation and Reproducible Analysis Structured questions From journal articles
published between 2008 and 2010, retrieve all brain volumes and ADOS scores of persons with autism spectrum disorder who are right handed and under the age of 10. Rerun the analysis used in publication X on my data. Is the volume of the caudate nucleus smaller in persons with Obsessive Compulsive Disorder compared to controls? Find data-use agreements for open-accessible datasets used in articles by author Y. Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 5 / 50

Knowledge Generation and Reproducible Analysis Why can’t we do this?
There is no formal vocabulary to describe all entities, activities and agents in the domain, and vocabulary creation is a time-consuming process Standardized provenance tracking tools are typically not integrated into scientiﬁc software, making the curation process time consuming, resource intensive, and error prone Binary data formats do not provide standardized access to metadata The actual data can vary in size from 1-bit survey answers to terabytes In many research laboratories much of the derived data are deleted, keeping only the bits essential for publication There are no standards for computational platforms Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 6 / 50

Knowledge Generation and Reproducible Analysis Why should we do this?
A fundamental challenge in neuroscience is to integrate data across species, spatial scales (nanometers to inches), temporal scales (microseconds to years), instrumentation (e.g., electron microscopy, magnetic resonance imaging) and disorders (e.g., autism, schizophrenia). Datasets contain ad hoc metadata and are processed with methods specific to the sub-domain, limiting integration. The lack of shared and relevant metadata and the lack of provenance about data and computation in neuroscience precludes or complicates machine readability or reproducibility. Beyond the significant human effort to answer the previous queries, errors can happen from the lack of complete specification of data or methods, as well as from misinterpretation of methods Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 7 / 50

Knowledge Generation and Reproducible Analysis Research workﬂow in Brain Imaging
Reproducibility can mean many things. (Poline et al., 2012) Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 8 / 50

Knowledge Generation and Reproducible Analysis Reproducibility is complicated Components to
reproduce 1 Participants Screening, inclusion and exclusion criteria Demographic matching Experimental setup Stimuli Experiment control software Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 9 / 50

Knowledge Generation and Reproducible Analysis Components to reproduce 2 Data
acquisition MR scanner Pulse sequences and reconstruction algorithms Cognitive or neuropsychological assessments Data analysis: Software tools Environments Quality control/assurance Analysis scripts Figure creation Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 10 / 50

Knowledge Generation and Reproducible Analysis Reproducibility is necessary “In my
own experience, error is ubiquitous in scientiﬁc computing, and one needs to work very diligently and energetically to eliminate it. One needs a very clear idea of what has been done in order to know where to look for likely sources of error. I often cannot really be sure what a student or colleague has done from his/her own presentation, and in fact often his/her description does not agree with my own understanding of what has been done, once I look carefully at the scripts. Actually, I ﬁnd that researchers quite generally forget what they have done and misrepresent their computations.” (Donoho, 2010) Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 11 / 50

Knowledge Generation and Reproducible Analysis Aims of reproducible analysis Ability
to reproduce analysis Increase accuracy Ability to verify analyses are consistent with intentions Ability to review analysis choices Increase clarity of communication Increased trustworthiness Ability for others to verify Extensibility Ability to easily modify and/or re-use existing analyses Contextualize Ability to establish bounds of a given application or within a given tolerance Source: https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/blob/master/talk/talk.md Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 12 / 50

Knowledge Generation and Reproducible Analysis Capturing research today The laboratory
notebook (e.g., Documents, Google, Dropbox) Code Directories on ﬁlesystem Code repositories (e.g., Github, Sourceforge) Data (e.g., Databases, Archives) Environments Python requirements.txt Virtual Machines Cloud (e.g., Amazon Web Services, Azure, Rackspace) Supplementary information MIT DSpace Journal archives Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 13 / 50

Provenance and Semantic Web Tools Provenance and Semantic Web Tools

Provenance and Semantic Web Tools A central theme: Capturing information

Provenance and Semantic Web Tools What will this platform look
like and enable? Provide a decentralized linked data and computational network Encode information in standardized and machine accessible form View data from a provenance perspective as products of activities or transformations carried out by people, software or machines Allow any individual, laboratory, or institution to discover and share data and computational services, along with the provenance of that data Immediately re-test an algorithm, re-validate results or test a new hypothesis on new data Develop applications based on a consistent, federated query and update interface Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 16 / 50

Provenance and Semantic Web Tools Some definitions What is Provenance?
Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. (source: w3c) What is a ‘data model’? A data model is an abstract conceptual formulation of information that explictly determines the structure of data and allows software and people to communicate and interpret data precisely. (source: wikipedia) What is PROV-DM? PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications. PROV-DM provides a generic basis that captures relationships associated with the creation and modification of entities by activities and agents. Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 17 / 50

Provenance and Semantic Web Tools PROV-DM components 1 Entities and
activities, and the time at which they were created, used, or ended 2 Derivations of entities from entities 3 Agents bearing responsibility for entities that were generated or activities that happened 4 A notion of bundle, a mechanism to support provenance of provenance 5 Properties to link entities that refer to the same thing 6 Collections forming a logical structure for its members Source: PROV-DM Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 18 / 50

Provenance and Semantic Web Tools used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy
wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime http://www.w3.org/TR/prov-o/diagrams/starting-points.svg Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 19 / 50

Provenance and Semantic Web Tools generatedAtTime value hadMember invalidatedAtTime wasStartedBy
/ wasEndedBy wasInvalidatedBy wasInfluencedBy / wasQuotedFrom / wasRevisionOf / hadPrimarySource Activity Entity Collection xsd:dateTime xsd:dateTime alternateOf / specializationOf atLocation Location Agent Person SoftwareAgent Organization Plan Bundle http://www.w3.org/TR/prov-o/diagrams/expanded.svg Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 20 / 50

Provenance and Semantic Web Tools Why PROV-DM? Provenance is not
an afterthought Captures data and metadata (about entities, activities and agents) within the same context A formal, technology-agnostic representation of machine-accessible structured information Federated queries using SPARQL when represented as RDF A W3C recommendation simpliﬁes app development and allows integration with other future services Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 21 / 50

Provenance and Semantic Web Tools Semantic Web Tools The Semantic
Web provides a common framework that allows data sharing and reuse, is based on the Resource Description Framework (RDF), and extends the principles of the Web from pages to machine useful data Data and descriptors are accessed using uniform resource identiﬁers (URIs) Unlike the traditional Web, the source and the target along with the relationship itself are unambiguously named with URIs and form a ‘triple’ of a subject, a relationship and an object nif:tbi rdf:type nif:mental disorder . This ﬂexible approach allows data to be easily added and for the nature of the relations to evolve, resulting in an architecture that allows retrieving answers to more complex queries Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 22 / 50

Provenance and Semantic Web Tools RDF Example @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix contact: <http://www.w3.org/2000/10/swap/pim/contact#>. <http://www.w3.org/People/EM/contact#me> rdf:type contact:Person; contact:fullName "Eric Miller"; contact:mailbox <mailto:[email protected]>; contact:personalTitle "Dr.". Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 23 / 50

Provenance and Semantic Web Tools SPARQL Protocol and RDF Query
Language (SPARQL) A query language for RDF Triples can be represented with compact syntaxes (e.g., Turtle) The queries are themselves similar in syntax SPARQL 1.1 (oﬃcial W3C recommdation in March, 2013) SPARQL allows users to write unambiguous queries Supports federation: a query can be distributed to multiple SPARQL endpoints, computed and results gathered A SPARQL client library can query a static RDF document or a SPARQL endpoint Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 24 / 50

Provenance and Semantic Web Tools SPARQL examples PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?email WHERE { ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email. } PREFIX abc: <http://example.com/exampleOntology#> SELECT ?capital ?country WHERE { ?x abc:cityname ?capital ; abc:isCapitalOf ?y . ?y abc:countryname ?country ; abc:isInContinent abc:Africa . } Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 25 / 50

A Prototype Platform A Prototype Platform Satrajit S. Ghosh -
[email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 26 / 50

A Prototype Platform Requirements A standardized data model (NI-DM) Provenance
tracking (Prov + workﬂow tools) Decentralized content creation and storage (Workﬂow tools, RDF triples, triple stores) Federated query (SPARQL) Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 27 / 50

A Prototype Platform 1. A standardized data model Neuroimaging Data
Model (NI-DM) (Keator et al., 2013) Based on PROV-DM hence borrows PROV ontology (PROV-O) Structured information encoding Consistent vocabulary Metadata standards via domain speciﬁc object models Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 28 / 50

A Prototype Platform NIDM components Terms A lexicon of all
things brain imaging. (e.g., DICOM terms, software specific terms, statistic terms, paradigm terms) Object Models Structured information in brain imaging (e.g., directory structures, CSV/Tab delimited files, brain imaging file formats) Integrated provenance How are entities generated or derived and by what or who? Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 29 / 50

A Prototype Platform NIDM platform Satrajit S. Ghosh - [email protected]
Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 30 / 50

A Prototype Platform 2. Provenance tracking Satrajit S. Ghosh -
[email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 31 / 50

A Prototype Platform Provenance tracking tools in Python 1 IPython
notebook 2 Sumatra 3 Synapse (with Python client) 4 Prov Python library (with RDF extensions) Similar tools exist for other languages and some of the above systems allow HTTP based tracking with a RESTful service API. Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 32 / 50

A Prototype Platform Workflow tools supporting W3C PROV 1 Nipype
A brain imaging focused workflow environment Flexible semantics for scripting complex workflows 2 VisTrails Scientific Workflow and Provenance Management Manage rapidly evolving workflows Can be used via a graphical interface 3 Taverna/Kepler/Galaxy (or supports the precursor to PROV, the open provenance model) Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 33 / 50

A Prototype Platform Nipype: A Workﬂow environment for brain imaging
(Gorgolewski et al., 2011) Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 34 / 50

A Prototype Platform Attempts at provenance in Nipype Logging to
ﬁle Restructured text output per interface Exporting the script Executable IPython notebooks Using Prov library and storing RDF in a ﬁle or triplestore Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 35 / 50

A Prototype Platform 3. Decentralized content creation and storage Create
and expose metadata where you do analysis Register dataset with central authority Use robots Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 36 / 50

A Prototype Platform Example storage of provenance From the Nipype
analysis of a single participant we get: 429 statements/triples from a single interface/function runtime dependencies inputs outputs md5/sha512 hashes and pointers to ﬁles 6021 statements/triples from the workﬂow includes relations between processes includes links to shared input output entities Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 37 / 50

A Prototype Platform 4. Federated Query using SPARQL on triplestores
select ?id ?age ?vol ?viq ?dx where { ?c fs:subject_id ?id; prov:hadMember ?e1 . ?sc prov:wasDerivedFrom ?e1; a nidm:FreeSurferStatsCollection; prov:hadMember [ nidm:AnatomicalAnnotation ?annot; fs:Volume_mm3 ?vol] . FILTER regex(?annot, "Right-Amy") SERVICE <http://computor.mit.edu:8890/sparql> { ?c2 nidm:ID ?id . ?c2 nidm:Age ?age . ?c2 nidm:Verbal_IQ ?viq . ?c2 nidm:DX ?dx . } } LIMIT 100 Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 38 / 50

A Prototype Platform Example applications Extracting data Javascript Using PROV
for determining relations Federated query Python + Javascript Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 39 / 50

A Prototype Platform Javascript example select ?val (count(?s) as ?nsubjects)
WHERE { ?c fs:subject_id ?s; prov:hadMember ?e1 . ?sc prov:wasDerivedFrom ?e1; a nidm:FreeSurferStatsCollection; prov:hadMember [ nidm:AnatomicalAnnotation ?annot; fs:Volume_mm3 ?val] . FILTER regex(?annot, "Right-Amy") } Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 40 / 50

A Prototype Platform Output visualized directly via Javascript Satrajit S.
Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 41 / 50

A Prototype Platform Federated query select ?id ?age ?vol ?viq
?dx where { ?c fs:subject_id ?id; prov:hadMember ?e1 . ?sc prov:wasDerivedFrom ?e1; a nidm:FreeSurferStatsCollection; prov:hadMember [ nidm:AnatomicalAnnotation ?annot; fs:Volume_mm3 ?vol] . FILTER regex(?annot, "Right-Amy") SERVICE <http://computor.mit.edu:8890/sparql> { ?c2 nidm:ID ?id . ?c2 nidm:Age ?age . ?c2 nidm:Verbal_IQ ?viq . ?c2 nidm:DX ?dx . } } LIMIT 100 Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 42 / 50

A Prototype Platform Interactive csv browser App call: http://localhost:5000/u?url=http://bit.ly/1atAL00 Scatterize:
https://github.com/njvack/scatterize Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 43 / 50

Challenges & Future directions Challenges & Future directions Satrajit S.

Challenges & Future directions What does all this buy us?
Common vocabulary for communication Rich structured information including provenance Domain speciﬁc object models that are embedded in the common structure Data/Content can be repurposed diﬀerentially for applications Execution duration could be used to instrument schedulers Parametric failure modes can be tracked across large databases Determine “amount” of existing data on a particular topic Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 45 / 50

Challenges & Future directions Where are we headed? Formalize NI-DM
object models as extensions to PROV-O Common vocabulary across software Relate to the Linked Data Web Publications, authors, grants Re-use existing vocabularies and ontologies Integrate with existing databases App instrumentation and development with built in provenance tracking Publish more structured data Reproduce analysis on a VM with an existing analysis pathway Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 46 / 50

Challenges & Future directions A lighweight decentralized architecture Satrajit S.

Challenges & Future directions Thanks International Neuroinformatics Coordinating Facilities BIRN
derived-data working group Neuroimaging in Python Community W3C PROV Working group NIH, INCF for support Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 48 / 50

Challenges & Future directions The picture of the future (Bechhofer
et al., 2013) http://www.researchobject.org/ Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 49 / 50

Challenges & Future directions Bechhofer, S., Buchan, I., Roure, D.
D., Missier, P., Ainsworth, J., Bhagat, J., Couch, P., et al. (2013). Why linked data is not enough for scientists. Future Generation Computer Systems, 29(2), 599–611. doi:http://dx.doi.org/10.1016/j.future.2011.08.004 Donoho, D. L. (2010). An invitation to reproducible computational research. Biostatistics, 11(3), 385–388. Gorgolewski, K., Burns, C. D., Madison, C., Clark, D., Halchenko, Y. O., Waskom, M. L., & Ghosh, S. S. (2011). Nipype: a ﬂexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in neuroinformatics, 5. Keator, D. B., Helmer, K., Steﬀener, J., Turner, J. A., Van Erp, T. G., Gadde, S., Ashish, N., et al. (2013). Towards structured sharing of raw and derived neuroimaging data across existing resources. NeuroImage. Poline, J.-B., Breeze, J. L., Ghosh, S., Gorgolewski, K., Halchenko, Y. O., Hanke, M., Haselgrove, C., et al. (2012). Data sharing in neuroimaging research. Frontiers in neuroinformatics, 6. Satrajit S. Ghosh - [email protected] Enabling knowledge generation and reproducible researchby embedding provenance models August 28, 2013 50 / 50

Enabling knowledge generation and reproducible ...

Enabling knowledge generation and reproducible research by embedding provenance models in metadata stores

More Decks by Satrajit Ghosh

Other Decks in Science

Featured

Transcript