Linking Knowledge and Reproducible Research Via Standardized Provenance Models

Linking Knowledge and Reproducible Research Via Standardized Provenance Models

UCLA symposium on Tools for Integrating and Planning Experiments in Neuroscience


Satrajit Ghosh

March 10, 2014


  1. Linking  Knowledge  and  Reproducible  Research     Via    

    Standardized  Provenance  Models satrajit ghosh ! massachusetts institute of technology
  2. Knowledge and Reproducibility ! Provenance and Semantic Web Tools !

    Tools and Infrastructure Outline
  3. Knowledge and Reproducibility ! Provenance and Semantic Web Tools !

    Tools and Infrastructure Outline
  4. Neuroscience data source: Sean Hill

  5. Neuroscience data source: Sean Hill

  6. Two approaches Computational modeling Brain imaging Knowledge driven Data driven

  7. We need to know: - What has been done? -

    How was it done? - Why was it done? ! What are current sources: - Publications - Unpublished experiments - Conversations How do we plan experiments?
  8. The dataflow in brain imaging Poline, Breeze, Ghosh, et al.

  9. The dataflow in brain imaging Poline, Breeze, Ghosh, et al.

  10. - Method sections - Fit general knowledge - Implicit trust

    (c.f. Donoho) ! - Stochastic verification - many studies Verification
  11. Example 1: Knowledge piece(s)

  12. Example 1: Variation in white matter in stuttering Cai, Tourville,

    Perkell, Guenther, Ghosh (2014) Variation in locations of reported white matter differences between people who stutter and people with fluent speech
  13. Example 2: Knowledge piece(s)

  14. Example 2: Variation in functional analysis Carp (2013) Variation in

    functional activity when the same task data was analyzed by different workflows
  15. Example 3: Knowledge piece(s)

  16. xkcd:242

  17. Example 3: Underpowered studies Lefebvre, Beggiato, Bourgeron, Toro (2014)

  18. Example 3: Underpowered studies Lefebvre, Beggiato, Bourgeron, Toro (2014) Corpus

    callosum volume difference between ASD and controls could not be replicated in a cohort of 694 participants.
  19. Knowledge gap Knowledge can be represented as a collection of

    assertions. ! Typically these assertions are derived from data transformations: experiments, analyses, simulations and theory. ! The information supporting such assertions is often sparse, cryptic, incorrect, or unavailable. Data ! Tools Knowledge People
  20. Missing info Assertion in publication Provenance Source: Timothy Lebo -
  21. Missing info Assertion in publication Provenance

  22. Missing info Assertion in publication Provenance

  23. The dataflow in brain imaging Poline, Breeze, Ghosh, et al.

  24. Ubiquity of error in scientific computing “In my own experience,

    error is ubiquitous in scientific computing, and one needs to work very diligently and energetically to eliminate it. One needs a very clear idea of what has been done in order to know where to look for likely sources of error. ! I often cannot really be sure what a student or colleague has done from his/her own presentation, and in fact often his/her description does not agree with my own understanding of what has been done, once I look carefully at the scripts. ! Actually, I find that researchers quite generally forget what they have done and misrepresent their computations.” Donoho (2010)
  25. Houston we have a problem Limited reliability ! Limited data

  26. How would reproducibility help? Increase accuracy - enable verification that

    analyses are consistent with intentions - enable review of analysis choices ! ! Increase credibility - enable others to verify ! Increase reusability - enable easy modification and/or re-use of existing analyses Adapted from:
  27. How can reproducibility and provenance help? • From journal articles

    published between 2008 and 2010, retrieve all brain volumes and ADOS scores of persons with autism spectrum disorder who are right handed and under the age of 10. ! • Rerun the analysis used in publication X on my data. ! • Is the volume of the caudate nucleus smaller in persons with Obsessive Compulsive Disorder compared to controls? ! • Find data-use agreements for open-accessible datasets used in articles by author Y.
  28. Why should we attempt to do this? Beyond the significant

    human effort to answer the previous queries, errors can happen from the lack of complete specification of data or methods, as well as from misinterpretation of methods
  29. None
  30. Why is it difficult? • Datasets contain ad hoc metadata

    and are processed with methods specific to the sub-domain, limiting integration. • Provenance tracking tools are typically not integrated into scientific software, making the curation process time consuming, resource intensive, and error prone. • In many research laboratories much of the derived data are deleted, keeping only the bits essential for publication. • There are no standards for computational platforms. • Research involves proprietary tools and binary data formats that are harder to instrument. ! • There is no formal vocabulary to describe all entities, activities and agents in the domain, and vocabulary creation is a time-consuming process. • Information is behind barriers.
  31. State of provenance encapsulation

  32. Capturing research today • The laboratory notebook (e.g., Documents, Google,

    Dropbox) • Code • Directories on filesystem • Code repositories (e.g., Github, Sourceforge) • Data (e.g., Databases, Archives) • Environments • Python requirements.txt • Virtual Machines • Cloud (e.g., Amazon Web Services, Azure, Rackspace) • Supplementary information • MIT DSpace • Journal archives
  33. Knowledge and Reproducibility ! Provenance and Semantic Web Tools !

    Tools and Infrastructure Outline
  34. Some definitions What is Provenance? Provenance is information about entities,

    activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. (source: w3c)
  35. Some definitions What is a ‘data model’? A data model

    is an abstract conceptual formulation of information that explicitly determines the structure of data and allows software and people to communicate and interpret data precisely. (source: wikipedia) ! What is PROV-DM? PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications. PROV- DM provides a generic basis that captures relationships associated with the creation and modification of entities by activities and agents.
  36. PROV-DM components 1. Entities and activities, and the time at

    which they were created, used, or ended 2. Derivations of entities from entities 3. Agents bearing responsibility for entities that were generated or 
 activities that happened 4. A notion of bundle, a mechanism to support provenance of provenance 5. Properties to link entities that refer to the same thing 6. Collections forming a logical structure for its members ! Source: PROV-DM
  37. Why PROV-DM? • Provenance is not an afterthought • Captures

    data and metadata (about entities, activities and agents) within the same context ! • A formal, technology-agnostic representation of machine-accessible structured information ! • Federated queries using SPARQL when represented as RDF ! • A W3C recommendation simplifies app development and allows integration with other future services
  38. The Semantic Web • The Semantic Web provides a common

    framework that allows data sharing and reuse, is based on the Resource Description Framework (RDF), and extends the principles of the Web from pages to machine useful data • Data and descriptors are accessed using uniform resource identifiers (URIs) • Unlike the traditional Web, the source and the target along with the relationship itself are unambiguously named with URIs and form a ‘triple’ of a subject, a relationship and an object nif:tbi rdf:type nif:mental_disorder . • This flexible approach allows data to be easily added and for the nature of the relations to evolve, resulting in an architecture that allows retrieving answers to more complex queries
  39. The Semantic Web source:

  40. The Semantic Web :satra :a :person . :satra :works_at :mit

    . :satra :attending :workshop1 . :workshop1 :at :ucla . :satra :knows :maryann . URIs :satra <> :satra <>
  41. The Linked Open Data Web

  42. The Linked Open Data Web

  43. The Knowledge Graph

  44. SPARQL • A query language for RDF • SPARQL 1.1

    (official W3C recommdation in March, 2013) • SPARQL allows users to write unambiguous queries ! • Supports federation: • a query can be distributed to multiple SPARQL endpoints, computed and results gathered • A SPARQL client library can query a static RDF document or a SPARQL endpoint
  45. Knowledge and Reproducibility ! Provenance and Semantic Web Tools !

    Tools and Infrastructure Outline
  46. What will this platform look like and enable • Encode

    information in standardized and machine accessible form • View data from a provenance perspective • as products of activities or transformations carried out by people, software or machines • Allow any individual, laboratory, or institution to discover and share data and computational services • Immediately re-test an algorithm, re-validate results or test a new hypothesis on new data • Develop applications based on a consistent, federated query and update interface • Provide a decentralized linked data and computational network
  47. Reproducibility in brain imaging Poline, Breeze, Ghosh, et al. (2012)

  48. Integrated tools and resources Interoperability

  49. Common vocabulary and data standards

  50. Standards

  51. Data sharing platforms

  52. Open source software

  53. Nipype: Interoperable and reproducible computing Gorgolewski et al., 2012

  54. Neurosynth: Automated meta-analysis Yarkoni et al., 2012

  55. BrainSpell: Human curation of literature

  56. Neurovault: Result sharing

  57. Community Q+A platform

  58. CIRRUSScience Cloud Infrastructure for Reproducible Research and Utilities for Sharing

    Science Apps Metadata Data Objects
  59. CIRRUSScience Cloud Infrastructure for Reproducible Research and Utilities for Sharing

  60. CIRRUSScience Cloud Infrastructure for Reproducible Research and Utilities for Sharing

  61. CIRRUSScience: a bitly demo http://localhost:5000/u?url=

  62. Research objects Bechhofer et al., 2013

  63. Research objects

  64. Research objects Bechhofer et al., 2013

  65. Scientific automation Adam A robot scientist that “autonomously generated functional

    genomics hypotheses about the yeast Saccharomyces cerevisiae and experimentally tested these hypotheses by using laboratory automation.” ! Abe Identify the full dynamical model from scratch without prior knowledge or structural assumptions. … The method performed well to high levels of noise for most states, could identify the correct model de novo, and make better predictions than ordinary parametric regression and neural network models. Adam - King et al. (2009); Abe - Schmidt et al. (2011)
  66. What does all this infrastructure buy us? • Common vocabulary

    for communication • Rich structured information including provenance • Domain specific object models that are embedded in the common structure • Data/Content can be repurposed differentially for applications • Execution duration could be used to instrument schedulers • Parametric failure modes can be tracked across large databases • Determine “amount” of existing data on a particular topic
  67. What does all this infrastructure buy us? Trust

  68. Move away from our current approach wikipedia

  69. Knowledge comes from transformations. ! Transformations are often lossy. !

    Provenance tracking can: - improve credibility (reduce loss) - aggregate information - extract knowledge ! CIRRUSScience aims to integrate computational resources and data using common object models. Summary
  70. Thank you