Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linking Knowledge and Reproducible Research Via Standardized Provenance Models

Linking Knowledge and Reproducible Research Via Standardized Provenance Models

UCLA symposium on Tools for Integrating and Planning Experiments in Neuroscience

Satrajit Ghosh

March 10, 2014
Tweet

More Decks by Satrajit Ghosh

Other Decks in Research

Transcript

  1. Linking  Knowledge  and  Reproducible  Research    
    Via    
    Standardized  Provenance  Models
    satrajit ghosh [email protected]

    !
    massachusetts institute of technology

    View full-size slide

  2. Knowledge and Reproducibility

    !
    Provenance and Semantic Web Tools

    !
    Tools and Infrastructure
    Outline

    View full-size slide

  3. Knowledge and Reproducibility

    !
    Provenance and Semantic Web Tools

    !
    Tools and Infrastructure
    Outline

    View full-size slide

  4. Neuroscience data
    source: Sean Hill

    View full-size slide

  5. Neuroscience data
    source: Sean Hill

    View full-size slide

  6. Two approaches
    Computational modeling Brain imaging
    Knowledge driven Data driven

    View full-size slide

  7. We need to know:

    - What has been done?

    - How was it done?

    - Why was it done?

    !
    What are current sources:


    - Publications


    - Unpublished experiments


    - Conversations
    How do we plan experiments?

    View full-size slide

  8. The dataflow in brain imaging
    Poline, Breeze, Ghosh, et al. (2012)

    View full-size slide

  9. The dataflow in brain imaging
    Poline, Breeze, Ghosh, et al. (2012)

    View full-size slide

  10. - Method sections

    - Fit general knowledge

    - Implicit trust (c.f. Donoho)

    !
    - Stochastic verification

    - many studies
    Verification

    View full-size slide

  11. Example 1: Knowledge piece(s)

    View full-size slide

  12. Example 1: Variation in white matter in stuttering
    Cai, Tourville, Perkell, Guenther, Ghosh (2014)
    Variation in locations of reported white matter differences

    between people who stutter and people with fluent speech

    View full-size slide

  13. Example 2: Knowledge piece(s)

    View full-size slide

  14. Example 2: Variation in functional analysis
    Carp (2013)
    Variation in functional activity when the same

    task data was analyzed by different workflows

    View full-size slide

  15. Example 3: Knowledge piece(s)

    View full-size slide

  16. Example 3: Underpowered studies
    Lefebvre, Beggiato, Bourgeron, Toro (2014)

    View full-size slide

  17. Example 3: Underpowered studies
    Lefebvre, Beggiato, Bourgeron, Toro (2014)
    Corpus callosum volume difference between ASD and controls

    could not be replicated in a cohort of 694 participants.

    View full-size slide

  18. Knowledge gap
    Knowledge can be represented as a

    collection of assertions.

    !
    Typically these assertions are derived
    from data transformations:

    experiments, analyses, simulations and
    theory.

    !
    The information supporting such
    assertions is often

    sparse, cryptic, incorrect, or unavailable.
    Data
    !
    Tools
    Knowledge
    People

    View full-size slide

  19. Missing info
    Assertion in

    publication
    Provenance
    Source: Timothy Lebo - http://bit.ly/lebo_cogsci_issues_2011

    View full-size slide

  20. Missing info
    Assertion in

    publication
    Provenance

    View full-size slide

  21. Missing info
    Assertion in

    publication
    Provenance

    View full-size slide

  22. The dataflow in brain imaging
    Poline, Breeze, Ghosh, et al. (2012)

    View full-size slide

  23. Ubiquity of error in scientific computing
    “In my own experience, error is ubiquitous in scientific computing,
    and one needs to work very diligently and energetically to eliminate
    it. One needs a very clear idea of what has been done in order to
    know where to look for likely sources of error.

    !
    I often cannot really be sure what a student or colleague has done
    from his/her own presentation, and in fact often his/her description
    does not agree with my own understanding of what has been done,
    once I look carefully at the scripts.

    !
    Actually, I find that researchers quite generally forget what they
    have done and misrepresent their computations.”
    Donoho (2010)

    View full-size slide

  24. Houston we have a problem
    Limited reliability

    !
    Limited data

    View full-size slide

  25. How would reproducibility help?
    Increase accuracy

    - enable verification that analyses
    are consistent with intentions

    - enable review of analysis choices

    !
    !
    Increase credibility
    - enable others to verify

    !
    Increase reusability

    - enable easy modification and/or
    re-use of existing analyses
    Adapted from: https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/blob/master/talk/talk.md

    View full-size slide

  26. How can reproducibility and provenance help?
    • From journal articles published between 2008 and 2010,
    retrieve all brain volumes and ADOS scores of persons with
    autism spectrum disorder who are right handed and under the
    age of 10.

    !
    • Rerun the analysis used in publication X on my data.

    !
    • Is the volume of the caudate nucleus smaller in persons with
    Obsessive Compulsive Disorder compared to controls?

    !
    • Find data-use agreements for open-accessible datasets used in
    articles by author Y.

    View full-size slide

  27. Why should we attempt to do this?
    Beyond the significant human effort to answer the previous
    queries, errors can happen from the lack of complete
    specification of data or methods, as well as from
    misinterpretation of methods

    View full-size slide

  28. Why is it difficult?
    • Datasets contain ad hoc metadata and are processed with
    methods specific to the sub-domain, limiting integration.

    • Provenance tracking tools are typically not integrated into
    scientific software, making the curation process time
    consuming, resource intensive, and error prone.

    • In many research laboratories much of the derived data are
    deleted, keeping only the bits essential for publication.

    • There are no standards for computational platforms.
    • Research involves proprietary tools and binary data formats
    that are harder to instrument.

    !
    • There is no formal vocabulary to describe all entities,
    activities and agents in the domain, and vocabulary creation is
    a time-consuming process.
    • Information is behind barriers.

    View full-size slide

  29. State of provenance encapsulation

    View full-size slide

  30. Capturing research today
    • The laboratory notebook (e.g., Documents,
    Google, Dropbox)

    • Code

    • Directories on filesystem

    • Code repositories (e.g., Github, Sourceforge)

    • Data (e.g., Databases, Archives)

    • Environments

    • Python requirements.txt

    • Virtual Machines

    • Cloud (e.g., Amazon Web Services, Azure,
    Rackspace)

    • Supplementary information

    • MIT DSpace

    • Journal archives

    View full-size slide

  31. Knowledge and Reproducibility

    !
    Provenance and Semantic Web Tools

    !
    Tools and Infrastructure
    Outline

    View full-size slide

  32. Some definitions
    What is Provenance?
    Provenance is information about entities, activities, and people
    involved in producing a piece of data or thing, which can be
    used to form assessments about its quality, reliability or
    trustworthiness. (source: w3c)

    View full-size slide

  33. Some definitions
    What is a ‘data model’?
    A data model is an abstract conceptual formulation of
    information that explicitly determines the structure of data
    and allows software and people to communicate and
    interpret data precisely. (source: wikipedia)

    !
    What is PROV-DM?
    PROV-DM is the conceptual data model that forms a basis for
    the W3C provenance (PROV) family of specifications. PROV-
    DM provides a generic basis that captures relationships
    associated with the creation and modification of entities by
    activities and agents.

    View full-size slide

  34. PROV-DM components
    1. Entities and activities, and the time at
    which they were created, used, or
    ended

    2. Derivations of entities from entities

    3. Agents bearing responsibility for
    entities that were generated or 

    activities that happened

    4. A notion of bundle, a mechanism to
    support provenance of provenance

    5. Properties to link entities that refer to
    the same thing

    6. Collections forming a logical structure
    for its members

    !
    Source: PROV-DM

    View full-size slide

  35. Why PROV-DM?
    • Provenance is not an afterthought

    • Captures data and metadata (about
    entities, activities and agents) within the
    same context

    !
    • A formal, technology-agnostic representation
    of machine-accessible structured information

    !
    • Federated queries using SPARQL when
    represented as RDF

    !
    • A W3C recommendation simplifies app
    development and allows integration with
    other future services

    View full-size slide

  36. The Semantic Web
    • The Semantic Web provides a common framework that
    allows data sharing and reuse, is based on the Resource
    Description Framework (RDF), and extends the principles
    of the Web from pages to machine useful data

    • Data and descriptors are accessed using uniform resource
    identifiers (URIs)

    • Unlike the traditional Web, the source and the target
    along with the relationship itself are unambiguously named
    with URIs and form a ‘triple’ of a subject, a relationship
    and an object

    nif:tbi rdf:type nif:mental_disorder .
    • This flexible approach allows data to be easily added and
    for the nature of the relations to evolve, resulting in an
    architecture that allows retrieving answers to more
    complex queries

    View full-size slide

  37. The Semantic Web
    source: http://blog.soton.ac.uk/hive/2012/05/10/recommendation-system-of-hive/

    View full-size slide

  38. The Semantic Web
    :satra :a :person .
    :satra :works_at :mit .
    :satra :attending :workshop1 .
    :workshop1 :at :ucla .
    :satra :knows :maryann .
    URIs
    :satra
    :satra

    View full-size slide

  39. The Linked Open Data Web

    View full-size slide

  40. The Linked Open Data Web

    View full-size slide

  41. The Knowledge Graph

    View full-size slide

  42. SPARQL
    • A query language for RDF

    • SPARQL 1.1 (official W3C recommdation in March, 2013)

    • SPARQL allows users to write unambiguous queries

    !
    • Supports federation:

    • a query can be distributed to multiple SPARQL
    endpoints, computed and results gathered

    • A SPARQL client library can query a static RDF document
    or a SPARQL endpoint

    View full-size slide

  43. Knowledge and Reproducibility

    !
    Provenance and Semantic Web Tools

    !
    Tools and Infrastructure
    Outline

    View full-size slide

  44. What will this platform look like and enable

    • Encode information in standardized and machine
    accessible form

    • View data from a provenance perspective
    • as products of activities or transformations carried
    out by people, software or machines

    • Allow any individual, laboratory, or institution to
    discover and share data and computational services
    • Immediately re-test an algorithm, re-validate results or
    test a new hypothesis on new data

    • Develop applications based on a consistent, federated
    query and update interface
    • Provide a decentralized linked data and computational
    network

    View full-size slide

  45. Reproducibility in brain imaging
    Poline, Breeze, Ghosh, et al. (2012)

    View full-size slide

  46. Integrated tools and resources
    Interoperability

    View full-size slide

  47. Common vocabulary and data standards

    View full-size slide

  48. Data sharing platforms

    View full-size slide

  49. Open source software

    View full-size slide

  50. Nipype: Interoperable and reproducible computing
    Gorgolewski et al., 2012

    View full-size slide

  51. Neurosynth: Automated meta-analysis
    Yarkoni et al., 2012

    View full-size slide

  52. BrainSpell: Human curation of literature

    View full-size slide

  53. Neurovault: Result sharing

    View full-size slide

  54. Community Q+A platform

    View full-size slide

  55. CIRRUSScience
    Cloud Infrastructure for Reproducible Research

    and Utilities for Sharing Science
    Apps
    Metadata Data
    Objects

    View full-size slide

  56. CIRRUSScience
    Cloud Infrastructure for Reproducible Research

    and Utilities for Sharing Science

    View full-size slide

  57. CIRRUSScience
    Cloud Infrastructure for Reproducible Research

    and Utilities for Sharing Science

    View full-size slide

  58. CIRRUSScience: a bitly demo
    http://localhost:5000/u?url=http://bit.ly/1bDqJM8

    View full-size slide

  59. Research objects
    Bechhofer et al., 2013

    View full-size slide

  60. Research objects

    View full-size slide

  61. Research objects
    Bechhofer et al., 2013

    View full-size slide

  62. Scientific automation
    Adam

    A robot scientist that “autonomously generated functional
    genomics hypotheses about the yeast Saccharomyces cerevisiae and
    experimentally tested these hypotheses by using laboratory
    automation.”

    !
    Abe

    Identify the full dynamical model from scratch without prior
    knowledge or structural assumptions. … The method performed
    well to high levels of noise for most states, could identify the
    correct model de novo, and make better predictions than ordinary
    parametric regression and neural network models.



    Adam - King et al. (2009); Abe - Schmidt et al. (2011)

    View full-size slide

  63. What does all this infrastructure buy us?
    • Common vocabulary for communication

    • Rich structured information including provenance

    • Domain specific object models that are embedded in the
    common structure

    • Data/Content can be repurposed differentially for
    applications

    • Execution duration could be used to instrument
    schedulers

    • Parametric failure modes can be tracked across large
    databases

    • Determine “amount” of existing data on a particular
    topic

    View full-size slide

  64. What does all this infrastructure buy us?
    Trust

    View full-size slide

  65. Move away from our current approach
    wikipedia

    View full-size slide

  66. Knowledge comes from transformations.

    !
    Transformations are often lossy.

    !
    Provenance tracking can:


    - improve credibility (reduce loss)


    - aggregate information


    - extract knowledge

    !
    CIRRUSScience aims to integrate
    computational resources and data using
    common object models.
    Summary

    View full-size slide