Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data sharing and reproducible research

Satrajit Ghosh
September 21, 2011

Data sharing and reproducible research

A presentation given at Janelia farm during the workshop on BioImage Informatics.

Satrajit Ghosh

September 21, 2011
Tweet

More Decks by Satrajit Ghosh

Other Decks in Science

Transcript

  1. data  sharing  and  reproducible  research
    barriers  and  solu-ons
    Satrajit Ghosh [email protected]
    Massachusetts Institute of Technology

    View Slide

  2. problems
    solutions
    future
    barriers to data and code sharing
    current approaches in neuroimaging
    can we have reproducible research?

    View Slide

  3. problems
    solution
    future
    let’s step backwards
    Oliver Doehermann, Anisha Keshavan, Franzi H.

    View Slide

  4. problems
    solution
    future
    let’s step backwards
    Train
    Test
    SPM
    Extract
    Cluster
    Means
    Linear
    Regression
    Predict
    Step 1: Split data into a
    training and test set
    Step 2: Find significant
    clusters
    Step 3: Extract mean
    intensity for the test
    and training sets
    Step 4: Use training
    data to run a linear
    regression
    Step 5: Use the
    model and the test
    data to predict lsas‐Δ
    score.
    Folds
    Data
    Oliver Doehermann, Anisha Keshavan, Franzi H.

    View Slide

  5. problems
    solution
    future
    let’s step backwards
    Train
    Test
    SPM
    Extract
    Cluster
    Means
    Linear
    Regression
    Predict
    Step 1: Split data into a
    training and test set
    Step 2: Find significant
    clusters
    Step 3: Extract mean
    intensity for the test
    and training sets
    Step 4: Use training
    data to run a linear
    regression
    Step 5: Use the
    model and the test
    data to predict lsas‐Δ
    score.
    Folds
    Data
    Oliver Doehermann, Anisha Keshavan, Franzi H.

    View Slide

  6. problems
    solution
    future
    let’s step backwards
    Train
    Test
    SPM
    Extract
    Cluster
    Means
    Linear
    Regression
    Predict
    Step 1: Split data into a
    training and test set
    Step 2: Find significant
    clusters
    Step 3: Extract mean
    intensity for the test
    and training sets
    Step 4: Use training
    data to run a linear
    regression
    Step 5: Use the
    model and the test
    data to predict lsas‐Δ
    score.
    Folds
    Data
    Oliver Doehermann, Anisha Keshavan, Franzi H.

    View Slide

  7. problems
    solution
    future
    let’s step backwards
    Train
    Test
    SPM
    Extract
    Cluster
    Means
    Linear
    Regression
    Predict
    Step 1: Split data into a
    training and test set
    Step 2: Find significant
    clusters
    Step 3: Extract mean
    intensity for the test
    and training sets
    Step 4: Use training
    data to run a linear
    regression
    Step 5: Use the
    model and the test
    data to predict lsas‐Δ
    score.
    Folds
    Data
    Oliver Doehermann, Anisha Keshavan, Franzi H.

    View Slide

  8. problems
    solution
    future
    let’s step backwards
    Train
    Test
    SPM
    Extract
    Cluster
    Means
    Linear
    Regression
    Predict
    Step 1: Split data into a
    training and test set
    Step 2: Find significant
    clusters
    Step 3: Extract mean
    intensity for the test
    and training sets
    Step 4: Use training
    data to run a linear
    regression
    Step 5: Use the
    model and the test
    data to predict lsas‐Δ
    score.
    Folds
    Data
    Oliver Doehermann, Anisha Keshavan, Franzi H.

    View Slide

  9. problems
    solution
    future
    let’s step backwards
    Train
    Test
    SPM
    Extract
    Cluster
    Means
    Linear
    Regression
    Predict
    Step 1: Split data into a
    training and test set
    Step 2: Find significant
    clusters
    Step 3: Extract mean
    intensity for the test
    and training sets
    Step 4: Use training
    data to run a linear
    regression
    Step 5: Use the
    model and the test
    data to predict lsas‐Δ
    score.
    Folds
    Data
    Oliver Doehermann, Anisha Keshavan, Franzi H.

    View Slide

  10. problems
    solution
    future
    let’s step backwards
    Train
    Test
    SPM
    Extract
    Cluster
    Means
    Linear
    Regression
    Predict
    Step 1: Split data into a
    training and test set
    Step 2: Find significant
    clusters
    Step 3: Extract mean
    intensity for the test
    and training sets
    Step 4: Use training
    data to run a linear
    regression
    Step 5: Use the
    model and the test
    data to predict lsas‐Δ
    score.
    Folds
    Data
    Enough
    details?
    Oliver Doehermann, Anisha Keshavan, Franzi H.

    View Slide

  11. problems
    solution
    future
    let’s step backwards
    Train
    Test
    SPM
    Extract
    Cluster
    Means
    Linear
    Regression
    Predict
    Step 1: Split data into a
    training and test set
    Step 2: Find significant
    clusters
    Step 3: Extract mean
    intensity for the test
    and training sets
    Step 4: Use training
    data to run a linear
    regression
    Step 5: Use the
    model and the test
    data to predict lsas‐Δ
    score.
    Folds
    Data
    Enough
    details?
    No
    Oliver Doehermann, Anisha Keshavan, Franzi H.

    View Slide

  12. problems
    solution
    future
    let’s step backwards

    View Slide

  13. problems
    solution
    future
    let’s step backwards
    What else?
    Where did the input data come from?
    What procedure was used to collect clinical data?
    What were the process parameters?

    View Slide

  14. problems
    solution
    future
    the scientific process
    Generate hypotheses
    Design experiment
    Collect data
    Analyze data
    Interpret results
    Publish

    View Slide

  15. problems
    solution
    future
    the scientific process
    Generate hypotheses
    Design experiment
    Collect data
    Analyze data
    Interpret results
    Publish
    18%
    9%
    18%
    27%
    9%
    18%
    Hypothesize Design
    Collect Analyze
    Interpret Publish
    dramatization. not to be
    taken too seriously

    View Slide

  16. Slice  6ming  
    correc6on
    Realignment  
    (to  first  image)
    Func6onal  
    data
    Structural    
    data
    Skull  stripping
    (FSL)
    Bias  correc6on
    (SPM2)
    Normaliza6on
    (SPM2  
    template)
    Smoothing  
    (4mm  FWHM)
    Normaliza6on
    Coregistra6on
    problems
    solution
    future
    replicating methods

    View Slide

  17. Slice  6ming  
    correc6on
    Realignment  
    (to  first  image)
    Func6onal  
    data
    Structural    
    data
    Skull  stripping
    (FSL)
    Bias  correc6on
    (SPM2)
    Normaliza6on
    (SPM2  
    template)
    Smoothing  
    (4mm  FWHM)
    Normaliza6on
    Coregistra6on
    problems
    solution
    future
    replicating methods

    View Slide

  18. problems
    solution
    future
    the metamorphosis

    View Slide

  19. Structural
    Diffusion
    Functional
    problems
    solution
    future
    the metamorphosis

    View Slide

  20. Structural
    Diffusion
    Functional
    problems
    solution
    future
    the metamorphosis

    View Slide

  21. ?
    Structural
    Diffusion
    Functional
    problems
    solution
    future
    the metamorphosis

    View Slide

  22. ?
    Structural
    Diffusion
    Functional
    problems
    solution
    future
    the metamorphosis

    View Slide

  23. problems
    solution
    future
    the world of neuroimaging analysis software
    data source: pymvpa.org
    1990 92 94 96 98 2000 02 04 06 08 2010
    Afni
    Brainvoyager
    Freesurfer
    R
    Caret
    Fmristat
    FSL
    MVPA
    NiPy
    ANTS
    SPM
    Brainvisa

    View Slide

  24. different algorithms
    different assumptions
    different platforms
    different interfaces
    different file formats
    problems
    solution
    future
    the world of neuroimaging analysis software
    data source: pymvpa.org
    1990 92 94 96 98 2000 02 04 06 08 2010
    Afni
    Brainvoyager
    Freesurfer
    R
    Caret
    Fmristat
    FSL
    MVPA
    NiPy
    ANTS
    SPM
    Brainvisa

    View Slide

  25. problems
    solution
    future
    the scientific process
    “The scientific method’s central motivation is the ubiquity of error -
    the awareness that mistakes and self-delusion can creep in absolutely
    anywhere and that the scientist’s effort is primarily expended in
    recognizing and rooting out error.”
    Donoho et al. (2009)
    assumed veracity of publications
    dependence on peer review as a proxy for testing

    View Slide

  26. problems
    solution
    future
    but why share data and code?

    View Slide

  27. special populations :
    problems
    solution
    future
    but why share data and code?

    View Slide

  28. special populations :
    enables aggregation of large data sets
    problems
    solution
    future
    but why share data and code?

    View Slide

  29. special populations :
    enables aggregation of large data sets
    (e.g., Autism, ADHD, Schizophrenia)
    problems
    solution
    future
    but why share data and code?

    View Slide

  30. special populations :
    enables aggregation of large data sets
    (e.g., Autism, ADHD, Schizophrenia)
    cross-discipline interaction :
    problems
    solution
    future
    but why share data and code?

    View Slide

  31. special populations :
    enables aggregation of large data sets
    (e.g., Autism, ADHD, Schizophrenia)
    cross-discipline interaction :
    - provides data to test their algorithms
    problems
    solution
    future
    but why share data and code?

    View Slide

  32. special populations :
    enables aggregation of large data sets
    (e.g., Autism, ADHD, Schizophrenia)
    cross-discipline interaction :
    - provides data to test their algorithms
    - increases sample size for learning algorithms
    problems
    solution
    future
    but why share data and code?

    View Slide

  33. special populations :
    enables aggregation of large data sets
    (e.g., Autism, ADHD, Schizophrenia)
    cross-discipline interaction :
    - provides data to test their algorithms
    - increases sample size for learning algorithms
    pedagogy :
    problems
    solution
    future
    but why share data and code?

    View Slide

  34. special populations :
    enables aggregation of large data sets
    (e.g., Autism, ADHD, Schizophrenia)
    cross-discipline interaction :
    - provides data to test their algorithms
    - increases sample size for learning algorithms
    pedagogy :
    provides easy mechanism to train new personnel
    problems
    solution
    future
    but why share data and code?

    View Slide

  35. problems
    solution
    future
    current barriers

    View Slide

  36. most publications do not include data, code
    problems
    solution
    future
    current barriers

    View Slide

  37. most publications do not include data, code
    some journals mandate but provide no
    infrastructure for storage, distribution
    problems
    solution
    future
    current barriers

    View Slide

  38. most publications do not include data, code
    some journals mandate but provide no
    infrastructure for storage, distribution
    most scientists do not have the time to curate
    data
    problems
    solution
    future
    current barriers

    View Slide

  39. most publications do not include data, code
    some journals mandate but provide no
    infrastructure for storage, distribution
    most scientists do not have the time to curate
    data
    no standard ontology for describing
    experiments, data, derived data, workflows
    problems
    solution
    future
    current barriers

    View Slide

  40. problems
    solution
    future
    other practical matters

    View Slide

  41. how to share data/code?
    problems
    solution
    future
    other practical matters

    View Slide

  42. how to share data/code?
    what to share?
    problems
    solution
    future
    other practical matters

    View Slide

  43. how to share data/code?
    what to share?
    where to share?
    problems
    solution
    future
    other practical matters

    View Slide

  44. how to share data/code?
    what to share?
    where to share?
    how to include manual intervention?
    problems
    solution
    future
    other practical matters

    View Slide

  45. problems
    solutions
    future
    barriers to data and code sharing
    current approaches in neuroimaging
    can we have reproducible research?

    View Slide

  46. problems
    solution
    future
    data sharing
    Neuroimaging Tools and Resources Clearinghouse (NITRC)
    XNAT (Wash U) + HID (BIRN) + IDA (LONI) databases
    Brain Map (brainmap.org)
    National Database for Autism Research (NDAR)
    Personal web sites

    View Slide

  47. ipython scipy + numpy networkx
    sympy matplotlib mayavi + tvtk
    problems
    solution
    future
    open-source scientific computing stack

    View Slide

  48. problems
    solution
    future
    code sharing: formalized interfaces and workflows
    Neuroimaging Pipelines and Interfaces
    nipy.org/nipype

    View Slide

  49. problems
    solution
    future
    part of a family

    View Slide

  50. problems
    solution
    future
    nipype: components

    View Slide

  51. problems
    solution
    future
    nipype: components

    View Slide

  52. problems
    solution
    future
    nipype: workflow engine

    View Slide

  53. problems
    solution
    future
    nipype: execution plugins

    View Slide

  54. problems
    solution
    future
    nipype

    View Slide

  55. problems
    solution
    future
    using nipype: integrating across packages

    View Slide

  56. problems
    solution
    future
    capturing provenance: open provenance model

    View Slide

  57. problems
    solution
    future
    capturing provenance: open provenance model

    View Slide

  58. problems
    solution
    future
    capturing provenance: open provenance model

    View Slide

  59. problems
    solutions
    future
    barriers to data and code sharing
    current approaches in neuroimaging
    can we have reproducible research?

    View Slide

  60. problems
    solution
    future
    technical options
    Electronic computable notebooks
    (cdf : computable document format)
    Virtual environments
    Standard ontology
    Formalized interfaces
    Federated databases

    View Slide