Data sharing and reproducible research

1c8c8eeba90d924df74f588bc2f1de23?s=47 Satrajit Ghosh
September 21, 2011

Data sharing and reproducible research

A presentation given at Janelia farm during the workshop on BioImage Informatics.

1c8c8eeba90d924df74f588bc2f1de23?s=128

Satrajit Ghosh

September 21, 2011
Tweet

Transcript

  1. data  sharing  and  reproducible  research barriers  and  solu-ons Satrajit Ghosh

    satra@mit.edu Massachusetts Institute of Technology
  2. problems solutions future barriers to data and code sharing current

    approaches in neuroimaging can we have reproducible research?
  3. problems solution future let’s step backwards Oliver Doehermann, Anisha Keshavan,

    Franzi H.
  4. problems solution future let’s step backwards Train Test SPM Extract

    Cluster Means Linear Regression Predict Step 1: Split data into a training and test set Step 2: Find significant clusters Step 3: Extract mean intensity for the test and training sets Step 4: Use training data to run a linear regression Step 5: Use the model and the test data to predict lsas‐Δ score. Folds Data Oliver Doehermann, Anisha Keshavan, Franzi H.
  5. problems solution future let’s step backwards Train Test SPM Extract

    Cluster Means Linear Regression Predict Step 1: Split data into a training and test set Step 2: Find significant clusters Step 3: Extract mean intensity for the test and training sets Step 4: Use training data to run a linear regression Step 5: Use the model and the test data to predict lsas‐Δ score. Folds Data Oliver Doehermann, Anisha Keshavan, Franzi H.
  6. problems solution future let’s step backwards Train Test SPM Extract

    Cluster Means Linear Regression Predict Step 1: Split data into a training and test set Step 2: Find significant clusters Step 3: Extract mean intensity for the test and training sets Step 4: Use training data to run a linear regression Step 5: Use the model and the test data to predict lsas‐Δ score. Folds Data Oliver Doehermann, Anisha Keshavan, Franzi H.
  7. problems solution future let’s step backwards Train Test SPM Extract

    Cluster Means Linear Regression Predict Step 1: Split data into a training and test set Step 2: Find significant clusters Step 3: Extract mean intensity for the test and training sets Step 4: Use training data to run a linear regression Step 5: Use the model and the test data to predict lsas‐Δ score. Folds Data Oliver Doehermann, Anisha Keshavan, Franzi H.
  8. problems solution future let’s step backwards Train Test SPM Extract

    Cluster Means Linear Regression Predict Step 1: Split data into a training and test set Step 2: Find significant clusters Step 3: Extract mean intensity for the test and training sets Step 4: Use training data to run a linear regression Step 5: Use the model and the test data to predict lsas‐Δ score. Folds Data Oliver Doehermann, Anisha Keshavan, Franzi H.
  9. problems solution future let’s step backwards Train Test SPM Extract

    Cluster Means Linear Regression Predict Step 1: Split data into a training and test set Step 2: Find significant clusters Step 3: Extract mean intensity for the test and training sets Step 4: Use training data to run a linear regression Step 5: Use the model and the test data to predict lsas‐Δ score. Folds Data Oliver Doehermann, Anisha Keshavan, Franzi H.
  10. problems solution future let’s step backwards Train Test SPM Extract

    Cluster Means Linear Regression Predict Step 1: Split data into a training and test set Step 2: Find significant clusters Step 3: Extract mean intensity for the test and training sets Step 4: Use training data to run a linear regression Step 5: Use the model and the test data to predict lsas‐Δ score. Folds Data Enough details? Oliver Doehermann, Anisha Keshavan, Franzi H.
  11. problems solution future let’s step backwards Train Test SPM Extract

    Cluster Means Linear Regression Predict Step 1: Split data into a training and test set Step 2: Find significant clusters Step 3: Extract mean intensity for the test and training sets Step 4: Use training data to run a linear regression Step 5: Use the model and the test data to predict lsas‐Δ score. Folds Data Enough details? No Oliver Doehermann, Anisha Keshavan, Franzi H.
  12. problems solution future let’s step backwards

  13. problems solution future let’s step backwards What else? Where did

    the input data come from? What procedure was used to collect clinical data? What were the process parameters?
  14. problems solution future the scientific process Generate hypotheses Design experiment

    Collect data Analyze data Interpret results Publish
  15. problems solution future the scientific process Generate hypotheses Design experiment

    Collect data Analyze data Interpret results Publish 18% 9% 18% 27% 9% 18% Hypothesize Design Collect Analyze Interpret Publish dramatization. not to be taken too seriously
  16. Slice  6ming   correc6on Realignment   (to  first  image) Func6onal

      data Structural     data Skull  stripping (FSL) Bias  correc6on (SPM2) Normaliza6on (SPM2   template) Smoothing   (4mm  FWHM) Normaliza6on Coregistra6on problems solution future replicating methods
  17. Slice  6ming   correc6on Realignment   (to  first  image) Func6onal

      data Structural     data Skull  stripping (FSL) Bias  correc6on (SPM2) Normaliza6on (SPM2   template) Smoothing   (4mm  FWHM) Normaliza6on Coregistra6on problems solution future replicating methods
  18. problems solution future the metamorphosis

  19. Structural Diffusion Functional problems solution future the metamorphosis

  20. Structural Diffusion Functional problems solution future the metamorphosis

  21. ? Structural Diffusion Functional problems solution future the metamorphosis

  22. ? Structural Diffusion Functional problems solution future the metamorphosis

  23. problems solution future the world of neuroimaging analysis software data

    source: pymvpa.org 1990 92 94 96 98 2000 02 04 06 08 2010 Afni Brainvoyager Freesurfer R Caret Fmristat FSL MVPA NiPy ANTS SPM Brainvisa
  24. different algorithms different assumptions different platforms different interfaces different file

    formats problems solution future the world of neuroimaging analysis software data source: pymvpa.org 1990 92 94 96 98 2000 02 04 06 08 2010 Afni Brainvoyager Freesurfer R Caret Fmristat FSL MVPA NiPy ANTS SPM Brainvisa
  25. problems solution future the scientific process “The scientific method’s central

    motivation is the ubiquity of error - the awareness that mistakes and self-delusion can creep in absolutely anywhere and that the scientist’s effort is primarily expended in recognizing and rooting out error.” Donoho et al. (2009) assumed veracity of publications dependence on peer review as a proxy for testing
  26. problems solution future but why share data and code?

  27. special populations : problems solution future but why share data

    and code?
  28. special populations : enables aggregation of large data sets problems

    solution future but why share data and code?
  29. special populations : enables aggregation of large data sets (e.g.,

    Autism, ADHD, Schizophrenia) problems solution future but why share data and code?
  30. special populations : enables aggregation of large data sets (e.g.,

    Autism, ADHD, Schizophrenia) cross-discipline interaction : problems solution future but why share data and code?
  31. special populations : enables aggregation of large data sets (e.g.,

    Autism, ADHD, Schizophrenia) cross-discipline interaction : - provides data to test their algorithms problems solution future but why share data and code?
  32. special populations : enables aggregation of large data sets (e.g.,

    Autism, ADHD, Schizophrenia) cross-discipline interaction : - provides data to test their algorithms - increases sample size for learning algorithms problems solution future but why share data and code?
  33. special populations : enables aggregation of large data sets (e.g.,

    Autism, ADHD, Schizophrenia) cross-discipline interaction : - provides data to test their algorithms - increases sample size for learning algorithms pedagogy : problems solution future but why share data and code?
  34. special populations : enables aggregation of large data sets (e.g.,

    Autism, ADHD, Schizophrenia) cross-discipline interaction : - provides data to test their algorithms - increases sample size for learning algorithms pedagogy : provides easy mechanism to train new personnel problems solution future but why share data and code?
  35. problems solution future current barriers

  36. most publications do not include data, code problems solution future

    current barriers
  37. most publications do not include data, code some journals mandate

    but provide no infrastructure for storage, distribution problems solution future current barriers
  38. most publications do not include data, code some journals mandate

    but provide no infrastructure for storage, distribution most scientists do not have the time to curate data problems solution future current barriers
  39. most publications do not include data, code some journals mandate

    but provide no infrastructure for storage, distribution most scientists do not have the time to curate data no standard ontology for describing experiments, data, derived data, workflows problems solution future current barriers
  40. problems solution future other practical matters

  41. how to share data/code? problems solution future other practical matters

  42. how to share data/code? what to share? problems solution future

    other practical matters
  43. how to share data/code? what to share? where to share?

    problems solution future other practical matters
  44. how to share data/code? what to share? where to share?

    how to include manual intervention? problems solution future other practical matters
  45. problems solutions future barriers to data and code sharing current

    approaches in neuroimaging can we have reproducible research?
  46. problems solution future data sharing Neuroimaging Tools and Resources Clearinghouse

    (NITRC) XNAT (Wash U) + HID (BIRN) + IDA (LONI) databases Brain Map (brainmap.org) National Database for Autism Research (NDAR) Personal web sites
  47. ipython scipy + numpy networkx sympy matplotlib mayavi + tvtk

    problems solution future open-source scientific computing stack
  48. problems solution future code sharing: formalized interfaces and workflows Neuroimaging

    Pipelines and Interfaces nipy.org/nipype
  49. problems solution future part of a family

  50. problems solution future nipype: components

  51. problems solution future nipype: components

  52. problems solution future nipype: workflow engine

  53. problems solution future nipype: execution plugins

  54. problems solution future nipype

  55. problems solution future using nipype: integrating across packages

  56. problems solution future capturing provenance: open provenance model

  57. problems solution future capturing provenance: open provenance model

  58. problems solution future capturing provenance: open provenance model

  59. problems solutions future barriers to data and code sharing current

    approaches in neuroimaging can we have reproducible research?
  60. problems solution future technical options Electronic computable notebooks (cdf :

    computable document format) Virtual environments Standard ontology Formalized interfaces Federated databases