Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DACETS: research data management for individual scientists

Ivan
June 14, 2013

DACETS: research data management for individual scientists

Handling of research data (raw data and all associated experimental meta-data) is still very hard for individual scientist and small research groups and few tools are available to lessen the burden.

Ivan

June 14, 2013
Tweet

More Decks by Ivan

Other Decks in Technology

Transcript

  1. The mission of the WLCG project is to provide global

    computing resources to store, distribute and analyse the ~25 Petabytes (25 million Gigabytes) of data annually generated by the Large Hadron Collider (LHC) at CERN on the Franco-Swiss border.
  2. 60 hrs of video / min 60 *100MB * 60min

    * 24 = 8TB/day (3 PB/year)
  3. original brain extract brain split GM/WM segm sulci skeleton simple

    surfaces 3D recon & labeling data processing pipeline
  4. data organization ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree -L 2 . ├── Anat_variability/

    │ ├── 01_GRV/ │ ├── 02_GLZ/ │ ├── 03_LIB/ │ ├── 04_RAC/ │ ├── 05_DUB/ │ └── 06_WIL/ └── fMRI_Lang/ ├── 01_GRV/ ├── 02_GLZ/ ├── 03_LIB/ ├── 04_RAC/ ├── 05_DUB/ └── 06_WIL/ ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree fMRI_Lang/ fMRI_Lang/ ├── 01_GRV/ │ ├── analyze/ │ │ ├── fmri_noun.hdr │ │ ├── fmri_noun.img │ │ ├── fmri_verb.hdr │ │ ├── fmri_verb.img │ │ ├── t1_3d.hdr │ │ ├── t1_3d.img │ │ ├── t1_axe.hdr │ │ └── t1_axe.img │ ├── maps/ │ │ ├── zscore_noun.hdr │ │ ├── zscore_noun.img │ │ ├── zscore_verb.hdr │ │ └── zscore_verb.img │ └── raw/ │ ├── DICOM/ │ │ ├── 0000001.ima │ │ ├── 0000002.ima │ │ └── ... (~1000 files) │ └── DICOMDIR ├── 02_GLZ/ ├── 03_LIB/ original project image sequence header file derived project
  5. data organization ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree fMRI_Lang/ fMRI_Lang/ ├── 01_GRV/ │

    ├── analyze/ │ │ ├── fmri_noun.hdr │ │ ├── fmri_noun.img │ │ ├── fmri_verb.hdr │ │ ├── fmri_verb.img │ │ ├── t1_3d.hdr │ │ ├── t1_3d.img │ │ ├── t1_axe.hdr │ │ └── t1_axe.img │ ├── maps/ │ │ ├── zscore_noun.hdr │ │ ├── zscore_noun.img │ │ ├── zscore_verb.hdr │ │ └── zscore_verb.img │ └── raw/ │ ├── DICOM/ │ │ ├── 0000001.ima │ │ ├── 0000002.ima │ │ └── ... (~1000 files) │ └── DICOMDIR ├── 02_GLZ/ ├── 03_LIB/ ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree Anat_variability/ Anat_variability/ ├── 01_GRV/ │ ├── READMEs/ │ ├── anatomy/ │ │ ├── T1.hdr │ │ ├── T1.img │ │ ├── T1_norm.hdr │ │ └── T1_norm.img │ ├── graphe/ │ ├── segment/ │ │ ├── T1_brain.hdr │ │ └── T1_brain.img │ └── tri/ │ ├── LHemi.dat │ └── RHemi.dat ├── 02_GLZ/ ├── 03_LIB/ ├── 04_RAC/ ├── 05_DUB/ └── 06_WIL/
  6. data organization $ find . -name \*3d.img -exec script.py {}\;

    useful, but clearly not enough (binary headers)
  7. what’s needed? help with data org structure tracking notes/comments about

    data tracking provenance search (beyond filename/{c,m}time) efficient local/remote sync ...
  8. “Before someone suggests OME, we don’t have the wherewithal to

    move to OMERO – the server setup is beyond me and not something that we can implement easily through our IT support. This is one for the future…”
  9. isn’t it a solved problem? Amazon AMIs Media management (iTunes

    et al.) SW packages (deb, rpm, egg, npm,...)
  10. setup.py setup(name='beets', version='1.1.1', description='music tagger and library organizer', author='Adrian Sampson',

    author_email='[email protected]', url='http://beets.radbox.org/', license='MIT', long_description=_read('README.rst'), packages=[ 'beets', 'beets.ui', ] install_requires=[ 'mutagen>=1.21', 'musicbrainzngs>=0.4', 'pyyaml', ] classifiers=[ 'Topic :: Multimedia :: Sound/Audio', 'License :: OSI Approved :: MIT License', 'Environment :: Console', 'Programming Language :: Python :: 2.7', ], )
  11. package.json { "name": "http-server", "version": "0.3.0", "description": "a simple command-line

    http server", "license": "MIT", "author": "Nodejitsu <[email protected]>", "contributors": [ { "name": "Marak Squires", "email": "[email protected]" } ], "repository": { "type": "git", "url": "https://github.com/nodejitsu/http-server.git" }, "keywords": [ "cli", "http", "server" ], "dependencies" : { "flatiron" : "0.1.x", } }
  12. data-package.json { # general "metadata" name: "a-unique-human-readable-and-url-usable-identifier", title: "A nice

    title", licenses: [...], sources: [...], resources: [ { ... resource info described below ... } ], # optional "metadata" ... additional information ... }
  13. data-package.yaml --- project: name: "project name" license: ... description: ...

    resources: ... experiment: sample: "sample_id", date: "2013-01-01 ", equipment: ... acquisition: # (simulation, data-analysis) settings: ... data: - path: url
  14. dacets $ dct import /path/to/project/ $ dct add /path/to/file $

    dct remove /path/to/file $ dct find “mri+brain+3T+T1” -o dacets.json $ dct sync -f dacet.json <src> <dest> >>> from dacets import Dacets >>> dc1 = Dacets.load(‘dacets.json’) >>> dc2 = Dacets.find(‘mri+brain+3T+T1’) >>> run_pipeline(on=dc2)