DACETS: research data management for individual scientists

research data management for individual scientists Ivan Zimine 2013-06-14 DACETS

$ whoami Physicist ? Neuroscientist ? Programmer ? Manager

https://speakerdeck.com/kennethreitz/api-driven-development

Open Science & Reproducible Research open {access, source, data}

CERN, LHC

The mission of the WLCG project is to provide global
computing resources to store, distribute and analyse the ~25 Petabytes (25 million Gigabytes) of data annually generated by the Large Hadron Collider (LHC) at CERN on the Franco-Swiss border.

60 hrs of video / min 60 *100MB * 60min
* 24 = 8TB/day (3 PB/year)

BIG TOYS SMALL TOYS

Research Data publications raw data derived data summary data annotations
analysis workﬂow code/scripts...

... brain variability (an old project...)

original brain extract brain split GM/WM segm sulci skeleton simple
surfaces 3D recon & labeling data processing pipeline

data organization ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree -L 2 . ├── Anat_variability/
│ ├── 01_GRV/ │ ├── 02_GLZ/ │ ├── 03_LIB/ │ ├── 04_RAC/ │ ├── 05_DUB/ │ └── 06_WIL/ └── fMRI_Lang/ ├── 01_GRV/ ├── 02_GLZ/ ├── 03_LIB/ ├── 04_RAC/ ├── 05_DUB/ └── 06_WIL/ ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree fMRI_Lang/ fMRI_Lang/ ├── 01_GRV/ │ ├── analyze/ │ │ ├── fmri_noun.hdr │ │ ├── fmri_noun.img │ │ ├── fmri_verb.hdr │ │ ├── fmri_verb.img │ │ ├── t1_3d.hdr │ │ ├── t1_3d.img │ │ ├── t1_axe.hdr │ │ └── t1_axe.img │ ├── maps/ │ │ ├── zscore_noun.hdr │ │ ├── zscore_noun.img │ │ ├── zscore_verb.hdr │ │ └── zscore_verb.img │ └── raw/ │ ├── DICOM/ │ │ ├── 0000001.ima │ │ ├── 0000002.ima │ │ └── ... (~1000 files) │ └── DICOMDIR ├── 02_GLZ/ ├── 03_LIB/ original project image sequence header ﬁle derived project

data organization ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree fMRI_Lang/ fMRI_Lang/ ├── 01_GRV/ │
├── analyze/ │ │ ├── fmri_noun.hdr │ │ ├── fmri_noun.img │ │ ├── fmri_verb.hdr │ │ ├── fmri_verb.img │ │ ├── t1_3d.hdr │ │ ├── t1_3d.img │ │ ├── t1_axe.hdr │ │ └── t1_axe.img │ ├── maps/ │ │ ├── zscore_noun.hdr │ │ ├── zscore_noun.img │ │ ├── zscore_verb.hdr │ │ └── zscore_verb.img │ └── raw/ │ ├── DICOM/ │ │ ├── 0000001.ima │ │ ├── 0000002.ima │ │ └── ... (~1000 files) │ └── DICOMDIR ├── 02_GLZ/ ├── 03_LIB/ ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree Anat_variability/ Anat_variability/ ├── 01_GRV/ │ ├── READMEs/ │ ├── anatomy/ │ │ ├── T1.hdr │ │ ├── T1.img │ │ ├── T1_norm.hdr │ │ └── T1_norm.img │ ├── graphe/ │ ├── segment/ │ │ ├── T1_brain.hdr │ │ └── T1_brain.img │ └── tri/ │ ├── LHemi.dat │ └── RHemi.dat ├── 02_GLZ/ ├── 03_LIB/ ├── 04_RAC/ ├── 05_DUB/ └── 06_WIL/

data organization $ find . -name \*3d.img -exec script.py {}\;
useful, but clearly not enough (binary headers)

what’s the problem? I can not reproduce my own research
results

what’s the actual problem? lack of simple tools for handling
collections of [large] binary objects

what’s needed? help with data org structure tracking notes/comments about
data tracking provenance search (beyond ﬁlename/{c,m}time) efﬁcient local/remote sync ...

isn’t it a solved problem? http://wlcg.web.cern.ch/ http://www.mygrid.org.uk/ http://galaxyproject.org/ https://www.globusonline.org/ https://www.opensciencegrid.org/
http://www.openmicroscopy.org/

“Before someone suggests OME, we don’t have the wherewithal to
move to OMERO – the server setup is beyond me and not something that we can implement easily through our IT support. This is one for the future…”

isn’t it a solved problem? Amazon AMIs Media management (iTunes
et al.) SW packages (deb, rpm, egg, npm,...)

setup.py setup(name='beets', version='1.1.1', description='music tagger and library organizer', author='Adrian Sampson',
author_email='[email protected]', url='http://beets.radbox.org/', license='MIT', long_description=_read('README.rst'), packages=[ 'beets', 'beets.ui', ] install_requires=[ 'mutagen>=1.21', 'musicbrainzngs>=0.4', 'pyyaml', ] classifiers=[ 'Topic :: Multimedia :: Sound/Audio', 'License :: OSI Approved :: MIT License', 'Environment :: Console', 'Programming Language :: Python :: 2.7', ], )

package.json { "name": "http-server", "version": "0.3.0", "description": "a simple command-line
http server", "license": "MIT", "author": "Nodejitsu <[email protected]>", "contributors": [ { "name": "Marak Squires", "email": "[email protected]" } ], "repository": { "type": "git", "url": "https://github.com/nodejitsu/http-server.git" }, "keywords": [ "cli", "http", "server" ], "dependencies" : { "flatiron" : "0.1.x", } }

data-package.json { # general "metadata" name: "a-unique-human-readable-and-url-usable-identifier", title: "A nice
title", licenses: [...], sources: [...], resources: [ { ... resource info described below ... } ], # optional "metadata" ... additional information ... }

data-package.yaml --- project: name: "project name" license: ... description: ...
resources: ... experiment: sample: "sample_id", date: "2013-01-01 ", equipment: ... acquisition: # (simulation, data-analysis) settings: ... data: - path: url

dacets $ dct import /path/to/project/ $ dct add /path/to/file $
dct remove /path/to/file $ dct find “mri+brain+3T+T1” -o dacets.json $ dct sync -f dacet.json <src> <dest> >>> from dacets import Dacets >>> dc1 = Dacets.load(‘dacets.json’) >>> dc2 = Dacets.find(‘mri+brain+3T+T1’) >>> run_pipeline(on=dc2)

dacets datasets + facets binary ﬁles DVCS python module (integration
with ipython notebook)

interesting tech git-annex hg LargeﬁlesExtension http://neuralensemble.org/sumatra/ https://github.com/okfn/dpm/

DACETS: research data management for individual...

DACETS: research data management for individual scientists

Ivan

More Decks by Ivan

Other Decks in Technology

Featured

Transcript

research data management for individual scientists Ivan Zimine 2013-06-14 DACETS

$ whoami Physicist ? Neuroscientist ? Programmer ? Manager

https://speakerdeck.com/kennethreitz/api-driven-development

Open Science & Reproducible Research open {access, source, data}

CERN, LHC

The mission of the WLCG project is to provide global

60 hrs of video / min 60 100MB 60min

BIG TOYS SMALL TOYS

BIG TOYS SMALL TOYS

Research Data publications raw data derived data summary data annotations

... brain variability (an old project...)

original brain extract brain split GM/WM segm sulci skeleton simple

data organization ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree -L 2 . ├── Anat_variability/

data organization ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree fMRI_Lang/ fMRI_Lang/ ├── 01_GRV/ │

data organization $ find . -name \*3d.img -exec script.py {}\;

what’s the problem? I can not reproduce my own research

what’s the actual problem? lack of simple tools for handling

what’s needed? help with data org structure tracking notes/comments about

isn’t it a solved problem? http://wlcg.web.cern.ch/ http://www.mygrid.org.uk/ http://galaxyproject.org/ https://www.globusonline.org/ https://www.opensciencegrid.org/

“Before someone suggests OME, we don’t have the wherewithal to

isn’t it a solved problem? Amazon AMIs Media management (iTunes

setup.py setup(name='beets', version='1.1.1', description='music tagger and library organizer', author='Adrian Sampson',

package.json { "name": "http-server", "version": "0.3.0", "description": "a simple command-line

data-package.json { # general "metadata" name: "a-unique-human-readable-and-url-usable-identifier", title: "A nice

data-package.yaml --- project: name: "project name" license: ... description: ...

dacets $ dct import /path/to/project/ $ dct add /path/to/file $

dacets datasets + facets binary ﬁles DVCS python module (integration

interesting tech git-annex hg LargeﬁlesExtension http://neuralensemble.org/sumatra/ https://github.com/okfn/dpm/