DACETS: research data management for individual scientists

Slide 1

Slide 1 text

research data management for individual scientists Ivan Zimine 2013-06-14 DACETS

Slide 2

Slide 2 text

$ whoami Physicist ? Neuroscientist ? Programmer ? Manager

Slide 3

Slide 3 text

https://speakerdeck.com/kennethreitz/api-driven-development

Slide 4

Slide 4 text

Open Science & Reproducible Research open {access, source, data}

Slide 5

Slide 5 text

CERN, LHC

Slide 6

Slide 6 text

The mission of the WLCG project is to provide global computing resources to store, distribute and analyse the ~25 Petabytes (25 million Gigabytes) of data annually generated by the Large Hadron Collider (LHC) at CERN on the Franco-Swiss border.

Slide 7

Slide 7 text

60 hrs of video / min 60 *100MB * 60min * 24 = 8TB/day (3 PB/year)

Slide 8

Slide 8 text

BIG TOYS SMALL TOYS

Slide 9

Slide 9 text

BIG TOYS SMALL TOYS

Slide 10

Slide 10 text

Research Data publications raw data derived data summary data annotations analysis workﬂow code/scripts...

Slide 11

Slide 11 text

... brain variability (an old project...)

Slide 12

Slide 12 text

original brain extract brain split GM/WM segm sulci skeleton simple surfaces 3D recon & labeling data processing pipeline

Slide 13

Slide 13 text

data organization ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree -L 2 . ├── Anat_variability/ │ ├── 01_GRV/ │ ├── 02_GLZ/ │ ├── 03_LIB/ │ ├── 04_RAC/ │ ├── 05_DUB/ │ └── 06_WIL/ └── fMRI_Lang/ ├── 01_GRV/ ├── 02_GLZ/ ├── 03_LIB/ ├── 04_RAC/ ├── 05_DUB/ └── 06_WIL/ ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree fMRI_Lang/ fMRI_Lang/ ├── 01_GRV/ │ ├── analyze/ │ │ ├── fmri_noun.hdr │ │ ├── fmri_noun.img │ │ ├── fmri_verb.hdr │ │ ├── fmri_verb.img │ │ ├── t1_3d.hdr │ │ ├── t1_3d.img │ │ ├── t1_axe.hdr │ │ └── t1_axe.img │ ├── maps/ │ │ ├── zscore_noun.hdr │ │ ├── zscore_noun.img │ │ ├── zscore_verb.hdr │ │ └── zscore_verb.img │ └── raw/ │ ├── DICOM/ │ │ ├── 0000001.ima │ │ ├── 0000002.ima │ │ └── ... (~1000 files) │ └── DICOMDIR ├── 02_GLZ/ ├── 03_LIB/ original project image sequence header ﬁle derived project

Slide 14

Slide 14 text

data organization ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree fMRI_Lang/ fMRI_Lang/ ├── 01_GRV/ │ ├── analyze/ │ │ ├── fmri_noun.hdr │ │ ├── fmri_noun.img │ │ ├── fmri_verb.hdr │ │ ├── fmri_verb.img │ │ ├── t1_3d.hdr │ │ ├── t1_3d.img │ │ ├── t1_axe.hdr │ │ └── t1_axe.img │ ├── maps/ │ │ ├── zscore_noun.hdr │ │ ├── zscore_noun.img │ │ ├── zscore_verb.hdr │ │ └── zscore_verb.img │ └── raw/ │ ├── DICOM/ │ │ ├── 0000001.ima │ │ ├── 0000002.ima │ │ └── ... (~1000 files) │ └── DICOMDIR ├── 02_GLZ/ ├── 03_LIB/ ivan@pavlov:~/mri_data/fMRI_Lang_var $ tree Anat_variability/ Anat_variability/ ├── 01_GRV/ │ ├── READMEs/ │ ├── anatomy/ │ │ ├── T1.hdr │ │ ├── T1.img │ │ ├── T1_norm.hdr │ │ └── T1_norm.img │ ├── graphe/ │ ├── segment/ │ │ ├── T1_brain.hdr │ │ └── T1_brain.img │ └── tri/ │ ├── LHemi.dat │ └── RHemi.dat ├── 02_GLZ/ ├── 03_LIB/ ├── 04_RAC/ ├── 05_DUB/ └── 06_WIL/

Slide 15

Slide 15 text

data organization $ find . -name \*3d.img -exec script.py {}\; useful, but clearly not enough (binary headers)

Slide 16

Slide 16 text

what’s the problem? I can not reproduce my own research results

Slide 17

Slide 17 text

what’s the actual problem? lack of simple tools for handling collections of [large] binary objects

Slide 18

Slide 18 text

what’s needed? help with data org structure tracking notes/comments about data tracking provenance search (beyond ﬁlename/{c,m}time) efﬁcient local/remote sync ...

Slide 19

Slide 19 text

isn’t it a solved problem? http://wlcg.web.cern.ch/ http://www.mygrid.org.uk/ http://galaxyproject.org/ https://www.globusonline.org/ https://www.opensciencegrid.org/ http://www.openmicroscopy.org/

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

“Before someone suggests OME, we don’t have the wherewithal to move to OMERO – the server setup is beyond me and not something that we can implement easily through our IT support. This is one for the future…”

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

isn’t it a solved problem? Amazon AMIs Media management (iTunes et al.) SW packages (deb, rpm, egg, npm,...)

Slide 25

Slide 25 text

setup.py setup(name='beets', version='1.1.1', description='music tagger and library organizer', author='Adrian Sampson', author_email='[email protected]', url='http://beets.radbox.org/', license='MIT', long_description=_read('README.rst'), packages=[ 'beets', 'beets.ui', ] install_requires=[ 'mutagen>=1.21', 'musicbrainzngs>=0.4', 'pyyaml', ] classifiers=[ 'Topic :: Multimedia :: Sound/Audio', 'License :: OSI Approved :: MIT License', 'Environment :: Console', 'Programming Language :: Python :: 2.7', ], )

Slide 26

Slide 26 text

package.json { "name": "http-server", "version": "0.3.0", "description": "a simple command-line http server", "license": "MIT", "author": "Nodejitsu ", "contributors": [ { "name": "Marak Squires", "email": "[email protected]" } ], "repository": { "type": "git", "url": "https://github.com/nodejitsu/http-server.git" }, "keywords": [ "cli", "http", "server" ], "dependencies" : { "flatiron" : "0.1.x", } }

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

data-package.json { # general "metadata" name: "a-unique-human-readable-and-url-usable-identifier", title: "A nice title", licenses: [...], sources: [...], resources: [ { ... resource info described below ... } ], # optional "metadata" ... additional information ... }

Slide 29

Slide 29 text

data-package.yaml --- project: name: "project name" license: ... description: ... resources: ... experiment: sample: "sample_id", date: "2013-01-01 ", equipment: ... acquisition: # (simulation, data-analysis) settings: ... data: - path: url

Slide 30

Slide 30 text

dacets $ dct import /path/to/project/ $ dct add /path/to/file $ dct remove /path/to/file $ dct find “mri+brain+3T+T1” -o dacets.json $ dct sync -f dacet.json >>> from dacets import Dacets >>> dc1 = Dacets.load(‘dacets.json’) >>> dc2 = Dacets.find(‘mri+brain+3T+T1’) >>> run_pipeline(on=dc2)

Slide 31

Slide 31 text

dacets datasets + facets binary ﬁles DVCS python module (integration with ipython notebook)

Slide 32

Slide 32 text

interesting tech git-annex hg LargeﬁlesExtension http://neuralensemble.org/sumatra/ https://github.com/okfn/dpm/