Data + Datadex @ ScienceExchange

Data + Datadex @ ScienceExchange

The awesome folks over at @ScienceExchange had me over to talk about data management in science and the tools I'm working on.

A9670c143716320893863524a0efbaff?s=128

Juan Batiz-Benet

March 11, 2014
Tweet

Transcript

  1. data + datadex Juan Batiz-Benet http://juan.benet.ai @juanbenet 2014-03-11 @ScienceExchange

  2. Introduction - Data management has big, silly problems - Software

    engineering solved similar issues - New tools + culture can help! - Think open source, GitHub, package managers read: http://juan.benet.ai/data/2014-02-21/lets-solve-data-management/
  3. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  4. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  5. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  6. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  7. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  8. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  9. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  10. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  11. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  12. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  13. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  14. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  15. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access v1.2.0 v1.3.0 v.2.0.0 read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  16. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  17. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access hinton/cifar.numpy@v1.2.0 hinton/cifar.matlab@v1.2.0 jbenet/cifar.matlab@v1.2.1 read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  18. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  19. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access ... Enter License: MIT ... read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  20. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access > data fork hinton/cifar > data publish jbenet/cifar read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  21. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  22. data - package manager command line tool & datadex -

    dataset package index
  23. Main Goals: - SAVE TIME for scientists and data workers

    - incentivize data publishing + reduce friction - permanent research data repository - enable wide collaboration
  24. data - package manager command line tool > data github.com/jbenet/data

  25. data - package manager command line tool > data get

    jbenet/cifar github.com/jbenet/data
  26. data - package manager command line tool > data get

    jbenet/cifar > data get jbenet/cifar@v1.1 github.com/jbenet/data
  27. data - package manager command line tool > data get

    jbenet/cifar > data get jbenet/cifar@v1.1 > data publish github.com/jbenet/data
  28. > cat Datafile dataset: jbenet/zipcodes-example@1.2 tagline: Example dataset with zipcodes.

    website: http://datadex.io/jbenet/ zipcodes-example@1.0 license: CC-BY-SA sources: - http://federalgovernmentzipcodes.us/ note: will support OKFN’s data-packages. github.com/jbenet/data
  29. today (cutting corners to ship): data ----> poor man’s git

    version control data ----> uploads to s3 (for now) data ----> publishes to datadex github.com/jbenet/data
  30. tomorrow (once gfs is built): data ----> full git object

    model versioning data ----> uploads to distributed fs data ----> publishes to datadex github.com/jbenet/data
  31. None
  32. datadex - dataset package index github.com/jbenet/datadex

  33. http://datadex.io http://github.com/jbenet/datadex

  34. datadex - dataset package index - free and open source

    - can run your own, private or public - view + download files from website github.com/jbenet/datadex
  35. datadex - dataset package index - encryption? - access control?

    - visualizations? - code? github.com/jbenet/datadex
  36. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  37. DEMOS

  38. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  39. planned for the future: - Distributed FS (p2p + bittorrent

    + git = gfs. stay tuned. p0.) - GUI Client (hard. do most in web. maybe dropbox-like) - Web additions (viz, download, edit - but leverage github too) - Social Data (science needs what github did to coding) - Citations (DOI, CrossRef, anything is citable)
  40. datadex could also manage: - Formats (+ conversions locally, or

    server-side on-demand) - Protocols (need a good language, type system, compiler, web textbox) - Data-related Programs (with github repo urls, like npm) - Anything, really (let’s experiment! add if makes sense)
  41. LINKS! - datadex.io (datadex public service) - github.com/a/data (pm cli

    repo) - github.com/jbenet/datadex (index repo) - juan.benet.ai/data (my data chronicles)
  42. Must See Related Work: - okfn.org/opendata (lots of open data

    efforts!) - dat-data.com (awesome tabular data version control) - academictorrents.com (torrent tracker for sci data) and more at juan.benet.ai/data/2014-03-11/related-work
  43. thanks!