Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data + Datadex @ ScienceExchange

Data + Datadex @ ScienceExchange

The awesome folks over at @ScienceExchange had me over to talk about data management in science and the tools I'm working on.

Juan Batiz-Benet

March 11, 2014
Tweet

More Decks by Juan Batiz-Benet

Other Decks in Technology

Transcript

  1. Introduction - Data management has big, silly problems - Software

    engineering solved similar issues - New tools + culture can help! - Think open source, GitHub, package managers read: http://juan.benet.ai/data/2014-02-21/lets-solve-data-management/
  2. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  3. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  4. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  5. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  6. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  7. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  8. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  9. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/
  10. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  11. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  12. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  13. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  14. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access v1.2.0 v1.3.0 v.2.0.0 read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  15. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  16. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access hinton/[email protected] hinton/[email protected] jbenet/[email protected] read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  17. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  18. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access ... Enter License: MIT ... read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  19. - Distribution - Versioning - Permanence - Indexing - Formatting

    - Licensing - Open Access > data fork hinton/cifar > data publish jbenet/cifar read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/
  20. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  21. Main Goals: - SAVE TIME for scientists and data workers

    - incentivize data publishing + reduce friction - permanent research data repository - enable wide collaboration
  22. data - package manager command line tool > data get

    jbenet/cifar github.com/jbenet/data
  23. data - package manager command line tool > data get

    jbenet/cifar > data get jbenet/[email protected] github.com/jbenet/data
  24. data - package manager command line tool > data get

    jbenet/cifar > data get jbenet/[email protected] > data publish github.com/jbenet/data
  25. > cat Datafile dataset: jbenet/[email protected] tagline: Example dataset with zipcodes.

    website: http://datadex.io/jbenet/ [email protected] license: CC-BY-SA sources: - http://federalgovernmentzipcodes.us/ note: will support OKFN’s data-packages. github.com/jbenet/data
  26. today (cutting corners to ship): data ----> poor man’s git

    version control data ----> uploads to s3 (for now) data ----> publishes to datadex github.com/jbenet/data
  27. tomorrow (once gfs is built): data ----> full git object

    model versioning data ----> uploads to distributed fs data ----> publishes to datadex github.com/jbenet/data
  28. datadex - dataset package index - free and open source

    - can run your own, private or public - view + download files from website github.com/jbenet/datadex
  29. datadex - dataset package index - encryption? - access control?

    - visualizations? - code? github.com/jbenet/datadex
  30. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  31. 1. The Problems in Data Management 2. The Case for

    Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A
  32. planned for the future: - Distributed FS (p2p + bittorrent

    + git = gfs. stay tuned. p0.) - GUI Client (hard. do most in web. maybe dropbox-like) - Web additions (viz, download, edit - but leverage github too) - Social Data (science needs what github did to coding) - Citations (DOI, CrossRef, anything is citable)
  33. datadex could also manage: - Formats (+ conversions locally, or

    server-side on-demand) - Protocols (need a good language, type system, compiler, web textbox) - Data-related Programs (with github repo urls, like npm) - Anything, really (let’s experiment! add if makes sense)
  34. LINKS! - datadex.io (datadex public service) - github.com/a/data (pm cli

    repo) - github.com/jbenet/datadex (index repo) - juan.benet.ai/data (my data chronicles)
  35. Must See Related Work: - okfn.org/opendata (lots of open data

    efforts!) - dat-data.com (awesome tabular data version control) - academictorrents.com (torrent tracker for sci data) and more at juan.benet.ai/data/2014-03-11/related-work