Slide 1

Slide 1 text

data + datadex Juan Batiz-Benet http://juan.benet.ai @juanbenet 2014-03-11 @ScienceExchange

Slide 2

Slide 2 text

Introduction - Data management has big, silly problems - Software engineering solved similar issues - New tools + culture can help! - Think open source, GitHub, package managers read: http://juan.benet.ai/data/2014-02-21/lets-solve-data-management/

Slide 3

Slide 3 text

1. The Problems in Data Management 2. The Case for Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A

Slide 4

Slide 4 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/

Slide 5

Slide 5 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/

Slide 6

Slide 6 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/

Slide 7

Slide 7 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/

Slide 8

Slide 8 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/

Slide 9

Slide 9 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/

Slide 10

Slide 10 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-02-21/data-management-problems/

Slide 11

Slide 11 text

1. The Problems in Data Management 2. The Case for Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A

Slide 12

Slide 12 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/

Slide 13

Slide 13 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/

Slide 14

Slide 14 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/

Slide 15

Slide 15 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access v1.2.0 v1.3.0 v.2.0.0 read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/

Slide 16

Slide 16 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/

Slide 17

Slide 17 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access hinton/[email protected] hinton/[email protected] jbenet/[email protected] read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/

Slide 18

Slide 18 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/

Slide 19

Slide 19 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access ... Enter License: MIT ... read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/

Slide 20

Slide 20 text

- Distribution - Versioning - Permanence - Indexing - Formatting - Licensing - Open Access > data fork hinton/cifar > data publish jbenet/cifar read: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers/

Slide 21

Slide 21 text

1. The Problems in Data Management 2. The Case for Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A

Slide 22

Slide 22 text

data - package manager command line tool & datadex - dataset package index

Slide 23

Slide 23 text

Main Goals: - SAVE TIME for scientists and data workers - incentivize data publishing + reduce friction - permanent research data repository - enable wide collaboration

Slide 24

Slide 24 text

data - package manager command line tool > data github.com/jbenet/data

Slide 25

Slide 25 text

data - package manager command line tool > data get jbenet/cifar github.com/jbenet/data

Slide 26

Slide 26 text

data - package manager command line tool > data get jbenet/cifar > data get jbenet/[email protected] github.com/jbenet/data

Slide 27

Slide 27 text

data - package manager command line tool > data get jbenet/cifar > data get jbenet/[email protected] > data publish github.com/jbenet/data

Slide 28

Slide 28 text

> cat Datafile dataset: jbenet/[email protected] tagline: Example dataset with zipcodes. website: http://datadex.io/jbenet/ [email protected] license: CC-BY-SA sources: - http://federalgovernmentzipcodes.us/ note: will support OKFN’s data-packages. github.com/jbenet/data

Slide 29

Slide 29 text

today (cutting corners to ship): data ----> poor man’s git version control data ----> uploads to s3 (for now) data ----> publishes to datadex github.com/jbenet/data

Slide 30

Slide 30 text

tomorrow (once gfs is built): data ----> full git object model versioning data ----> uploads to distributed fs data ----> publishes to datadex github.com/jbenet/data

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

datadex - dataset package index github.com/jbenet/datadex

Slide 33

Slide 33 text

http://datadex.io http://github.com/jbenet/datadex

Slide 34

Slide 34 text

datadex - dataset package index - free and open source - can run your own, private or public - view + download files from website github.com/jbenet/datadex

Slide 35

Slide 35 text

datadex - dataset package index - encryption? - access control? - visualizations? - code? github.com/jbenet/datadex

Slide 36

Slide 36 text

1. The Problems in Data Management 2. The Case for Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A

Slide 37

Slide 37 text

DEMOS

Slide 38

Slide 38 text

1. The Problems in Data Management 2. The Case for Data Package Managers 3. data and datadex 4. Demos 5. Future, Q&A

Slide 39

Slide 39 text

planned for the future: - Distributed FS (p2p + bittorrent + git = gfs. stay tuned. p0.) - GUI Client (hard. do most in web. maybe dropbox-like) - Web additions (viz, download, edit - but leverage github too) - Social Data (science needs what github did to coding) - Citations (DOI, CrossRef, anything is citable)

Slide 40

Slide 40 text

datadex could also manage: - Formats (+ conversions locally, or server-side on-demand) - Protocols (need a good language, type system, compiler, web textbox) - Data-related Programs (with github repo urls, like npm) - Anything, really (let’s experiment! add if makes sense)

Slide 41

Slide 41 text

LINKS! - datadex.io (datadex public service) - github.com/a/data (pm cli repo) - github.com/jbenet/datadex (index repo) - juan.benet.ai/data (my data chronicles)

Slide 42

Slide 42 text

Must See Related Work: - okfn.org/opendata (lots of open data efforts!) - dat-data.com (awesome tabular data version control) - academictorrents.com (torrent tracker for sci data) and more at juan.benet.ai/data/2014-03-11/related-work

Slide 43

Slide 43 text

thanks!