×
Copy
Open
Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
There is no magic There is only awesome D e e p a k S i n g h Platforms for data science
Slide 2
Slide 2 text
bioinformatics image: Ethan Hein
Slide 3
Slide 3 text
3
Slide 4
Slide 4 text
collection
Slide 5
Slide 5 text
curation
Slide 6
Slide 6 text
analysis
Slide 7
Slide 7 text
what’s the big deal?
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
Source: http://www.nature.com/news/specials/bigdata/index.html
Slide 10
Slide 10 text
Image: Yael Fitzpatrick (AAAS)
Slide 11
Slide 11 text
Image: Yael Fitzpatrick (AAAS)
Slide 12
Slide 12 text
lots of data
Slide 13
Slide 13 text
lots of people
Slide 14
Slide 14 text
lots of places
Slide 15
Slide 15 text
constant change
Slide 16
Slide 16 text
we want to make our data more effective
Slide 17
Slide 17 text
versioning
Slide 18
Slide 18 text
provenance
Slide 19
Slide 19 text
filter
Slide 20
Slide 20 text
aggregate
Slide 21
Slide 21 text
extend
Slide 22
Slide 22 text
mashup
Slide 23
Slide 23 text
human interfaces
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
image: Leo Reynolds
Slide 26
Slide 26 text
hard problem
Slide 27
Slide 27 text
really hard problem
Slide 28
Slide 28 text
so how do get there?
Slide 29
Slide 29 text
information platforms
Slide 30
Slide 30 text
Image: Drew Conway
Slide 31
Slide 31 text
dataspaces Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
Slide 32
Slide 32 text
the unreasonable effectiveness of data Halevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
Slide 33
Slide 33 text
accept all data formats
Slide 34
Slide 34 text
evolve APIs
Slide 35
Slide 35 text
beyond databases and the data warehouse
Slide 36
Slide 36 text
data as a programmable resource
Slide 37
Slide 37 text
data is a royal garden
Slide 38
Slide 38 text
compute is a fungible commodity
Slide 39
Slide 39 text
optimizing the most valuable resource
Slide 40
Slide 40 text
compute, storage, workflows, memory, transmission, algorithms, cost, …
Slide 41
Slide 41 text
people Credit: Pieter Musterd a CC-BY-NC-ND license
Slide 42
Slide 42 text
Image: Chris Dagdigian
Slide 43
Slide 43 text
my bias
Slide 44
Slide 44 text
cloud services
Slide 45
Slide 45 text
distributed systems
Slide 46
Slide 46 text
scale
Slide 47
Slide 47 text
global
Slide 48
Slide 48 text
consumption models
Slide 49
Slide 49 text
on-demand
Slide 50
Slide 50 text
what is the value of your data?
Slide 51
Slide 51 text
No content
Slide 52
Slide 52 text
No content
Slide 53
Slide 53 text
Credit: Angel Pizzaro, U. Penn
Slide 54
Slide 54 text
mapreduce for genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml http://contrail-bio.sourceforge.net http://bowtie-bio.sourceforge.net/myrna/index.shtml
Slide 55
Slide 55 text
No content
Slide 56
Slide 56 text
Bioproximity http://aws.amazon.com/solutions/case-studies/bioproximity/
Slide 57
Slide 57 text
No content
Slide 58
Slide 58 text
No content
Slide 59
Slide 59 text
30,472 cores
Slide 60
Slide 60 text
$1279/hr
Slide 61
Slide 61 text
http://cloudbiolinux.org/
Slide 62
Slide 62 text
http://usegalaxy.org/cloud
Slide 63
Slide 63 text
in summary
Slide 64
Slide 64 text
large scale data requires a rethink
Slide 65
Slide 65 text
data architecture
Slide 66
Slide 66 text
compute architecture
Slide 67
Slide 67 text
distributed, programmable infrastructure
Slide 68
Slide 68 text
cloud services
Slide 69
Slide 69 text
remove constraints
Slide 70
Slide 70 text
can we build data science platforms?
Slide 71
Slide 71 text
there is no magic there is only awesome
Slide 72
Slide 72 text
[email protected]
Twitter:@mndoci http://slideshare.net/mndoci http://mndoci.com Inspiration and ideas from Matt Wood& Larry Lessig Credit” Oberazzi under a CC-BY-NC-SA license