Platforms for
Deepak Singh
P r i n c i p a l P r o d u c t M a n a g e r
Data Science
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
bioinformatics
Slide 5
Slide 5 text
collection
Slide 6
Slide 6 text
curation
Slide 7
Slide 7 text
analysis
Slide 8
Slide 8 text
so what?
Slide 9
Slide 9 text
Image: Yael Fitzpatrick (AAAS)
Slide 10
Slide 10 text
lots of data
Slide 11
Slide 11 text
lots of people
Slide 12
Slide 12 text
lots of places
Slide 13
Slide 13 text
constant change
Slide 14
Slide 14 text
we want to make our
data more effective
Slide 15
Slide 15 text
versioning
Slide 16
Slide 16 text
provenance
Slide 17
Slide 17 text
filter
Via asklar under a CC-BY license
Slide 18
Slide 18 text
aggregate
Image: Chris Heiler
Slide 19
Slide 19 text
extend
Image: Bethan
Slide 20
Slide 20 text
human interfaces
Image: Sebastian Anthony
Slide 21
Slide 21 text
share
Slide 22
Slide 22 text
image: Leo Reynolds
communicate
Slide 23
Slide 23 text
hard problem
Slide 24
Slide 24 text
really hard problem
Slide 25
Slide 25 text
so how do
get there?
Slide 26
Slide 26 text
information
platforms
Slide 27
Slide 27 text
Image: Drew Conway
Slide 28
Slide 28 text
dataspaces
Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
Slide 29
Slide 29 text
the unreasonable
effectiveness of data
Halevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
Slide 30
Slide 30 text
accept all data
formats
Slide 31
Slide 31 text
evolve APIs
Slide 32
Slide 32 text
data as a
programmable
resource
Slide 33
Slide 33 text
data is a
royal garden
Slide 34
Slide 34 text
compute is a
fungible commodity
Slide 35
Slide 35 text
constraints
everywhere
Slide 36
Slide 36 text
Hardware
CPU, storage,
memory
Data management
Collections, datasets,
provenance
Software
parallelization,
optimization
Availability
Backup, redundant,
replicated
Cost Small
Slide 37
Slide 37 text
remove constraints
Credit: Pieter Musterd a CC-BY-NC-ND license
Slide 38
Slide 38 text
amazon web services
Slide 39
Slide 39 text
your
infrastructure
Slide 40
Slide 40 text
No content
Slide 41
Slide 41 text
ec2-run-instances
Slide 42
Slide 42 text
No content
Slide 43
Slide 43 text
secure
global
on demand
Slide 44
Slide 44 text
programmable
Slide 45
Slide 45 text
No content
Slide 46
Slide 46 text
No content
Slide 47
Slide 47 text
No content
Slide 48
Slide 48 text
elastic
Slide 49
Slide 49 text
Netflix needed to transcode
17,000 titles (80TB of data) to
support the launch of Sony PS3.
They provisioned 1200 Amazon
EC2 instances and completed
the transcoding process in just
days.
Source: Adrian Cockroft (Netflix)
Slide 50
Slide 50 text
Source: Adrian Cockroft (Netflix)
Slide 51
Slide 51 text
durable
Slide 52
Slide 52 text
99.999999999%
Slide 53
Slide 53 text
I did say data was a
royal garden
Slide 54
Slide 54 text
performance
Slide 55
Slide 55 text
“Our 40-instance (m2.2xlarge) cluster can
scan, filter, and aggregate 1 billion rows in
950 milliseconds.”
Mike Driscoll - Metamarkets
Slide 56
Slide 56 text
WIEN2K Parallel
Performance
H size 56,000 (25GB)
Runtime (16x8 processors)
Local (Infiniband) 3h:48
Cloud (10Gbps) 1h:30 ($40)
1200 atom unit cell; SCALAPACK+MPI
diagonalization, matrix size 50k-100k
Credit: K. Jorissen, F. D. Villa, and J. J. Rehr (U. Washington)
Slide 57
Slide 57 text
“Our tests have shown more than 90
percent scaling efficiency on
clusters of up to 128 GPUs each”
Slide 58
Slide 58 text
consumption
models
Slide 59
Slide 59 text
on-demand
Slide 60
Slide 60 text
Reserved Instances
Slide 61
Slide 61 text
what is the value of
your data?
Slide 62
Slide 62 text
No content
Slide 63
Slide 63 text
No content
Slide 64
Slide 64 text
the clouds
biggest value
Slide 65
Slide 65 text
remove constraints
Slide 66
Slide 66 text
No content
Slide 67
Slide 67 text
Image: Chris Dagdigian
Slide 68
Slide 68 text
Credit: Angel Pizzaro, U. Penn
Slide 69
Slide 69 text
13k sequences
- 10 min
- 0.1s per sequence
Slide 70
Slide 70 text
mapreduce for
genomics
http://bowtie-bio.sourceforge.net/crossbow/index.shtml
http://contrail-bio.sourceforge.net
http://bowtie-bio.sourceforge.net/myrna/index.shtml
Slide 71
Slide 71 text
No content
Slide 72
Slide 72 text
No content
Slide 73
Slide 73 text
No content
Slide 74
Slide 74 text
30,472 cores
Slide 75
Slide 75 text
$1279/hr
Slide 76
Slide 76 text
http://cloudbiolinux.org/
Slide 77
Slide 77 text
http://usegalaxy.org/cloud
Slide 78
Slide 78 text
“The process of moving StarMolsim over to the cloud to support the “Introduction to
Modeling and Simulation” course at MIT was a huge success. The cloud enabled the STAR
group to move away from the responsibility of owning and maintaing dedicated hardware
and instead focus on their core mission of developing software and services for faculty,
students, and researchers at MIT”
http://web.mit.edu/stardev/cluster/about.html
Slide 79
Slide 79 text
in summary
Slide 80
Slide 80 text
large scale data
requires a rethink
Slide 81
Slide 81 text
data architecture
Slide 82
Slide 82 text
compute architecture
Slide 83
Slide 83 text
distributed,
programmable
infrastructure
Slide 84
Slide 84 text
amazon web services
Slide 85
Slide 85 text
remove constraints
Slide 86
Slide 86 text
can we build data
science platforms?
Slide 87
Slide 87 text
there is no magic
there is only awesome
Slide 88
Slide 88 text
[email protected]
Twitter:@mndoci
http://slideshare.net/mndoci
http://mndoci.com
Inspiration and ideas from
Matt Wood& Larry Lessig
Credit” Oberazzi under a CC-BY-NC-SA license