A platform for data science - Systems Bioinformatics Workshop

There is no magic There is only awesome D e
e p a k S i n g h A platform for data science

bioinformatics image: Ethan Hein

collection

curation

analysis

what’s the big deal?

Source: http://www.nature.com/news/specials/bigdata/index.html

Image: Yael Fitzpatrick (AAAS)

lots of data

lots of people

lots of places

to make data effective

versioning

provenance

ﬁlter

aggregate

extend

mashup

human interfaces

hard problem

really hard problem

change how we think about compute

change how we think about data

change how we think about science

information platforms

Image: Drew Conway

dataspaces Further reading: Jeff Hammerbacher, Information Platforms and the rise
of the data scientist, Beautiful Data

the unreasonable effectiveness of data Halevy, et al. IEEE Intelligent
Systems, 24, 8-12 (2009)

accept all data formats

evolve APIs

beyond the database and the data warehouse

data as a programmable resource

data as a royal garden

compute as a fungible commodity

which brings us to ...

amazon web services

common characteristics

on demand

pay as you go

secure

elastic

3000 CPU’s for one firm’s risk management application !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.'
8%%9%.:/' ;<"&/:1=' >?,3?,44@' A&B:1=' >?,>?,44@' C".:1=' >?,D?,44@' E(.:1=' >?,F?,44@' ;"%/:1=' >?,G?,44@' C10"&:1=' >?,H?,44@' I%:.%/:1=' >?,,?,44@' 3444JJ' 344'JJ'

programmable

“infrastructure as code”

include_recipe "packages" include_recipe "ruby" include_recipe "apache2" if platform?("centos","redhat") if dist_only?
# just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" end else %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end end end gem_package "passenger" do version node[:passenger][:version] end execute "passenger_module" do command 'echo -en "\n\n\n\n" | passenger-install-apache2-module' creates node[:passenger][:module_path] end

import boto import boto.emr from boto.emr.step import StreamingStep from boto.emr.bootstrap_action
import BootstrapAction import time # set your aws keys and S3 bucket, e.g. from environment or .boto AWSKEY= SECRETKEY= S3_BUCKET= NUM_INSTANCES = 1 conn = boto.connect_emr(AWSKEY,SECRETKEY) bootstrap_step = BootstrapAction("download.tst", "s3://elasticmapreduce/bootstrap-actions/download.sh",None) step = StreamingStep(name='Wordcount', mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py', cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"], reducer='aggregate', input='s3n://elasticmapreduce/samples/wordcount/input', output='s3n://' + S3_BUCKET + '/output/wordcount_output') jobid = conn.run_jobflow( name="testbootstrap", log_uri="s3://" + S3_BUCKET + "/logs", steps = [step], bootstrap_actions=[bootstrap_step], num_instances=NUM_INSTANCES) print "finished spawning job (note: starting still takes time)" state = conn.describe_jobflow(jobid).state print "job state = ", state print "job id = ", jobid while state != u'COMPLETED': print time.localtime() time.sleep(30) state = conn.describe_jobflow(jobid).state print "job state = ", state print "job id = ", jobid print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMP print "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ." Connect to Elastic MapReduce Install packages Set up mappers & reduces job state

“I terminate the instance and relaunch it. Thats my error
handling” Source: @jtimberman on Twitter

compute is a fungible commodity

emphasis on productivity

you can get a lot of awesome

dive in

just a little

Simple Storage Service S3

highly durable

99.999999999%

Highly scalable

Elastic Compute Cloud EC2

dynamic

autoscaling

EC2 instance types

standard “m1” high cpu “c1” high memory “m2” http://aws.amazon.com/ec2/instance-types/ EC2
instance types

text cluster compute instances

cluster GPU instances

http://aws.amazon.com/ec2/instance-types/ cluster compute “cc1” EC2 instance types cluster GPU “cg1”

10gbps

Placement Group

full bisection bandwidth

Linpack benchmark 880-instance CC1 cluster Performance: 41.82 TFlops* *#231 in
Nov 2010 Top 500 rankings

KS for huge system at 1 k-point VERY DEMANDING network
performance H size 56,000 (25GB) Runtime (16x8 processors) Local (Inﬁniband) 3h:48 Cloud (10Gbps) 1h:30 ($40) WIEN2k Parallel Performance Credit: K. Jorissen, F. D. Villa, and J. J. Rehr (U. Washington) •1200 atom unit cell; SCALAPACK+MPI diagonalization, matrix size 50k-100k

cost and use models

% Utilization time Ideal Effective Utilization Reserved Utilization On Demand
Utilization Spot Utilization

making things easier

Elastic Beanstalk

Heroku

% Utilization time Ideal Effective Utilization Reserved Utilization On Demand
Utilization Spot Utilization

data at scale

some practical considerations

everything fails all the time

compute needs vary

new data/compute paradigms

Amazon Elastic MapReduce

doing stuff

Customer Case Study: cyclopic energy http://aws.amazon.com/solutions/case-studies/cyclopic-energy/ OpenFOAM®

NASA JPL

Credit: Angel Pizzaro, U. Penn

http://aws.amazon.com/solutions/case-studies/numerate/

Bioproximity http://aws.amazon.com/solutions/case-studies/bioproximity/

http://usegalaxy.org/cloud

mapreduce for genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml http://contrail-bio.sourceforge.net http://bowtie-bio.sourceforge.net/myrna/index.shtml

http://cloudbiolinux.org/

in summary

large scale data requires a rethink

data architecture

compute architecture

in infrastructure

the cloud

distributed, programmable infrastructure

rapid, massive, scaling

architecture evolved with the internet

can we build data science platforms?

there is no magic there is only awesome

two more things

10 minutes

http://aws.amazon.com/about-aws/build-a-cluster-in-under-10/

http://aws.amazon.com/education

[email protected] Twitter:@mndoci http://slideshare.net/mndoci http://mndoci.com Inspiration and ideas from Matt Wood&
Larry Lessig Credit” Oberazzi under a CC-BY-NC-SA license

A platform for data science - Systems Bioinform...

A platform for data science - Systems Bioinformatics Workshop

More Decks by Deepak Singh

Other Decks in Technology

Featured

Transcript