Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Platform for Data Science

Deepak Singh
February 19, 2012

A Platform for Data Science

Data, life science, productivity and cloud computing

Deepak Singh

February 19, 2012
Tweet

More Decks by Deepak Singh

Other Decks in Technology

Transcript

  1. A Platform for Deepak Singh P r i n c

    i p a l P r o d u c t M a n a g e r Data Science
  2. Data Clips allow the results of SQL queries on a

    Heroku Postgres database to be easily shared. Simply create a query on postgres.heroku.com, and then share the resulting URL with co- workers, colleagues, or the world. Data clips can be shared through e-mail, Twitter, irc, or any other medium - they are just URLs. The recipients of a data clip are able to view the data in their browser or download it in JSON, CSV, XML, or Microsoft Excel formats.
  3. Hardware CPU, storage, memory Data management Collections, datasets, provenance Software

    parallelization, optimization Availability Backup, redundant, replicated Cost Small
  4. CloudFormation Templates "ClusterUserKeys" : { "Type" : "AWS::IAM::AccessKey", "Properties" :

    { "UserName" : { "Ref": "ClusterUser" } } }, ! "Ec2Instance" : { ! "Type" : "AWS::EC2::Instance", ! "Properties" : { ! "SecurityGroups" : [ { "Ref" : "InstanceSecurityGroup" } ], ! ! ! "InstanceType" : "t1.micro", ! "ImageId" : "ami-7341831a", ! ! "KeyName" : { "Ref" : "KeyName" }, ! "Tags" : [{ ! "Key" : "Role", ! "Value" : "Controller" ! }], ! ! ! "UserData" : { "Fn::Base64" : { "Fn::Join" : ["", ! ! ! [ ! ! ! ! ! ! "#!/bin/sh\n", ! ! ! "/opt/aws/bin/cfn-init ", " -s ", { "Ref" : "AWS::StackName" }, " -r Ec2Instance ", ! ! ! "--access-key=", { "Ref" : "ClusterUserKeys" }, " ", "--secret-key=", { "Fn::GetAtt" : ["ClusterUserKeys", "SecretAccessKey"]}, "\n", ! ! ! ! ! ! "cd /usr/src/pycrypto/pycrypto-2.4; /usr/bin/python setup.py build\n", ! ! ! ! ! ! "cd /usr/src/pycrypto/pycrypto-2.4; /usr/bin/python setup.py install\n", ! ! ! ! ! ! "cd /home/ec2-user/starcluster; /usr/bin/python distribute_setup.py\n", ! ! ! ! ! ! "cd /home/ec2-user/starcluster; /usr/bin/python setup.py install\n", ! ! ! ! ! "/bin/mkdir /home/ec2-user/.starcluster\n", ! ! ! ! ! "/bin/chown ec2-user:ec2-user -R /home/ec2-user/.starcluster\n", ! ! ! ! ! "/usr/bin/ruby /home/ec2-user/parser.rb /home/ec2-user/cc2-template.erb /home/ec2-user/values.yml > /home/ec2- user/.starcluster/config\n", ! ! ! ! ! "/usr/bin/starcluster -c /home/ec2-user/.starcluster/config createkey ", { "Ref" : "ClusterKeypair" }, " -o / home/ec2-user/.ssh/rsa-", { "Ref" : "ClusterKeypair" }, "\n", ! ! ! ! ! "/bin/chown ec2-user:ec2-user -R /home/ec2-user/.ssh/rsa-", { "Ref" : "ClusterKeypair" }, "\n", ! ! ! ! ! "cd /home/ec2-user/; /usr/bin/starcluster -c /home/ec2-user/.starcluster/config start ec2-cluster \n"! ! ! ! ! ! ! ! ! ! ! ! ]]}}
  5. Netflix needed to transcode 17,000 titles (80TB of data) to

    support the launch of Sony PS3. They provisioned 1200 Amazon EC2 instances and completed the transcoding process in just days. Source: Adrian Cockroft (Netflix)
  6. Storage: Scale of Amazon S3 Total  Number  of  Objects  Stored

     in  Amazon  S3 Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Peak  Requests: 500,000+ per  second 2.9  Billion 14  Billion 40  Billion 102  Billion 762  Billion 262  Billion
  7. “Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1

    billion rows in 950 milliseconds.” Mike Driscoll - Metamarkets
  8. $0# $5,000# $10,000# $15,000# $20,000# $25,000# 5%# 10%# 15%# 20%#

    25%# 30%# 35%# 40%# 45%# 50%# 55%# 60%# 65%# 70%# 75%# 80%# 85%# 90%# 95%# 100%# On#Demand# Light#U<liza<on# Medium#U<liza<on# High#U<liza<on# 40% 80% 10%
  9. “Now clients don’t have to wait in a queue… All

    projects get priority.” -- Adam Kraut
  10. services Amazon Elastic Compute Cloud (EC2) Amazon Elastic MapReduce Amazon

    Simple Storage Service AWS Import/Export Amazon Elastic Block Store Amazon RDS Amazon DynamoDB Amazon VPC AWS CloudFormation
  11. services Zero  Administra.on Low  Latency  SSD’s Reserved  Capacity Unlimited  Poten.al

     Storage   and  Throughput DynamoDB is a fully managed NoSQL database service that provides extremely fast and predictable performance with seamless scalability
  12. AWS$Public$Data$ EBS$Snapshot$ S3$Bucket$ EC2$Instance$$ (Orchestrator)$ $ $ $ $ $

    $ $ CloudForma=on$ Template$ Route 53 Zone Apex (optional) Chef$Cookbook$ Public$AMIs$ Autoscale (optional) Web$Server$ Chef$ Pre/Post$ processing$ End User Developer Console API/ CLI HTTP/HTTPS Client
  13. EC2$Instance$$ (Orchestrator)$ $ $ $ $ $ $ $ Route

    53 Zone Apex (optional) Autoscale (optional) Web$Server$ Chef$ Pre/Post$ processing$ End User HTTP/HTTPS Client EC2$ $ EC2$ $ EC2$ $ EC2$ $ EC2$ $ Star Cluster Blastz