Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A platform for data science - Systems Bioinformatics Workshop

A platform for data science - Systems Bioinformatics Workshop

Talk given at Systems Bioinformatics Workshop at ISB

Deepak Singh

October 01, 2011
Tweet

More Decks by Deepak Singh

Other Decks in Technology

Transcript

  1. There is no magic There is only awesome D e

    e p a k S i n g h A platform for data science
  2. 3

  3. 3000 CPU’s for one firm’s risk management application !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.'

    8%%9%.:/' ;<"&/:1=' >?,3?,44@' A&B:1=' >?,>?,44@' C".:1=' >?,D?,44@' E(.:1=' >?,F?,44@' ;"%/:1=' >?,G?,44@' C10"&:1=' >?,H?,44@' I%:.%/:1=' >?,,?,44@' 3444JJ' 344'JJ'
  4. include_recipe "packages" include_recipe "ruby" include_recipe "apache2" if platform?("centos","redhat") if dist_only?

    # just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" end else %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end end end gem_package "passenger" do version node[:passenger][:version] end execute "passenger_module" do command 'echo -en "\n\n\n\n" | passenger-install-apache2-module' creates node[:passenger][:module_path] end
  5. import boto import boto.emr from boto.emr.step import StreamingStep from boto.emr.bootstrap_action

    import BootstrapAction import time # set your aws keys and S3 bucket, e.g. from environment or .boto AWSKEY= SECRETKEY= S3_BUCKET= NUM_INSTANCES = 1 conn = boto.connect_emr(AWSKEY,SECRETKEY) bootstrap_step = BootstrapAction("download.tst", "s3://elasticmapreduce/bootstrap-actions/download.sh",None) step = StreamingStep(name='Wordcount', mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py', cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"], reducer='aggregate', input='s3n://elasticmapreduce/samples/wordcount/input', output='s3n://' + S3_BUCKET + '/output/wordcount_output') jobid = conn.run_jobflow( name="testbootstrap", log_uri="s3://" + S3_BUCKET + "/logs", steps = [step], bootstrap_actions=[bootstrap_step], num_instances=NUM_INSTANCES) print "finished spawning job (note: starting still takes time)" state = conn.describe_jobflow(jobid).state print "job state = ", state print "job id = ", jobid while state != u'COMPLETED': print time.localtime() time.sleep(30) state = conn.describe_jobflow(jobid).state print "job state = ", state print "job id = ", jobid print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMP print "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ." Connect to Elastic MapReduce Install packages Set up mappers & reduces job state
  6. “I terminate the instance and relaunch it. Thats my error

    handling” Source: @jtimberman on Twitter
  7. KS for huge system at 1 k-point VERY DEMANDING network

    performance H size 56,000 (25GB) Runtime (16x8 processors) Local (Infiniband) 3h:48 Cloud (10Gbps) 1h:30 ($40) WIEN2k Parallel Performance Credit: K. Jorissen, F. D. Villa, and J. J. Rehr (U. Washington) •1200 atom unit cell; SCALAPACK+MPI diagonalization, matrix size 50k-100k