Right Place, Right Time, Right Science: Lessons Learned from Utility HPC/Big Data - James Cuff

Right Place, Right Time, Right Science: Lessons Learned from Utility
HPC / Big Data James Cuff, CTO @jamesdotcuff, @cyclecomputing

For over sixteen years I have seen a steady evolution
in research computing 1996 @ Oxford 1 cpu @ 200Mhz / 18GB 2000 @ Sanger / EBI 360 cpu @ 168GHz / 50TB 2003-2006 @ Harvard / MIT 200 cpu @ 400GHz / 250-600TB 2012 @ Harvard >25,000 cpu @ 32THz / 10.0PB

Massive increases in compute Massive increases in storage Massive increases
in scale Massive increases in throughput Massive increases in performance Researchers need it all, and all need it now! Computational growth is not going away We need new systems, teams and methods to effectively support our scholarly and scientific research

I don’t try and keep up with the big boys
and girls…

University of Manchester British Nuclear Fuels Limited Oxford University European
Bioinformatics Institute Inpharmatica Wellcome Trust Sanger Institute Whitehead Genome Center Broad Institute of MIT and Harvard Harvard University Cycle Computing

– 360 node DEC Alpha DS10L 1U – 9 racks
– 100KW power – 1,440 cat5 crimps… – 466MHz x 360 CPU 168GHz The Human Genome Project ca. 2000

THIS is not HPC…

Neither is THIS…

Or THIS … !!

THIS is HPC!

At Cycle, we believe Scientists and Researchers are shackled by
a lack of access to compute

History teaches how to collaborate and remove shackles surrounding our
science

The 60’s The 70’s The 80’s The 90’s The 00’s
From centralized to decentralized, collaborative to independent and right back again! The 10’s Mainframes VAX The PC Beowulf Clusters Central Clusters Centers provide access to compute The supercomputing famine, funding gap Individual computing Computing is too big to fit under desk, Linux explodes Clouds/VMware IaaS, SaaS, PaaS 100% 60% 0% 40% ???% SHARING ~ 0Mbit ~ 1Mbit ~ 10Mbit ~ 1000 Mbit ~ 10,000 Mbit Bigger, better but further and further away from the scientist’s lab

Ask a Question Hypothesize Predict Experiment / Test Analyze Final
Results The Scientific Method “Test and Analyze” Require the most time, compute, data and effort

Results The Scientific Method Any improvements to this cycle yield multiplicative benefits

If we democratize access to high performance compute, we WILL
accelerate science

Not neuroscience…

Or a good excuse for MRI data…

We make software tools to easily orchestrate complex workloads and
data access across Utility HPC NIMBUS Discovery 12 years of compute in 3 hours $20M of infrastructure for < $3,000 Big 10 Pharma Built 10,600 server cluster ($44M) in 2 hours, 40 years of compute in 11 hours for $4,372 Genomics Research Institute: 1 million hours or 115 years of compute in 1 week for $19,555

Utility HPC in the News WSJ, NYTimes, Wired, Bio-IT World
BusinessWeek

We solve this challenge across many industries SLED/PS Insurance Financial
Services Life Sciences Manufacturing & Electronics Energy, Media & Other

Too small when you need it most, Too large every
other time… Before, Local Cluster:

– 360 node DEC Alpha DS10L 1U – 9 racks
– 100KW power – 1,440 cat5 crimps… – 466MHz x 360 CPU 168GHz Remember this from earlier?

The world is a very different place now…

Life Science Activities: Compute vs. Data Compute Data/Bandwidth NGS Molecular
Modeling PK/PD CAD/ CAM GWAS Neuroscience Genomics Proteomics Biomarker/ Image Analysis Sensor Data Import Creating Fake Charts, with Fake data

When you’re not limited by fixed-size compute & data, what
happens?

#1: “Better” Science “Answer the question we want to ask”
not constrained to what fits on local compute power all desired samples all desired queries easier collaboration

#2 “Faster” Science Run our “better” science that would have
taken months or years in hours or days

A couple of use cases…

Before: Trade-off compute time vs. accuracy Now: Better analysis, fewer
false negatives Better results faster Initial Coarse Screen Higher Quality Analysis Best Quality Highest Quality Analysis Otherwise Unexplored Molecules Higher Quality Analysis Best Quality Scientific Process for Molecular Modeling

Computational Chemistry Novartis  Need  Enable push-button Utility Supercomputing
for molecular modeling  Solution  30,000 CPU run across US/EU Cloud (AWS)  10 years of compute in 8 hours for $10,000  Found 3 compounds now in the wetlab!

 $$$/science  Application-aware data management  Data security Lessons
learned

Another big 10 pharma… Built a 10,600 server cluster ($44M)
in 2 hours, running 40 years of compute in 11 hours for $4,372

Big 10 Pharma created 10,600 instance cluster ($44M) in 2
hours, running 40 years of compute in 11 hours for $4,372

 Capacity is no longer an issue  Hardware =
software  Testing (error handling, unit testing, etc.)  Cycle has spent >$1M dollars on AWS over 5yrs Lessons learned

Gene Expression Analysis Morgridge Institute for Research  Need 
Run a comparison of 78TB stem cell RNA samples to build a unique gene expression database  Make it easier to replicate disease in petri dishes w/induced stem cells  Solution  Enable massive RNAseq run using BowTie that was impossible before

1 Million compute hours 115 years of computing in 1
week for $19,555

What have we learned here?

Servers are not house plants

Servers are wheat!

Protein Binding / GPU Large BioTech 128 GPU cluster 13
GPU-Years of computing in 1.5 months for $150,000 vs. 5 months of CPU for $450,000 Local Data Corporate Firewall 3x the science, ¼ the cost Secure HPC Cluster 8 TB FS External Cloud 128 GPU cluster Scheduled Data Drug designer

Genomic Analysis Research Lab Cloud HPC File System (100TB) (Track
Directories) Internal compute Hi Seq instruments Blob data (S3) Cloud Filer Glacier (Archive) Auto-scaling external environment HPC Cluster LIMS Internal HPC Data Scheduling

DataManager data aware scheduler for science and HPC

So we can do reliable HPC and data movement in
the cloud…

 Hardware is HARD!  Great Software tools yield happiness
 IT solving scientific problems vs. low-level ops  Replicating to AWS Glacier offers DR options…

 Take advantage of Cloud storage scale (S3)  Capacity
isn’t an issue  Large public data sets + secure, massive compute, provide huge opportunities for new science

Let us quickly recap

This isn’t neuroscience…

Servers are not house plants!

Servers are wheat!

At Cycle, we believe Scientists and Researchers are shackled by
a lack of access to compute

Results The Scientific Method on Utility HPC Yields “Better”, “Faster” Research for way less $$$

Oh, and one more thing…

2013 BigScience Challenge $10,000 of free AWS and CycleComputing powered
services to any science benefitting humanity The 2012 winner was a 115yr genomic analysis Enter at: cyclecomputing.com/big-science-challenge/enter

Thank You! Questions? blog.cyclecomputing.com www.cyclecomputing.com @cyclecomputing @jamesdotcuff

Right Place, Right Time, Right Science: Lessons...

Right Place, Right Time, Right Science: Lessons Learned from Utility HPC/Big Data - James Cuff

More Decks by National Database for Autism Research

Other Decks in Science

Featured

Transcript