Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data intensive biology in the cloud: instrumenting ALL the things by Titus Brown

PyCon 2014
April 12, 2014

Data intensive biology in the cloud: instrumenting ALL the things by Titus Brown

Cloud computing offers some great opportunities for science, but most cloud computing platforms are I/O and memory limited, and hence are poor matches for data-intensive computing. After 4 years of research software development we are now instrumenting and benchmarking our analysis pipelines; numbers, lessons learned, and future plans will be discussed. Everything is open source.

PyCon 2014

April 12, 2014
Tweet

More Decks by PyCon 2014

Other Decks in Science

Transcript

  1. Instrument  ALL  the  things:   Studying  data-­‐intensive   workflows  in

     the  clowd.   C.  Titus  Brown   Michigan  State  University   (See  blog  post)  
  2. A  few  upfront  definiGons   Big  Data,  n:  whatever  is

     s"ll  inconvenient  to  compute  on.   Data  scienGst,  n:  a  staGsGcian  who  lives  in  San  Francisco.   Professor,  n:  someone  who  writes  grants  to  fund  people   who  do  the  work  (c.f.  Fernando  Perez)   I  am  a  professor  (not  a  data  scien"st)    who   writes  grants  so  that  others  can  do  data-­‐ intensive  biology.  
  3. This  talk  dedicated  to  Terry  Peppers   Titus,  I  no

     longer  understand   what  you  actually  do…   Daddy,  what  do  you  do  at   work!?  
  4. I  assemble  puzzles  for  a  living.   Well,  ok,  I

     strategize  about  solving  mulG-­‐dimensional  puzzles  with  billions  of  pieces  and  no  box.  
  5. Three  bioinformaGc  strategies  in  use   •  Greedy:  “if  the

     piece  sorta  fits…”   •  N2  –  “Do  these  two  pieces  match?  How  about   this  next  one?”   •  The  Dutch  approach.  
  6. The  Dutch  SoluGon   Algorithmically:   •  Is  linear  in

     Gme  with  number  of  pieces  J    (Way  beZer  than  N2!)   •  Is  linear  in  memory  with  volume  of  data  L    (This  is  due  to  errors  in  digiGzaGon  process.)    
  7. Our  research  challenges  –   1.  It  costs  only  $10k

     &  1  week  to  generate   enough  sequence  data  that  no  commodity   computer  (and  few  supercomputers)  can   assemble  it.   2.  Hundreds  -­‐>  thousands  of  such  data  sets  are   being  generated  each  year.  
  8. Our  research  challenges  –   1.  It  costs  only  $10k

     &  1  week  to  generate   enough  sequence  data  that  no  commodity   computer  (and  few  supercomputers)  can   assemble  it.   2.  Hundreds  -­‐>  thousands  of  such  data  sets  are   being  generated  each  year.  
  9. Our  research  (i)  -­‐  CS   •  Streaming  lossy  compression

     approach  that   discards  pieces  we’ve  seen  before.   •  Low  memory  probabilisGc  data  structures.    (…see  Pycon  2013  talk)   =>  RAM  now  scales  beZer:  O(I)  where  I  <<  N   (I  is  sample  dependent  but  typically  I  <  N/20)  
  10. Our  research  (ii)  -­‐  approach   •  Open  source,  open

     data,  open  science,  and   reproducible  computaGonal  research.   – GitHub   – Automated  tesGng,  CI,  &  literate  reSTing   – Blogging,  TwiZer   – IPython  Notebook  for  data  analysis,  figures.   •  Protocols  for  assembling  in  the  cloud.  
  11. Molgula  oculata   Molgula  occulta   Molgula  oculata   Real

     soluGons,  tackling  squishy  biology!   Elijah  Lowe  &  Billie  Swalla  
  12. Doing  things  right  =>  #awesomesauce     Protocols in English

    for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
  13. Benchmarking  strategy   •  Rent  a  bunch  of  cloud  VMs

     from  Amazon  and   Rackspace.   •  Extract  commands  from  tutorials  using   literate-­‐resGng.   •  Use  ‘sar’  (sysstat  pkg)  to  sample  CPU,  RAM,   and  disk  I/O.  
  14. ObservaGon  #1:  Rackspace  is  faster   machine   data  disk

      working   hours   cost                       rackspace-­‐15gb   200  GB   100  GB   34.9   $23.70   m2.xlarge   EBS   ephemeral   44.7   $18.34   m1.xlarge   EBS   ephemeral   45.5   $21.82   m1.xlarge   EBS,  max   IOPS   ephemeral   49.1   $23.56   m1.xlarge   EBS,  max   IOPS   EBS,  max  IOPS   52.5   $25.20  
  15. Surprise  #1:  AWS  ephemeral  storage  is   FASTER   machine

      data  disk   working   hours   cost                       rackspace-­‐15gb   200  GB   100  GB   34.9   $23.70   m2.xlarge   EBS   ephemeral   44.7   $18.34   m1.xlarge   EBS   ephemeral   45.5   $21.82   m1.xlarge   EBS,  max   IOPS   ephemeral   49.1   $23.56   m1.xlarge   EBS,  max   IOPS   EBS,  max  IOPS   52.5   $25.20  
  16. Can’t  we  just  use  a  faster  computer?   •  Demo

     data  on  m1.xlarge:  2789  s   •  Demo  data  on  m3.xlarge:  1970  s  –  30%  faster!   (Why?   m3.xlarge  has  2x40  GB  SSD  drives  &  40%  faster   cores.)     Great!  Let’s  try  it  out!  
  17. ObservaGon  #3:  mulGfaceted  problem!   •  Full  data  on  m1.xlarge:

     45.5  h   •  Full  data  on  m3.xlarge:  out  of  disk  space.   We  need  about  200  GB  to  run  the  full  pipeline.     You  can  have  fast  disk  or  lots  of  disk  but  not   both,  for  the  moment.  
  18. Future  direcGons   1.  Invest  in  cache-­‐local  data  structures  and

      algorithms.   2.  Invest  in  streaming/in-­‐memory  approaches.   3.  Not  clear  (to  me)  that  straight  code   opGmizaGon  or  infrastructure  engineering  is   worthwhile  investment.  
  19. Frequently  Offered  SoluGons   1.  You  should  like,  totally  mulGthread

     that.   (See:  McDonald  &  Brown,  POSA)     2.  Hadoop  will  just  crush  that  workload,  dude.   (Unlikely  to  be  cost-­‐effecGve.)   3.  Have  you  tried  <my  proprietary  Big  Data   technology  stack>?   (Thatz  Not  Science)  
  20. OpGmizaGon  vs  scaling   •  Linear  Gme/memory  improvements  would  not

      have  addressed  our  core  problem.   (2  years,  20x  improvement,  100x  increase  in  data.)   •  Puzzle  problem  is  a  graph  problem  with  big  data,   no  locality,  small  compute.  Not  friendly.   •  We  need(ed)  to  scale  our  algorithms.   •  Can  now  run  on  single-­‐chassis,  in  ~15  GB  RAM.    
  21. OpGmizaGon  vs  scaling  -­‐-­‐   10 0 1 2 3

    4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 Size of problem Compute resources (abstract)
  22. Scaling  can  be  more  important!   100 0 10 20

    30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 Size of problem Compute resources (abstract)
  23. What  are  we  losing  by  focusing  our   engineering  on

     pleasantly  parallel   problems?   •  Hadoop  is  fundamentally  not  that  interes"ng.   •  Research  is  about  the  100x.   •  Scaling  new  problems,  evaluaGng/creaGng   new  data  structures  and  algorithms,  etc.  
  24. (From  my  PyCon  2011  talk.)   Theme:  Life’s  too  short

     to  tackle  the   easy  problems  –  come  to  academia!     10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Resources ($$, etc.) Parallelizability Awesome stuff to research Easy stuff like Google Search
  25. Thanks!   •  Leigh  Sheneman,  for  starGng  the   benchmarking

     project.   •  Labbies:  Michael  R.  Crusoe,  Luiz  Irber,  Likit   Preeyanon,  Camille  ScoZ,  and  Qingpeng   Zhang.  
  26. Thanks!   •  github.com/ged-­‐lab/   –  khmer  –  core  project

      –  khmer-­‐protocols  –  tutorials/acceptance  tests   –  literate-­‐resGng  –  script  to  pull  out  code  from  reST  tutorials   •  Blog  post  at:  hZp://ivory.idyll.org/blog/2014-­‐pycon.html   •  Michael  R.  Crusoe,  Likit  Preeyanon,  Camille  ScoZ,  and   Qingpeng  Zhang  are  here  at  PyCon.   …note,  you  can  probably  afford  to   buy  them  off  me  :)