CERN: Accelerating Science with Puppet

October 25, 2012

CERN: Accelerating Science with Puppet

"Accelerating Science with Puppet" by Tim Bell of CERN, at PuppetConf 2012.

Abstract: This talk will review CERN's objectives and how the computing infrastructure is evolving to address the challenges at scale using community supported software such as Puppet and OpenStack.

Speaker Bio: Tim Bell is responsible for the CERN IT Operating System and Infrastructure Group which supports Windows, Mac and Linux across the site along with virtualisation, E-mail and web services. These systems are used by over 10,000 scientists researching fundamental physics, finding out what the Universe is made of and how it works. Prior to working at CERN, Tim worked for Deutsche Bank managing private banking infrastructure in Europe and for IBM as a Unix kernel developer and deploying large scale technical computing solutions.

  1. Accelera'ng  Science   with  Puppet   Tim  Bell   Tim.Bell@cern.ch

      @noggin143     PuppetConf  San  Francisco   28th  September  2012   1   PuppetConf  2012   Tim  Bell,  CERN  
  2. What  is  CERN  ?   PuppetConf  2012   Tim  Bell,

     CERN   2   •  Conseil  Européen  pour  la   Recherche  Nucléaire  –  aka   European  Laboratory  for   Par'cle  Physics   •  Between  Geneva  and  the   Jura  mountains,  straddling   the  Swiss-­‐French  border   •  Founded  in  1954  with  an   interna'onal  treaty   •  Our  business  is   fundamental  physics  ,  what   is  the  universe  made  of  and   how  does  it  work  
  3. PuppetConf  2012   Tim  Bell,  CERN   3   Answering

     fundamental  ques'ons…   •  How  to  explain  par'cles  have  mass?   We  have  theories  and  accumula'ng  experimental  evidence..  Ge[ng  close…     •  What  is  96%  of  the  universe  made  of  ?   We  can  only  see  4%  of  its  es'mated  mass!     •  Why  isn’t  there  an'-­‐ma`er   in  the  universe?   Nature  should  be  symmetric…     •  What  was  the  state  of  ma`er  just   aber  the  «  Big  Bang  »  ?   Travelling  back  to  the  earliest  instants  of   the  universe  would  help…  
  4. Community  collabora'on  on  an  interna'onal  scale   Tim  Bell,  CERN

      4   PuppetConf  2012  
  5. The  Large  Hadron  Collider   Tim  Bell,  CERN   5

      PuppetConf  2012  
  6. PuppetConf  2012   Tim  Bell,  CERN   6  

  7. LHC  construc'on   PuppetConf  2012   Tim  Bell,  CERN  

  8. 8 The  Large  Hadron  Collider  (LHC)  tunnel   PuppetConf  2012

      Tim  Bell,  CERN  
  9. PuppetConf  2012   Tim  Bell,  CERN   9  

  10. Superconduc'ng  magnets  –  October  2008   PuppetConf  2012   Tim

     Bell,  CERN   10   A  faulty  connec'on  between  two  superconduc'ng  magnets  led  to  the  release  of  a   large  amount  of  helium  into  the  LHC  tunnel  and  forced  the  machine  to  shut  down   for  repairs  for  one  year  
  11. Accumula'ng  events  in  2009-­‐2011   PuppetConf  2012   Tim  Bell,

     CERN   11  
  12. PuppetConf  2012   Tim  Bell,  CERN   12  

  13. Heavy  Ion  Collisions   PuppetConf  2012   Tim  Bell,  CERN

  14. PuppetConf  2012   Tim  Bell,  CERN   14  

  15. PuppetConf  2012   Tim  Bell,  CERN   15   Tier-­‐1

     (11  centres):   • Permanent  storage   • Re-­‐processing   • Analysis   Tier-­‐0  (CERN):   • Data  recording   • Ini'al  data  reconstruc'on   • Data  distribu'on   Tier-­‐2    (~200  centres):   •   Simula'on   •   End-­‐user  analysis   •  Data  is  recorded  at  CERN  and  Tier-­‐1s  and  analysed  in  the  Worldwide  LHC  Compu'ng  Grid   •  In  a  normal  day,  the  grid  provides  100,000  CPU  days  execu'ng  1  million  jobs  
  16. PuppetConf  2012   Tim  Bell,  CERN   16   • 

    Data  Centre  by  Numbers   –  Hardware  installa'on  &  re'rement   •  ~7,000  hardware  movements/year;  ~1,800  disk  failures/year   Xeon   5150   2%   Xeon   5160   10%   Xeon   E5335   7%   Xeon   E5345   14%   Xeon   E5405   6%   Xeon   E5410   16%   Xeon   L5420   8%   Xeon   L5520   33%   Xeon   3GHz   4%    Fujitsu   3%    Hitachi   23%    HP   0%    Maxtor   0%    Seagate   15%    Western   Digital   59%   Other   0%   High  Speed  Routers   (640  Mbps  →  2.4  Tbps)   24   Ethernet  Switches   350   10  Gbps  ports   2,000   Switching  Capacity   4.8  Tbps   1  Gbps  ports   16,939   10  Gbps  ports   558   Racks   828   Servers   11,728   Processors   15,694   Cores   64,238   HEPSpec06   482,507   Disks   64,109   Raw  disk  capacity  (TiB)   63,289   Memory  modules   56,014   Memory  capacity  (TiB)   158   RAID  controllers   3,749   Tape  Drives   160   Tape  Cartridges   45,000   Tape  slots   56,000   Tape  Capacity  (TiB)   73,000   IT  Power  Consump^on   2,456  KW   Total  Power  Consump^on   3,890  KW  
  17. Our  Challenges  -­‐  Data  storage   PuppetConf  2012   Tim

     Bell,  CERN   17   •  25PB/year  to  record   •  >20  years  reten'on   •  6GB/s  average   •  25GB/s  peaks  
  18. PuppetConf  2012   Tim  Bell,  CERN   18  

  19. PuppetConf  2012   Tim  Bell,  CERN   19   45,000

     tapes  holding  73PB  of  physics  data  
  20. New  data  centre  to  expand  capacity   PuppetConf  2012  

    Tim  Bell,  CERN   20   •  Data  centre  in   Geneva  reaches  limit   of  electrical  capacity   at  3.5MW   •  New  centre  chosen  in   Budapest,  Hungary   •  Addi'onal  2.7MW  of   usable  power   •  Hands  off  facility   •  Deploying  from  2013  
  21. Time  to  change  strategy   •  Ra'onale   –  Need

     to  manage  twice  the  servers  as  today   –  No  increase  in  staff  numbers   –  Tools  becoming  increasingly  bri`le  and  will  not  scale  as-­‐is   •  Approach   –  We  are  no  longer  a  special  case  for  compute   –  Adopt  an  open  source  tool  chain  model   –  Strong  engineering  skills  allows  rapid  adop'on  of  new  technologies   •  Evaluate  solu'ons  in  the  problem  domain   •  Iden'fy  func'onal  gaps  and  challenge  them   –  Contribute  new  func'on  back  to  the  community   PuppetConf  2012   Tim  Bell,  CERN   21  
  22. Building  Blocks   PuppetConf  2012   Tim  Bell,  CERN  

    22   Bamboo Koji, Mock AIMS/PXE Foreman Yum repo Pulp Puppet-DB mcollective, yum JIRA Lemon / Hadoop git OpenStack Nova Hardware database Puppet Active Directory / LDAP
  23. Training  and  Support   •  Buy  the  book  rather  than

     guru  mentoring   •  Newcomers  are  rapidly  produc've  (and  oben  know  more  than  us)   •  Community  and  Enterprise  support  means  we’re  not  on  our  own   PuppetConf  2012   Tim  Bell,  CERN   23  
  24. Staff  Mo'va'on   •  Skills  valuable  outside  of  CERN  when

     an  engineer’s  contracts   end   PuppetConf  2012   Tim  Bell,  CERN   24  
  25. Prepare  the  move  to  the  clouds   •  Improve  opera'onal

     efficiency   –  Machine  recep'on  and  tes'ng   –  Hardware  interven'ons  with  long  running  programs   –  Mul'ple  opera'ng  system  demand   •  Improve  resource  efficiency   –  Exploit  idle  resources,  especially  wai'ng  for  tape  I/O   –  Highly  variable  load  such  as  interac've  or  build  machines   •  Improve  responsiveness   –  Self-­‐Service   –  Coffee  break  response  'me   PuppetConf  2012   Tim  Bell,  CERN   25  
  26. Service  Model   PuppetConf  2012   Tim  Bell,  CERN  

    26   •  Pets are given names like pussinboots.cern.ch •  They are unique, lovingly hand raised and cared for •  When they get ill, you nurse them back to health •  Cattle are given numbers like vm0042.cern.ch •  They are almost identical to other cattle •  When they get ill, you get another one •  Future application architectures tend towards Cattle but Pets with configuration management are also viable
  27. OpenStack   PuppetConf  2012   Tim  Bell,  CERN   27

      •  Open  source  cloud  run  by  an  independent  founda'on   with  over  6,000  members  from  850  organisa'ons   •  Started  in  2010  but  maturing  rapidly  with  public  cloud   services  from  Rackspace,  HP  and  Ubuntu     Pla'num  Members  
  28. Many  OpenStack  Components  to  Configure   PuppetConf  2012   Tim

     Bell,  CERN   28   Compute Scheduler Network Volume Registry Image KEYSTONE HORIZON NOVA   GLANCE
  29. When  communi'es  combine…   •  OpenStack’s  many  components  and  op'ons

     make   configura'on  complex  out  of  the  box   •  Puppet  forge  module  from  PuppetLabs  (Thanks,  Dan  Bode)   •  The  Foreman  adds  OpenStack  provisioning  for  user  kiosk     PuppetConf  2012   Tim  Bell,  CERN   29  
  30. Scaling  up  with  Puppet  and  OpenStack   •  Use  LHC@Home

     based  on  BOINC  for  simula'ng  magne'cs   guiding  par'cles  around  the  LHC   •  Naturally,  there  is  a  puppet  module  puppet-­‐boinc   •  1000  VMs  spun  up  to  stress  test  the  hypervisors  with  Puppet,   Foreman  and  OpenStack   PuppetConf  2012   Tim  Bell,  CERN   30  
  31. Next  Steps   •  Expand  tool  chain   –  Mcollec've

      –  Puppet-­‐DB   •  Deploy  at  scale  in  produc'on   –  Move  towards  15,000  hypervisors  over  next  two  years   –  Ex'mate  100-­‐300,000  virtual  machines   •  Work  with  labs  on  common  solu'ons  for  scien'fic  compu'ng   –  Batch  system  configura'ons   –  Grids   –  Publishing  to  h`p://github.com/cernops   •  Inves'gate  desktop  and  device  management   –  Linux  desktops   –  Macs   –  KVMs,  PDUs   PuppetConf  2012   Tim  Bell,  CERN   31  
  32. Final  Thoughts   PuppetConf  2012   Tim  Bell,  CERN  

    32   •  A  small  project  to  share  documents  at   CERN  in  the  ‘90s  created  the  massive   phenomenon  that  is  today’s  world  wide   web   •  Open  Source   •  Vibrant  community  and  eco-­‐system   •  Working  with  the  Puppet  and  OpenStack   communi'es  has  shown  the  power  of   collabora'on     •  We  have  built  a  toolchain  in  one   year  with  part  'me  resources   •  Running  15,000  servers  and  up  to   300,000  VMs  is  scary  but  achievable   •  Looking  forward  to  further  contribu'ons   as  we  move  to  large  scale  deployment  
  33. For  more  details,  see  Ben  Jones’  talk  at  15:50  today

      Configura'on  Management  at  CERN  –  From   Homegrown  to  Industry  Standard   Tim  Bell  
  34. References   PuppetConf  2012   Tim  Bell,  CERN   34

      CERN   h`p://public.web.cern.ch/public/   Scien'fic  Linux   h`p://www.scien'ficlinux.org/   Worldwide  LHC  Compu'ng  Grid   h`p://lcg.web.cern.ch/lcg/   h`p://rtm.hep.ph.ic.ac.uk/   Jobs   h`p://cern.ch/jobs   Detailed  Report  on  Agile  Infrastructure   h`p://cern.ch/go/N8wp  
  35. Backup  Slides   PuppetConf  2012   Tim  Bell,  CERN  

  36. CERN’s  tools   •  The  world’s  most  powerful  accelerator:  LHC

      –  A  27  km  long  tunnel  filled  with  high-­‐tech  instruments   –  Equipped  with  thousands  of  superconduc'ng  magnets   –  Accelerates  par'cles  to  energies  never  before  obtained   –  Produces  par'cle  collisions  crea'ng  microscopic  “big  bangs”   •  Very  large  sophis'cated  detectors   –  Four  experiments  each  the  size  of  a  cathedral   –  Hundred  million  measurement  channels  each   –  Data  acquisi'on  systems  trea'ng  Petabytes  per  second   •  Top  level  compu'ng  to  distribute  and  analyse  the  data   –  A  Compu'ng  Grid  linking  ~200  computer  centres  around  the  globe   –  Sufficient  compu'ng  power  and  storage  to  handle  25  Petabytes  per   year,  making  them  available  to  thousands  of  physicists  for  analysis   PuppetConf  2012   Tim  Bell,  CERN   36  
  37. Our  Infrastructure   •  Hardware  is  generally  based  on  commodity,

     white-­‐box  servers   –  Open  tendering  process  based  on  SpecInt/CHF,  CHF/Wa`  and  GB/CHF   –  Compute  nodes  typically  dual  processor,  2GB  per  core   –  Bulk  storage  on  24x2TB  disk  storage-­‐in-­‐a-­‐box  with  a  RAID  card   •  Vast  majority  of  servers  run  Scien'fic  Linux,  developed  by   Fermilab  and  CERN,  based  on  Redhat  Enterprise   –  Focus  is  on  stability  in  view  of  the  number  of  centres  on  the  WLCG   PuppetConf  2012   Tim  Bell,  CERN   37  
  38. New  architecture  data  flows   PuppetConf  2012   Tim  Bell,

     CERN   38  
  39. OpenStack   PuppetConf  2012   Tim  Bell,  CERN   39

      Gold  Members