Slide 1

Slide 1 text

Accelera'ng  Science   with  Puppet   Tim  Bell   [email protected]   @noggin143     PuppetConf  San  Francisco   28th  September  2012   1   PuppetConf  2012   Tim  Bell,  CERN  

Slide 2

Slide 2 text

What  is  CERN  ?   PuppetConf  2012   Tim  Bell,  CERN   2   •  Conseil  Européen  pour  la   Recherche  Nucléaire  –  aka   European  Laboratory  for   Par'cle  Physics   •  Between  Geneva  and  the   Jura  mountains,  straddling   the  Swiss-­‐French  border   •  Founded  in  1954  with  an   interna'onal  treaty   •  Our  business  is   fundamental  physics  ,  what   is  the  universe  made  of  and   how  does  it  work  

Slide 3

Slide 3 text

PuppetConf  2012   Tim  Bell,  CERN   3   Answering  fundamental  ques'ons…   •  How  to  explain  par'cles  have  mass?   We  have  theories  and  accumula'ng  experimental  evidence..  Ge[ng  close…     •  What  is  96%  of  the  universe  made  of  ?   We  can  only  see  4%  of  its  es'mated  mass!     •  Why  isn’t  there  an'-­‐ma`er   in  the  universe?   Nature  should  be  symmetric…     •  What  was  the  state  of  ma`er  just   aber  the  «  Big  Bang  »  ?   Travelling  back  to  the  earliest  instants  of   the  universe  would  help…  

Slide 4

Slide 4 text

Community  collabora'on  on  an  interna'onal  scale   Tim  Bell,  CERN   4   PuppetConf  2012  

Slide 5

Slide 5 text

The  Large  Hadron  Collider   Tim  Bell,  CERN   5   PuppetConf  2012  

Slide 6

Slide 6 text

PuppetConf  2012   Tim  Bell,  CERN   6  

Slide 7

Slide 7 text

LHC  construc'on   PuppetConf  2012   Tim  Bell,  CERN   7  

Slide 8

Slide 8 text

8 The  Large  Hadron  Collider  (LHC)  tunnel   PuppetConf  2012   Tim  Bell,  CERN  

Slide 9

Slide 9 text

PuppetConf  2012   Tim  Bell,  CERN   9  

Slide 10

Slide 10 text

Superconduc'ng  magnets  –  October  2008   PuppetConf  2012   Tim  Bell,  CERN   10   A  faulty  connec'on  between  two  superconduc'ng  magnets  led  to  the  release  of  a   large  amount  of  helium  into  the  LHC  tunnel  and  forced  the  machine  to  shut  down   for  repairs  for  one  year  

Slide 11

Slide 11 text

Accumula'ng  events  in  2009-­‐2011   PuppetConf  2012   Tim  Bell,  CERN   11  

Slide 12

Slide 12 text

PuppetConf  2012   Tim  Bell,  CERN   12  

Slide 13

Slide 13 text

Heavy  Ion  Collisions   PuppetConf  2012   Tim  Bell,  CERN   13  

Slide 14

Slide 14 text

PuppetConf  2012   Tim  Bell,  CERN   14  

Slide 15

Slide 15 text

PuppetConf  2012   Tim  Bell,  CERN   15   Tier-­‐1  (11  centres):   • Permanent  storage   • Re-­‐processing   • Analysis   Tier-­‐0  (CERN):   • Data  recording   • Ini'al  data  reconstruc'on   • Data  distribu'on   Tier-­‐2    (~200  centres):   •   Simula'on   •   End-­‐user  analysis   •  Data  is  recorded  at  CERN  and  Tier-­‐1s  and  analysed  in  the  Worldwide  LHC  Compu'ng  Grid   •  In  a  normal  day,  the  grid  provides  100,000  CPU  days  execu'ng  1  million  jobs  

Slide 16

Slide 16 text

PuppetConf  2012   Tim  Bell,  CERN   16   •  Data  Centre  by  Numbers   –  Hardware  installa'on  &  re'rement   •  ~7,000  hardware  movements/year;  ~1,800  disk  failures/year   Xeon   5150   2%   Xeon   5160   10%   Xeon   E5335   7%   Xeon   E5345   14%   Xeon   E5405   6%   Xeon   E5410   16%   Xeon   L5420   8%   Xeon   L5520   33%   Xeon   3GHz   4%    Fujitsu   3%    Hitachi   23%    HP   0%    Maxtor   0%    Seagate   15%    Western   Digital   59%   Other   0%   High  Speed  Routers   (640  Mbps  →  2.4  Tbps)   24   Ethernet  Switches   350   10  Gbps  ports   2,000   Switching  Capacity   4.8  Tbps   1  Gbps  ports   16,939   10  Gbps  ports   558   Racks   828   Servers   11,728   Processors   15,694   Cores   64,238   HEPSpec06   482,507   Disks   64,109   Raw  disk  capacity  (TiB)   63,289   Memory  modules   56,014   Memory  capacity  (TiB)   158   RAID  controllers   3,749   Tape  Drives   160   Tape  Cartridges   45,000   Tape  slots   56,000   Tape  Capacity  (TiB)   73,000   IT  Power  Consump^on   2,456  KW   Total  Power  Consump^on   3,890  KW  

Slide 17

Slide 17 text

Our  Challenges  -­‐  Data  storage   PuppetConf  2012   Tim  Bell,  CERN   17   •  25PB/year  to  record   •  >20  years  reten'on   •  6GB/s  average   •  25GB/s  peaks  

Slide 18

Slide 18 text

PuppetConf  2012   Tim  Bell,  CERN   18  

Slide 19

Slide 19 text

PuppetConf  2012   Tim  Bell,  CERN   19   45,000  tapes  holding  73PB  of  physics  data  

Slide 20

Slide 20 text

New  data  centre  to  expand  capacity   PuppetConf  2012   Tim  Bell,  CERN   20   •  Data  centre  in   Geneva  reaches  limit   of  electrical  capacity   at  3.5MW   •  New  centre  chosen  in   Budapest,  Hungary   •  Addi'onal  2.7MW  of   usable  power   •  Hands  off  facility   •  Deploying  from  2013  

Slide 21

Slide 21 text

Time  to  change  strategy   •  Ra'onale   –  Need  to  manage  twice  the  servers  as  today   –  No  increase  in  staff  numbers   –  Tools  becoming  increasingly  bri`le  and  will  not  scale  as-­‐is   •  Approach   –  We  are  no  longer  a  special  case  for  compute   –  Adopt  an  open  source  tool  chain  model   –  Strong  engineering  skills  allows  rapid  adop'on  of  new  technologies   •  Evaluate  solu'ons  in  the  problem  domain   •  Iden'fy  func'onal  gaps  and  challenge  them   –  Contribute  new  func'on  back  to  the  community   PuppetConf  2012   Tim  Bell,  CERN   21  

Slide 22

Slide 22 text

Building  Blocks   PuppetConf  2012   Tim  Bell,  CERN   22   Bamboo Koji, Mock AIMS/PXE Foreman Yum repo Pulp Puppet-DB mcollective, yum JIRA Lemon / Hadoop git OpenStack Nova Hardware database Puppet Active Directory / LDAP

Slide 23

Slide 23 text

Training  and  Support   •  Buy  the  book  rather  than  guru  mentoring   •  Newcomers  are  rapidly  produc've  (and  oben  know  more  than  us)   •  Community  and  Enterprise  support  means  we’re  not  on  our  own   PuppetConf  2012   Tim  Bell,  CERN   23  

Slide 24

Slide 24 text

Staff  Mo'va'on   •  Skills  valuable  outside  of  CERN  when  an  engineer’s  contracts   end   PuppetConf  2012   Tim  Bell,  CERN   24  

Slide 25

Slide 25 text

Prepare  the  move  to  the  clouds   •  Improve  opera'onal  efficiency   –  Machine  recep'on  and  tes'ng   –  Hardware  interven'ons  with  long  running  programs   –  Mul'ple  opera'ng  system  demand   •  Improve  resource  efficiency   –  Exploit  idle  resources,  especially  wai'ng  for  tape  I/O   –  Highly  variable  load  such  as  interac've  or  build  machines   •  Improve  responsiveness   –  Self-­‐Service   –  Coffee  break  response  'me   PuppetConf  2012   Tim  Bell,  CERN   25  

Slide 26

Slide 26 text

Service  Model   PuppetConf  2012   Tim  Bell,  CERN   26   •  Pets are given names like pussinboots.cern.ch •  They are unique, lovingly hand raised and cared for •  When they get ill, you nurse them back to health •  Cattle are given numbers like vm0042.cern.ch •  They are almost identical to other cattle •  When they get ill, you get another one •  Future application architectures tend towards Cattle but Pets with configuration management are also viable

Slide 27

Slide 27 text

OpenStack   PuppetConf  2012   Tim  Bell,  CERN   27   •  Open  source  cloud  run  by  an  independent  founda'on   with  over  6,000  members  from  850  organisa'ons   •  Started  in  2010  but  maturing  rapidly  with  public  cloud   services  from  Rackspace,  HP  and  Ubuntu     Pla'num  Members  

Slide 28

Slide 28 text

Many  OpenStack  Components  to  Configure   PuppetConf  2012   Tim  Bell,  CERN   28   Compute Scheduler Network Volume Registry Image KEYSTONE HORIZON NOVA   GLANCE

Slide 29

Slide 29 text

When  communi'es  combine…   •  OpenStack’s  many  components  and  op'ons  make   configura'on  complex  out  of  the  box   •  Puppet  forge  module  from  PuppetLabs  (Thanks,  Dan  Bode)   •  The  Foreman  adds  OpenStack  provisioning  for  user  kiosk     PuppetConf  2012   Tim  Bell,  CERN   29  

Slide 30

Slide 30 text

Scaling  up  with  Puppet  and  OpenStack   •  Use  LHC@Home  based  on  BOINC  for  simula'ng  magne'cs   guiding  par'cles  around  the  LHC   •  Naturally,  there  is  a  puppet  module  puppet-­‐boinc   •  1000  VMs  spun  up  to  stress  test  the  hypervisors  with  Puppet,   Foreman  and  OpenStack   PuppetConf  2012   Tim  Bell,  CERN   30  

Slide 31

Slide 31 text

Next  Steps   •  Expand  tool  chain   –  Mcollec've   –  Puppet-­‐DB   •  Deploy  at  scale  in  produc'on   –  Move  towards  15,000  hypervisors  over  next  two  years   –  Ex'mate  100-­‐300,000  virtual  machines   •  Work  with  labs  on  common  solu'ons  for  scien'fic  compu'ng   –  Batch  system  configura'ons   –  Grids   –  Publishing  to  h`p://github.com/cernops   •  Inves'gate  desktop  and  device  management   –  Linux  desktops   –  Macs   –  KVMs,  PDUs   PuppetConf  2012   Tim  Bell,  CERN   31  

Slide 32

Slide 32 text

Final  Thoughts   PuppetConf  2012   Tim  Bell,  CERN   32   •  A  small  project  to  share  documents  at   CERN  in  the  ‘90s  created  the  massive   phenomenon  that  is  today’s  world  wide   web   •  Open  Source   •  Vibrant  community  and  eco-­‐system   •  Working  with  the  Puppet  and  OpenStack   communi'es  has  shown  the  power  of   collabora'on     •  We  have  built  a  toolchain  in  one   year  with  part  'me  resources   •  Running  15,000  servers  and  up  to   300,000  VMs  is  scary  but  achievable   •  Looking  forward  to  further  contribu'ons   as  we  move  to  large  scale  deployment  

Slide 33

Slide 33 text

For  more  details,  see  Ben  Jones’  talk  at  15:50  today   Configura'on  Management  at  CERN  –  From   Homegrown  to  Industry  Standard   Tim  Bell  

Slide 34

Slide 34 text

References   PuppetConf  2012   Tim  Bell,  CERN   34   CERN   h`p://public.web.cern.ch/public/   Scien'fic  Linux   h`p://www.scien'ficlinux.org/   Worldwide  LHC  Compu'ng  Grid   h`p://lcg.web.cern.ch/lcg/   h`p://rtm.hep.ph.ic.ac.uk/   Jobs   h`p://cern.ch/jobs   Detailed  Report  on  Agile  Infrastructure   h`p://cern.ch/go/N8wp  

Slide 35

Slide 35 text

Backup  Slides   PuppetConf  2012   Tim  Bell,  CERN   35  

Slide 36

Slide 36 text

CERN’s  tools   •  The  world’s  most  powerful  accelerator:  LHC   –  A  27  km  long  tunnel  filled  with  high-­‐tech  instruments   –  Equipped  with  thousands  of  superconduc'ng  magnets   –  Accelerates  par'cles  to  energies  never  before  obtained   –  Produces  par'cle  collisions  crea'ng  microscopic  “big  bangs”   •  Very  large  sophis'cated  detectors   –  Four  experiments  each  the  size  of  a  cathedral   –  Hundred  million  measurement  channels  each   –  Data  acquisi'on  systems  trea'ng  Petabytes  per  second   •  Top  level  compu'ng  to  distribute  and  analyse  the  data   –  A  Compu'ng  Grid  linking  ~200  computer  centres  around  the  globe   –  Sufficient  compu'ng  power  and  storage  to  handle  25  Petabytes  per   year,  making  them  available  to  thousands  of  physicists  for  analysis   PuppetConf  2012   Tim  Bell,  CERN   36  

Slide 37

Slide 37 text

Our  Infrastructure   •  Hardware  is  generally  based  on  commodity,  white-­‐box  servers   –  Open  tendering  process  based  on  SpecInt/CHF,  CHF/Wa`  and  GB/CHF   –  Compute  nodes  typically  dual  processor,  2GB  per  core   –  Bulk  storage  on  24x2TB  disk  storage-­‐in-­‐a-­‐box  with  a  RAID  card   •  Vast  majority  of  servers  run  Scien'fic  Linux,  developed  by   Fermilab  and  CERN,  based  on  Redhat  Enterprise   –  Focus  is  on  stability  in  view  of  the  number  of  centres  on  the  WLCG   PuppetConf  2012   Tim  Bell,  CERN   37  

Slide 38

Slide 38 text

New  architecture  data  flows   PuppetConf  2012   Tim  Bell,  CERN   38  

Slide 39

Slide 39 text

OpenStack   PuppetConf  2012   Tim  Bell,  CERN   39   Gold  Members