Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Berlin 2013 - Session - Reza Spagnolo

0580d500edfdb2e5e80e4732ac8df1ea?s=47 Monitorama
September 20, 2013
240

Berlin 2013 - Session - Reza Spagnolo

0580d500edfdb2e5e80e4732ac8df1ea?s=128

Monitorama

September 20, 2013
Tweet

Transcript

  1. Adap%ve  Applica%on   Architecture   Reza  Spagnolo   @rmspagnolo  

  2. Hey  there  !   Who  am  I  ?   • 

    A  student   •  An  engineer,  for  9  years  now   •  Interested  in  building  systems   •  Dev  &  Ops  since  the  beginning  
  3. #monitoringsocks   but  never  sucked  for  real  

  4. Monitoring  is  an  architecture   component  

  5. Infrastructure  is  code  

  6. Monitoring  is  code   •  Development  process   •  Tes%ng

      •  Deployment  
  7. Monitoring  is  service   •  Metrics   •  Alerts  

  8. Namespaces   There  are  only  two  hard  things  in  Computer

      Science:  cache  invalida<on  and  naming  things.   -­‐-­‐  Phil  Karlton  
  9. #soLwaresucks   without  namespaces  

  10. Metrics  namespaces   •  Helps  your  mental  model   • 

    Helps  iden%fying  things   •  Dimensions:  loca%on,  versions,  etc  
  11. Monitoring  based  promo%on   Acceptance   Development   Produc%on  

    •  Produc%on  configura%on   •  Comparison   •  Log  analysis  
  12. Monitoring  deployment   •  Push  changes   •  Keep  correspondence

      •  Automate   •  Namespaces  
  13. Synthe%c  traffic  

  14. Canaries  

  15. Miner’s  canary   •  If  a  customer  lets  you  know

     about  a  problem   then  you  have  already  failed  at  least  twice   •  The  right  quan%ty     •  Filtering  –  see  the  right  picture   •  Document  changes  to  your  baselines  
  16. Other  types  of  birds  

  17. The  preXy  ones  we  just  saw  

  18. The  Angry  ones  

  19. And  monkeys  !  

  20. Audi%ng   Events  %meline   •  Changes   •  Deployments

      •  Rollbacks   •  Alarms  
  21. Architecture   •  Single  responsibility  principle   •  Orchestra%on  or

     Choreography   •  Dynamic  configura%on   •  Failover  and  feedback  cycles   •  Rate  limi%ng   •  Integra%on  paXerns  
  22. Single  responsibility  principle   •  (Micro-­‐)Services   •  Components  

    •  Small  number  of  dependencies   •  Predictable  failure  modes   •  Easier  adapta%on   •  Expecta%on  on  metrics  
  23. Orchestra%on  or  Choreography   •  Orchestra%on   – May  be  simpler

     to  reason  about   – Coupling  with  the  director   •  Choreography   – Possibly  more  flexible   – Beware  of  corrup%on  of  state  
  24. Dynamic  configura%on   •  Reconfigurable  at  run%me   •  Fast

     reac%on   •  Beware  of  snowflakes  
  25. Failover  and  feedback  cycles   •  Automated  failover   • 

    Failover  stress   •  Beware  of  amplifying  effects   •  Break  cycles  
  26. Rate  limi%ng   •  Degraded  is  beXer  than  nothing  

    •  Not  only  at  the  top  level   •  Component  rate  limi%ng   •  Rate  limi%ng  should  be  dynamic   •  Rate  limi%ng  can  be  par%%oned   •  Clients  should  be  part  of  the  contract   •  Rate  limi%ng  is  aLer  all  handshaking   •  Handshaking:  within  the  protocol  or  out  of  band  
  27. Integra%on  and  component  PaXerns   •  Timeouts   •  Circuit

     breakers   •  Resource  pools   •  Fail  fast   •  Queue  and  retry   •  Applica%on  pings  and  sanity  checks  
  28. None
  29. Addi%onal  prac%ces   •  Quaran%ne   •  Regenera%ve  infrastructure  

    •  Rollback  and  monitoring   •  Automa%on  of  SOP  –  Runbook  
  30. Automated  runbooks  and  checklists   •  Automate  your  SOP  

    •  Respond  to  failure  with  a   checklist   •  Automate  checklists  too   •  Helps  to  avoid  the   cogni%ve  bias  and  other   nasty  stuff  your  brain   does  
  31. Discipline  !  

  32. Sources   •  Recovery  Oriented  Compu%ng  Papers   •  James

     Hamilton  LISA  paper   •  Release  It  !   •  Scalable  Internet  Architectures   •  A  ton  of  other  great  books  and  papers  
  33. The  value   Among  the  kinds  of  overhead:   • 

    The  opera%onal  one     •  The  customers  one   No  maXer  how  sophis%cated  is  our  monitoring  infrastructure  issues   no%fied  by  customers  are  at  the  end  the  most  important  ones  as  they   impact  their  experience  directly  and  are  oLen  discovering  unknown   bugs.     Freeing  up  the  team  as  much  as  possible  from  the  overhead  of  the   first  type  gives  more  %me  to  focus  on  the  issues  of  the  product  itself.  
  34. Thank  you  !