$30 off During Our Annual Pro Sale. View Details »

Berlin 2013 - Session - Reza Spagnolo

Monitorama
September 20, 2013
260

Berlin 2013 - Session - Reza Spagnolo

Monitorama

September 20, 2013
Tweet

Transcript

  1. Adap%ve  Applica%on  
    Architecture  
    Reza  Spagnolo  
    @rmspagnolo  

    View Slide

  2. Hey  there  !  
    Who  am  I  ?  
    •  A  student  
    •  An  engineer,  for  9  years  now  
    •  Interested  in  building  systems  
    •  Dev  &  Ops  since  the  beginning  

    View Slide

  3. #monitoringsocks  
    but  never  sucked  for  real  

    View Slide

  4. Monitoring  is  an  architecture  
    component  

    View Slide

  5. Infrastructure  is  code  

    View Slide

  6. Monitoring  is  code  
    •  Development  process  
    •  Tes%ng  
    •  Deployment  

    View Slide

  7. Monitoring  is  service  
    •  Metrics  
    •  Alerts  

    View Slide

  8. Namespaces  
    There  are  only  two  hard  things  in  Computer  
    Science:  cache  invalida-­‐-­‐  Phil  Karlton  

    View Slide

  9. #soLwaresucks  
    without  namespaces  

    View Slide

  10. Metrics  namespaces  
    •  Helps  your  mental  model  
    •  Helps  iden%fying  things  
    •  Dimensions:  loca%on,  versions,  etc  

    View Slide

  11. Monitoring  based  promo%on  
    Acceptance  
    Development   Produc%on  
    •  Produc%on  configura%on  
    •  Comparison  
    •  Log  analysis  

    View Slide

  12. Monitoring  deployment  
    •  Push  changes  
    •  Keep  correspondence  
    •  Automate  
    •  Namespaces  

    View Slide

  13. Synthe%c  traffic  

    View Slide

  14. Canaries  

    View Slide

  15. Miner’s  canary  
    •  If  a  customer  lets  you  know  about  a  problem  
    then  you  have  already  failed  at  least  twice  
    •  The  right  quan%ty    
    •  Filtering  –  see  the  right  picture  
    •  Document  changes  to  your  baselines  

    View Slide

  16. Other  types  of  birds  

    View Slide

  17. The  preXy  ones  we  just  saw  

    View Slide

  18. The  Angry  ones  

    View Slide

  19. And  monkeys  !  

    View Slide

  20. Audi%ng  
    Events  %meline  
    •  Changes  
    •  Deployments  
    •  Rollbacks  
    •  Alarms  

    View Slide

  21. Architecture  
    •  Single  responsibility  principle  
    •  Orchestra%on  or  Choreography  
    •  Dynamic  configura%on  
    •  Failover  and  feedback  cycles  
    •  Rate  limi%ng  
    •  Integra%on  paXerns  

    View Slide

  22. Single  responsibility  principle  
    •  (Micro-­‐)Services  
    •  Components  
    •  Small  number  of  dependencies  
    •  Predictable  failure  modes  
    •  Easier  adapta%on  
    •  Expecta%on  on  metrics  

    View Slide

  23. Orchestra%on  or  Choreography  
    •  Orchestra%on  
    – May  be  simpler  to  reason  about  
    – Coupling  with  the  director  
    •  Choreography  
    – Possibly  more  flexible  
    – Beware  of  corrup%on  of  state  

    View Slide

  24. Dynamic  configura%on  
    •  Reconfigurable  at  run%me  
    •  Fast  reac%on  
    •  Beware  of  snowflakes  

    View Slide

  25. Failover  and  feedback  cycles  
    •  Automated  failover  
    •  Failover  stress  
    •  Beware  of  amplifying  effects  
    •  Break  cycles  

    View Slide

  26. Rate  limi%ng  
    •  Degraded  is  beXer  than  nothing  
    •  Not  only  at  the  top  level  
    •  Component  rate  limi%ng  
    •  Rate  limi%ng  should  be  dynamic  
    •  Rate  limi%ng  can  be  par%%oned  
    •  Clients  should  be  part  of  the  contract  
    •  Rate  limi%ng  is  aLer  all  handshaking  
    •  Handshaking:  within  the  protocol  or  out  of  band  

    View Slide

  27. Integra%on  and  component  PaXerns  
    •  Timeouts  
    •  Circuit  breakers  
    •  Resource  pools  
    •  Fail  fast  
    •  Queue  and  retry  
    •  Applica%on  pings  and  sanity  checks  

    View Slide

  28. View Slide

  29. Addi%onal  prac%ces  
    •  Quaran%ne  
    •  Regenera%ve  infrastructure  
    •  Rollback  and  monitoring  
    •  Automa%on  of  SOP  –  Runbook  

    View Slide

  30. Automated  runbooks  and  checklists  
    •  Automate  your  SOP  
    •  Respond  to  failure  with  a  
    checklist  
    •  Automate  checklists  too  
    •  Helps  to  avoid  the  
    cogni%ve  bias  and  other  
    nasty  stuff  your  brain  
    does  

    View Slide

  31. Discipline  !  

    View Slide

  32. Sources  
    •  Recovery  Oriented  Compu%ng  Papers  
    •  James  Hamilton  LISA  paper  
    •  Release  It  !  
    •  Scalable  Internet  Architectures  
    •  A  ton  of  other  great  books  and  papers  

    View Slide

  33. The  value  
    Among  the  kinds  of  overhead:  
    •  The  opera%onal  one    
    •  The  customers  one  
    No  maXer  how  sophis%cated  is  our  monitoring  infrastructure  issues  
    no%fied  by  customers  are  at  the  end  the  most  important  ones  as  they  
    impact  their  experience  directly  and  are  oLen  discovering  unknown  
    bugs.  
     
    Freeing  up  the  team  as  much  as  possible  from  the  overhead  of  the  
    first  type  gives  more  %me  to  focus  on  the  issues  of  the  product  itself.  

    View Slide

  34. Thank  you  !  

    View Slide