Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we made an API service that never went down...

How we made an API service that never went down...

More Decks by API Strategy & Practice Conference

Other Decks in Technology

Transcript

  1. How  we  made  an  API  service  that   never  went

     down  and  why  that   wasn’t  the  victory  I’d  intended.   Ken  Wronkiewicz  (@wirehead)   Manager,  SoBware  Development   Rackspace  
  2. Ticker   Plant   Ticker   Plant   NYSE  /

     NASDAQ   Central   Distributor   Central   Distributor   Regional   Distributor   Customer   Distributor   Customer   Distributor   Stock   Trader   Stock   Trader   Stock   Trader   Control   Plane   Control   Plane   DB   DB   DB   DB  
  3. Ticker   Plant   Ticker   Plant   NYSE  /

     NASDAQ   Central   Distributor   Central   Distributor   Rack-­‐level   Distributor   Rack-­‐level   Distributor   Rack-­‐level   Distributor  
  4. -­‐ Test  coverage:   -­‐ Unit  tests   -­‐ IntegraXon  tests   -­‐ CI/CD

     pipelines   -­‐ Code  review  pracXce   -­‐ OperaXonal  pracXce   -­‐ AutomaXon  instead  of  manual  steps.  
  5. Cassandra   Cassandra   Cassandra   Cassandra   API  /

      Worker   API  /   Worker   API  /   Worker   API  /   Worker   Cloud  Load   Balancer  
  6. Some  lies  we  tell  ourselves  about   producXon  operaXons:  

    •  Servers  won’t  go  down   •  You’ll  never  loose  a  colo   •  You’ll  never  see  a  net  split  inside  of  a  colo   •  Users  will  never  noXce  a  brief  blip  
  7. Get     Lock   Release   Lock   Request

      Response   Get     Lock   Release   Lock   Get     Lock   Release   Lock   Request  
  8. 2   4   1   10  buckets   [1-­‐3]

      [4-­‐7]   [8-­‐10]   Front  end  nodes   Stale   Events   WaiXng   Events  
  9. Disaster  day  drills   •  Start  in  staging,  but  do

     it  in  producXon   •  Pick  a  Dungeon  Master   •  If  your  monitoring/alerXng/paging  pipeline  is   working,  the  second  the  DM  breaks  something,   you  should  noXce.   •  You  make  sure  that  people  are  comfortable  with   each  and  every  runbook.  
  10. 90%  =  More  than  a  month  downXme  per  year  

    99%  =  Less  than  4  days  per  year   99.9%  =  Less  than  9  hours  per  year   99.99%  =  Less  than  an  hour  per  year   99.999%  =  ~5  minutes  per  year   99.9999%  =  ~31  seconds  per  year   HTTP  500  Nein!