Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Velocity NY 2014: Deploying on the Edge

Rob Peters
September 17, 2014

Velocity NY 2014: Deploying on the Edge

Slides from http://velocityconf.com/velocityny2014/public/schedule/detail/35815

Video at https://www.youtube.com/watch?v=PJbKTJN3ThY

See https://speakerdeck.com/rjpcal/velocity-ny-2014-deploying-on-the-edge-with-notes for slides with speaker notes

We operate a global edge network that delivers many types of modern web traffic, including dynamic applications, websites, mobile apps, live and on-demand streams, and large-file downloads. We strive to maintain reliability, performance, and functionality as we develop and deploy the http server software that handles this traffic. In this talk we’ll cover the evolution of our deployment best practices as we have learned from the community and from our own experiences, including the following:

* Go fast — the deployment cycle should be as short as possible in order to minimize batch size (so as to constrain the scope of the unexpected, because there is always something unexpected) and reduce risk and mean-time-to-recovery.
* But not too fast — the deployment cycle should be long enough to be very confident that the latest release has no new issues before moving on to the next release. The time to “long enough” may vary significantly depending on the layer of the software stack.
* Monitor everything — you can’t fix a problem until you can visualize it.
* But really monitor just a few things — find the smallest set of vital signs that can reliably indicate “is everything running smoothly?”
* Be able to roll forward/backward almost instantly, keeping in mind that the links between command/control and edge systems may be slow and/or lossy.
* Be “lazy" — a programmer’s “lazy” can mean spending days building something that turns a 10-second task into a 2-second task. This is exactly the right approach when it comes to deployment, where it means spending extra time to ensure that new code and configurations are built in such a way that they can be deployed painlessly. This often amounts to strict compatibility between the default behaviors of adjacent versions, configurability to easily turn new functionality on/off, and comprehensive hooks for testing and monitoring.
* Minimize risk in the deployment process itself — while each new software update may have different and unique changes, the procedure for deploying the update can be the same every time.
* Don’t be too portable/configurable — if the application should never run in production without package XYZ, then it shouldn’t pretend to be portable to an environment without XYZ.

Rob Peters

September 17, 2014
Tweet

More Decks by Rob Peters

Other Decks in Technology

Transcript

  1. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. DEPLOYING ON THE EDGE Rob Peters | Chief Architect | @rjpcal Verizon EdgeCast Velocity Conference | New York | September 17, 2014 http://velocityconf.com/velocityny2014/public/schedule/detail/35815
  2. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  EdgeCast founded in 2006 •  Became part of Verizon Digital Media Services in 2013 •  Now delivering 4-7% of end-user internet traffic •  6,000+ customers across all segments •  About me: –  6 years at EdgeCast in Core Engineering –  Focused on functionality, performance, reliability of our edge server network –  Background of Ph.D. / Post-Doc in Computational Neuroscience: •  measurement: human visual psychophysics and eye tracking •  modeling: biologically-inspired computer vision systems About Verizon EdgeCast
  3. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  The problem landscape •  Current practices –  go fast (but know when you might have to go slow, and how to deal) –  monitor everything (but find your vital signs) –  simplify the process •  Guiding principles Outline
  4. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. The big picture •  60+ Super POPs •  5 Continents / 20 Countries •  2,000+ Peering Connections •  8,000,000+ Objects/Second •  Edge Services: –  HTTP Content/Application Delivery –  Live & On-Demand Streaming –  DNS – Security (WAF & DDoS)
  5. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Content delivery in action customer origins back-office POP end users end users end users end users end users end users end users end users end users customer ops data (http traffic) metadata (monitoring; command+control) CDN ops edge POP edge POP edge POP
  6. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Content delivery in action customer origins back-office POP end users end users end users end users end users end users end users end users end users customer ops data (http traffic) metadata (monitoring; command+control) CDN ops edge POP edge POP edge POP
  7. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. core libs & OS kernel OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish customer conf, rules, lua app conf env info Anatomy of an http edge server
  8. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. core libs & OS kernel OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish customer conf, rules, lua app conf env info Anatomy of an http edge server
  9. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. core libs & OS kernel OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish customer conf, rules, lua app conf env info Anatomy of an http edge server
  10. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. core libs & OS kernel OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish customer conf, rules, lua app conf env info Anatomy of an edge server 100x/day 100x/day 100x/day 1-10x/week 2-4x/year
  11. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows / deployment pipelines TYPE OF CHANGE FREQUENCY network-level configs (e.g. info about peer servers, VIPs, address blocks) ~100x per day customer configs (what customers specify in portals/APIs) ~100x per day OS + app configs ~1-10x per week scripts + glue + base code (e.g. daemon control, cron, monit, snmp, sysctl, ...) ~1-5x per week core application (“sailfish” http server) ~1-4x per month kernel ~2-4x per year OS distro ~1x per 2 years
  12. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. core libs & OS kernel OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish app conf env info Anatomy of an http edge server customer conf, rules, lua
  13. edge configs are a ~1 million LOC program with thousands

    of maintainers and 100x deploys per day
  14. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  performance •  uptime •  competitive feature set •  pleasant + fulfilling work environment The usual factors
  15. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  scale •  inertia: lots of content in cache •  geographical diversity •  customers’ failure/risk tolerance varies widely •  software development at many layers of the stack •  individual component failure is no problem The UNusual factors
  16. 18 COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED

    HEREIN IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Go fast (but not too fast)
  17. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  deployment cycle should be short to –  minimize batch size –  minimize risk & time-to-recovery –  minimize latency to reap benefits of a new feature •  see: –  Flowcon – http://flowcon.org/ –  Continuous Delivery – Jez Humble, David Farley – Addison-Wesley 2010 –  How To Win Computers and Influence Reality – Adam Jacob – Velocity 2012 [http://velocityconf.com/velocity2013/public/schedule/detail/29503] –  DevOps Means Business – Gene Kim, Jez Humble, Nigel Kersten, Nicole Forsgren Velasquez – Velocity 2014 [http://velocityconf.com/velocity2014/ public/schedule/detail/35184] Why go fast?
  18. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. 0 0.2 0.4 0.6 0.8 1 confidence time Going faster idea(s)! fully deployed
  19. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time idea(s)! fully deployed
  20. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time compiles? passes CI tests? passes load test? doesn’t crash in prod? prod metrics healthy? customers happy? idea(s)! fully deployed
  21. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time 0 0.2 0.4 0.6 0.8 1 confidence time idea(s)! fully deployed passes load test? doesn’t crash in prod? prod metrics healthy? customers happy? compiles? passes CI tests?
  22. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time 0 0.2 0.4 0.6 0.8 1 confidence time compiles? passes CI tests? idea(s)! fully deployed passes load test? doesn’t crash in prod? prod metrics healthy? customers happy?
  23. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time 0 0.2 0.4 0.6 0.8 1 confidence time compiles? passes CI tests? idea(s)! fully deployed passes load test? doesn’t crash in prod? prod metrics healthy? customers happy?
  24. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server customer config changes customer portal professional services admin portal provisioning servers admin portal provisioning networks addrs admin portal update helper apps cmdline tool updating app configs cmdline tool updating app code cmdline tool
  25. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server test customer config changes customer portal test professional services admin portal test provisioning servers admin portal test provisioning networks addrs admin portal test update helper apps cmdline tool test updating app configs cmdline tool test updating app code cmdline tool
  26. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server test customer config changes customer portal test professional services admin portal test provisioning servers admin portal test provisioning networks addrs admin portal test update helper apps cmdline tool test updating app configs cmdline tool test updating app code cmdline tool
  27. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server test customer config changes customer portal test professional services admin portal test provisioning servers admin portal test provisioning networks addrs admin portal test update helper apps cmdline tool test updating app configs cmdline tool test updating app code cmdline tool
  28. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server test customer config changes customer portal test professional services admin portal test provisioning servers admin portal test provisioning networks addrs admin portal test update helper apps cmdline tool test updating app configs cmdline tool test updating app code cmdline tool edgeverify
  29. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. EdgeVerify { "name": "TKT0033517 - ec_country cookie exists", "request": { "url": "http://localhost/fr/sub/", "headers": { "Host": "www.customer.com", "x-enable-country-check": "1", "X-Forwarded-For": "72.21.82.34" } }, "response": { "status": 404, "headers": { "Set-Cookie": "=~ ec_country=us" } } }
  30. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server test customer config changes customer portal test professional services admin portal test provisioning servers admin portal test provisioning networks addrs admin portal test update helper apps cmdline tool test updating app configs cmdline tool test updating app code cmdline tool edgeverify
  31. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server test customer config changes customer portal test professional services admin portal test provisioning servers admin portal test provisioning networks addrs admin portal test update helper apps cmdline tool test updating app configs cmdline tool test updating app code cmdline tool edgeverify
  32. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. EdgeVerify EdgeVerify failed while testing the adn platform. The last known good revision has not been updated. Changes will NOT make it out to this platform until the error is corrected. EdgeVerify failed while checking changes made between revision 3868707 and 3868835. Failures were: # Failed test '[31601.json] TKT0089074 -- Increase max compression size - status 200’ # got: '503' # expected: '200' # Looks like you failed 1 test of 122.
  33. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time first time in prod deployment complete
  34. each deployment is an experiment null hypothesis: code/config version N+1

    behaves identically to code/config version N (except for expected changes X, Y, Z)
  35. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Controlled A/B experiments time period 1 (pre) time period 2 (post) server group A (control) old version old version server group B (test) old version new version
  36. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. A/B Comparisons (graphs) 38
  37. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. OS kernel & core libs OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish customer conf, rules, lua app conf env info Anatomy of an edge server
  38. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. OS kernel & core libs OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish customer conf, rules, lua app conf env info Anatomy of an edge server
  39. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  64-bit counter of nanoseconds since boot •  is supposed to wrap at (1<<64) nanoseconds (i.e. 585 years) •  actually wraps around (1<<54) nanoseconds (i.e. 208.5 days) •  http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/? id=4cecf6d401a01d054afc1e5f605bcbfe553cb9b9 All the things: kernel 208.5-day bug
  40. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. All the things: traffic cycles
  41. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. All the things: sporadic traffc
  42. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. All the things: sporadic errors
  43. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. All the things: geographic diversity 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 100 200 300 400 500 600 % TCP segments retransmitted kB delivered per http request Asia Europe North America
  44. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Compensating for “slow” time period 1 (pre) time period 2 (post) server group A (control) old version old version server group B (test) old version new version
  45. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Compensating for “slow” time period 1 (pre) time period 2 (post) server group A (control) old version old version server group B (test) old version new version time period (live) server group A (prod) old version server group A’ (control) old version server group A’’ (test) new version Amir Khakpour “Ghostfish” http://velocityconf.com/ velocity2014/public/ schedule/detail/36846
  46. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Compensating for “slow” 0 0.2 0.4 0.6 0.8 1 confidence time first time in prod deployment complete
  47. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Decompose with feature flags code: if (config.new_parser == true) { run_new_parser(); } else { run_old_parser(); } config: [customerid == 123] new_parser = true [serverid == 456] new_parser = true http://code.flickr.net/2009/12/02/flipping-out/
  48. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  “dark launches” — control end users’ perceived launch timing •  simplify version control (avoids feature branching; see also “branch by abstraction”) •  reduce batch size (work-in-progress sits in trunk and can be deployed) Decompose with feature flags code: if (config.new_parser == true) { run_new_parser(); } else { run_old_parser(); } config: [customerid == 123] new_parser = true [serverid == 456] new_parser = true http://code.flickr.net/2009/12/02/flipping-out/
  49. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  added (lesser-known?) bonuses: •  reduce need to roll back an entire release due to just one broken feature •  allow independent / parallel experiments on the individual features Decompose with feature flags code: if (config.new_parser == true) { run_new_parser(); } else { run_old_parser(); } config: [customerid == 123] new_parser = true [serverid == 456] new_parser = true http://code.flickr.net/2009/12/02/flipping-out/
  50. 56 COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED

    HEREIN IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Monitor everything but know your vital signs
  51. In software development, you can’t fix a bug until you

    can reproduce it. paraphrased  from  h,p://www.mehdi-­‐khalili.com/bug-­‐fixing-­‐help-­‐reproduce-­‐a-­‐bug  
  52. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Measurement Stream – Examples Samples / month Measurement Details PASSIVE 1013 http edge server access logs remote/local IP address, timestamp/duration, Host/URL, User-Agent, byte counts, TCP stats 1012 DNS server logs remote/local IP address, timestamp, hostname 1010 application metrics status codes, error counts, customer stats 1010 OS/hardware metrics disk, network, cpu, memory usage stats 109 core routers, switches netflow, per-port usage & errors ACTIVE 1011 internally generated synthetic probes local and inter-POP 107 third-party synthetic probes download timing, traceroutes 109 real-user beacons from html/javascript or video players
  53. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Maturity stage Description M1 Manual capture Data must be explicitly manually captured when / where desired M2 Automated recording Automated processes record data continuously from target systems M3 Automated aggregation / visualization Automated tools offer visualization and inspection of aggregated data M4 Proactive alerts Automated systems proactively notify appropriate teams of anomalies M5 Self-healing / self-adaptation System acts autonomously to correct for detected anomalies Measurement Maturity Model http://velocityconf.com/velocity2014/public/schedule/detail/36847
  54. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time compiles? passes CI tests? passes load test? doesn’t crash in prod? prod metrics healthy? customers happy? idea(s)! fully deployed
  55. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Timing info in logs 1226600390 0 93.184.208.68 19 93.184.208.105 80 TCP_HIT/200 369 GET http://93.184.208.105:80/000002/sla/health_check.html - 0 147 "-" "EdgeDirector/1.0" 2 hit/1/-/-/-/-/ok/-/-/nq=1/ tt=0.4/cmp=-/rc=1/93.184.208.105:80 1226600391 0 93.184.208.69 19 93.184.208.105 80 TCP_HIT/200 369 GET http://93.184.208.105:80/000002/sla/health_check.html - 0 147 "-" "EdgeDirector/1.0" 2 hit/1/-/-/-/-/ok/-/-/nq=1/ tt=0.5/cmp=-/rc=1/93.184.208.105:80
  56. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Aggregate graphs
  57. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. App metrics – “live grid”, w/ graphs
  58. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Cross-measurement correlation
  59. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Alarm dashboard
  60. 69 COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED

    HEREIN IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Simplify the process
  61. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  spend extra time to make sure that new code and configurations are built in such a way that they can be deployed painlessly and without downtime, including if rollback is required •  feature flags •  strict feature/bug-compatibility between adjacent versions •  comprehensive testing & monitoring hooks Code with the deployment in mind
  62. By D464-Darren Hall (2 5 R in Frankfurt where all

    the magic happens) [CC-BY-SA-], via Wikimedia Commons this is not the time to start your preflight checks
  63. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. The painfully slow rollback $ svn up -r5 /code/sailfish/ # takes a while but no matter # now start new version $ sailfish.sh reload # something broken; revert $ svn up -r4 /code/sailfish/ # internet is slow … # … fetching 100s MBs # … meanwhile things are broken # … finally, restart old version $ sailfish.sh reload
  64. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. The painfully slow rollback $ svn up -r5 /code/sailfish/ # takes a while but no matter # now start new version $ sailfish.sh reload # something broken; revert $ svn up -r4 /code/sailfish/ # internet is slow … # … fetching 100s MBs # … meanwhile things are broken # … finally, restart old version $ sailfish.sh reload $ svn co -r5 /code/sailfish/5 # takes a while but no matter $ echo 5 > /config/sailfish-rev # sailfish.sh reads sailfish-rev # start new version $ sailfish.sh reload # something broken; revert # /code/sailfish/4 still exists $ echo 4 > /config/sailfish-rev # sailfish.sh reads sailfish-rev # restart old version $ sailfish.sh reload
  65. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. 0 0.2 0.4 0.6 0.8 1 confidence time Phased rollout 1 server 5% servers 20% servers 50% simple?
  66. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Make changes be cookie-cutter A phased rollout plan: •  deploy to 1 server •  deploy to 5% of servers •  deploy to 20% of servers •  deploy to 50% of servers •  deploy to all servers •  are we using the same command for each step? •  how much redundant information do we have to enter? •  is each phased rollout done the same way?
  67. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. 0 0.2 0.4 0.6 0.8 1 confidence time Phased rollout – simple? 1 server 5% servers 20% servers 50%
  68. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. 0 0.2 0.4 0.6 0.8 1 confidence time 0 0.2 0.4 0.6 0.8 1 confidence time Phased rollout – simple? 1 server 5% servers 20% servers 50%
  69. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  Tools are great for reducing the deployment risk for standard changes •  But there are always non-standard changes •  Pro-tip: one type of high-risk non-standard change, ironically, is the deployment of new tools/infrastructure to be used for low-risk standard changes Recognize non-cookie-cutter changes
  70. By D464-Darren Hall (2 5 R in Frankfurt where all

    the magic happens) [CC-BY-SA-], via Wikimedia Commons standard change
  71. Index: src/log.h.in ===================================================== --- src/log.h.in (revision 6450) +++ src/log.h.in (revision

    6464) @@ -28,6 +28,7 @@ #include <vector> #include <cstdlib> // for abort() #include <malloc.h> +#include <fcntl.h> // defines O_ASYNC ///////////////////////////////////////
  72. Index: src/log.h.in ===================================================== --- src/log.h.in (revision 6450) +++ src/log.h.in (revision

    6464) @@ -28,6 +28,7 @@ #include <vector> #include <cstdlib> // for abort() #include <malloc.h> +#include <fcntl.h> // defines O_ASYNC ///////////////////////////////////////
  73. in our code: #if defined(HAVE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED) // read ahead

    const int ret = posix_fadvise(fd, offset, len, POSIX_FADV_WILLNEED); // ... #endif in /usr/include/bits/fcntl.h: # define POSIX_FADV_WILLNEED 3 /* Will need these pages. */
  74. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  if you have portability flags, you should be testing them •  if you don’t feel the need to test them (because your prod environments won’t ever need them) then you don’t need that “portability” The unexpected #ifdef
  75. COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN

    IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  Go fast: –  use your tests for max benefit –  but you might have to go slow –  use feature flags for speed/safety –  parallelize experiments •  Monitor everything: –  you need to visualize it –  mature your metrics –  focus on vital signs •  Simplify the process: –  code with deployment in mind –  instant roll-back –  (non-)cookie-cutter changes –  watch out for new tools –  don’t be (too) portable Thanks! Find me online: Rob Peters @rjpcal Find us online: http://verizondigitalmedia.com/ http://www.edgecast.com/ @edgecast