Velocity NY 2014: Deploying on the Edge

COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED HEREIN
IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. DEPLOYING ON THE EDGE Rob Peters | Chief Architect | @rjpcal Verizon EdgeCast Velocity Conference | New York | September 17, 2014 http://velocityconf.com/velocityny2014/public/schedule/detail/35815

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  EdgeCast founded in 2006 •  Became part of Verizon Digital Media Services in 2013 •  Now delivering 4-7% of end-user internet traffic •  6,000+ customers across all segments •  About me: –  6 years at EdgeCast in Core Engineering –  Focused on functionality, performance, reliability of our edge server network –  Background of Ph.D. / Post-Doc in Computational Neuroscience: •  measurement: human visual psychophysics and eye tracking •  modeling: biologically-inspired computer vision systems About Verizon EdgeCast

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  The problem landscape •  Current practices –  go fast (but know when you might have to go slow, and how to deal) –  monitor everything (but find your vital signs) –  simplify the process •  Guiding principles Outline

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. The big picture •  60+ Super POPs •  5 Continents / 20 Countries •  2,000+ Peering Connections •  8,000,000+ Objects/Second •  Edge Services: –  HTTP Content/Application Delivery –  Live & On-Demand Streaming –  DNS – Security (WAF & DDoS)

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Content delivery in action customer origins back-office POP end users end users end users end users end users end users end users end users end users customer ops data (http traffic) metadata (monitoring; command+control) CDN ops edge POP edge POP edge POP

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. core libs & OS kernel OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish customer conf, rules, lua app conf env info Anatomy of an http edge server

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. core libs & OS kernel OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish customer conf, rules, lua app conf env info Anatomy of an edge server 100x/day 100x/day 100x/day 1-10x/week 2-4x/year

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows / deployment pipelines TYPE OF CHANGE FREQUENCY network-level configs (e.g. info about peer servers, VIPs, address blocks) ~100x per day customer configs (what customers specify in portals/APIs) ~100x per day OS + app configs ~1-10x per week scripts + glue + base code (e.g. daemon control, cron, monit, snmp, sysctl, ...) ~1-5x per week core application (“sailfish” http server) ~1-4x per month kernel ~2-4x per year OS distro ~1x per 2 years

code == configs (*)

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. core libs & OS kernel OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish app conf env info Anatomy of an http edge server customer conf, rules, lua

edge configs are a ~1 million LOC program with thousands
of maintainers and 100x deploys per day

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  performance •  uptime •  competitive feature set •  pleasant + fulfilling work environment The usual factors

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  scale •  inertia: lots of content in cache •  geographical diversity •  customers’ failure/risk tolerance varies widely •  software development at many layers of the stack •  individual component failure is no problem The UNusual factors

code with the deployment in mind ç i.e. dev+ops

18 COPYRIGHT © 2014 VERIZON, ALL RIGHTS RESERVED. INFORMATION CONTAINED
HEREIN IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Go fast (but not too fast)

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  deployment cycle should be short to –  minimize batch size –  minimize risk & time-to-recovery –  minimize latency to reap benefits of a new feature •  see: –  Flowcon – http://flowcon.org/ –  Continuous Delivery – Jez Humble, David Farley – Addison-Wesley 2010 –  How To Win Computers and Influence Reality – Adam Jacob – Velocity 2012 [http://velocityconf.com/velocity2013/public/schedule/detail/29503] –  DevOps Means Business – Gene Kim, Jez Humble, Nigel Kersten, Nicole Forsgren Velasquez – Velocity 2014 [http://velocityconf.com/velocity2014/ public/schedule/detail/35184] Why go fast?

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. 0 0.2 0.4 0.6 0.8 1 confidence time Going faster idea(s)! fully deployed

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time idea(s)! fully deployed

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time compiles? passes CI tests? passes load test? doesn’t crash in prod? prod metrics healthy? customers happy? idea(s)! fully deployed

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time 0 0.2 0.4 0.6 0.8 1 confidence time idea(s)! fully deployed passes load test? doesn’t crash in prod? prod metrics healthy? customers happy? compiles? passes CI tests?

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time 0 0.2 0.4 0.6 0.8 1 confidence time compiles? passes CI tests? idea(s)! fully deployed passes load test? doesn’t crash in prod? prod metrics healthy? customers happy?

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server customer config changes customer portal professional services admin portal provisioning servers admin portal provisioning networks addrs admin portal update helper apps cmdline tool updating app configs cmdline tool updating app code cmdline tool

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server test customer config changes customer portal test professional services admin portal test provisioning servers admin portal test provisioning networks addrs admin portal test update helper apps cmdline tool test updating app configs cmdline tool test updating app code cmdline tool

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server test customer config changes customer portal test professional services admin portal test provisioning servers admin portal test provisioning networks addrs admin portal test update helper apps cmdline tool test updating app configs cmdline tool test updating app code cmdline tool edgeverify

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. EdgeVerify { "name": "TKT0033517 - ec_country cookie exists", "request": { "url": "http://localhost/fr/sub/", "headers": { "Host": "www.customer.com", "x-enable-country-check": "1", "X-Forwarded-For": "72.21.82.34" } }, "response": { "status": 404, "headers": { "Set-Cookie": "=~ ec_country=us" } } }

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Change flows edge server test customer config changes customer portal test professional services admin portal test provisioning servers admin portal test provisioning networks addrs admin portal test update helper apps cmdline tool test updating app configs cmdline tool test updating app code cmdline tool edgeverify

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. EdgeVerify EdgeVerify failed while testing the adn platform. The last known good revision has not been updated. Changes will NOT make it out to this platform until the error is corrected. EdgeVerify failed while checking changes made between revision 3868707 and 3868835. Failures were: # Failed test '[31601.json] TKT0089074 -- Increase max compression size - status 200’ # got: '503' # expected: '200' # Looks like you failed 1 test of 122.

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time first time in prod deployment complete

each deployment is an experiment null hypothesis: code/config version N+1
behaves identically to code/config version N (except for expected changes X, Y, Z)

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Controlled A/B experiments time period 1 (pre) time period 2 (post) server group A (control) old version old version server group B (test) old version new version

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. A/B Comparisons (graphs) 38

go fast… but go slow enough

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. OS kernel & core libs OS configs sysctl conf iproute conf system daemons cron conf monit conf rsyslog conf edge helper daemons infosrv conf cache mgr conf core application sailfish customer conf, rules, lua app conf env info Anatomy of an edge server

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  64-bit counter of nanoseconds since boot •  is supposed to wrap at (1<<64) nanoseconds (i.e. 585 years) •  actually wraps around (1<<54) nanoseconds (i.e. 208.5 days) •  http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/? id=4cecf6d401a01d054afc1e5f605bcbfe553cb9b9 All the things: kernel 208.5-day bug

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. All the things: traffic cycles

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. All the things: sporadic traffc

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. All the things: sporadic errors

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. All the things: geographic diversity 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 100 200 300 400 500 600 % TCP segments retransmitted kB delivered per http request Asia Europe North America

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Compensating for “slow” time period 1 (pre) time period 2 (post) server group A (control) old version old version server group B (test) old version new version

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Compensating for “slow” time period 1 (pre) time period 2 (post) server group A (control) old version old version server group B (test) old version new version time period (live) server group A (prod) old version server group A’ (control) old version server group A’’ (test) new version Amir Khakpour “Ghostfish” http://velocityconf.com/ velocity2014/public/ schedule/detail/36846

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Compensating for “slow” 0 0.2 0.4 0.6 0.8 1 confidence time first time in prod deployment complete

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Decompose with feature flags code: if (config.new_parser == true) { run_new_parser(); } else { run_old_parser(); } config: [customerid == 123] new_parser = true [serverid == 456] new_parser = true http://code.flickr.net/2009/12/02/flipping-out/

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  “dark launches” — control end users’ perceived launch timing •  simplify version control (avoids feature branching; see also “branch by abstraction”) •  reduce batch size (work-in-progress sits in trunk and can be deployed) Decompose with feature flags code: if (config.new_parser == true) { run_new_parser(); } else { run_old_parser(); } config: [customerid == 123] new_parser = true [serverid == 456] new_parser = true http://code.flickr.net/2009/12/02/flipping-out/

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  added (lesser-known?) bonuses: •  reduce need to roll back an entire release due to just one broken feature •  allow independent / parallel experiments on the individual features Decompose with feature flags code: if (config.new_parser == true) { run_new_parser(); } else { run_old_parser(); } config: [customerid == 123] new_parser = true [serverid == 456] new_parser = true http://code.flickr.net/2009/12/02/flipping-out/

(*) code changes: serialized config changes: parallelized

HEREIN IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Monitor everything but know your vital signs

In software development, you can’t fix a bug until you
can reproduce it. paraphrased from h,p://www.mehdi-‐khalili.com/bug-‐ﬁxing-‐help-‐reproduce-‐a-‐bug

In web performance, you can’t fix a problem until you
can visualize it clearly.

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Measurement Stream – Examples Samples / month Measurement Details PASSIVE 1013 http edge server access logs remote/local IP address, timestamp/duration, Host/URL, User-Agent, byte counts, TCP stats 1012 DNS server logs remote/local IP address, timestamp, hostname 1010 application metrics status codes, error counts, customer stats 1010 OS/hardware metrics disk, network, cpu, memory usage stats 109 core routers, switches netflow, per-port usage & errors ACTIVE 1011 internally generated synthetic probes local and inter-POP 107 third-party synthetic probes download timing, traceroutes 109 real-user beacons from html/javascript or video players

mature your measurement streams

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Maturity stage Description M1 Manual capture Data must be explicitly manually captured when / where desired M2 Automated recording Automated processes record data continuously from target systems M3 Automated aggregation / visualization Automated tools offer visualization and inspection of aggregated data M4 Proactive alerts Automated systems proactively notify appropriate teams of anomalies M5 Self-healing / self-adaptation System acts autonomously to correct for detected anomalies Measurement Maturity Model http://velocityconf.com/velocity2014/public/schedule/detail/36847

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Going faster 0 0.2 0.4 0.6 0.8 1 confidence time compiles? passes CI tests? passes load test? doesn’t crash in prod? prod metrics healthy? customers happy? idea(s)! fully deployed

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Timing info in logs 1226600390 0 93.184.208.68 19 93.184.208.105 80 TCP_HIT/200 369 GET http://93.184.208.105:80/000002/sla/health_check.html - 0 147 "-" "EdgeDirector/1.0" 2 hit/1/-/-/-/-/ok/-/-/nq=1/ tt=0.4/cmp=-/rc=1/93.184.208.105:80 1226600391 0 93.184.208.69 19 93.184.208.105 80 TCP_HIT/200 369 GET http://93.184.208.105:80/000002/sla/health_check.html - 0 147 "-" "EdgeDirector/1.0" 2 hit/1/-/-/-/-/ok/-/-/nq=1/ tt=0.5/cmp=-/rc=1/93.184.208.105:80

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Aggregate graphs

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. App metrics – “live grid”, w/ graphs

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Cross-measurement correlation

focus on vital signs

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Alarm dashboard

HEREIN IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Simplify the process

be “lazy” (by programmer’s definition)

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  spend extra time to make sure that new code and configurations are built in such a way that they can be deployed painlessly and without downtime, including if rollback is required •  feature flags •  strict feature/bug-compatibility between adjacent versions •  comprehensive testing & monitoring hooks Code with the deployment in mind

By D464-Darren Hall (2 5 R in Frankfurt where all
the magic happens) [CC-BY-SA-], via Wikimedia Commons this is not the time to start your preflight checks

fast (instant!) roll-back / roll-forward

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. The painfully slow rollback $ svn up -r5 /code/sailfish/ # takes a while but no matter # now start new version $ sailfish.sh reload # something broken; revert $ svn up -r4 /code/sailfish/ # internet is slow … # … fetching 100s MBs # … meanwhile things are broken # … finally, restart old version $ sailfish.sh reload

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. The painfully slow rollback $ svn up -r5 /code/sailfish/ # takes a while but no matter # now start new version $ sailfish.sh reload # something broken; revert $ svn up -r4 /code/sailfish/ # internet is slow … # … fetching 100s MBs # … meanwhile things are broken # … finally, restart old version $ sailfish.sh reload $ svn co -r5 /code/sailfish/5 # takes a while but no matter $ echo 5 > /config/sailfish-rev # sailfish.sh reads sailfish-rev # start new version $ sailfish.sh reload # something broken; revert # /code/sailfish/4 still exists $ echo 4 > /config/sailfish-rev # sailfish.sh reads sailfish-rev # restart old version $ sailfish.sh reload

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. 0 0.2 0.4 0.6 0.8 1 confidence time Phased rollout 1 server 5% servers 20% servers 50% simple?

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. Make changes be cookie-cutter A phased rollout plan: •  deploy to 1 server •  deploy to 5% of servers •  deploy to 20% of servers •  deploy to 50% of servers •  deploy to all servers •  are we using the same command for each step? •  how much redundant information do we have to enter? •  is each phased rollout done the same way?

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. 0 0.2 0.4 0.6 0.8 1 confidence time Phased rollout – simple? 1 server 5% servers 20% servers 50%

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. 0 0.2 0.4 0.6 0.8 1 confidence time 0 0.2 0.4 0.6 0.8 1 confidence time Phased rollout – simple? 1 server 5% servers 20% servers 50%

each step of phased rollout must use the same action

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  Tools are great for reducing the deployment risk for standard changes •  But there are always non-standard changes •  Pro-tip: one type of high-risk non-standard change, ironically, is the deployment of new tools/infrastructure to be used for low-risk standard changes Recognize non-cookie-cutter changes

By D464-Darren Hall (2 5 R in Frankfurt where all
the magic happens) [CC-BY-SA-], via Wikimedia Commons standard change

http://www.portseattle100.org/properties/runways NOT a standard change!

using simple process == safe simplifying the process == risky

don’t be portable/configurable (any more than necessary)

Index: src/log.h.in ===================================================== --- src/log.h.in (revision 6450) +++ src/log.h.in (revision
6464) @@ -28,6 +28,7 @@ #include <vector> #include <cstdlib> // for abort() #include <malloc.h> +#include <fcntl.h> // defines O_ASYNC ///////////////////////////////////////

in our code: #if defined(HAVE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED) // read ahead
const int ret = posix_fadvise(fd, offset, len, POSIX_FADV_WILLNEED); // ... #endif in /usr/include/bits/fcntl.h: # define POSIX_FADV_WILLNEED 3 /* Will need these pages. */

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  if you have portability flags, you should be testing them •  if you don’t feel the need to test them (because your prod environments won’t ever need them) then you don’t need that “portability” The unexpected #ifdef

IS PROVIDED AS IS AND SUBJECT TO CHANGE WITHOUT NOTICE. •  Go fast: –  use your tests for max benefit –  but you might have to go slow –  use feature flags for speed/safety –  parallelize experiments •  Monitor everything: –  you need to visualize it –  mature your metrics –  focus on vital signs •  Simplify the process: –  code with deployment in mind –  instant roll-back –  (non-)cookie-cutter changes –  watch out for new tools –  don’t be (too) portable Thanks! Find me online: Rob Peters @rjpcal Find us online: http://verizondigitalmedia.com/ http://www.edgecast.com/ @edgecast

Velocity NY 2014: Deploying on the Edge

Velocity NY 2014: Deploying on the Edge

More Decks by Rob Peters

Other Decks in Technology

Featured

Transcript