ITT 2019 - Constance Caramanolis - High severity incident response leveraging Envoy

ITT 2019 - Constance Caramanolis - High severity incident response leveraging Envoy

Incident management is inherently stressful and is made worse when the diagnostics and observability data is lacking and heterogeneous. Lyft runs Envoy at every hop of the network providing best in class observability across the entirety of Lyft’s network topology. Homogenous data reduces the time it takes to identify production issues. This talk will introduce Envoy, how Lyft configures Envoy and simulate a production incident at Lyft. Attendees are guided from the dreaded notification of an issue in production to resolution, by showing how engineers use Envoy’s extensive observability to identify and root cause the incident and remedy the situation.

990b89ca5f918a94ef6523d399eda9a4?s=128

Istanbul Tech Talks

April 02, 2019
Tweet

Transcript

  1. LEVERAGING ENVOY WHEN RESPONDING TO HIGH-SEVERITY INCIDENTS CONSTANCE CARAMANOLIS @

    LYFT TWITTER: @CCARAMANOLIS / CCARAMANOLIS@LYFT.COM
  2. None
  3. PANIC AND CHAOS

  4. None
  5. WHY BUILD ENVOY? Service Oriented Architecture gets complicated quickly. •

    Languages and frameworks • Protocols • Distributed Systems best practices • Libraries for service calls • Observability outputs • Load Balancers
  6. WHAT IS ENVOY? The network should be transparent to applications.

    When network and application problems do occur it should be easy to determine the source of the problem.
  7. ENVOY IS PRETTY COOL… • Performance • Reliability • Modern

    Codebase • Configuration API • Observability • Community
  8. LYFT ARCHITECTURE*

  9. virtual_hosts: - name: www domains: - www.yourcompany.com routes: - match:

    prefix: "/foo/bar" route: cluster: "service2" - match: prefix: "/" route: cluster: "service1" - name: api domains: - api.yourcompany.com routes: - match: prefix: "/" route: cluster: "service3" CONFIGURING EDGE ENVOY • Ordered list of domains and routes. • Each route can associate a cluster for proxying the request to. • Request is matched to the first route that satisfies the constraints.
  10. CONFIGURING INTERNAL SERVICES • Request are made to Envoy through

    localhost:<port number> ◦ Application no longer needs to maintain connections, handle errors or emit metrics for requests! ◦ Envoy will handle service discovery for you! • General rule of thumb ◦ One port for ingress traffic to a service ◦ One port for egress traffic to other internal services ◦ One port per external dependency (database, 3rd party API, etc) port: 9001 virtual_hosts: - name: driver domains: driver routes: - match: prefix: "/" route: cluster: "driver" - name: locations domains: locations routes: - match: prefix: "/" route: cluster: "locations"
  11. UPSTREAM AND DOWNSTREAM Downstream: the direction of where the water

    flows Upstream: the direction against the flow of water. Response ~ Water Request Response Service A Service B
  12. CRASH COURSE ENVOY METRICS • HTTP Status Codes metrics ◦

    upstream_rq_200, upstream_rq_404, upstream_rq_503 per upstream cluster • Request and Connection errors ◦ upstream_rq_retries, upstream_rq_maintenance_mode, upstream_cx_connect_fail … • HTTP errors per listeners ◦ http.listener.downstream_rq_2xx, http.listener.downstream_rq_4xx, … Success Rate is the ratio of successful request (2xx) over total requests sent. Remember - Envoy is used at every hop!
  13. P0 EDGE ENVOY DEGRADATION Incident Report Email Time: 3:44 am

    Edge Envoy is paging due to success rate dip. Investigation ongoing in #operations. Limited understanding of impact. Update: When more is known.
  14. EDGE ENVOY SUCCESS RATE DIP • P0 page at Lyft.

    • Requests coming into Lyft experiencing degraded state • Usually indicates customer impact.
  15. EDGE ENVOY UPSTREAM ERRORS All 5xx errors seen in Edge

    Envoy grouped by cluster.
  16. • Ordered list of domains and routes. • Identifying route

    can be done a few ways: ◦ Visual inspection of routes ◦ Virtual Cluster metrics ◦ Access Logs [2019-03-26] "GET /show_maps HTTP/1.1" 503 UH ”1-2-3-4" "api.lyft.net" "maprender"
  17. WHAT WE KNOW SO FAR

  18. AUTOGENERATED PANEL USING ENVOY METRICS

  19. ERRORS TO UPSTREAM SERVICES FROM MAPRENDER

  20. MAKING MORE SENSE...

  21. DRIVER SERVICE HEALTH Reapply the technique used to investigate MapRender

    errors.
  22. ROOT CAUSE

  23. RE: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    3:55 am Root cause has been identified. Photo Filters API is returning errors. Remediation is being discussed in #operations. Impact: Maps aren’t rendering in clients.
  24. CONTACT SUPPORT AWARE OF ISSUE NO ETA FOR FIX

  25. REMEDIATION REDUCING THE IMPACT TO OUR CUSTOMERS

  26. MANAGING PHOTO FILTER ERRORS 1. Make changes to the Driver

    code at 4 am. 2. Reduce the load on Photo Filter API. * LATE NIGHT CODING PHOTO PROVIDED BY HTTPS://WWW.FLICKR.COM/PHOTOS/JJACKOWSKI/15659707052 IN ITS ORIGINAL FORMAT. PLEASE REFER TO CREATIVE COMMON FOR MORE INFO.. Maintenance mode is a runtime key that sheds a percentage of traffic to an upstream host.
  27. APPLYING MAINTENANCE MODE

  28. RE: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    4:00 am Maintenance mode applied for Photo Filters API. Waiting to see results. Update in 15 minutes or with new information.
  29. IMPACT TO DRIVER SERVICE • Increase the number of Driver

    instances. ◦ Scaling can be SLOW. • Reduce the number of requests that can be made to Driver services by configuring Circuit Breakers.
  30. CIRCUIT BREAKERS ¡ Quickly fails and allows for backpressure to

    be applied throughout the system. ¡ Configure maximum number of connections, pending requests, requests, active retries and concurrent connection pools. ¡ Different levers for HTTP 1 and HTTP 2! ¡ Different levers for request priorities! ¡ Metrics emitted! ¡ upstream_rq_pending_overflow, upstream_cx_overflow, … ¡ Runtime configurable
  31. Driver service is steady now...

  32. RE: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    4:10 am Maintenance mode to Photo Filter APi was insufficient. In addition, Driver service CPU was running hot due to misconfigured circuit breaker settings. Reducing the number of requests per host and scaling up. Update in 10 mins
  33. EDGE TRAFFIC TO MAPRENDER • Routes are matched in ordered

    specified in the configuration. • All request paths prefixed with ‘/show_map’ are proxied to MapRender. • All other requests are proxied to Render.
  34. SHIFTING TRAFFIC AWAY FROM MAP RENDER

  35. RE: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    4:20 am Decision was made to disable calls to MapRender for API /show_map (which invoked the Driver service and Photo Filters API). Defaulting to legacy Render service. This is expected to return Lyft to normal state. Update: 10 mins.
  36. Leveraging Envoy remediation options

  37. RESOLVED: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    4:25 am Confirmed map is correctly rendering in all clients. Application success rate is back at 100%. Post mortem is scheduled for Wednesday at 1pm. No further updates. Good night!
  38. HOW HAS THIS HELPED LYFT?

  39. ONCE LEARNED, ANYONE CAN HELP DURING AN INCIDENT.

  40. FASTER ROOT CAUSING Instead of following every hop to the

    failing service ¡ Edge to Map Render ¡ Map Render to Driver ¡ Driver to Photo Filter API Look at all upstream failures at once
  41. JUST THE BEGINNING! ¡ Circuit breaker settings ¡ Outlier detection

    ¡ Access logging ¡ Tracing ¡ Request Mirroring https://www.envoyproxy.io/ ¡ Dynamic Envoy configurations ¡ HTTP header options ¡ Traffic shifting ¡ Maintenance mode
  42. Recap

  43. THANK YOU! ¡Twitter: @ccaramanolis ¡Email: ccaramanolis@lyft.com