ITT 2019 - Constance Caramanolis - High severity incident response leveraging Envoy

LEVERAGING ENVOY WHEN RESPONDING TO HIGH-SEVERITY INCIDENTS CONSTANCE CARAMANOLIS @
LYFT TWITTER: @CCARAMANOLIS / [email protected]

PANIC AND CHAOS

WHY BUILD ENVOY? Service Oriented Architecture gets complicated quickly. •
Languages and frameworks • Protocols • Distributed Systems best practices • Libraries for service calls • Observability outputs • Load Balancers

WHAT IS ENVOY? The network should be transparent to applications.
When network and application problems do occur it should be easy to determine the source of the problem.

ENVOY IS PRETTY COOL… • Performance • Reliability • Modern
Codebase • Configuration API • Observability • Community

LYFT ARCHITECTURE*

virtual_hosts: - name: www domains: - www.yourcompany.com routes: - match:
prefix: "/foo/bar" route: cluster: "service2" - match: prefix: "/" route: cluster: "service1" - name: api domains: - api.yourcompany.com routes: - match: prefix: "/" route: cluster: "service3" CONFIGURING EDGE ENVOY • Ordered list of domains and routes. • Each route can associate a cluster for proxying the request to. • Request is matched to the first route that satisfies the constraints.

CONFIGURING INTERNAL SERVICES • Request are made to Envoy through
localhost:<port number> ◦ Application no longer needs to maintain connections, handle errors or emit metrics for requests! ◦ Envoy will handle service discovery for you! • General rule of thumb ◦ One port for ingress traffic to a service ◦ One port for egress traffic to other internal services ◦ One port per external dependency (database, 3rd party API, etc) port: 9001 virtual_hosts: - name: driver domains: driver routes: - match: prefix: "/" route: cluster: "driver" - name: locations domains: locations routes: - match: prefix: "/" route: cluster: "locations"

UPSTREAM AND DOWNSTREAM Downstream: the direction of where the water
flows Upstream: the direction against the flow of water. Response ~ Water Request Response Service A Service B

CRASH COURSE ENVOY METRICS • HTTP Status Codes metrics ◦
upstream_rq_200, upstream_rq_404, upstream_rq_503 per upstream cluster • Request and Connection errors ◦ upstream_rq_retries, upstream_rq_maintenance_mode, upstream_cx_connect_fail … • HTTP errors per listeners ◦ http.listener.downstream_rq_2xx, http.listener.downstream_rq_4xx, … Success Rate is the ratio of successful request (2xx) over total requests sent. Remember - Envoy is used at every hop!

P0 EDGE ENVOY DEGRADATION Incident Report Email Time: 3:44 am
Edge Envoy is paging due to success rate dip. Investigation ongoing in #operations. Limited understanding of impact. Update: When more is known.

EDGE ENVOY SUCCESS RATE DIP • P0 page at Lyft.
• Requests coming into Lyft experiencing degraded state • Usually indicates customer impact.

EDGE ENVOY UPSTREAM ERRORS All 5xx errors seen in Edge
Envoy grouped by cluster.

• Ordered list of domains and routes. • Identifying route
can be done a few ways: ◦ Visual inspection of routes ◦ Virtual Cluster metrics ◦ Access Logs [2019-03-26] "GET /show_maps HTTP/1.1" 503 UH ”1-2-3-4" "api.lyft.net" "maprender"

WHAT WE KNOW SO FAR

AUTOGENERATED PANEL USING ENVOY METRICS

ERRORS TO UPSTREAM SERVICES FROM MAPRENDER

MAKING MORE SENSE...

DRIVER SERVICE HEALTH Reapply the technique used to investigate MapRender
errors.

ROOT CAUSE

RE: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:
3:55 am Root cause has been identified. Photo Filters API is returning errors. Remediation is being discussed in #operations. Impact: Maps aren’t rendering in clients.

CONTACT SUPPORT AWARE OF ISSUE NO ETA FOR FIX

REMEDIATION REDUCING THE IMPACT TO OUR CUSTOMERS

MANAGING PHOTO FILTER ERRORS 1. Make changes to the Driver
code at 4 am. 2. Reduce the load on Photo Filter API. * LATE NIGHT CODING PHOTO PROVIDED BY HTTPS://WWW.FLICKR.COM/PHOTOS/JJACKOWSKI/15659707052 IN ITS ORIGINAL FORMAT. PLEASE REFER TO CREATIVE COMMON FOR MORE INFO.. Maintenance mode is a runtime key that sheds a percentage of traffic to an upstream host.

APPLYING MAINTENANCE MODE

4:00 am Maintenance mode applied for Photo Filters API. Waiting to see results. Update in 15 minutes or with new information.

IMPACT TO DRIVER SERVICE • Increase the number of Driver
instances. ◦ Scaling can be SLOW. • Reduce the number of requests that can be made to Driver services by configuring Circuit Breakers.

CIRCUIT BREAKERS ¡ Quickly fails and allows for backpressure to
be applied throughout the system. ¡ Configure maximum number of connections, pending requests, requests, active retries and concurrent connection pools. ¡ Different levers for HTTP 1 and HTTP 2! ¡ Different levers for request priorities! ¡ Metrics emitted! ¡ upstream_rq_pending_overflow, upstream_cx_overflow, … ¡ Runtime configurable

Driver service is steady now...

4:10 am Maintenance mode to Photo Filter APi was insufficient. In addition, Driver service CPU was running hot due to misconfigured circuit breaker settings. Reducing the number of requests per host and scaling up. Update in 10 mins

EDGE TRAFFIC TO MAPRENDER • Routes are matched in ordered
specified in the configuration. • All request paths prefixed with ‘/show_map’ are proxied to MapRender. • All other requests are proxied to Render.

SHIFTING TRAFFIC AWAY FROM MAP RENDER

4:20 am Decision was made to disable calls to MapRender for API /show_map (which invoked the Driver service and Photo Filters API). Defaulting to legacy Render service. This is expected to return Lyft to normal state. Update: 10 mins.

Leveraging Envoy remediation options

RESOLVED: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:
4:25 am Confirmed map is correctly rendering in all clients. Application success rate is back at 100%. Post mortem is scheduled for Wednesday at 1pm. No further updates. Good night!

HOW HAS THIS HELPED LYFT?

ONCE LEARNED, ANYONE CAN HELP DURING AN INCIDENT.

FASTER ROOT CAUSING Instead of following every hop to the
failing service ¡ Edge to Map Render ¡ Map Render to Driver ¡ Driver to Photo Filter API Look at all upstream failures at once

JUST THE BEGINNING! ¡ Circuit breaker settings ¡ Outlier detection
¡ Access logging ¡ Tracing ¡ Request Mirroring https://www.envoyproxy.io/ ¡ Dynamic Envoy configurations ¡ HTTP header options ¡ Traffic shifting ¡ Maintenance mode

THANK YOU! ¡Twitter: @ccaramanolis ¡Email: [email protected]

ITT 2019 - Constance Caramanolis - High severit...

ITT 2019 - Constance Caramanolis - High severity incident response leveraging Envoy

More Decks by Istanbul Tech Talks

Other Decks in Technology

Featured

Transcript