Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ITT 2019 - Constance Caramanolis - High severity incident response leveraging Envoy

ITT 2019 - Constance Caramanolis - High severity incident response leveraging Envoy

Incident management is inherently stressful and is made worse when the diagnostics and observability data is lacking and heterogeneous. Lyft runs Envoy at every hop of the network providing best in class observability across the entirety of Lyft’s network topology. Homogenous data reduces the time it takes to identify production issues. This talk will introduce Envoy, how Lyft configures Envoy and simulate a production incident at Lyft. Attendees are guided from the dreaded notification of an issue in production to resolution, by showing how engineers use Envoy’s extensive observability to identify and root cause the incident and remedy the situation.

Istanbul Tech Talks

April 02, 2019
Tweet

More Decks by Istanbul Tech Talks

Other Decks in Technology

Transcript

  1. LEVERAGING ENVOY WHEN RESPONDING TO
    HIGH-SEVERITY INCIDENTS
    CONSTANCE CARAMANOLIS @ LYFT
    TWITTER: @CCARAMANOLIS / [email protected]

    View Slide

  2. View Slide

  3. PANIC AND CHAOS

    View Slide

  4. View Slide

  5. WHY BUILD ENVOY?
    Service Oriented Architecture gets complicated quickly.
    • Languages and frameworks
    • Protocols
    • Distributed Systems best practices
    • Libraries for service calls
    • Observability outputs
    • Load Balancers

    View Slide

  6. WHAT IS ENVOY?
    The network should be
    transparent to applications.
    When network and
    application problems do occur
    it should be easy to determine
    the source of the problem.

    View Slide

  7. ENVOY IS PRETTY COOL…
    • Performance
    • Reliability
    • Modern Codebase
    • Configuration API
    • Observability
    • Community

    View Slide

  8. LYFT ARCHITECTURE*

    View Slide

  9. virtual_hosts:
    - name: www
    domains:
    - www.yourcompany.com
    routes:
    - match:
    prefix: "/foo/bar"
    route:
    cluster: "service2"
    - match:
    prefix: "/"
    route:
    cluster: "service1"
    - name: api
    domains:
    - api.yourcompany.com
    routes:
    - match:
    prefix: "/"
    route:
    cluster: "service3"
    CONFIGURING EDGE ENVOY
    ● Ordered list of domains and
    routes.
    ● Each route can associate a
    cluster for proxying the request
    to.
    ● Request is matched to the first
    route that satisfies the
    constraints.

    View Slide

  10. CONFIGURING INTERNAL SERVICES
    ● Request are made to Envoy through localhost:
    ○ Application no longer needs to maintain connections, handle errors or
    emit metrics for requests!
    ○ Envoy will handle service discovery for you!
    ● General rule of thumb
    ○ One port for ingress traffic to a service
    ○ One port for egress traffic to other internal services
    ○ One port per external dependency (database, 3rd party API, etc)
    port: 9001
    virtual_hosts:
    - name: driver
    domains: driver
    routes:
    - match:
    prefix: "/"
    route:
    cluster: "driver"
    - name: locations
    domains: locations
    routes:
    - match:
    prefix: "/"
    route:
    cluster: "locations"

    View Slide

  11. UPSTREAM AND DOWNSTREAM
    Downstream: the direction of where the water flows
    Upstream: the direction against the flow of water.
    Response ~ Water
    Request
    Response
    Service A Service B

    View Slide

  12. CRASH COURSE ENVOY METRICS
    ● HTTP Status Codes metrics
    ○ upstream_rq_200, upstream_rq_404, upstream_rq_503 per upstream cluster
    ● Request and Connection errors
    ○ upstream_rq_retries, upstream_rq_maintenance_mode, upstream_cx_connect_fail …
    ● HTTP errors per listeners
    ○ http.listener.downstream_rq_2xx, http.listener.downstream_rq_4xx, …
    Success Rate is the ratio of successful request (2xx) over total requests sent.
    Remember - Envoy is used at every hop!

    View Slide

  13. P0
    EDGE ENVOY
    DEGRADATION
    Incident Report Email
    Time: 3:44 am
    Edge Envoy is paging due to success rate dip.
    Investigation ongoing in #operations. Limited
    understanding of impact.
    Update: When more is known.

    View Slide

  14. EDGE ENVOY SUCCESS RATE DIP
    ● P0 page at Lyft.
    ● Requests coming into Lyft experiencing degraded state
    ● Usually indicates customer impact.

    View Slide

  15. EDGE ENVOY UPSTREAM ERRORS
    All 5xx errors seen
    in Edge Envoy
    grouped by cluster.

    View Slide

  16. ● Ordered list of domains and routes.
    ● Identifying route can be done a few ways:
    ○ Visual inspection of routes
    ○ Virtual Cluster metrics
    ○ Access Logs
    [2019-03-26] "GET /show_maps HTTP/1.1" 503 UH ”1-2-3-4" "api.lyft.net" "maprender"

    View Slide

  17. WHAT WE KNOW SO FAR

    View Slide

  18. AUTOGENERATED PANEL USING ENVOY METRICS

    View Slide

  19. ERRORS TO UPSTREAM SERVICES FROM MAPRENDER

    View Slide

  20. MAKING MORE SENSE...

    View Slide

  21. DRIVER SERVICE HEALTH
    Reapply the technique used to investigate MapRender errors.

    View Slide

  22. ROOT CAUSE

    View Slide

  23. RE: P0
    EDGE ENVOY
    DEGRADATION
    Incident Report Email Update
    Time: 3:55 am
    Root cause has been identified. Photo Filters API is
    returning errors. Remediation is being discussed in
    #operations.
    Impact: Maps aren’t rendering in clients.

    View Slide

  24. CONTACT SUPPORT
    AWARE OF ISSUE
    NO ETA FOR FIX

    View Slide

  25. REMEDIATION
    REDUCING THE IMPACT TO OUR CUSTOMERS

    View Slide

  26. MANAGING PHOTO FILTER ERRORS
    1. Make changes to the Driver code at 4 am.
    2. Reduce the load on Photo Filter API.
    * LATE NIGHT CODING PHOTO PROVIDED BY HTTPS://WWW.FLICKR.COM/PHOTOS/JJACKOWSKI/15659707052 IN ITS ORIGINAL FORMAT. PLEASE REFER TO CREATIVE COMMON
    FOR MORE INFO..
    Maintenance mode is a runtime key that
    sheds a percentage of traffic to an upstream
    host.

    View Slide

  27. APPLYING MAINTENANCE MODE

    View Slide

  28. RE: P0
    EDGE ENVOY
    DEGRADATION
    Incident Report Email Update
    Time: 4:00 am
    Maintenance mode applied for Photo Filters API. Waiting
    to see results.
    Update in 15 minutes or with new information.

    View Slide

  29. IMPACT TO DRIVER SERVICE
    ● Increase the number of Driver instances.
    ○ Scaling can be SLOW.
    ● Reduce the number of requests that can be made to
    Driver services by configuring Circuit Breakers.

    View Slide

  30. CIRCUIT BREAKERS
    ¡ Quickly fails and allows for backpressure to be applied throughout the system.
    ¡ Configure maximum number of connections, pending requests, requests, active retries and concurrent
    connection pools.
    ¡ Different levers for HTTP 1 and HTTP 2!
    ¡ Different levers for request priorities!
    ¡ Metrics emitted!
    ¡ upstream_rq_pending_overflow, upstream_cx_overflow, …
    ¡ Runtime configurable

    View Slide

  31. Driver service is steady now...

    View Slide

  32. RE: P0
    EDGE ENVOY
    DEGRADATION
    Incident Report Email Update
    Time: 4:10 am
    Maintenance mode to Photo Filter APi was insufficient. In
    addition, Driver service CPU was running hot due to
    misconfigured circuit breaker settings. Reducing the
    number of requests per host and scaling up.
    Update in 10 mins

    View Slide

  33. EDGE TRAFFIC TO MAPRENDER
    ● Routes are matched in ordered
    specified in the configuration.
    ● All request paths prefixed with
    ‘/show_map’ are proxied to
    MapRender.
    ● All other requests are proxied to
    Render.

    View Slide

  34. SHIFTING TRAFFIC AWAY FROM MAP RENDER

    View Slide

  35. RE: P0
    EDGE ENVOY
    DEGRADATION
    Incident Report Email Update
    Time: 4:20 am
    Decision was made to disable calls to MapRender for
    API /show_map (which invoked the Driver service and
    Photo Filters API). Defaulting to legacy Render service.
    This is expected to return Lyft to normal state.
    Update: 10 mins.

    View Slide

  36. Leveraging Envoy remediation options

    View Slide

  37. RESOLVED: P0
    EDGE ENVOY
    DEGRADATION
    Incident Report Email Update
    Time: 4:25 am
    Confirmed map is correctly rendering in all clients.
    Application success rate is back at 100%.
    Post mortem is scheduled for Wednesday at 1pm.
    No further updates.
    Good night!

    View Slide

  38. HOW HAS THIS HELPED LYFT?

    View Slide

  39. ONCE LEARNED,
    ANYONE CAN
    HELP DURING AN
    INCIDENT.

    View Slide

  40. FASTER ROOT CAUSING
    Instead of following every hop to the
    failing service
    ¡ Edge to Map Render
    ¡ Map Render to Driver
    ¡ Driver to Photo Filter API
    Look at all upstream failures at once

    View Slide

  41. JUST THE BEGINNING!
    ¡ Circuit breaker settings
    ¡ Outlier detection
    ¡ Access logging
    ¡ Tracing
    ¡ Request Mirroring
    https://www.envoyproxy.io/
    ¡ Dynamic Envoy configurations
    ¡ HTTP header options
    ¡ Traffic shifting
    ¡ Maintenance mode

    View Slide

  42. Recap

    View Slide

  43. THANK YOU! ¡Twitter: @ccaramanolis
    ¡Email: [email protected]

    View Slide