Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building and operating service mesh at mid-size company

taiki45
September 07, 2018

Building and operating service mesh at mid-size company

taiki45

September 07, 2018
Tweet

More Decks by taiki45

Other Decks in Technology

Transcript

  1. Building and operating
    service mesh
    at mid-size company
    Taiki Ono, Cookpad Inc.

    View Slide

  2. Agenda
    • Background
    • Problems
    • Introducing and operations
    • Key results
    • Next challenges

    View Slide

  3. Background

    View Slide

  4. Cookpad
    • "Make everyday cooking fun!"
    • Originally started in Japan in 1997
    • Operate in over 23 languages, 68
    countries

    View Slide

  5. Scale
    • 200+ product developers
    • 100+ production services
    • 90M Monthly Average User

    View Slide

  6. Organization structure

    View Slide

  7. Technology stack
    • Ruby on Rails for both web
    frontend and backend apps
    • Python for ML apps
    • Go for backend app
    • Rust, Swift, Java etc.. for internal
    apps

    View Slide

  8. View Slide

  9. Problems

    View Slide

  10. Operational problems
    • Decrease in system reliability
    • Hard to trouble shoot and debug
    ‣ Increase of time detect root causes
    of incidents
    ‣ Capacity planing

    View Slide

  11. Solutions
    • Expeditor
    ‣ Ruby library inspired by Netflix's
    Hystrix
    • aws-xray
    ‣ Ruby library for distributed tracing
    using AWS's X-Ray service
    https://github.com/cookpad

    View Slide

  12. http://techlife.cookpad.com/entry/2017/09/06/115710

    View Slide

  13. View Slide

  14. GoPythonRustJavaSwift apps?
    • Limitation of library model
    approach
    ‣ More for product development
    ‣ Controlling library versions is hard
    • Planning to develop our proxy and
    mixed with consul-template

    View Slide

  15. Service mesh to the rescue

    View Slide

  16. at SRECON America 2017
    "Lyft's Envoy: Experiences
    Operating a Large Service Mesh"

    View Slide

  17. Replacing libraries to proxy

    View Slide

  18. control-plane

    View Slide

  19. Introducing and
    operating service mesh

    View Slide

  20. Timeline
    • Early 2017: making plan
    • Late 2017: building MVP
    • Early 2018: generally available

    View Slide

  21. Envoy
    • Publicity released at mid 2016
    • Lightweight
    • Graceful reloading
    • gRPC support
    https://github.com/envoyproxy/envoy

    View Slide

  22. Plan: in-house
    • Early 2017: no Istio
    • We use Amazon ECS
    • Not using full features of Envoy
    • Resiliency and observability

    View Slide

  23. Goals
    • Control resiliency settings by Ops
    ‣ Centrally managed
    ‣ Review flow
    • All metrics should go into Prometheus
    • Low operation cost
    ‣ Less components, use of managed services

    View Slide

  24. View Slide

  25. Configuration contents
    • Jsonnet
    • Route config
    ‣ Retry, timeouts for
    paths, domains
    • Cluster config
    ‣ DNS name of internal ELB
    ‣ Circuit breaker settings
    https://github.com/cookpad/kumonos

    View Slide

  26. View Slide

  27. Drop statsd-relay
    • Adding tags to metrics
    with DogStatsd format
    • Less component is
    preferable
    ‣ Send PRs to Envoy
    ‣ dog_statsd sink and
    fixed tag configuration
    are available

    View Slide

  28. gRPC infrastructure
    • Need L7 proxy for
    HTTP/2 traffic
    • Let's extend
    control-plane

    View Slide

  29. ServiceDiscoveryService API
    • lyft/discovery
    ‣ Reference
    implementation of
    SDS API
    • Moved to cookpad/sds

    View Slide

  30. The hard point of ECS
    • Copy current ECS
    service
    • Wait then switch
    • Delete old one

    View Slide

  31. Generally available

    View Slide

  32. Operations

    View Slide

  33. Dashboards
    • Prometheus
    • Grafana
    ‣ Per service
    ‣ Per servie-to-service
    ‣ Envoy instances
    • Vizceral
    ‣ promviz, promviz-front

    View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. Envoy on EC2
    • Build and distribute as a in-house
    deb package
    • Manage as a systemd service
    • Use hot-restarter.py
    ‣ Generate starter script for each host
    role

    View Slide

  40. wait-side-car
    • Sidecar Envoy containers need a
    few seconds to be up
    ‣ For background jobs
    • Wrapper command-line tool
    ‣ cookpad/wait-side-car
    https://github.com/cookpad/wait-side-car

    View Slide

  41. https://techlife.cookpad.com/entry/2018/04/02/140846

    View Slide

  42. Key results

    View Slide

  43. Resiliency
    • Eliminates temporal burst of
    errors from backend services
    • Speed of reviewing settings and
    deployment
    • Fault isolation: not yet remarkable
    result

    View Slide

  44. Observability
    • Decrease of time to detect root causes
    around service communication issues
    • Visualization of how resilience
    mechanism is working
    • One of sources of Service Level
    Indicator

    View Slide

  45. Growth of platform
    • Improve application platform without
    application deployment
    • Increase velocity of platform
    development team

    View Slide

  46. Next challenges

    View Slide

  47. Next challenges
    • v2 xDS migration / Istio
    • Chaos engineering platform
    • Distributed tracing
    • Auth[z, n]

    View Slide

  48. Wrap up

    View Slide

  49. Wrap up
    • Issues around service communications
    • Introducing service mesh instead of
    doing library approach
    • Key results: resiliency, observability,
    platform improvement

    View Slide

  50. Q&A
    • Twitter: @taiki45
    • Publish this slide later
    • http://techlife.cookpad.com/

    View Slide