Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building and operating service mesh at mid-size company

taiki45
September 07, 2018

Building and operating service mesh at mid-size company

taiki45

September 07, 2018
Tweet

More Decks by taiki45

Other Decks in Technology

Transcript

  1. Building and operating
    service mesh
    at mid-size company
    Taiki Ono, Cookpad Inc.

    View full-size slide

  2. Agenda
    • Background
    • Problems
    • Introducing and operations
    • Key results
    • Next challenges

    View full-size slide

  3. Cookpad
    • "Make everyday cooking fun!"
    • Originally started in Japan in 1997
    • Operate in over 23 languages, 68
    countries

    View full-size slide

  4. Scale
    • 200+ product developers
    • 100+ production services
    • 90M Monthly Average User

    View full-size slide

  5. Organization structure

    View full-size slide

  6. Technology stack
    • Ruby on Rails for both web
    frontend and backend apps
    • Python for ML apps
    • Go for backend app
    • Rust, Swift, Java etc.. for internal
    apps

    View full-size slide

  7. Operational problems
    • Decrease in system reliability
    • Hard to trouble shoot and debug
    ‣ Increase of time detect root causes
    of incidents
    ‣ Capacity planing

    View full-size slide

  8. Solutions
    • Expeditor
    ‣ Ruby library inspired by Netflix's
    Hystrix
    • aws-xray
    ‣ Ruby library for distributed tracing
    using AWS's X-Ray service
    https://github.com/cookpad

    View full-size slide

  9. http://techlife.cookpad.com/entry/2017/09/06/115710

    View full-size slide

  10. GoPythonRustJavaSwift apps?
    • Limitation of library model
    approach
    ‣ More for product development
    ‣ Controlling library versions is hard
    • Planning to develop our proxy and
    mixed with consul-template

    View full-size slide

  11. Service mesh to the rescue

    View full-size slide

  12. at SRECON America 2017
    "Lyft's Envoy: Experiences
    Operating a Large Service Mesh"

    View full-size slide

  13. Replacing libraries to proxy

    View full-size slide

  14. control-plane

    View full-size slide

  15. Introducing and
    operating service mesh

    View full-size slide

  16. Timeline
    • Early 2017: making plan
    • Late 2017: building MVP
    • Early 2018: generally available

    View full-size slide

  17. Envoy
    • Publicity released at mid 2016
    • Lightweight
    • Graceful reloading
    • gRPC support
    https://github.com/envoyproxy/envoy

    View full-size slide

  18. Plan: in-house
    • Early 2017: no Istio
    • We use Amazon ECS
    • Not using full features of Envoy
    • Resiliency and observability

    View full-size slide

  19. Goals
    • Control resiliency settings by Ops
    ‣ Centrally managed
    ‣ Review flow
    • All metrics should go into Prometheus
    • Low operation cost
    ‣ Less components, use of managed services

    View full-size slide

  20. Configuration contents
    • Jsonnet
    • Route config
    ‣ Retry, timeouts for
    paths, domains
    • Cluster config
    ‣ DNS name of internal ELB
    ‣ Circuit breaker settings
    https://github.com/cookpad/kumonos

    View full-size slide

  21. Drop statsd-relay
    • Adding tags to metrics
    with DogStatsd format
    • Less component is
    preferable
    ‣ Send PRs to Envoy
    ‣ dog_statsd sink and
    fixed tag configuration
    are available

    View full-size slide

  22. gRPC infrastructure
    • Need L7 proxy for
    HTTP/2 traffic
    • Let's extend
    control-plane

    View full-size slide

  23. ServiceDiscoveryService API
    • lyft/discovery
    ‣ Reference
    implementation of
    SDS API
    • Moved to cookpad/sds

    View full-size slide

  24. The hard point of ECS
    • Copy current ECS
    service
    • Wait then switch
    • Delete old one

    View full-size slide

  25. Generally available

    View full-size slide

  26. Dashboards
    • Prometheus
    • Grafana
    ‣ Per service
    ‣ Per servie-to-service
    ‣ Envoy instances
    • Vizceral
    ‣ promviz, promviz-front

    View full-size slide

  27. Envoy on EC2
    • Build and distribute as a in-house
    deb package
    • Manage as a systemd service
    • Use hot-restarter.py
    ‣ Generate starter script for each host
    role

    View full-size slide

  28. wait-side-car
    • Sidecar Envoy containers need a
    few seconds to be up
    ‣ For background jobs
    • Wrapper command-line tool
    ‣ cookpad/wait-side-car
    https://github.com/cookpad/wait-side-car

    View full-size slide

  29. https://techlife.cookpad.com/entry/2018/04/02/140846

    View full-size slide

  30. Resiliency
    • Eliminates temporal burst of
    errors from backend services
    • Speed of reviewing settings and
    deployment
    • Fault isolation: not yet remarkable
    result

    View full-size slide

  31. Observability
    • Decrease of time to detect root causes
    around service communication issues
    • Visualization of how resilience
    mechanism is working
    • One of sources of Service Level
    Indicator

    View full-size slide

  32. Growth of platform
    • Improve application platform without
    application deployment
    • Increase velocity of platform
    development team

    View full-size slide

  33. Next challenges

    View full-size slide

  34. Next challenges
    • v2 xDS migration / Istio
    • Chaos engineering platform
    • Distributed tracing
    • Auth[z, n]

    View full-size slide

  35. Wrap up
    • Issues around service communications
    • Introducing service mesh instead of
    doing library approach
    • Key results: resiliency, observability,
    platform improvement

    View full-size slide

  36. Q&A
    • Twitter: @taiki45
    • Publish this slide later
    • http://techlife.cookpad.com/

    View full-size slide