$30 off During Our Annual Pro Sale. View Details »

Cookpad Tech Kitchen #20 クックパッドでのサービスメッシュについて

taiki45
November 28, 2018

Cookpad Tech Kitchen #20 クックパッドでのサービスメッシュについて

Cookpad Tech Kitchen #20 クックパッドのマイクロサービスプラットフォーム現状 https://cookpad.connpass.com/event/106913/

taiki45

November 28, 2018
Tweet

More Decks by taiki45

Other Decks in Technology

Transcript

  1. クックパッドでの
    サービスメッシュについて
    Taiki Ono, Cookpad Inc.

    View Slide

  2. Agenda
    • Background
    • Problems
    • Introducing and operations
    • Key results
    • Next challenges

    View Slide

  3. Background

    View Slide

  4. Cookpad
    • "Make everyday cooking fun!"
    • Originally started in Japan in 1997
    • Operate in over 23 languages, 68
    countries

    View Slide

  5. Scale
    • 200+ product developers
    • 100+ production services
    • 90M Monthly Average User

    View Slide

  6. Organization structure
    Service Team
    SRE team etc

    View Slide

  7. Technology stack
    • Ruby on Rails for both web frontend
    and backend apps
    • Python for ML apps
    • Go for backend app
    • Rust, Swift, Java etc.. for internal
    apps

    View Slide

  8. View Slide

  9. Problems

    View Slide

  10. Operational problems
    • Decrease in system reliability
    • Hard to trouble shoot and debug
    ‣ Increase of time detect root causes of
    incidents
    ‣ Capacity planing

    View Slide

  11. Solutions
    • Expeditor
    ‣ Ruby library inspired by Netflix's
    Hystrix
    • aws-xray
    ‣ Ruby library for distributed tracing
    using AWS's X-Ray service
    https://github.com/cookpad

    View Slide

  12. http://techlife.cookpad.com/entry/2017/09/06/115710

    View Slide

  13. View Slide

  14. GoPythonRustJavaSwift apps?
    • Limitation of library model
    approach
    ‣ More for product development
    ‣ Controlling library versions is hard
    • Planning to develop our proxy and
    mixed with consul-template

    View Slide

  15. Service mesh to the rescue

    View Slide

  16. at SRECON America 2017
    "Lyft's Envoy: Experiences Operating a Large Service Mesh"

    View Slide

  17. Replacing libraries to proxy

    View Slide

  18. control-plane

    View Slide

  19. Introducing and
    operating service mesh

    View Slide

  20. Timeline
    • Early 2017: making plan
    • Late 2017: building MVP
    • Early 2018: generally available

    View Slide

  21. Envoy
    • Publicity released at mid 2016
    • Lightweight
    • Graceful reloading
    • gRPC support
    https://github.com/envoyproxy/envoy

    View Slide

  22. Plan: in-house
    • Early 2017: no Istio
    • We use Amazon ECS
    • Not using full features of Envoy
    • Resiliency and observability parts

    View Slide

  23. Goals
    • Control resiliency settings by Ops
    ‣ Centrally managed
    ‣ Review flow
    • All metrics should go into Prometheus
    • Low operation cost
    ‣ Less components, use of managed services

    View Slide

  24. View Slide

  25. Configuration contents
    • Jsonnet
    • Route config
    ‣ Retry, timeouts for
    paths, domains
    • Cluster config
    ‣ DNS name of internal ELB
    ‣ Circuit breaker settings
    https://github.com/cookpad/kumonos

    View Slide

  26. Drop statsd-relay
    • Adding tags to metrics
    with DogStatsd format
    • Less component is
    preferable
    ‣ Send PRs to Envoy
    ‣ dog_statsd sink and
    fixed tag configuration
    are available

    View Slide

  27. gRPC infrastructure
    • Need L7 proxy for
    HTTP/2 traffic
    • Let's extend
    control-plane

    View Slide

  28. ServiceDiscoveryService API
    • lyft/discovery
    ‣ Reference
    implementation of SDS
    API
    • Moved to cookpad/sds

    View Slide

  29. The hard point of ECS
    • Copy current ECS
    service
    • Wait then switch
    • Delete old one

    View Slide

  30. Generally available

    View Slide

  31. Operations

    View Slide

  32. Dashboards
    • Prometheus
    • Grafana
    ‣ Per service
    ‣ Per servie-to-service
    ‣ Envoy instances
    • Vizceral
    ‣ promviz, promviz-front

    View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. Envoy on EC2
    • Build and distribute as a in-house
    deb package
    • Manage as a systemd service
    • Use hot-restarter.py
    ‣ Generate starter script for each host
    role

    View Slide

  39. wait-side-car
    • Sidecar Envoy containers need a few
    seconds to be up
    ‣ For background jobs
    • Wrapper command-line tool
    ‣ cookpad/wait-side-car
    https://github.com/cookpad/wait-side-car

    View Slide

  40. https://techlife.cookpad.com/entry/2018/04/02/140846

    View Slide

  41. Key results

    View Slide

  42. Resiliency
    • Eliminates temporal burst of errors
    from backend services
    • Speed of reviewing settings and
    deployment
    • Fault isolation: not yet remarkable
    result

    View Slide

  43. Observability
    • Decrease of time to detect root causes
    around service communication issues
    • Visualization of how resilience
    mechanism is working
    • One of sources of Service Level
    Indicator

    View Slide

  44. Continuous Growth of platform
    • Improve application platform without application
    deployment
    • Increase velocity of platform development team

    View Slide

  45. Next challenges

    View Slide

  46. Next challenges
    • v2 xDS migration
    • More effective traffic control
    • Chaos engineering platform
    • Distributed tracing
    • Auth[z, n]

    View Slide

  47. Wrap up

    View Slide

  48. Wrap up
    • Issues around service communications
    • Introducing service mesh instead of
    doing library approach
    • Key results: resiliency, observability,
    platform improvement

    View Slide

  49. Q&A
    • Twitter: @taiki45
    • http://techlife.cookpad.com/
    • EnvoyCon 2018 https://
    events.linuxfoundation.org/events/
    kubecon-cloudnativecon-north-
    america-2018/co-located-events/
    envoycon/

    View Slide