Cookpad Tech Kitchen #20 クックパッドでのサービスメッシュについて

44e6e0e9bcc3d8279020aad563f16f34?s=47 taiki45
November 28, 2018

Cookpad Tech Kitchen #20 クックパッドでのサービスメッシュについて

Cookpad Tech Kitchen #20 クックパッドのマイクロサービスプラットフォーム現状 https://cookpad.connpass.com/event/106913/

44e6e0e9bcc3d8279020aad563f16f34?s=128

taiki45

November 28, 2018
Tweet

Transcript

  1. クックパッドでの サービスメッシュについて Taiki Ono, Cookpad Inc.

  2. Agenda • Background • Problems • Introducing and operations •

    Key results • Next challenges
  3. Background

  4. Cookpad • "Make everyday cooking fun!" • Originally started in

    Japan in 1997 • Operate in over 23 languages, 68 countries
  5. Scale • 200+ product developers • 100+ production services •

    90M Monthly Average User
  6. Organization structure Service Team SRE team etc

  7. Technology stack • Ruby on Rails for both web frontend

    and backend apps • Python for ML apps • Go for backend app • Rust, Swift, Java etc.. for internal apps
  8. None
  9. Problems

  10. Operational problems • Decrease in system reliability • Hard to

    trouble shoot and debug ‣ Increase of time detect root causes of incidents ‣ Capacity planing
  11. Solutions • Expeditor ‣ Ruby library inspired by Netflix's Hystrix

    • aws-xray ‣ Ruby library for distributed tracing using AWS's X-Ray service https://github.com/cookpad
  12. http://techlife.cookpad.com/entry/2017/09/06/115710

  13. None
  14. GoPythonRustJavaSwift apps? • Limitation of library model approach ‣ More

    for product development ‣ Controlling library versions is hard • Planning to develop our proxy and mixed with consul-template
  15. Service mesh to the rescue

  16. at SRECON America 2017 "Lyft's Envoy: Experiences Operating a Large

    Service Mesh"
  17. Replacing libraries to proxy

  18. control-plane

  19. Introducing and operating service mesh

  20. Timeline • Early 2017: making plan • Late 2017: building

    MVP • Early 2018: generally available
  21. Envoy • Publicity released at mid 2016 • Lightweight •

    Graceful reloading • gRPC support https://github.com/envoyproxy/envoy
  22. Plan: in-house • Early 2017: no Istio • We use

    Amazon ECS • Not using full features of Envoy • Resiliency and observability parts
  23. Goals • Control resiliency settings by Ops ‣ Centrally managed

    ‣ Review flow • All metrics should go into Prometheus • Low operation cost ‣ Less components, use of managed services
  24. None
  25. Configuration contents • Jsonnet • Route config ‣ Retry, timeouts

    for paths, domains • Cluster config ‣ DNS name of internal ELB ‣ Circuit breaker settings https://github.com/cookpad/kumonos
  26. Drop statsd-relay • Adding tags to metrics with DogStatsd format

    • Less component is preferable ‣ Send PRs to Envoy ‣ dog_statsd sink and fixed tag configuration are available
  27. gRPC infrastructure • Need L7 proxy for HTTP/2 traffic •

    Let's extend control-plane
  28. ServiceDiscoveryService API • lyft/discovery ‣ Reference implementation of SDS API

    • Moved to cookpad/sds
  29. The hard point of ECS • Copy current ECS service

    • Wait then switch • Delete old one
  30. Generally available

  31. Operations

  32. Dashboards • Prometheus • Grafana ‣ Per service ‣ Per

    servie-to-service ‣ Envoy instances • Vizceral ‣ promviz, promviz-front
  33. None
  34. None
  35. None
  36. None
  37. None
  38. Envoy on EC2 • Build and distribute as a in-house

    deb package • Manage as a systemd service • Use hot-restarter.py ‣ Generate starter script for each host role
  39. wait-side-car • Sidecar Envoy containers need a few seconds to

    be up ‣ For background jobs • Wrapper command-line tool ‣ cookpad/wait-side-car https://github.com/cookpad/wait-side-car
  40. https://techlife.cookpad.com/entry/2018/04/02/140846

  41. Key results

  42. Resiliency • Eliminates temporal burst of errors from backend services

    • Speed of reviewing settings and deployment • Fault isolation: not yet remarkable result
  43. Observability • Decrease of time to detect root causes around

    service communication issues • Visualization of how resilience mechanism is working • One of sources of Service Level Indicator
  44. Continuous Growth of platform • Improve application platform without application

    deployment • Increase velocity of platform development team
  45. Next challenges

  46. Next challenges • v2 xDS migration • More effective traffic

    control • Chaos engineering platform • Distributed tracing • Auth[z, n]
  47. Wrap up

  48. Wrap up • Issues around service communications • Introducing service

    mesh instead of doing library approach • Key results: resiliency, observability, platform improvement
  49. Q&A • Twitter: @taiki45 • http://techlife.cookpad.com/ • EnvoyCon 2018 https://

    events.linuxfoundation.org/events/ kubecon-cloudnativecon-north- america-2018/co-located-events/ envoycon/