Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Cookpad Tech Kitchen #20 クックパッドでのサービスメッシュについて
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
taiki45
November 28, 2018
Technology
2.5k
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Cookpad Tech Kitchen #20 クックパッドでのサービスメッシュについて
Cookpad Tech Kitchen #20 クックパッドのマイクロサービスプラットフォーム現状
https://cookpad.connpass.com/event/106913/
taiki45
November 28, 2018
More Decks by taiki45
See All by taiki45
Mocking in Rust Applications
taiki45
2
770
Error Handling in Rust Applications
taiki45
3
850
Efficient Platform for Security and Compliance
taiki45
6
1.7k
RustでAWS Lambda functionをいい感じに書く
taiki45
2
860
SPIFFE Meetup Tokyo #2 LT: Envoy SDS
taiki45
0
850
builderscon Tokyo 2019: Intro Service Mesh
taiki45
6
3.7k
NoOps Meetup Tokyo #7: 入門サービスメッシュ
taiki45
4
2k
CloudNative Days Tokyo 2019: Understanding Envoy
taiki45
3
3.6k
Cloud Native Meetup Tokyo #8 ServiceMesh Day Recap
taiki45
2
430
Other Decks in Technology
See All in Technology
AIの性能が向上しても未解決な組織の重大問題は何か?/An Unsolved Organizational Problem in the Age of AI
moriyuya
3
600
「エンジニア進化論」2028年の開発完全自動化、エンジニアはどう進化するか
cyberagentdevelopers
PRO
4
4.3k
MIERUNE JCT 発表資料「宇宙から伊能忠敬ごっこ」
syuchimu
0
200
AI Engineering Summit Tokyo 2026 AIの前に、やることがある 〜医療データ企業の4フェーズ〜
dtaniwaki
0
2.5k
2026TECHFRESH畢業分享會 - Lightning Talk - 打造精準高效的 MCP 設計模式與測試實務
line_developers_tw
PRO
0
700
AAIFに入ってみた ~内から見えるコミュニティ動向~
sato4
0
140
中期計画、2回作ってみた ~業務委託と正社員、両方の視点から~
demaecan
1
650
Oracle AI Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
6
1.5k
生成 AI × MCP で切り拓く次世代 SRE!自律型運用への挑戦と開発者体験の進化
_awache
0
190
2026 TECHFRESH 畢業分享會 - 開發日常大解密!從領域驅動到企業級上線
line_developers_tw
PRO
0
700
EventBridge Connection
_kensh
5
690
Disciplined Vibes: Scaling AI-Assisted Engineering
sheharyar
0
130
Featured
See All Featured
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1.4k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
370
Making the Leap to Tech Lead
cromwellryan
135
9.9k
Stop Working from a Prison Cell
hatefulcrawdad
274
21k
The agentic SEO stack - context over prompts
schlessera
0
810
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
360
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.5k
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
250
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
140
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.8k
Transcript
クックパッドでの サービスメッシュについて Taiki Ono, Cookpad Inc.
Agenda • Background • Problems • Introducing and operations •
Key results • Next challenges
Background
Cookpad • "Make everyday cooking fun!" • Originally started in
Japan in 1997 • Operate in over 23 languages, 68 countries
Scale • 200+ product developers • 100+ production services •
90M Monthly Average User
Organization structure Service Team SRE team etc
Technology stack • Ruby on Rails for both web frontend
and backend apps • Python for ML apps • Go for backend app • Rust, Swift, Java etc.. for internal apps
None
Problems
Operational problems • Decrease in system reliability • Hard to
trouble shoot and debug ‣ Increase of time detect root causes of incidents ‣ Capacity planing
Solutions • Expeditor ‣ Ruby library inspired by Netflix's Hystrix
• aws-xray ‣ Ruby library for distributed tracing using AWS's X-Ray service https://github.com/cookpad
http://techlife.cookpad.com/entry/2017/09/06/115710
None
GoPythonRustJavaSwift apps? • Limitation of library model approach ‣ More
for product development ‣ Controlling library versions is hard • Planning to develop our proxy and mixed with consul-template
Service mesh to the rescue
at SRECON America 2017 "Lyft's Envoy: Experiences Operating a Large
Service Mesh"
Replacing libraries to proxy
control-plane
Introducing and operating service mesh
Timeline • Early 2017: making plan • Late 2017: building
MVP • Early 2018: generally available
Envoy • Publicity released at mid 2016 • Lightweight •
Graceful reloading • gRPC support https://github.com/envoyproxy/envoy
Plan: in-house • Early 2017: no Istio • We use
Amazon ECS • Not using full features of Envoy • Resiliency and observability parts
Goals • Control resiliency settings by Ops ‣ Centrally managed
‣ Review flow • All metrics should go into Prometheus • Low operation cost ‣ Less components, use of managed services
None
Configuration contents • Jsonnet • Route config ‣ Retry, timeouts
for paths, domains • Cluster config ‣ DNS name of internal ELB ‣ Circuit breaker settings https://github.com/cookpad/kumonos
Drop statsd-relay • Adding tags to metrics with DogStatsd format
• Less component is preferable ‣ Send PRs to Envoy ‣ dog_statsd sink and fixed tag configuration are available
gRPC infrastructure • Need L7 proxy for HTTP/2 traffic •
Let's extend control-plane
ServiceDiscoveryService API • lyft/discovery ‣ Reference implementation of SDS API
• Moved to cookpad/sds
The hard point of ECS • Copy current ECS service
• Wait then switch • Delete old one
Generally available
Operations
Dashboards • Prometheus • Grafana ‣ Per service ‣ Per
servie-to-service ‣ Envoy instances • Vizceral ‣ promviz, promviz-front
None
None
None
None
None
Envoy on EC2 • Build and distribute as a in-house
deb package • Manage as a systemd service • Use hot-restarter.py ‣ Generate starter script for each host role
wait-side-car • Sidecar Envoy containers need a few seconds to
be up ‣ For background jobs • Wrapper command-line tool ‣ cookpad/wait-side-car https://github.com/cookpad/wait-side-car
https://techlife.cookpad.com/entry/2018/04/02/140846
Key results
Resiliency • Eliminates temporal burst of errors from backend services
• Speed of reviewing settings and deployment • Fault isolation: not yet remarkable result
Observability • Decrease of time to detect root causes around
service communication issues • Visualization of how resilience mechanism is working • One of sources of Service Level Indicator
Continuous Growth of platform • Improve application platform without application
deployment • Increase velocity of platform development team
Next challenges
Next challenges • v2 xDS migration • More effective traffic
control • Chaos engineering platform • Distributed tracing • Auth[z, n]
Wrap up
Wrap up • Issues around service communications • Introducing service
mesh instead of doing library approach • Key results: resiliency, observability, platform improvement
Q&A • Twitter: @taiki45 • http://techlife.cookpad.com/ • EnvoyCon 2018 https://
events.linuxfoundation.org/events/ kubecon-cloudnativecon-north- america-2018/co-located-events/ envoycon/