Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Cookpad Tech Kitchen #20 クックパッドでのサービスメッシュについて
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
taiki45
November 28, 2018
Technology
1
2.5k
Cookpad Tech Kitchen #20 クックパッドでのサービスメッシュについて
Cookpad Tech Kitchen #20 クックパッドのマイクロサービスプラットフォーム現状
https://cookpad.connpass.com/event/106913/
taiki45
November 28, 2018
Tweet
Share
More Decks by taiki45
See All by taiki45
Mocking in Rust Applications
taiki45
2
710
Error Handling in Rust Applications
taiki45
3
790
Efficient Platform for Security and Compliance
taiki45
6
1.6k
RustでAWS Lambda functionをいい感じに書く
taiki45
2
810
SPIFFE Meetup Tokyo #2 LT: Envoy SDS
taiki45
0
820
builderscon Tokyo 2019: Intro Service Mesh
taiki45
6
3.6k
NoOps Meetup Tokyo #7: 入門サービスメッシュ
taiki45
4
1.9k
CloudNative Days Tokyo 2019: Understanding Envoy
taiki45
3
3.6k
Cloud Native Meetup Tokyo #8 ServiceMesh Day Recap
taiki45
2
410
Other Decks in Technology
See All in Technology
堅牢.py#2 LT資料
t3tra
0
130
JAWS DAYS 2026 ExaWizards_20260307
exawizards
0
410
2026-03-11 JAWS-UG 茨城 #12 改めてALBを便利に使う
masasuzu
2
340
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
8
7.2k
[2026-03-07]あの日諦めたスクラムの答えを僕達はまだ探している。〜守ることと、諦めることと、それでも前に進むチームの話〜
tosite
0
170
EMからVPoEを経てCTOへ:マネジメントキャリアパスにおける葛藤と成長
kakehashi
PRO
9
1.6k
わたしがセキュアにAWSを使えるわけないじゃん、ムリムリ!(※ムリじゃなかった!?)
cmusudakeisuke
1
500
Yahoo!ショッピングのレコメンデーション・システムにおけるML実践の一例
lycorptech_jp
PRO
1
190
聲の形にみるアクセシビリティ
tomokusaba
0
170
DevOpsエージェントで実現する!! AWS Well-Architected(W-A) を実現するシステム設計 / 20260307 Masaki Okuda
shift_evolve
PRO
3
550
新職業『オーケストレーター』誕生 — エージェント10体を同時に回すAgentOps
gunta
4
1.8k
AIエージェント時代に備える AWS Organizations とアカウント設計
kossykinto
3
740
Featured
See All Featured
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
128
55k
SEOcharity - Dark patterns in SEO and UX: How to avoid them and build a more ethical web
sarafernandez
0
140
Rebuilding a faster, lazier Slack
samanthasiow
85
9.4k
Paper Plane (Part 1)
katiecoart
PRO
0
5.5k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
17k
GitHub's CSS Performance
jonrohan
1032
470k
The Invisible Side of Design
smashingmag
302
51k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.8k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
2.4k
A designer walks into a library…
pauljervisheath
210
24k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
Into the Great Unknown - MozCon
thekraken
40
2.3k
Transcript
クックパッドでの サービスメッシュについて Taiki Ono, Cookpad Inc.
Agenda • Background • Problems • Introducing and operations •
Key results • Next challenges
Background
Cookpad • "Make everyday cooking fun!" • Originally started in
Japan in 1997 • Operate in over 23 languages, 68 countries
Scale • 200+ product developers • 100+ production services •
90M Monthly Average User
Organization structure Service Team SRE team etc
Technology stack • Ruby on Rails for both web frontend
and backend apps • Python for ML apps • Go for backend app • Rust, Swift, Java etc.. for internal apps
None
Problems
Operational problems • Decrease in system reliability • Hard to
trouble shoot and debug ‣ Increase of time detect root causes of incidents ‣ Capacity planing
Solutions • Expeditor ‣ Ruby library inspired by Netflix's Hystrix
• aws-xray ‣ Ruby library for distributed tracing using AWS's X-Ray service https://github.com/cookpad
http://techlife.cookpad.com/entry/2017/09/06/115710
None
GoPythonRustJavaSwift apps? • Limitation of library model approach ‣ More
for product development ‣ Controlling library versions is hard • Planning to develop our proxy and mixed with consul-template
Service mesh to the rescue
at SRECON America 2017 "Lyft's Envoy: Experiences Operating a Large
Service Mesh"
Replacing libraries to proxy
control-plane
Introducing and operating service mesh
Timeline • Early 2017: making plan • Late 2017: building
MVP • Early 2018: generally available
Envoy • Publicity released at mid 2016 • Lightweight •
Graceful reloading • gRPC support https://github.com/envoyproxy/envoy
Plan: in-house • Early 2017: no Istio • We use
Amazon ECS • Not using full features of Envoy • Resiliency and observability parts
Goals • Control resiliency settings by Ops ‣ Centrally managed
‣ Review flow • All metrics should go into Prometheus • Low operation cost ‣ Less components, use of managed services
None
Configuration contents • Jsonnet • Route config ‣ Retry, timeouts
for paths, domains • Cluster config ‣ DNS name of internal ELB ‣ Circuit breaker settings https://github.com/cookpad/kumonos
Drop statsd-relay • Adding tags to metrics with DogStatsd format
• Less component is preferable ‣ Send PRs to Envoy ‣ dog_statsd sink and fixed tag configuration are available
gRPC infrastructure • Need L7 proxy for HTTP/2 traffic •
Let's extend control-plane
ServiceDiscoveryService API • lyft/discovery ‣ Reference implementation of SDS API
• Moved to cookpad/sds
The hard point of ECS • Copy current ECS service
• Wait then switch • Delete old one
Generally available
Operations
Dashboards • Prometheus • Grafana ‣ Per service ‣ Per
servie-to-service ‣ Envoy instances • Vizceral ‣ promviz, promviz-front
None
None
None
None
None
Envoy on EC2 • Build and distribute as a in-house
deb package • Manage as a systemd service • Use hot-restarter.py ‣ Generate starter script for each host role
wait-side-car • Sidecar Envoy containers need a few seconds to
be up ‣ For background jobs • Wrapper command-line tool ‣ cookpad/wait-side-car https://github.com/cookpad/wait-side-car
https://techlife.cookpad.com/entry/2018/04/02/140846
Key results
Resiliency • Eliminates temporal burst of errors from backend services
• Speed of reviewing settings and deployment • Fault isolation: not yet remarkable result
Observability • Decrease of time to detect root causes around
service communication issues • Visualization of how resilience mechanism is working • One of sources of Service Level Indicator
Continuous Growth of platform • Improve application platform without application
deployment • Increase velocity of platform development team
Next challenges
Next challenges • v2 xDS migration • More effective traffic
control • Chaos engineering platform • Distributed tracing • Auth[z, n]
Wrap up
Wrap up • Issues around service communications • Introducing service
mesh instead of doing library approach • Key results: resiliency, observability, platform improvement
Q&A • Twitter: @taiki45 • http://techlife.cookpad.com/ • EnvoyCon 2018 https://
events.linuxfoundation.org/events/ kubecon-cloudnativecon-north- america-2018/co-located-events/ envoycon/