Walking the minefield of Service Mesh

Walking the mineﬁeld of Service Mesh DevOpsDays Portugal Sep/24 by:
Daniel Requena

Scan me

Agenda ➔ Whoami ➔ History and Context ➔ The BIG
question ➔ Avoiding mines ➔ Conclusions ➔ Questions? ➔ References

Dad, Husband, Nerd Bachelor in Computer Science Master Computer Engineering
+20 years of XP in Sysadmin/DevOps/SRE…etc. Senior Software Engineer at iFood @trafﬁc team Daniel Requena $Whoami

Special Thanks Jhonn Frazão Eduardo Baitello Débora Berte Fagner Luz
Jhonatan Morais Edson Almeida Fernando Junior Kelvin Lopes

History and Context iFood? • Brazilian Food Delivery Company •
+100 millions orders per month • ~5500 employees / ~2k engineers • ~250K/600k RPS • +8k Deploy per month • ~3k microservices • +54 Kubernetes Cluster Big numbers

What is Service Mesh? The swift left of Distributed Systems

History and Context Our Mesh • Istio based ◦ sidecar
model • Single Mesh/Multi-Cluster/Multi-primary/Multi-Network ◦ sandbox / production ◦ k8s only (no VMs) • Running since Q1-2022 • Current workload adoption: +70% • Current trafﬁc ﬂow: +75%

History and Context Our Mesh

• Features ◦ mTLS ◦ Authn/Authz ◦ Trafﬁc management ▪
Canary ▪ Retry policy ▪ Circuit Break ▪ Rate Limit ◦ Telemetry ◦ Traces ◦ Service Map ◦ + some custom extensions • Important role in our multi-region strategy History and Context Our Mesh

The BIG question Should I use Service Mesh? • "one"
person team? • lots of other responsibilities? • only a few applications? • one or two programming languages and/or frameworks? • need the "basic" features? Probably Not

The BIG question How we did it? 6 months 8
months 16 months Multi disciplinary team Hybrid Ops team Mesh team Trafﬁc team Research && Tests Sandbox setup Production Setup Controlled adoption Full GRPC adoption Platform maturity New features +70% Prod adoption Continuous improvement New features 100% adoption

The BIG question

Avoiding Mines Start SLOW!

• Features ◦ mTLS ◦ Authn/Authz ◦ Trafﬁc management ▪
Canary ▪ Retry policy ▪ Circuit Break ▪ Rate Limit ◦ Telemetry ◦ Traces ◦ Service Map ◦ + some custom extensions Avoiding Mines Start SLOW!

Avoiding Mines Start SLOW! Remember: • Just because the Mesh
can do it, it DOESN'T mean that is the BEST place to do it. ◦ Egress control ◦ Service Map ◦ Caos Engineering

• All Meshes offer granulary "opt-in" choices ◦ Per namespace
◦ Per workload • Per environment ◦ sandbox ◦ production • Be aware ◦ small downtime can happen Avoiding Mines Start small

Avoiding Mines Start small How we did it? • Sandbox
Setup ◦ Cherry pick workloads ▪ different clusters ▪ different types (BFFs, APIs, Workers, SSR, etc..) ◦ Sandbox free to go <2-3 months> • Production Setup ◦ continuous adoption • Never forced adoption to the teams (until now)

Avoiding Mines Start small (if you can 😅)

Avoiding Mines Abstract it, but don't make it invisible •
Not a Mesh exclusivity ◦ Too many parameters ▪ which ones to make conﬁgurable? ◦ Too dangerous ◦ Guardrails are a must • 80/20 rule of thumb ◦ Retry feature. ◦ retry on (503, refused-stream, unavailable) • Abstraction example ◦ How to turn-on the Mesh ◦ Authn/Authz

Avoiding Mines Abstract it, but don't make it invisible

Avoiding Mines Abstract it, but don't make it invisible A
new way to understand app-to-app communication • from server-side to client-side features

Avoiding Mines Abstract it, but don't make it invisible A
new way to understand app-to-app communication • Avoid wrong doing conﬁgs ◦ Retries ◦ Load Balancing ◦ Timeout ◦ etc…

Avoiding Mines The Mesh effect "The assumption that ANY problem
that a workload has is the MESH's fault" Hello support team, we enabled the Mesh in our app two weeks ago and now we see an increase in latency Our app NEVER had this behavior before, m aybe it is the M esh We keep seeing 503 after a deploy and we believe the reason is the Mesh The Mesh is making our POD restart Could that be the Mesh? I can't build the app in my machine because of the Mesh I'm miserable BECAUSE of the Mesh! My DOG is sick after our app adopted the Mesh, how can I ﬁx that? According to the Traces, there is something called Envoy that is holding our requests…

Avoiding Mines The Mesh effect Spread empathy and knowledge •
Documentation • Videos • Presentations • Enablers (10/10) • "With new metrics, comes old unknown problems" (uncle Istio) • Support Team focus training • New Dashboards customized for Mesh Workloads • Mesh workload Alerts

Avoiding Mines The Mesh effect

Avoiding Mines Know your proxy! • Unless only L4 (in
a sidecarless setup) • Your application is now AS resilient AS your Proxy ◦ resource ◦ scalability ◦ alerts ◦ metrics

Avoiding Mines Know your proxy! • Some proxy problems ◦
Timeouts ◦ Connection terminations ◦ Idle Timeout ◦ Connection Drain ◦ Different Header handling standards • Proxy Layer 4 ◦ Port exclusion (ie: DB port, Kafka, AWS metadata, etc…) ◦ Protocol exclusion (UDP?) ◦ IP Ranges/Networks exclusion

Avoiding Mines Know your proxy! Sidecar nightmares • Start/Stop (+1.28
solved?) • Resource deﬁnition • HPA resource based ◦ memory ◦ cpu Main APP Sidecar HPA CPU: 80% Mem: 70% resources: request: memory: 512MB cpu: 500 resources: request: memory: 200MB cpu: 200

Avoiding Mines It's a platform Happy with Kubernetes upgrades? /s
Cool, now you have another layer to upgrade • CRDs • APIs • Internal structures • Proxy behaviour Create • Docs/procedures/tests (automate whenever possible) Be aware: even with tests, things can go south

Avoiding Mines It's a platform The Istio upgrade monster 👻
• Benchmarks scared us ◦ Difﬁcult ◦ Error prune ◦ "We are far behind from supported version" • Sandbox for the win! • began 1.12 • today 1.22 Revision based FROM DAY 1

Avoiding Mines It's a platform

Conclusions • Service Mesh is a powerful tool! Make it
a super-powerful platform • Adopting Service Mesh is a BIG challenge • Consider create a Mesh Team (at least a temporary one) • Start slow, start small, share empathy and Knowledge • At least for us, it was worth it.

Conclusions

Questions?

https://www.linkedin.com/in/danielrequena/ https://bolha.us/@requena https://github.com/drequena/ https://speakerdeck.com/drequena https://twitter.com/Daniel_Requena Daniel Requena Contacts

References • [Istio slow adoption ] https://youtu.be/8pgn5UaHkmQ?si=AqcErZ1aq4oJ7zFt • [Service Mesh
is not the solution] https://youtu.be/5quqbj8npRo?si=ByLVtoOJ7qtCzE5l • [Istio Ratelimit in iFood] https://www.youtube.com/watch?v=GGnCq3B2J8A • [Load Balancing GRPC] https://majidfn.com/blog/grpc-load-balancing/ • [LB gRPC with Service Mesh] https://www.useanvil.com/blog/engineering/load-balancing-grpc-in-kubernetes-with-istio/ • [Kubernetes native sidecar] https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/ • [Istio sidecar k8s support] https://istio.io/latest/blog/2023/native-sidecars/

Walking the minefield of Service Mesh

Walking the minefield of Service Mesh

More Decks by drequena

Other Decks in Technology

Featured

Transcript