Slide 1

Slide 1 text

Walking the minefield of Service Mesh DevOpsDays Portugal Sep/24 by: Daniel Requena

Slide 2

Slide 2 text

Scan me

Slide 3

Slide 3 text

Agenda ➔ Whoami ➔ History and Context ➔ The BIG question ➔ Avoiding mines ➔ Conclusions ➔ Questions? ➔ References

Slide 4

Slide 4 text

Dad, Husband, Nerd Bachelor in Computer Science Master Computer Engineering +20 years of XP in Sysadmin/DevOps/SRE…etc. Senior Software Engineer at iFood @traffic team Daniel Requena $Whoami

Slide 5

Slide 5 text

Special Thanks Jhonn Frazão Eduardo Baitello Débora Berte Fagner Luz Jhonatan Morais Edson Almeida Fernando Junior Kelvin Lopes

Slide 6

Slide 6 text

History and Context iFood? ● Brazilian Food Delivery Company ● +100 millions orders per month ● ~5500 employees / ~2k engineers ● ~250K/600k RPS ● +8k Deploy per month ● ~3k microservices ● +54 Kubernetes Cluster Big numbers

Slide 7

Slide 7 text

What is Service Mesh? The swift left of Distributed Systems

Slide 8

Slide 8 text

History and Context Our Mesh ● Istio based ○ sidecar model ● Single Mesh/Multi-Cluster/Multi-primary/Multi-Network ○ sandbox / production ○ k8s only (no VMs) ● Running since Q1-2022 ● Current workload adoption: +70% ● Current traffic flow: +75%

Slide 9

Slide 9 text

History and Context Our Mesh

Slide 10

Slide 10 text

● Features ○ mTLS ○ Authn/Authz ○ Traffic management ■ Canary ■ Retry policy ■ Circuit Break ■ Rate Limit ○ Telemetry ○ Traces ○ Service Map ○ + some custom extensions ● Important role in our multi-region strategy History and Context Our Mesh

Slide 11

Slide 11 text

The BIG question Should I use Service Mesh? ● "one" person team? ● lots of other responsibilities? ● only a few applications? ● one or two programming languages and/or frameworks? ● need the "basic" features? Probably Not

Slide 12

Slide 12 text

The BIG question How we did it? 6 months 8 months 16 months Multi disciplinary team Hybrid Ops team Mesh team Traffic team Research && Tests Sandbox setup Production Setup Controlled adoption Full GRPC adoption Platform maturity New features +70% Prod adoption Continuous improvement New features 100% adoption

Slide 13

Slide 13 text

The BIG question

Slide 14

Slide 14 text

Avoiding Mines Start SLOW!

Slide 15

Slide 15 text

● Features ○ mTLS ○ Authn/Authz ○ Traffic management ■ Canary ■ Retry policy ■ Circuit Break ■ Rate Limit ○ Telemetry ○ Traces ○ Service Map ○ + some custom extensions Avoiding Mines Start SLOW!

Slide 16

Slide 16 text

Avoiding Mines Start SLOW! Remember: ● Just because the Mesh can do it, it DOESN'T mean that is the BEST place to do it. ○ Egress control ○ Service Map ○ Caos Engineering

Slide 17

Slide 17 text

● All Meshes offer granulary "opt-in" choices ○ Per namespace ○ Per workload ● Per environment ○ sandbox ○ production ● Be aware ○ small downtime can happen Avoiding Mines Start small

Slide 18

Slide 18 text

Avoiding Mines Start small How we did it? ● Sandbox Setup ○ Cherry pick workloads ■ different clusters ■ different types (BFFs, APIs, Workers, SSR, etc..) ○ Sandbox free to go <2-3 months> ● Production Setup ○ continuous adoption ● Never forced adoption to the teams (until now)

Slide 19

Slide 19 text

Avoiding Mines Start small (if you can 😅)

Slide 20

Slide 20 text

Avoiding Mines Abstract it, but don't make it invisible ● Not a Mesh exclusivity ○ Too many parameters ■ which ones to make configurable? ○ Too dangerous ○ Guardrails are a must ● 80/20 rule of thumb ○ Retry feature. ○ retry on (503, refused-stream, unavailable) ● Abstraction example ○ How to turn-on the Mesh ○ Authn/Authz

Slide 21

Slide 21 text

Avoiding Mines Abstract it, but don't make it invisible

Slide 22

Slide 22 text

Avoiding Mines Abstract it, but don't make it invisible A new way to understand app-to-app communication ● from server-side to client-side features

Slide 23

Slide 23 text

Avoiding Mines Abstract it, but don't make it invisible A new way to understand app-to-app communication ● Avoid wrong doing configs ○ Retries ○ Load Balancing ○ Timeout ○ etc…

Slide 24

Slide 24 text

Avoiding Mines The Mesh effect "The assumption that ANY problem that a workload has is the MESH's fault" Hello support team, we enabled the Mesh in our app two weeks ago and now we see an increase in latency Our app NEVER had this behavior before, m aybe it is the M esh We keep seeing 503 after a deploy and we believe the reason is the Mesh The Mesh is making our POD restart Could that be the Mesh? I can't build the app in my machine because of the Mesh I'm miserable BECAUSE of the Mesh! My DOG is sick after our app adopted the Mesh, how can I fix that? According to the Traces, there is something called Envoy that is holding our requests…

Slide 25

Slide 25 text

Avoiding Mines The Mesh effect Spread empathy and knowledge ● Documentation ● Videos ● Presentations ● Enablers (10/10) ● "With new metrics, comes old unknown problems" (uncle Istio) ● Support Team focus training ● New Dashboards customized for Mesh Workloads ● Mesh workload Alerts

Slide 26

Slide 26 text

Avoiding Mines The Mesh effect

Slide 27

Slide 27 text

Avoiding Mines The Mesh effect

Slide 28

Slide 28 text

Avoiding Mines The Mesh effect

Slide 29

Slide 29 text

Avoiding Mines The Mesh effect

Slide 30

Slide 30 text

Avoiding Mines Know your proxy! ● Unless only L4 (in a sidecarless setup) ● Your application is now AS resilient AS your Proxy ○ resource ○ scalability ○ alerts ○ metrics

Slide 31

Slide 31 text

Avoiding Mines Know your proxy! ● Some proxy problems ○ Timeouts ○ Connection terminations ○ Idle Timeout ○ Connection Drain ○ Different Header handling standards ● Proxy Layer 4 ○ Port exclusion (ie: DB port, Kafka, AWS metadata, etc…) ○ Protocol exclusion (UDP?) ○ IP Ranges/Networks exclusion

Slide 32

Slide 32 text

Avoiding Mines Know your proxy! Sidecar nightmares ● Start/Stop (+1.28 solved?) ● Resource definition ● HPA resource based ○ memory ○ cpu Main APP Sidecar HPA CPU: 80% Mem: 70% resources: request: memory: 512MB cpu: 500 resources: request: memory: 200MB cpu: 200

Slide 33

Slide 33 text

Avoiding Mines It's a platform Happy with Kubernetes upgrades? /s Cool, now you have another layer to upgrade ● CRDs ● APIs ● Internal structures ● Proxy behaviour Create ● Docs/procedures/tests (automate whenever possible) Be aware: even with tests, things can go south

Slide 34

Slide 34 text

Avoiding Mines It's a platform The Istio upgrade monster 👻 ● Benchmarks scared us ○ Difficult ○ Error prune ○ "We are far behind from supported version" ● Sandbox for the win! ● began 1.12 ● today 1.22 Revision based FROM DAY 1

Slide 35

Slide 35 text

Avoiding Mines It's a platform

Slide 36

Slide 36 text

Conclusions ● Service Mesh is a powerful tool! Make it a super-powerful platform ● Adopting Service Mesh is a BIG challenge ● Consider create a Mesh Team (at least a temporary one) ● Start slow, start small, share empathy and Knowledge ● At least for us, it was worth it.

Slide 37

Slide 37 text

Conclusions

Slide 38

Slide 38 text

Questions?

Slide 39

Slide 39 text

https://www.linkedin.com/in/danielrequena/ https://bolha.us/@requena https://github.com/drequena/ https://speakerdeck.com/drequena https://twitter.com/Daniel_Requena Daniel Requena Contacts

Slide 40

Slide 40 text

References ● [Istio slow adoption ] https://youtu.be/8pgn5UaHkmQ?si=AqcErZ1aq4oJ7zFt ● [Service Mesh is not the solution] https://youtu.be/5quqbj8npRo?si=ByLVtoOJ7qtCzE5l ● [Istio Ratelimit in iFood] https://www.youtube.com/watch?v=GGnCq3B2J8A ● [Load Balancing GRPC] https://majidfn.com/blog/grpc-load-balancing/ ● [LB gRPC with Service Mesh] https://www.useanvil.com/blog/engineering/load-balancing-grpc-in-kubernetes-with-istio/ ● [Kubernetes native sidecar] https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/ ● [Istio sidecar k8s support] https://istio.io/latest/blog/2023/native-sidecars/