Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Walking the minefield of Service Mesh

drequena
September 23, 2024

Walking the minefield of Service Mesh

Running a production ready service mesh is VERY hard. From fully understanding the mesh architecture and installing it, to how to scale, upgrade, monitor, secure, and so on, it can take many months to build confidence to release it into production.

For more than 2 years, our Traffic team has been running a large scale, Istio based service mesh. We faced tons of problems during the many phases of a project this big. From picking initial features to choose the Mesh architecture, to thinking on how to hide Mesh complexity details from users to when to promote the mesh to production.

In this talk we want to share the biggest stumbles in a variety of subjects related to operation, maintenance, training, monitoring and abstraction of our Mesh solution. We believe that by sharing some, hard to find, tips and tricks, people and organizations can save a lot of time adopting Service Mesh.

Join us in this minefield called Service Mesh and learn how to avoid the main dangers of it.

Presented in: 24/9/24 DevopsDays - Portugal (Porto)

drequena

September 23, 2024
Tweet

More Decks by drequena

Other Decks in Technology

Transcript

  1. Agenda ➔ Whoami ➔ History and Context ➔ The BIG

    question ➔ Avoiding mines ➔ Conclusions ➔ Questions? ➔ References
  2. Dad, Husband, Nerd Bachelor in Computer Science Master Computer Engineering

    +20 years of XP in Sysadmin/DevOps/SRE…etc. Senior Software Engineer at iFood @traffic team Daniel Requena $Whoami
  3. Special Thanks Jhonn Frazão Eduardo Baitello Débora Berte Fagner Luz

    Jhonatan Morais Edson Almeida Fernando Junior Kelvin Lopes
  4. History and Context iFood? • Brazilian Food Delivery Company •

    +100 millions orders per month • ~5500 employees / ~2k engineers • ~250K/600k RPS • +8k Deploy per month • ~3k microservices • +54 Kubernetes Cluster Big numbers
  5. History and Context Our Mesh • Istio based ◦ sidecar

    model • Single Mesh/Multi-Cluster/Multi-primary/Multi-Network ◦ sandbox / production ◦ k8s only (no VMs) • Running since Q1-2022 • Current workload adoption: +70% • Current traffic flow: +75%
  6. • Features ◦ mTLS ◦ Authn/Authz ◦ Traffic management ▪

    Canary ▪ Retry policy ▪ Circuit Break ▪ Rate Limit ◦ Telemetry ◦ Traces ◦ Service Map ◦ + some custom extensions • Important role in our multi-region strategy History and Context Our Mesh
  7. The BIG question Should I use Service Mesh? • "one"

    person team? • lots of other responsibilities? • only a few applications? • one or two programming languages and/or frameworks? • need the "basic" features? Probably Not
  8. The BIG question How we did it? 6 months 8

    months 16 months Multi disciplinary team Hybrid Ops team Mesh team Traffic team Research && Tests Sandbox setup Production Setup Controlled adoption Full GRPC adoption Platform maturity New features +70% Prod adoption Continuous improvement New features 100% adoption
  9. • Features ◦ mTLS ◦ Authn/Authz ◦ Traffic management ▪

    Canary ▪ Retry policy ▪ Circuit Break ▪ Rate Limit ◦ Telemetry ◦ Traces ◦ Service Map ◦ + some custom extensions Avoiding Mines Start SLOW!
  10. Avoiding Mines Start SLOW! Remember: • Just because the Mesh

    can do it, it DOESN'T mean that is the BEST place to do it. ◦ Egress control ◦ Service Map ◦ Caos Engineering
  11. • All Meshes offer granulary "opt-in" choices ◦ Per namespace

    ◦ Per workload • Per environment ◦ sandbox ◦ production • Be aware ◦ small downtime can happen Avoiding Mines Start small
  12. Avoiding Mines Start small How we did it? • Sandbox

    Setup ◦ Cherry pick workloads ▪ different clusters ▪ different types (BFFs, APIs, Workers, SSR, etc..) ◦ Sandbox free to go <2-3 months> • Production Setup ◦ continuous adoption • Never forced adoption to the teams (until now)
  13. Avoiding Mines Abstract it, but don't make it invisible •

    Not a Mesh exclusivity ◦ Too many parameters ▪ which ones to make configurable? ◦ Too dangerous ◦ Guardrails are a must • 80/20 rule of thumb ◦ Retry feature. ◦ retry on (503, refused-stream, unavailable) • Abstraction example ◦ How to turn-on the Mesh ◦ Authn/Authz
  14. Avoiding Mines Abstract it, but don't make it invisible A

    new way to understand app-to-app communication • from server-side to client-side features
  15. Avoiding Mines Abstract it, but don't make it invisible A

    new way to understand app-to-app communication • Avoid wrong doing configs ◦ Retries ◦ Load Balancing ◦ Timeout ◦ etc…
  16. Avoiding Mines The Mesh effect "The assumption that ANY problem

    that a workload has is the MESH's fault" Hello support team, we enabled the Mesh in our app two weeks ago and now we see an increase in latency Our app NEVER had this behavior before, m aybe it is the M esh We keep seeing 503 after a deploy and we believe the reason is the Mesh The Mesh is making our POD restart Could that be the Mesh? I can't build the app in my machine because of the Mesh I'm miserable BECAUSE of the Mesh! My DOG is sick after our app adopted the Mesh, how can I fix that? According to the Traces, there is something called Envoy that is holding our requests…
  17. Avoiding Mines The Mesh effect Spread empathy and knowledge •

    Documentation • Videos • Presentations • Enablers (10/10) • "With new metrics, comes old unknown problems" (uncle Istio) • Support Team focus training • New Dashboards customized for Mesh Workloads • Mesh workload Alerts
  18. Avoiding Mines Know your proxy! • Unless only L4 (in

    a sidecarless setup) • Your application is now AS resilient AS your Proxy ◦ resource ◦ scalability ◦ alerts ◦ metrics
  19. Avoiding Mines Know your proxy! • Some proxy problems ◦

    Timeouts ◦ Connection terminations ◦ Idle Timeout ◦ Connection Drain ◦ Different Header handling standards • Proxy Layer 4 ◦ Port exclusion (ie: DB port, Kafka, AWS metadata, etc…) ◦ Protocol exclusion (UDP?) ◦ IP Ranges/Networks exclusion
  20. Avoiding Mines Know your proxy! Sidecar nightmares • Start/Stop (+1.28

    solved?) • Resource definition • HPA resource based ◦ memory ◦ cpu Main APP Sidecar HPA CPU: 80% Mem: 70% resources: request: memory: 512MB cpu: 500 resources: request: memory: 200MB cpu: 200
  21. Avoiding Mines It's a platform Happy with Kubernetes upgrades? /s

    Cool, now you have another layer to upgrade • CRDs • APIs • Internal structures • Proxy behaviour Create • Docs/procedures/tests (automate whenever possible) Be aware: even with tests, things can go south
  22. Avoiding Mines It's a platform The Istio upgrade monster 👻

    • Benchmarks scared us ◦ Difficult ◦ Error prune ◦ "We are far behind from supported version" • Sandbox for the win! • began 1.12 • today 1.22 Revision based FROM DAY 1
  23. Conclusions • Service Mesh is a powerful tool! Make it

    a super-powerful platform • Adopting Service Mesh is a BIG challenge • Consider create a Mesh Team (at least a temporary one) • Start slow, start small, share empathy and Knowledge • At least for us, it was worth it.
  24. References • [Istio slow adoption ] https://youtu.be/8pgn5UaHkmQ?si=AqcErZ1aq4oJ7zFt • [Service Mesh

    is not the solution] https://youtu.be/5quqbj8npRo?si=ByLVtoOJ7qtCzE5l • [Istio Ratelimit in iFood] https://www.youtube.com/watch?v=GGnCq3B2J8A • [Load Balancing GRPC] https://majidfn.com/blog/grpc-load-balancing/ • [LB gRPC with Service Mesh] https://www.useanvil.com/blog/engineering/load-balancing-grpc-in-kubernetes-with-istio/ • [Kubernetes native sidecar] https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/ • [Istio sidecar k8s support] https://istio.io/latest/blog/2023/native-sidecars/