All I wish I knew before
running Istio in
Production
KCD Porto - Portugal
Sep/24
by: Daniel Requena
Slide 2
Slide 2 text
Agenda
➔ Whoami
➔ Our environment
➔ What I wish I knew
➔ Questions?
➔ References
Slide 3
Slide 3 text
Dad, Husband, Nerd
Bachelor in Computer Science
Master Computer Engineering
+20 years of XP in Sysadmin/DevOps/SRE…etc.
Staff Engineer at iFood
@traffic team
Daniel Requena
$Whoami
Slide 4
Slide 4 text
Special Thanks
Jhonn
Frazão
Eduardo
Baitello
Débora
Berte
Fagner
Luz
Jhonatan
Morais
Edson
Almeida
Fernando
Junior
Kelvin
Lopes
Slide 5
Slide 5 text
Our environment
iFood?
Big
numbers
● Brazilian Food Delivery Company
● +100 millions orders per month
● ~5500 employees / ~2k engineers
● ~250K/600k RPS
● +8k Deploy per month
● ~3k microservices
● +54 Kubernetes Cluster
Slide 6
Slide 6 text
Why this talk?
Slide 7
Slide 7 text
Our environment
Mesh
● Istio based
○ sidecar model
● Kubernetes only (no VMs)
● Running since Q1-2022
● Current workload adoption: +70%
● Current traffic flow: +75%
Slide 8
Slide 8 text
● Features
○ mTLS
○ Authn/Authz
○ Traffic management
■ Canary
■ Retry policy
■ Circuit Break
■ Rate Limit
○ Telemetry
○ Traces
○ Service Map (?)
○ + some custom extensions
● Important role in our multi-region strategy
Our environment
Mesh
Slide 9
Slide 9 text
What I wish I knew
Let's divide in topics
● Concepts and mental model
● Setup/Upgrades
● Scalability
● Monitoring
● Sidecar/Proxy stuff
● Cost
● Misc
Slide 10
Slide 10 text
What I wish I knew
Concepts and mental model
● Istio is an "Envoy configurator", at least in sidecar-mode (please,
don't be mad)
Slide 11
Slide 11 text
What I wish I knew
Concepts and mental model
Api Server
Istio CRDs
services
endpoints
… xDS
Protocol
Istiod
Remote
Api Server
services
endpoints
Slide 12
Slide 12 text
What I wish I knew
Concepts and mental model
● What else does it do it?
○ Adds its own rules and validations
○ It can choose different Envoy features
○ Has a mechanism for precedence and merge of objects
■ local ns
■ external ns
■ root ns (istio-system)
■ This rule can be affected by "ExportTo" configurations
■ some CRDs have different merge rules
Slide 13
Slide 13 text
What I wish I knew
Concepts and mental model
● What else does it do it?
○ Adds its own rules and validations
○ It can choose different Envoy features
○ Has a mechanism for precedence and merge of objects
■ local ns
■ external ns
■ root ns
■ This rule can be affected by "ExportTo" configurations
■ some CRDs have different merge rules
Slide 14
Slide 14 text
What I wish I knew
Concepts and mental model
● Most of the features are enforced in Client Side (sidecar mode)
○ Load Balancing
○ Retry
○ Locality
○ Timeout
○ etc…
service-b.namespace.svc.cluster.local service-b
100.127.2.1
100.127.2.1
Slide 15
Slide 15 text
What I wish I knew
Concepts and mental model
service-b.namespace.svc.cluster.local
service-b
100.127.2.1
100.127.2.2
100.127.2.3
100.127.2.4
100.127.2.1
100.127.2.2
100.127.2.3
100.127.2.4
Slide 16
Slide 16 text
What I wish I knew
Concepts and mental model
service-b.namespace.svc.cluster.local
service-b
100.127.2.1
100.127.2.2
100.127.2.3
100.127.2.4
100.127.2.1
100.127.2.2
100.127.2.3
100.127.2.4
100.127.3.1
100.127.3.2
100.127.3.3
100.127.3.4
100.127.5.1
100.127.5.2
100.127.5.3
100.127.5.4
service-d
100.127.3.1
100.127.3.2
100.127.3.3
100.127.3.4
service-e
100.127.5.1
100.127.5.2
100.127.5.3
100.127.5.4
Slide 17
Slide 17 text
What I wish I knew
Concepts and mental model
● Envoy request workflow and "structures"
Endpoint list:
- 100.67.1.2
- 100.67.2.1
- 100.67.10.5
- …
Slide 18
Slide 18 text
What I wish I knew
Concepts and mental model
● Envoy request workflow and "structures"
○ istioctl proxy-config [structure] args…
○ istioctl proxy-config logs
○ istioctl proxy-status
Slide 19
Slide 19 text
What I wish I knew
Concepts and mental model
● Envoy request workflow and "structures"
○ istioctl proxy-config [structure] args…
○ istioctl proxy-status
Slide 20
Slide 20 text
What I wish I knew
Setup
● Choose WISELY
○ Mesh type
■ Single Mesh
■ Isolated Meshes
○ Network Model
■ Single
■ Multi
○ Control plane setup
■ Centralized
■ Decentralized
Our Setup
● Single Mesh
○ Per environment
● Multi-Cluster
○ Business units
● Multi-primary
○ Each cluster has its Istiod
● Multi-Network
○ Aws setup
○ k8s network setup
Slide 21
Slide 21 text
What I wish I knew
Setup
Slide 22
Slide 22 text
What I wish I knew
Setup
Slide 23
Slide 23 text
What I wish I knew
● Downsides
○ N:N K8S Istio ratio (scalability)
○ Multiple upgrades processes
○ Namespace + Service "uniqueness"
○ Istio Service Discovery scope
○ East-West L4 is "problematic"
● Setup/maintenance processes
○ istioctl + istiooperator.yaml file (GitOps)
Setup
Slide 24
Slide 24 text
What I wish I knew
Upgrades
The mesh is a platform on its on…
● CRDs
● APIs
● Internal structures
● Proxy behaviour
Slide 25
Slide 25 text
What I wish I knew
Upgrades
The Istio upgrade monster 👻
● Benchmarks scared us
○ Difficult
○ Error prune
○ "We are far behind from supported version"
● Sandbox for the win!
● began 1.12
● today 1.22
Revision based FROM DAY 1
Slide 26
Slide 26 text
What I wish I knew
Upgrades
The Istio upgrade monster 👻
● Benchmarks scared us
○ Difficult
○ Error prune
○ "We are far behind from supported version"
● Sandbox for the win!
● began 1.12
● today 1.22
Revision based FROM DAY 1
Slide 27
Slide 27 text
What I wish I knew
Upgrades
Slide 28
Slide 28 text
What I wish I knew
Upgrades
Slide 29
Slide 29 text
What I wish I knew
Scalability
● Istio, by default, is greedy
○ All namespaces and services are "consumed"
○ Proxy configs are one of the biggest reasons for
■ adding latency
■ resources consumptions
Slide 30
Slide 30 text
What I wish I knew
Scalability
● Let's "fix" that.
○ discoverySelectors
meshConfig:
discoverySelectors:
- matchExpressions:
- key: istio-discovery
operator: NotIn
values:
- disabled
○ All kubernetes and "machinery" namespaces
Slide 31
Slide 31 text
What I wish I knew
Scalability
● Let's "fix" that.
○ Default service "ExportTo"
meshConfig:
defaultServiceExportTo:
- "~"
services:
labels:
networking.istio.io/exportTo: '*'
○ Only Mesh services will be recognized
Slide 32
Slide 32 text
What I wish I knew
Scalability
● Sidecar Object
○ Limits the "knowledge" of a sidecar about mesh
■ reduces configs/cost
○ How we solved this
■ Pipeline code scan (meh)
■ Consul Service Discovery 👍
○ Sidecar Objects DON'T WORK for Gateways
■ see costs slides
spec:
egress:
- hosts:
- ./*
- istio-system/*
- '*/consumed.workload.svc.cluster.local'
workloadSelector
:
labels:
app.kubernetes.io/name
: my-app
Slide 33
Slide 33 text
What I wish I knew
Scalability
Slide 34
Slide 34 text
What I wish I knew
Scalability
Slide 35
Slide 35 text
What I wish I knew
Scalability
● Ingress Gateways
○ cpu, memory, connections, requests
● Some components just can't scale by itself (istiod)
○ 30 min connection
○ Flip-flop (unless big spike)
○ Just create a warm up routine
Slide 36
Slide 36 text
What I wish I knew
Monitoring
● 3 components
○ Istiod
○ Gateways (N/S - E/W)
○ Sidecars
● But, there are A LOT of Metrics
Slide 37
Slide 37 text
What I wish I knew
Monitoring
● Istiod
○ Convergence Time
○ Config erros (stall)
○ Certificate validation and emission
● Gateways (N/S - E/W)
○ Basic Resources
○ Envoy Open connections
● Sidecars (basic)
○ Resources (avoid overload or restarts)
What I wish I knew
Sidecar/Proxy stuff
● Start/Stop
● HPA
● Flags
○ UH/UF/UO/NR
● Connection "imbalance"
Slide 41
Slide 41 text
What I wish I knew
● Port/protocol/Network exclusions
○ traffic.sidecar.istio.io/excludeOutboundPorts
○ traffic.sidecar.istio.io/excludeOutboundIPRanges
● Connections drains
meshConfig:
defaultConfig:
proxyMetadata:
MINIMUM_DRAIN_DURATION: "5s"
EXIT_ON_ZERO_ACTIVE_CONNECTIONS: "true"
Sidecar
Slide 42
Slide 42 text
What I wish I knew
Sidecar/Proxy stuff
● HPA
Main APP
Sidecar
HPA
CPU: 80%
Mem: 70%
resources:
request:
memory: 512MB
cpu: 500
resources:
request:
memory: 200MB
cpu: 200
Slide 43
Slide 43 text
What I wish I knew
● Two major cost factors
○ Sidecar resources (already "solved")
■ CPU
■ Memory
■ Ambient Mesh?
○ Data Transfer
■ Gateways receives ALL configs
■ Huge Mesh
■ Lots of workloads
■ Lots of Gateways Replicas
Cost
Slide 44
Slide 44 text
What I wish I knew
Cost
● Data Transfer
○ Not a solved problem in production yet
■ Total TX: ~160G per day
■ Az isolated ASGs (gateways + istiod?)
■ Maybe Topology Aware Routing feature?
○ In sandbox - Pilot
env:
- name: PILOT_FILTER_GATEWAY_CLUSTER_CONFIG
value: "true"
Slide 45
Slide 45 text
What I wish I knew
Misc
● Envoy filters
○ No compatibility guarantee during upgrades
○ We had problems
■ Internal structure problem (lua code)
■ Rate limit
● Default Retry 2!
○ Highly elastic apps (↑↓)
○ Endpoints update process can fail (503 increase)
Warning: EnvoyFilter exposes internal
implementation details that may change at
any time. Prefer other APIs if possible, and
exercise extreme caution, especially around
upgrades.
Slide 46
Slide 46 text
What I wish I knew
Misc
● Guardrails
○ Block direct access and validate:
■ Gateway
■ Peerauthentication
■ VirtualServices
■ DestinationRules
Slide 47
Slide 47 text
What I wish I knew
Slide 48
Slide 48 text
Questions?
Slide 49
Slide 49 text
https://www.linkedin.com/in/danielrequena/
https://bolha.us/@requena
https://github.com/drequena/
https://speakerdeck.com/drequena
https://twitter.com/Daniel_Requena
Daniel Requena
Contacts