Media.net_A_Journey.pptx.pdf

Our journey from 500 servers to 3500 servers

• Responsible for managing production • Responsible for Internal Developer
Platform IDP • Attendee of Kubecon Amsterdam ‘23 • Certified Kubernetes Administrator Ajay SRE @ Media.net

History 2020 • 3 services on on-prem & AWS •
500+ backend servers • Consul: Service discovery • Script for moving logs to buckets • Prometheus + Grafana • HAproxy + Nginx Present 2023 • 5+ services on GCP only • 3500+ backend servers • and !

• Deployments ◦ Script for deploying new JAR ◦ 2+
hrs deployment duration ◦ Scripts for deployment veriﬁcation • Uptime ◦ ~95% SLO • Issues ◦ Longer downtimes ◦ Deployment failures ◦ Redundant issues • Onboarding new services • Observability

• MesOS DC/OS ◦ Scalability issue due to high load
on master node at scale • Nomad ◦ Multiple components ◦ Always going to be self managed • Kubernetes ◦ Steep learning curve ◦ Cloud Managed clusters ◦ Extensive community & tools

• Software Engineers ◦ Deployment beneﬁts ◦ Hear and understand
their issues with containerisation ◦ Help them to mitigate the issues ◦ Containerised their application for them, but one time only ◦ Automate • Senior management ◦ Reasons for the big step ◦ Beneﬁts out of it ◦ How reliable will it be ◦ How much time will be spent ◦ Possible consequence

• Operational ad hocs • Infrastructure management problems • Reinventing
the wheel ? • On call issues • Can we target two critical components with one orchestration tool ◦ Platform engineering (IDP) ◦ Production Infrastructure

• Containerize applications • Mirror production • Test traﬃc •
In parallel, migrate monitoring stack • Make all SREs comfortable • Compare metrics • Optimizations

• 5 Cloud zonal Data centers • 3 Databases •
5+ Backend service • 3 Months • Our rescuers

Issues with On-premise DC without kubernetes • Maintaining two code
bases for main applications ◦ Dockerised vs non dockerised • Maintaining two deployment stacks ◦ Deployments failing now and then • Monitoring two diﬀerently Orchestrated infrastructure • Finally moved out of it with some resistance

• Discovery • Auto scaling • Healthcheck • Support for
stacking side-car dependencies • Support for keeping conﬁg ﬁles • Ease of onboarding application • Access management

• Cluster YAML sync • Advance deployment strategies • Auto
scaling on custom integration

Cherry on top

When 500 pods scaled to 3500 pods • Catch up
scaling with traﬃc • Pod IPs exhausted • Prometheus running out of memory (OOM) • Sudden traﬃc bombardment on new scaled up pods • Slow discovery at LB (openresty + consul)

And when your production scales with JAVA • ◦ 0
to 300 QPS ◦ Application pods running on 16 cores each • Causing initial 5 min of traﬃc trashing ◦ Because JAVA does lazy loading • Need for slow start • Other issues ◦ Slow backend discovery via Consul ◦ No support for http2 upstream connection on nginx ◦ No tracing support on load balancers

• Load balancer’s job break down • xDS support •
Natively supports • Core of Istio, Kong, Ambassador, Gloo

• Load balancer on VM ❓ • In house built
control plane 😵 • Search for a opensource control plane • Istio VM in router mode • Istiod as control plane, envoyﬁlters

Thank you

Media.net_A_Journey.pptx.pdf

Media.net_A_Journey.pptx.pdf

Cloud Native Community

More Decks by Cloud Native Community

Other Decks in Technology

Featured

Transcript

Our journey from 500 servers to 3500 servers

• Responsible for managing production • Responsible for Internal Developer

History 2020 • 3 services on on-prem & AWS •

• Deployments ◦ Script for deploying new JAR ◦ 2+

• MesOS DC/OS ◦ Scalability issue due to high load

• Software Engineers ◦ Deployment beneﬁts ◦ Hear and understand

• Operational ad hocs • Infrastructure management problems • Reinventing

• Containerize applications • Mirror production • Test traﬃc •

• 5 Cloud zonal Data centers • 3 Databases •

Issues with On-premise DC without kubernetes • Maintaining two code

• Discovery • Auto scaling • Healthcheck • Support for

• Cluster YAML sync • Advance deployment strategies • Auto

Cherry on top

When 500 pods scaled to 3500 pods • Catch up

And when your production scales with JAVA • ◦ 0

• Load balancer’s job break down • xDS support •

• Load balancer on VM ❓ • In house built

Thank you