Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Media.net_A_Journey.pptx.pdf

 Media.net_A_Journey.pptx.pdf

Cloud Native Community

October 16, 2023
Tweet

More Decks by Cloud Native Community

Other Decks in Technology

Transcript

  1. • Responsible for managing production • Responsible for Internal Developer

    Platform IDP • Attendee of Kubecon Amsterdam ‘23 • Certified Kubernetes Administrator Ajay SRE @ Media.net
  2. History 2020 • 3 services on on-prem & AWS •

    500+ backend servers • Consul: Service discovery • Script for moving logs to buckets • Prometheus + Grafana • HAproxy + Nginx Present 2023 • 5+ services on GCP only • 3500+ backend servers • and !
  3. • Deployments ◦ Script for deploying new JAR ◦ 2+

    hrs deployment duration ◦ Scripts for deployment verification • Uptime ◦ ~95% SLO • Issues ◦ Longer downtimes ◦ Deployment failures ◦ Redundant issues • Onboarding new services • Observability
  4. • MesOS DC/OS ◦ Scalability issue due to high load

    on master node at scale • Nomad ◦ Multiple components ◦ Always going to be self managed • Kubernetes ◦ Steep learning curve ◦ Cloud Managed clusters ◦ Extensive community & tools
  5. • Software Engineers ◦ Deployment benefits ◦ Hear and understand

    their issues with containerisation ◦ Help them to mitigate the issues ◦ Containerised their application for them, but one time only ◦ Automate • Senior management ◦ Reasons for the big step ◦ Benefits out of it ◦ How reliable will it be ◦ How much time will be spent ◦ Possible consequence
  6. • Operational ad hocs • Infrastructure management problems • Reinventing

    the wheel ? • On call issues • Can we target two critical components with one orchestration tool ◦ Platform engineering (IDP) ◦ Production Infrastructure
  7. • Containerize applications • Mirror production • Test traffic •

    In parallel, migrate monitoring stack • Make all SREs comfortable • Compare metrics • Optimizations
  8. • 5 Cloud zonal Data centers • 3 Databases •

    5+ Backend service • 3 Months • Our rescuers
  9. Issues with On-premise DC without kubernetes • Maintaining two code

    bases for main applications ◦ Dockerised vs non dockerised • Maintaining two deployment stacks ◦ Deployments failing now and then • Monitoring two differently Orchestrated infrastructure • Finally moved out of it with some resistance
  10. • Discovery • Auto scaling • Healthcheck • Support for

    stacking side-car dependencies • Support for keeping config files • Ease of onboarding application • Access management
  11. When 500 pods scaled to 3500 pods • Catch up

    scaling with traffic • Pod IPs exhausted • Prometheus running out of memory (OOM) • Sudden traffic bombardment on new scaled up pods • Slow discovery at LB (openresty + consul)
  12. And when your production scales with JAVA • ◦ 0

    to 300 QPS ◦ Application pods running on 16 cores each • Causing initial 5 min of traffic trashing ◦ Because JAVA does lazy loading • Need for slow start • Other issues ◦ Slow backend discovery via Consul ◦ No support for http2 upstream connection on nginx ◦ No tracing support on load balancers
  13. • Load balancer’s job break down • xDS support •

    Natively supports • Core of Istio, Kong, Ambassador, Gloo
  14. • Load balancer on VM ❓ • In house built

    control plane 😵 • Search for a opensource control plane • Istio VM in router mode • Istiod as control plane, envoyfilters