Kubernetes for 14 months

KUBERNETES FOR 14 MONTHS JOSHUA PEPER - VERIFAI

DOCKER GRUNN PROJECT DATE CLIENT 26 02 2019 JOSHUA PEPER
CTO VERIFAI

spin-off largest online phone retailer in the netherlands

kubernetes

Why Kubernetes Hipster tech Verifai started from scratch Experience in
Docker Whole dev setup in Docker Need to be able to hyper scale

Why all at once No infra to replace Starting phase
Downtime is not a disaster Should not be able to go down by design Help from OSSO

CURRENT STACK

Current stack (running services) Django Backend Celery workers RabbitMQ Redis
Galera / MariaDB Public Docker registry Website Pootle Elasticsearch (ELK) Kibana APM Review env Public docs Sentry PostgreSQL Redis Minio Backup systems

Whats next Android Builds TensorFlow Jobs CEPH Gitlab Private Docker
registry Verifai Webservices

7 HASHTAG PRO TIPS

#1 LIMIT YOUR RESOURCES

#2 START WITH MANUAL DEPLOYMENTS

#3 USE KUBE-LEGO FOR SSL (AND READ THE MANUAL WHEN
UPGRADING)

#4 FIX NODES THAT USE STORAGE

Fix nodes that use storage For example: Static ﬁles (NGINX)
Database servers (Galera / MariaDB) Object storage (Minio (S3 compatible)) Blob storage (CEPH)

Fix nodes that use storage Problem: Pinning means downtime (if
that node is down) Minio has options to replicate itself

#5 MONITOR YOUR JOBS

Monitor your jobs We have a few large jobs Packaging
all the data required to train neural networks Lots of IO Lots of DB queries A lot of calculations 100k loops, and memory leaks It literally has taken down our entire cluster a dozen times

But it couldn’t go down? Yes, because of Services: Spreads
load to all servers Uses all cores in the cluster Using all DB servers Writing so fast to the disk Cluster down

#6 POD ANTI-AFFINITY

Pod anti-afﬁnity Kubernetes schedules by default to node with most
spare resources Can be that all your backend pods run on 1 node Gets redistributed when node goes down Takes 2-5 seconds (for our backend) Also prevents all pods on one node issue

#7 MONITOR THE CONTAINER LOGS

THANK YOU VERY MUCH

DOCKER GRUNN PROJECT DATE CLIENT 26 02 2019 JOSHUA PEPER
CTO VERIFAI

Python / Django TensorFlow / AI Kubernetes / DevOps Swift
for iOS Kotlin for Android C++ PHP / Laravel Hypervisors / BGP / DevOps Swift for iOS Kotlin for Android

Kubernetes for 14 months

Kubernetes for 14 months

Joshua Peper

More Decks by Joshua Peper

Other Decks in Technology

Featured

Transcript