Our fantastic journey through the Cloud Native World.

Our fantastic journey  Lessons learned Our adaption of Kubernetes, cloud
native tools and thinking André Veelken, DevOps Engineer@DreamIT

Overview (1/3) • The beginning • The challenge of running
a Java monolith on Kubernetes • Changing the mindset • Our favourite k8s distro: kops • The big switch • ElasticSearch on Kubernetes • Fluentd & fluentbit

Overview (2/3) • Helm: organizing the yaml mess • Security
• Monitoring: Prometheus with prometheus-operator and kube-prometheus • Infrastructure testing • Authentication & authorization • Gitlab CICD

Overview (3/3) • Networking: CNI • DNS • Ingress &
SSL • Conclusion • The future

The beginning • Three Kubernetes clusters in the beginning: dev,
CICD, prod; version 1.7 • First microservice going live mid 2017, second and third December 2017 • „kubectl apply -f“ deployments • No RBAC, manual kubeconfig handling • OS: Debian, FluentD, traefik ingress, SysDig monitoring • Problems to convince business and get developer time for cloud/k8s project.

Running a Java monolith on k8s • state is hard
• Payara, hazelcast („hasslecast“) • Migration a classical three tier architecture • Long time to get pods warmed up and ready

Changing the mindset • Pods are mortal, short-lived  -> More
disruption • Cattle, not pets • More flexibility for devs • Several months of prod deployments  to old env and k8s  -> Gain trust

Kops — Our k8s distro of choice • well tested,
rock solid • rather old k8s versions • no problems with k8s updates even to 1.12 with etcd3 • once, a cluster died last year (~one year old) • Container Linux with CLUO (update operator) • spin up a test cluster via CI

The big switch — moving to the cloud • Went
smooth in general • Problems with secrets and   helm (chicken or egg problem) • Prometheus and ElasticSearch   collapsed several times under   load

Elasticsearch on Kubernetes • We use chart from helm/stable repo.
• Recommendable! • Almost no reason to have ES cluster outside of k8s. • Exeption: logs lost when cluster crashed

Fluentd & fluentbit • use of fluentd since beginning •
need for encrypted logs • complex filter chain • trial and error • bad documentation • fluentbit evaluated short time ago

Helm • we use v2.13.1 at the moment • complex
deployments • rollback feature & installation of previous versions broken • broken states of deployments • no alternatives, kustomize e.g. does sth. different • hopes for helm 3

Security • RBAC • No root user in containers •
No Kubernetes dashboard / web UI • AWS audit trail: CloudTrail with alerting • Clair container vulnerability scanning in CI

Monitoring: Prometheus with   prometheus-operator & kube-prometheus • CoreOS Prometheus
operator runs Prometheus, Grafana, Alertmanager • Initially installed with helm — painful • Better solution: jsonnet • CI

Monitoring infrastructure services

Infrastructure (cluster) testing • self developed testing setup, running each
day • move to kube-bench planned

Authentication & authorization • At first: manual management of kubeconfig
files • Then: Dex (OpenID connect provider), kuberos, OAuth proxy -> RBAC authorization • Authentication through GitLab • Now: Heptio Gangway, Keycloak -> RBAC authorization • Authentication through GitLab

Gitlab • Single sign on for our Karma services, e.g.
Kibana, Grafana and Gangway • Most rollouts via helm CIs into different clusters • Docker builds • Test cluster setup

Cluster networking: CNI • Weave in the beginning • Then:
canal (calico and flannel combined) • No use of calico features for a long time • Special case: GCE cluster uses kubenet, switch to weave (real CNI) planned https://github.com/kubernetes/ kops/issues/2087

DNS inside and outside of clusters • Inside clusters •
Kube-dns for a long time • 5 sec responses from time to time • Switch to CoreDNS • Better performance and autopath plugin • Outside clusters • External-dns by Zalando -> Route53, CloudFlare • Just annotate your deployment

Ingress & SSL • Traefik in the beginning • Switch
to nginx-ingress before big migration • Rock solid, better performance than traefik • Cert-manager by Jetstack • Lets encrypt usage

Conclusion (1/2) & • Better scaling: horizontal pod autoscaler (HPA)
and aws- autoscaler (more nodes) ☁ ☁ ☁ • Self healing but tuning apps and load testing was a lot of effort • Far better operation even though many 12factor rules are neglected • It’s always good to have a failover cluster in place ⛈ • Databases are located outside of Kubernetes.

Conclusion (2/2) & • No network problems anymore • Just
scale with customer traffic • Six DevOps engineers full time on project and one dev team needed • Engaging with community helps a lot • High learning curve with new tools

The future • pod security policies • network policies •
anomaly detection in pods • more microservices

Sources • pictures from https://commons.wikimedia.org • kops https://github.com/kubernetes/kops • fluentd
https://www.fluentd.org/ • Elasticsearch Chart https://github.com/helm/charts/ tree/master/stable/elasticsearch

Questions & Feedback

Our fantastic journey through the Cloud Native ...

Our fantastic journey through the Cloud Native World.

dreamIT

More Decks by dreamIT

Other Decks in Technology

Featured

Transcript

Our fantastic journey  Lessons learned Our adaption of Kubernetes, cloud

Overview (1/3) • The beginning • The challenge of running

Overview (2/3) • Helm: organizing the yaml mess • Security

Overview (3/3) • Networking: CNI • DNS • Ingress &

The beginning • Three Kubernetes clusters in the beginning: dev,

Running a Java monolith on k8s • state is hard

Changing the mindset • Pods are mortal, short-lived  -> More

Kops — Our k8s distro of choice • well tested,

The big switch — moving to the cloud • Went

Elasticsearch on Kubernetes • We use chart from helm/stable repo.

Fluentd & fluentbit • use of fluentd since beginning •

Helm • we use v2.13.1 at the moment • complex

Security • RBAC • No root user in containers •

Monitoring: Prometheus with   prometheus-operator & kube-prometheus • CoreOS Prometheus

Monitoring infrastructure services

Infrastructure (cluster) testing • self developed testing setup, running each

Authentication & authorization • At first: manual management of kubeconfig

Gitlab • Single sign on for our Karma services, e.g.

Cluster networking: CNI • Weave in the beginning • Then:

DNS inside and outside of clusters • Inside clusters •

Ingress & SSL • Traefik in the beginning • Switch

Conclusion (1/2) & • Better scaling: horizontal pod autoscaler (HPA)

Conclusion (2/2) & • No network problems anymore • Just

The future • pod security policies • network policies •

Sources • pictures from https://commons.wikimedia.org • kops https://github.com/kubernetes/kops • fluentd

Questions & Feedback