Our fantastic journey through the Cloud Native World.

Slide 1

Slide 1 text

Our fantastic journey  Lessons learned Our adaption of Kubernetes, cloud native tools and thinking André Veelken, DevOps Engineer@DreamIT

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Overview (1/3) ● The beginning ● The challenge of running a Java monolith on Kubernetes ● Changing the mindset ● Our favourite k8s distro: kops ● The big switch ● ElasticSearch on Kubernetes ● Fluentd & fluentbit

Slide 4

Slide 4 text

Overview (2/3) ● Helm: organizing the yaml mess ● Security ● Monitoring: Prometheus with prometheus-operator and kube-prometheus ● Infrastructure testing ● Authentication & authorization ● Gitlab CICD

Slide 5

Slide 5 text

Overview (3/3) ● Networking: CNI ● DNS ● Ingress & SSL ● Conclusion ● The future

Slide 6

Slide 6 text

The beginning ● Three Kubernetes clusters in the beginning: dev, CICD, prod; version 1.7 ● First microservice going live mid 2017, second and third December 2017 ● „kubectl apply -f“ deployments ● No RBAC, manual kubeconfig handling ● OS: Debian, FluentD, traefik ingress, SysDig monitoring ● Problems to convince business and get developer time for cloud/k8s project.

Slide 7

Slide 7 text

Running a Java monolith on k8s ● state is hard ● Payara, hazelcast („hasslecast“) ● Migration a classical three tier architecture ● Long time to get pods warmed up and ready

Slide 8

Slide 8 text

Changing the mindset ● Pods are mortal, short-lived  -> More disruption ● Cattle, not pets ● More flexibility for devs ● Several months of prod deployments  to old env and k8s  -> Gain trust

Slide 9

Slide 9 text

Kops — Our k8s distro of choice ● well tested, rock solid ● rather old k8s versions ● no problems with k8s updates even to 1.12 with etcd3 ● once, a cluster died last year (~one year old) ● Container Linux with CLUO (update operator) ● spin up a test cluster via CI

Slide 10

Slide 10 text

The big switch — moving to the cloud ● Went smooth in general ● Problems with secrets and   helm (chicken or egg problem) ● Prometheus and ElasticSearch   collapsed several times under   load

Slide 11

Slide 11 text

Elasticsearch on Kubernetes ● We use chart from helm/stable repo. ● Recommendable! ● Almost no reason to have ES cluster outside of k8s. ● Exeption: logs lost when cluster crashed

Slide 12

Slide 12 text

Fluentd & fluentbit ● use of fluentd since beginning ● need for encrypted logs ● complex filter chain ● trial and error ● bad documentation ● fluentbit evaluated short time ago

Slide 13

Slide 13 text

Helm ● we use v2.13.1 at the moment ● complex deployments ● rollback feature & installation of previous versions broken ● broken states of deployments ● no alternatives, kustomize e.g. does sth. different ● hopes for helm 3

Slide 14

Slide 14 text

Security ● RBAC ● No root user in containers ● No Kubernetes dashboard / web UI ● AWS audit trail: CloudTrail with alerting ● Clair container vulnerability scanning in CI

Slide 15

Slide 15 text

Monitoring: Prometheus with   prometheus-operator & kube-prometheus ● CoreOS Prometheus operator runs Prometheus, Grafana, Alertmanager ● Initially installed with helm — painful ● Better solution: jsonnet ● CI

Slide 16

Slide 16 text

Monitoring infrastructure services

Slide 17

Slide 17 text

Infrastructure (cluster) testing ● self developed testing setup, running each day ● move to kube-bench planned

Slide 18

Slide 18 text

Authentication & authorization ● At first: manual management of kubeconfig files ● Then: Dex (OpenID connect provider), kuberos, OAuth proxy -> RBAC authorization ● Authentication through GitLab ● Now: Heptio Gangway, Keycloak -> RBAC authorization ● Authentication through GitLab

Slide 19

Slide 19 text

Gitlab ● Single sign on for our Karma services, e.g. Kibana, Grafana and Gangway ● Most rollouts via helm CIs into different clusters ● Docker builds ● Test cluster setup

Slide 20

Slide 20 text

Cluster networking: CNI ● Weave in the beginning ● Then: canal (calico and flannel combined) ● No use of calico features for a long time ● Special case: GCE cluster uses kubenet, switch to weave (real CNI) planned https://github.com/kubernetes/ kops/issues/2087

Slide 21

Slide 21 text

DNS inside and outside of clusters ● Inside clusters ● Kube-dns for a long time ● 5 sec responses from time to time ● Switch to CoreDNS ● Better performance and autopath plugin ● Outside clusters ● External-dns by Zalando -> Route53, CloudFlare ● Just annotate your deployment

Slide 22

Slide 22 text

Ingress & SSL ● Traefik in the beginning ● Switch to nginx-ingress before big migration ● Rock solid, better performance than traefik ● Cert-manager by Jetstack ● Lets encrypt usage

Slide 23

Slide 23 text

Conclusion (1/2) & ● Better scaling: horizontal pod autoscaler (HPA) and aws- autoscaler (more nodes) ☁ ☁ ☁ ● Self healing but tuning apps and load testing was a lot of effort ● Far better operation even though many 12factor rules are neglected ● It’s always good to have a failover cluster in place ⛈ ● Databases are located outside of Kubernetes.

Slide 24

Slide 24 text

Conclusion (2/2) & ● No network problems anymore ● Just scale with customer traffic ● Six DevOps engineers full time on project and one dev team needed ● Engaging with community helps a lot ● High learning curve with new tools

Slide 25

Slide 25 text

The future ● pod security policies ● network policies ● anomaly detection in pods ● more microservices

Slide 26

Slide 26 text

Sources ● pictures from https://commons.wikimedia.org ● kops https://github.com/kubernetes/kops ● fluentd https://www.fluentd.org/ ● Elasticsearch Chart https://github.com/helm/charts/ tree/master/stable/elasticsearch

Slide 27

Slide 27 text

Questions & Feedback