Schlammschlacht auf der grünen Wiese – Erkenntnisse aus drei Jahren Cloud Native Entwicklung Robert Hoffmann, Deutsche Telekom

SLIPPING ON THE GREEN FIELD A cloud native journey October
2019

HALLO MAGENTA a German voice assistant built on a European
platform

complete solution to build voice assistants full stack to understand,
process and respond UX incl. default skills fully customizable SaaS, API-driven GDPR by design

Comprehensive inhouse knowhow Voice processing agnostic APIs European cooperation with
Orange Close collaboration with European partners and local businesses

INTRO 5 ROBERT HOFFMANN Twitter: @robhoffmax Blog: blog.hypescaler.com Cloud Native
Fanboi Expert TPM for Inner SaaS Deutsche Telekom

CONTEXT 6

SOME DATA 7 Languages & frameworks: - Java, Kotlin, Spring
(some apps reactive) - Python - Golang - Angular > 300 active git repos > 600 active Mattermost users

ENVIRONMENT Distributed, diverse teams Innovation project Microservices Polyglot architecture Continuous
delivery Kubernetes Cloud-native

GOAL: PRACTICE CLOUD NATIVE

GOAL: „ACCELERATED“ DEVELOPMENT Industry experts (1) have identified the following
metrics that measure software delivery and organizational performance: low product delivery lead time the time it takes to go from code committed to code successfully running in production high deployment frequency a proxy metric for a small batch size (Lean paradigm) which is easy to measure low time to repair how long it generally takes to restore service for the application when there is an incident low fail rate what percentage of changes to production (for example, software releases and infrastructure configuration changes) fail (1) "Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations" by Nicole Forsgren PhD, Jez Humble, Gene Kim

GOAL: AUTONOMOUS TEAMS • autonomously, without blocking each other •
without the permission & depending on other teams • without communicating and coordinating without outsiders • deploy on demand, regardless of other services it depends upon • test on demand, without requiring an integrated test environment • perform deployments during normal business hours with negligible downtime • little communication is required between delivery teams to get their work done • architecture of the system is designed to enable teams to test, deploy, and change their systems without dependencies on other teams. • architecture and teams are loosely coupled. • delivery teams are cross-functional, with all the skills necessary to design, develop, test, deploy, and operate the system on the same team

GOAL: AUTONOMOUS TEAMS • autonomously, without blocking each other •
without the permission & depending on other teams • without communicating and coordinating without outsiders • deploy on demand, regardless of other services it depends upon • test on demand, without requiring an integrated test environment • perform deployments during normal business hours with negligible downtime • little communication is required between delivery teams to get their work done • architecture of the system is designed to enable teams to test, deploy, and change their systems without dependencies on other teams. • architecture and teams are loosely coupled. • delivery teams are cross-functional, with all the skills necessary to design, develop, test, deploy, and operate the system on the same team SCALE BY MAKING TEAMS AS AUTONOMOUS AS POSSIBLE

THE GOOD PARTS (CLEAR WINS) 13

CLOUD NATIVE OBSERVABILITY IS A BIG WIN 14 Make the
system transparent for everybody: Dev, QA, POs, Users ... Share all the insights.

OVERVIEW 06.10.19 15 Awesome Service Other Awesome Service Application World
Diagnosability World Prometheus Grafana Zipkin ElasticStack Operational Data • Metrics (Prometheus) • Events (Logs) • Traces (Zipkin) Metrics Traces Logs Prometheus Tracer Prometheus Tracer

„CLOUD NATIVE COLLABORATION“ Using modern observability / diagnosability / APM
tools to collaborate. • Building first-class observability into apps. • Linking observability tools & data. • Making observability easy to access. • Supporting documentation with observability. • Offering the same view to the whole (technical) audience. 16

ZIPKIN: CONTEXTUALIZED LOGGING 06.10.19 17 Opens the right logs in
Kibana

LOGS IN KIBANA 06.10.19 18

LIVE DEPENDENCY GRAPH 19

20 METRICS WITH PROMETHEUS + GRAFANA

METRICS WITH PROMETHEUS + GRAFANA ¡ Part of the Dev(Ops)
flow: Comes natural now whenever a new feature is built ¡ What metrics do we need? ¡ Devs contribute their own Grafana dashboards ¡ Create new metrics now and throw them away later when they became less relevant ¡ Cool pattern: • Deploy the new service in the shadow, • collect metrics, • make fixes, • officially release when confident 21

THE MIXED PARTS (GIVE AND TAKE) 22

USING KUBERNETES 23 • Is it the next big marketing
ploy to move you to the public cloud? • Is it too complex / overkill ? • Is it actually helping your DevOps movement? • Is it really able to make your apps more portable?

USING KUBERNETES 24 • Is it the next big marketing
ploy to move you to the public cloud? • Is it too complex / overkill ? • Is it actually helping your DevOps movement? • Is it really able to make your apps more portable? YES.

USING KUBERNETES IS LIKE… 25 https://dragonball.fandom.com/wiki/Weighted_Clothing Weighted clothing is clothing
that adds weight to various parts of the body, usually as part of resistance training.

USING KUBERNETES – KEY TAKEAWAYS 26 • Containerizing our apps
is the important part: CI/CD, Immutability, repeatability • API-driven deployments promote everything-as-code, GitOps • Weighted clothing: Constraints & complexity enforce better practices • We want to use it but not operate it. • We shot ourselves in the foot multiple times by writing bad deployment config. • We need to be able to create clusters fully automated & on demand.

USING KUBERNETES AS CATTLE 27 • State (databases, storage) is
outside the cluster • Origin: Storage wasn’t reliable in certain environments. • Close to immutable infrastructure now; can throw away the whole cluster, do canary testing etc. • Important capability! You do not want to upgrade a live Kubernetes cluster. • Cool use cases for outage / intrusion scenarios: Lock down the cluster for analysis / forensics, create a new one.

USING KUBERNETES IS JUST 10% OF YOUR DEV STORY 28
• Integrate IAM, ideally with your cloud / IaaS provider • Setup logging, tracing, metrics, alerting • Configure security policies, define team partitioning strategy • Knowing how to write sane k8s config • Watch out for objects that are not “garbage-collected”

USING KUBERNETES IS JUST 10% OF YOUR DEV STORY 29
• Use auto node scaling? • Keep track of infra constraints leaking into your container world • Max connections, bandwidth, file handlers, attached volumes… • Adversary application profiles on the same node (I/O- or CPU-heavy, etc.) • What’s the SLA on your CI/CD and collaboration tools? • Can’t cleanly change deployment when git, CI jobs don’t work. • Can’t easily communicate during incidents when team chat is down.

HOW DID WE COPE WITH THOSE CHALLENGES? 30

CI/CD ARCHITECTURE 31 App Repo 1 Central Config Repo CI
Gateway (Compliance Checker) IAC Repo Promotor (Deployment Orchestrator) Infrastructure (K8S, IaaS, PaaS) App Repo 1 App Repo 1 Secrets Repo DEV: READ + MR DEV: MASTER DEV: READ + MR commit trigger deploy deploy read / write Base Image Updater update Dockerfile

PROMOTOR UI 32

CENTRALIZED VS DECENTRALIZED CI/CD 33 • We chose centralized: Single
git repo contains everything to deploy the whole app stack from scratch. • Teams are not as independent and knowledgeable when it comes to deployments & configuration. The masters are the system engineering team. • Devs can see and interact with the config though, by inspecting the central git repository and creating merge requests. • Allowed us to easily prepare and execute multiple backend migrations.

CENTRALIZED VS DECENTRALIZED CI/CD 34

WHAT SHOULD YOU DO? 35

THE ZALANDO TEST – DO YOU REALLY NEED KUBERNETES? 36
• It’s hard to operate unless you are as experienced as Zalando. • It’s hard to use right unless you are as experienced Zalando. • You need to do a lot of integration work until you reached the level of Zalando. Kubernetes Level Use Kubernetes At least check other options Zalando less

HOT TAKE 37 You can use a proprietary container product
like Cloud Run, ECS or Service Fabric Mesh as long as • it is using Docker images • and has a lean runtime contract. https://blog.hypescaler.com/2019-08-05-containers-intro/

THE UGLY PARTS 38

QA & RELEASE PROCESS 39 The trap: When boundaries aren‘t
clearly defined and communcicated. Technical: Software is testable & releasable in a decoupled, microservice-y way: The services expose APIs and have a large suite of unit, integration and E2E (API) tests. Oragnizational: Projects sees itstelf as one giant, single coupled organization, our processes follow the same pattern. Watch out: You can, due to incorrect boundaries, work too close together in the wrong areas.

QA & RELEASE PROCESS 40 • In the beginning: Centralized,
slow, manual QA & release process that takes days • Traditional, “monolithic” QA is defeated by simple math: • If you have 100 microservices and each of them can be released every two weeks, you have 10 releases per working day. • You have to trust your developers to deliver stable, tested applications. • “Manual”, long-running QA is still very important: • as asynchronous, exploratory testing, • not as a synchronous release gate!

QA & RELEASE PROCESS – STRATEGIES TO DRIVE CHANGE 41
• Collect metrics – where do you find your bugs? For us, it is during development and in production, almost never in QA. • Release process evolution is hindered by heroism – remove it • Internal system boundaries were not clearly defined, communicated, agreed on – make this explicit • You need leadership understanding & support for the above • Technical capabilities (Canary Deployments) as basis for discussion on how to limit the blast radius; error budgets etc. as a way to allow change without pressure.

Schlammschlacht auf der grünen Wiese – Erkenntn...

Schlammschlacht auf der grünen Wiese – Erkenntnisse aus drei Jahren Cloud Native Entwicklung Robert Hoffmann, Deutsche Telekom

More Decks by Enterprise Cloud Native Summit

Other Decks in Technology

Featured

Transcript