Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Schlammschlacht auf der grünen Wiese – Erkenntnisse aus drei Jahren Cloud Native Entwicklung Robert Hoffmann, Deutsche Telekom

Schlammschlacht auf der grünen Wiese – Erkenntnisse aus drei Jahren Cloud Native Entwicklung Robert Hoffmann, Deutsche Telekom

Die „Hallo Magenta“ Smart Speaker Plattform ist eines der ersten Endkundenprodukte der Deutschen Telekom, das von Beginn an „cloud native“ entwickelt wurde. Als technischer Product Manager der Plattform blickt Robert zurück und zieht Resümee: Wo haben sich Technik, Organisation und Ambition ergänzt? Und wo nicht?

More Decks by Enterprise Cloud Native Summit

Other Decks in Technology

Transcript

  1. complete solution to build voice assistants full stack to understand,

    process and respond UX incl. default skills fully customizable SaaS, API-driven GDPR by design
  2. Comprehensive inhouse knowhow Voice processing agnostic APIs European cooperation with

    Orange Close collaboration with European partners and local businesses
  3. SOME DATA 7 Languages & frameworks: - Java, Kotlin, Spring

    (some apps reactive) - Python - Golang - Angular > 300 active git repos > 600 active Mattermost users
  4. GOAL: „ACCELERATED“ DEVELOPMENT Industry experts (1) have identified the following

    metrics that measure software delivery and organizational performance: low product delivery lead time the time it takes to go from code committed to code successfully running in production high deployment frequency a proxy metric for a small batch size (Lean paradigm) which is easy to measure low time to repair how long it generally takes to restore service for the application when there is an incident low fail rate what percentage of changes to production (for example, software releases and infrastructure configuration changes) fail (1) "Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations" by Nicole Forsgren PhD, Jez Humble, Gene Kim
  5. GOAL: AUTONOMOUS TEAMS • autonomously, without blocking each other •

    without the permission & depending on other teams • without communicating and coordinating without outsiders • deploy on demand, regardless of other services it depends upon • test on demand, without requiring an integrated test environment • perform deployments during normal business hours with negligible downtime • little communication is required between delivery teams to get their work done • architecture of the system is designed to enable teams to test, deploy, and change their systems without dependencies on other teams. • architecture and teams are loosely coupled. • delivery teams are cross-functional, with all the skills necessary to design, develop, test, deploy, and operate the system on the same team
  6. GOAL: AUTONOMOUS TEAMS • autonomously, without blocking each other •

    without the permission & depending on other teams • without communicating and coordinating without outsiders • deploy on demand, regardless of other services it depends upon • test on demand, without requiring an integrated test environment • perform deployments during normal business hours with negligible downtime • little communication is required between delivery teams to get their work done • architecture of the system is designed to enable teams to test, deploy, and change their systems without dependencies on other teams. • architecture and teams are loosely coupled. • delivery teams are cross-functional, with all the skills necessary to design, develop, test, deploy, and operate the system on the same team SCALE BY MAKING TEAMS AS AUTONOMOUS AS POSSIBLE
  7. CLOUD NATIVE OBSERVABILITY IS A BIG WIN 14 Make the

    system transparent for everybody: Dev, QA, POs, Users ... Share all the insights.
  8. OVERVIEW 06.10.19 15 Awesome Service Other Awesome Service Application World

    Diagnosability World Prometheus Grafana Zipkin ElasticStack Operational Data • Metrics (Prometheus) • Events (Logs) • Traces (Zipkin) Metrics Traces Logs Prometheus Tracer Prometheus Tracer
  9. „CLOUD NATIVE COLLABORATION“ Using modern observability / diagnosability / APM

    tools to collaborate. • Building first-class observability into apps. • Linking observability tools & data. • Making observability easy to access. • Supporting documentation with observability. • Offering the same view to the whole (technical) audience. 16
  10. METRICS WITH PROMETHEUS + GRAFANA ¡ Part of the Dev(Ops)

    flow: Comes natural now whenever a new feature is built ¡ What metrics do we need? ¡ Devs contribute their own Grafana dashboards ¡ Create new metrics now and throw them away later when they became less relevant ¡ Cool pattern: • Deploy the new service in the shadow, • collect metrics, • make fixes, • officially release when confident 21
  11. USING KUBERNETES 23 • Is it the next big marketing

    ploy to move you to the public cloud? • Is it too complex / overkill ? • Is it actually helping your DevOps movement? • Is it really able to make your apps more portable?
  12. USING KUBERNETES 24 • Is it the next big marketing

    ploy to move you to the public cloud? • Is it too complex / overkill ? • Is it actually helping your DevOps movement? • Is it really able to make your apps more portable? YES.
  13. USING KUBERNETES IS LIKE… 25 https://dragonball.fandom.com/wiki/Weighted_Clothing Weighted clothing is clothing

    that adds weight to various parts of the body, usually as part of resistance training.
  14. USING KUBERNETES – KEY TAKEAWAYS 26 • Containerizing our apps

    is the important part: CI/CD, Immutability, repeatability • API-driven deployments promote everything-as-code, GitOps • Weighted clothing: Constraints & complexity enforce better practices • We want to use it but not operate it. • We shot ourselves in the foot multiple times by writing bad deployment config. • We need to be able to create clusters fully automated & on demand.
  15. USING KUBERNETES AS CATTLE 27 • State (databases, storage) is

    outside the cluster • Origin: Storage wasn’t reliable in certain environments. • Close to immutable infrastructure now; can throw away the whole cluster, do canary testing etc. • Important capability! You do not want to upgrade a live Kubernetes cluster. • Cool use cases for outage / intrusion scenarios: Lock down the cluster for analysis / forensics, create a new one.
  16. USING KUBERNETES IS JUST 10% OF YOUR DEV STORY 28

    • Integrate IAM, ideally with your cloud / IaaS provider • Setup logging, tracing, metrics, alerting • Configure security policies, define team partitioning strategy • Knowing how to write sane k8s config • Watch out for objects that are not “garbage-collected”
  17. USING KUBERNETES IS JUST 10% OF YOUR DEV STORY 29

    • Use auto node scaling? • Keep track of infra constraints leaking into your container world • Max connections, bandwidth, file handlers, attached volumes… • Adversary application profiles on the same node (I/O- or CPU-heavy, etc.) • What’s the SLA on your CI/CD and collaboration tools? • Can’t cleanly change deployment when git, CI jobs don’t work. • Can’t easily communicate during incidents when team chat is down.
  18. CI/CD ARCHITECTURE 31 App Repo 1 Central Config Repo CI

    Gateway (Compliance Checker) IAC Repo Promotor (Deployment Orchestrator) Infrastructure (K8S, IaaS, PaaS) App Repo 1 App Repo 1 Secrets Repo DEV: READ + MR DEV: MASTER DEV: READ + MR commit trigger deploy deploy read / write Base Image Updater update Dockerfile
  19. CENTRALIZED VS DECENTRALIZED CI/CD 33 • We chose centralized: Single

    git repo contains everything to deploy the whole app stack from scratch. • Teams are not as independent and knowledgeable when it comes to deployments & configuration. The masters are the system engineering team. • Devs can see and interact with the config though, by inspecting the central git repository and creating merge requests. • Allowed us to easily prepare and execute multiple backend migrations.
  20. THE ZALANDO TEST – DO YOU REALLY NEED KUBERNETES? 36

    • It’s hard to operate unless you are as experienced as Zalando. • It’s hard to use right unless you are as experienced Zalando. • You need to do a lot of integration work until you reached the level of Zalando. Kubernetes Level Use Kubernetes At least check other options Zalando less
  21. HOT TAKE 37 You can use a proprietary container product

    like Cloud Run, ECS or Service Fabric Mesh as long as • it is using Docker images • and has a lean runtime contract. https://blog.hypescaler.com/2019-08-05-containers-intro/
  22. QA & RELEASE PROCESS 39 The trap: When boundaries aren‘t

    clearly defined and communcicated. Technical: Software is testable & releasable in a decoupled, microservice-y way: The services expose APIs and have a large suite of unit, integration and E2E (API) tests. Oragnizational: Projects sees itstelf as one giant, single coupled organization, our processes follow the same pattern. Watch out: You can, due to incorrect boundaries, work too close together in the wrong areas.
  23. QA & RELEASE PROCESS 40 • In the beginning: Centralized,

    slow, manual QA & release process that takes days • Traditional, “monolithic” QA is defeated by simple math: • If you have 100 microservices and each of them can be released every two weeks, you have 10 releases per working day. • You have to trust your developers to deliver stable, tested applications. • “Manual”, long-running QA is still very important: • as asynchronous, exploratory testing, • not as a synchronous release gate!
  24. QA & RELEASE PROCESS – STRATEGIES TO DRIVE CHANGE 41

    • Collect metrics – where do you find your bugs? For us, it is during development and in production, almost never in QA. • Release process evolution is hindered by heroism – remove it • Internal system boundaries were not clearly defined, communicated, agreed on – make this explicit • You need leadership understanding & support for the above • Technical capabilities (Canary Deployments) as basis for discussion on how to limit the blast radius; error budgets etc. as a way to allow change without pressure.