Around & After Kubernetes: The Principles and Ideas that Guide Us

790f25ed9045af0199f8a172aef44ab3?s=47 King'ori Maina
September 06, 2019

Around & After Kubernetes: The Principles and Ideas that Guide Us

We made the decision in late 2015 to move all our applications to containerised environments managed by Kubernetes. It took roughly 3 years to complete that migration. During that journey we learnt a lot about containerisation, distributed systems, complicated migrations and automation of systems managing over 85 developers. This talk was about sharing some of the key principles and ideas that guided us. It touched lightly on Kubernetes from a technical point of view and focused on sharing key ideas.

Links:

• Conference: DevOpsDays 2019 (Cape Town) - Full Talk
• Program: https://devopsdays.org/events/2019-cape-town/program/
• Video: https://youtu.be/sHZVD0fmVWg

PS: I'd recommend you watch the talk with speed set to 1.5x. I really need to work on my speed. 🙈

790f25ed9045af0199f8a172aef44ab3?s=128

King'ori Maina

September 06, 2019
Tweet

Transcript

  1. Around & After Kubernetes The Principles and Ideas that guide

    us DevOps Days Cape Town 2019 King’ori Maina “King”
  2. Meet The Team Together, we make the Infrastructure/DevOps/DevSecOps team ...

    Hadrian Valentine Infrastructure Engineer hadrian@zappistore.com @hadrianvale King’ori Maina Infrastructure Engineer kingori@zappistore.com @itskingori Hati Chindove Head of Information Security hati@zappistore.com @hatitye Zac Blazic Infrastructure Engineer zac.blazic@zappistore.com @zacblazic Head of In-Security (glorified task manager) Product Owner (paid to worry)
  3. Provide insights to allow global brands make better decisions. What

    clients pay us for. Predict effectiveness. Monitor performance. Validate ideas. Creating charts. Then creating more. Not a typo. Lots of integration with awesome tools to make developer’s lives easier. Open-Sauce Internal stuff that’s not available in any off the shelf tool ... at least in one place. Our Workflow. We use a bunch of technology from people way smarter than us. Infrastructure. Business Value Port Control What We Do Supporting Services @itskingori
  4. Difficult Is Now Easy Easier It All Started In 2015

    ... @itskingori
  5. Throw away any stateless component when we want. Immutability Observability

    Extensibility Reviewability Scalability Our Goals To know where we want to go, we needed some long term objectives … Build upon what we have when we want. Represent every- thing in source code when we want. Debug it when we want i.e. logging + metrics. Add capacity when we want … ideally, automatically. Because … immutability is a requirements for scaling up and scaling down. Because … we want to stand on the shoulders of giants. No need to re-invent the wheel. Because … we want to be able to commit changes into source-code and have a single source of truth. Because … it’s not a matter of if things will go wrong but when. Because … we want to be able to handle the thundering herd without always running at capacity. @itskingori
  6. Approach • We accepted the potential guaranteed risks. • We

    committed to set up any new services in Docker going forward. • We, in hindsight, did not know what we were setting ourselves up for! Impact • We now upgrade dependencies in isolation reducing blast radius. • We now have repeatable development, build, test and production environments. • We now limit resources per application based on requirements. • We spin up new servers in ~3 minutes. Challenge We had a server provisioning problem. We had a packaging problem. We had a process isolation problem. Our Journey on Docker @itskingori
  7. Approach • We embraced potential guaranteed complexity but resolved to

    keep it at a minimum (opportunity-cost). • We committed to migrating existing services to our new infrastructure one-by one. • We rebuilt tooling step-by-step. Impact • We do not regret our decision. • We have an API to hard-problems regarding infrastructure . • We spend more of our time on developer enablement than infrastructure problems. • We sleep better (declarative configuration, self-healing). Challenge We had an orchestration problem. We had a serious peer-pressure problem. We had a hard- problem, problem. Our Journey on Kubernetes @itskingori
  8. Approach • We invested in building a tool from scratch,

    for us … by us. • We build it with a one year horizon (more if possible). • We prioritize extensibility to cover unknown use-cases. Impact • We do not wait for or hack tools to work how we work. • We have an API for our internal-workflows. • We can on-board a new developer in less than 5 minutes (self-serve, immediately informed + productive). • 50k deployments since April 2017, 3.9k last month. Challenge We had an internal workflow problem problem. We had a retrofitting problem. We had a one “ring” to rule them all desire. Our Journey on Port Control @itskingori
  9. In Retrospect What is it that we’re looking forward to?

    Then we can be more intentional at building for the future by laying the right foundation as we go. What is that we’ve done right? So that we can keep doing them and guard against complacency. What is it that we could have done better? Then we can focus on those areas and see what more potential we can unlock. It’s all a narrative fallacy. @itskingori
  10. Zappi Confidential & Proprietary Information Reduce Cognitive Load 1. We

    want to exploit all of the advantages that come from having a small number of well- known tools. When you have a small number of well-known tools, you can then focus on the product. — John Allspaw, Former Etsy CTO @allspaw @itskingori
  11. Halt The Proliferation of Tools We’re living in amazing over-whelming

    times ... @itskingori … can we go back to LAMP stacks?
  12. Zappi Confidential & Proprietary Information ... of course, all of

    this has to be underpinned by … the system is stable and performant Keep The Main Thing, The Main Thing We don’t want to be doing engineering for engineering’s sake … @itskingori Optimise pushing code to production Simplify processes so that self-service unblocks most people Make deployments robust and atomic Because … if people are confident about the deploy process they will deploy more! Because … we want less work for ourselves so that we can focus on features not crisis! Because … deployments are a unit of work and a representation of business value going out!
  13. Post-mortem debriefings every day are littered with the artefacts of

    people insisting, the second before an outage, that “I don’t have to care about that. — John Allspaw Former Etsy CTO @allspaw The Cost of Abstractions Realities • Knowledge of Kubernetes is not an operational requirement for a developer. • Not all developers care about infrastructure. • Not all developers can care (context switching is expensive). • The right abstractions can have a multiplier effect on developer efficiency (consistency & predictability e.g. labels). @itskingori
  14. Insert text bla blaov saov;ih sdbv awsvn;deor vbla blaov .jbd

    sn z;i h awsvn;deor vbla blaov saov;ih awsvn;deor vbla blaov saov;ih awsvn;deor v Getting Out of the Way 2. It doesn’t make sense to hire smart people and tell them what to do; we hire smart people so they can tell us what to do. — Steve Jobs, Former Apple CEO
  15. Automate As Much As You Can Need Empowerment Tomorrow Developer

    needs to figure out a way to do task-X DevOps team provides a tool to do task-X (albeit manually) DevOps team teaches the system to do task-X (automagically) once / month @itskingori multiple times / week multiple times / day $ portctl redeploy team --team=supa-team \ --exclude-app=someapp-1 --exclude-app=someapp-2 \ --refresh
  16. Need Empowerment Tomorrow Developer needs to figure out a way

    to do task-X DevOps team provides a tool to do task-X (albeit manually) DevOps team teaches the system to do task-X (automagically) once / month @itskingori multiple times / week multiple times / day Delegate Responsibility Via Tooling $ portctl backup full --application=reports \ --environment=production $ portctl restore full --application=reports \ --environment=sandbox --team=supa-team --backup-id=123
  17. Zappi Confidential & Proprietary Information Shared Ownership & Responsibility 3.

    Engineering, as a discipline and as an activity, is multi- disciplinary. It’s just messy. And that’s actually the best part of engineering. It’s not about everyone knowing everything. It’s about paying attention to the shared, mutual understanding. — John Allspaw, Former Etsy CTO @allspaw @itskingori
  18. Proactive Education @itskingori Approach • We encourage questions and invest

    in detailed explanations. • We train on tooling where it’s not obvious e.g. Kibana (for logs) and Grafana (for metrics). • We view being viewed as wizards as proof of our failure to educate. • We haven’t done a good job at high-level write-ups (documentation is code, for now). As an engineer who starts day one, I am [not] the best one to know how network protocols at Etsy work, and I’m going to be encouraged to seek out the experts in those domains until I do. And maybe something will break, and then I’m going to learn something new. — John Allspaw Former Etsy CTO @allspaw
  19. Open Participation @itskingori Approach • We don’t own infrastructure, we

    just guide its vision & evolution. • We view our relationship with developers as a partnership. • We encourage developers to design their underlying systems (doors are open for consultation). • We do not dictate what we run i.e. versions, programming languages etc. • Everyone has access to our infrastructure (as code) … except secrets (work-in-progress). • Everyone can participate in infrastructure i.e. send pull-requests.
  20. Insert text bla blaov saov;ih sdbv awsvn;deor vbla blaov .jbd

    sn z;i h awsvn;deor vbla blaov saov;ih awsvn;deor vbla blaov saov;ih awsvn;deor v Security is an Endless Journey 4. When you decide to take on the [chief security officer] title, you decide that you’re going to run the risk of having decisions made above you or issues created by tens of thousands of people making decisions that will be stapled to your resume — Alex Stamos, Former Facebook CSO @alexstamos
  21. Security Is A Team Effort @itskingori We want to develop

    generative cultures, where risk is shared. It’s everyone’s concern. If you build security responsibility into every team, you can scale much more powerfully than if security is only the security staff’s responsibility. — Dai Zovi Cash App CTO at Square @dinodaizovi Approach • We generally have a high trust environment. • We have trust scopes (vary degrees of trust). • We have audit logs. • We have a penguin team with 37 volunteers (43%).
  22. Security Is Not A Destination @itskingori Realities • It’s involving

    and continuously evolving work. • We haven’t figured everything out (some security measures aren’t pragmatic). • Fundamentally, we want to avoid the front- page news. What Works For Us • We use SSO everywhere. • We pen-test as often as we can. • We automate user management; provisioning & revocation.
  23. Zappi Confidential & Proprietary Information The way a team plays

    as a whole determines its success. — Babe Ruth, Baseball Player Work Processes That Work For Us 5. @itskingori
  24. Empathy Underlies Our Processes Infrastructure as code: We use terraform

    to plan and apply infrastructure changes which are reviewed in pull requests (trust but verify) Feedback Loops: We view port-control as a product and developers as our clients … listen, fix, listen, improve, listen, adapt Document everything: We memorialize what’s not code in Slack, Google Docs, wikis for posterity (if you’re not there can someone else do it without you?) Proactive Support: We view ourselves as guides, not enforcers. Always having the bird’s eye view and jumping in to address an issue before it’s raised Dog-fooding: We use port-control to deploy port-control (api/dashboard) and release portctl (cli) @itskingori
  25. Where Do We Go From Here? @itskingori

  26. Measure The Four Golden Signals (Better) Implement More White-Box Monitoring

    Improve Alerting Latency, traffic, errors and saturation are becoming increasingly important to track how well we’re doing. Avoid setting up alerts only as a reaction to a failure. Codify alerting. Get a closer look into our applications and supporting services (not just your standard system metrics). Stuff We Need To Improve On @itskingori
  27. In The Next Year What tools can we use to

    debug network calls across microservices? How can we simplify local development in a micro-services world? What can we do to democratize the management of secrets? How can we implement different deployment strategies? Can we use machine learning to auto- suggest resolutions to developer issues? ??? Service meshes? Tracing? Training a model? Vault + Port Control?
  28. In Summary ... • Invest in your own internal-workflow tools.

    High initial cost, but returns are worth it. • Keep the main thing, the main thing. • Use empathy as your key driver and you’ll never go wrong. • Automate, automate, automate. Delegate, delegate, delegate. • Scale yourself through empowerment. • Security is like a long road-trip with friends with no end. • Figure out what works for you and get started. It’s a long road ahead, don’t get overwhelmed ... take a step at a time. • It’s never been a better time than now to rethink your infrastructure. @itskingori
  29. Thank You! That’s how we Dev + Sec + Ops

    @ @kingori @itskingori