Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Around & After Kubernetes: The Principles and Ideas that Guide Us

King'ori Maina
September 06, 2019

Around & After Kubernetes: The Principles and Ideas that Guide Us

We made the decision in late 2015 to move all our applications to containerised environments managed by Kubernetes. It took roughly 3 years to complete that migration. During that journey we learnt a lot about containerisation, distributed systems, complicated migrations and automation of systems managing over 85 developers. This talk was about sharing some of the key principles and ideas that guided us. It touched lightly on Kubernetes from a technical point of view and focused on sharing key ideas.

Links:

• Conference: DevOpsDays 2019 (Cape Town) - Full Talk
• Program: https://devopsdays.org/events/2019-cape-town/program/
• Video: https://youtu.be/sHZVD0fmVWg

PS: I'd recommend you watch the talk with speed set to 1.5x. I really need to work on my speed. 🙈

King'ori Maina

September 06, 2019
Tweet

More Decks by King'ori Maina

Other Decks in Technology

Transcript

  1. Around &
    After
    Kubernetes
    The Principles
    and Ideas that
    guide us
    DevOps Days
    Cape Town 2019 King’ori Maina
    “King”

    View Slide

  2. Meet The Team
    Together, we make the Infrastructure/DevOps/DevSecOps team ...
    Hadrian Valentine
    Infrastructure Engineer
    [email protected]
    @hadrianvale
    King’ori Maina
    Infrastructure Engineer
    [email protected]
    @itskingori
    Hati Chindove
    Head of Information Security
    [email protected]
    @hatitye
    Zac Blazic
    Infrastructure Engineer
    [email protected]
    @zacblazic
    Head of In-Security
    (glorified task manager)
    Product Owner
    (paid to worry)

    View Slide

  3. Provide insights to allow
    global brands make better
    decisions.
    What clients pay us for.
    Predict effectiveness.
    Monitor performance.
    Validate ideas.
    Creating charts. Then
    creating more.
    Not a typo. Lots of
    integration with
    awesome tools to
    make developer’s
    lives easier.
    Open-Sauce
    Internal stuff that’s
    not available in any
    off the shelf tool ... at
    least in one place.
    Our Workflow.
    We use a bunch of
    technology from people way
    smarter than us.
    Infrastructure.
    Business
    Value
    Port
    Control
    What We Do
    Supporting
    Services
    @itskingori

    View Slide

  4. Difficult
    Is Now Easy
    Easier
    It All Started
    In 2015 ...
    @itskingori

    View Slide

  5. Throw away any
    stateless
    component when
    we want.
    Immutability Observability
    Extensibility Reviewability Scalability
    Our Goals
    To know where we want to go, we needed some long term objectives …
    Build upon what
    we have when
    we want.
    Represent every-
    thing in source
    code when we
    want.
    Debug it when
    we want i.e.
    logging + metrics.
    Add capacity
    when we want …
    ideally,
    automatically.
    Because …
    immutability is a
    requirements for
    scaling up and
    scaling down.
    Because …
    we want to stand on
    the shoulders of
    giants. No need to
    re-invent the wheel.
    Because …
    we want to be able
    to commit changes
    into source-code
    and have a single
    source of truth.
    Because …
    it’s not a matter of if
    things will go wrong
    but when.
    Because …
    we want to be able
    to handle the
    thundering herd
    without always
    running at capacity.
    @itskingori

    View Slide

  6. Approach
    • We accepted the potential guaranteed risks.
    • We committed to set up any new services in
    Docker going forward.
    • We, in hindsight, did not know what we were
    setting ourselves up for!
    Impact
    • We now upgrade dependencies in isolation
    reducing blast radius.
    • We now have repeatable development, build,
    test and production environments.
    • We now limit resources per application
    based on requirements.
    • We spin up new servers in ~3 minutes.
    Challenge
    We had a server
    provisioning
    problem.
    We had a packaging
    problem.
    We had a process
    isolation problem.
    Our Journey on Docker
    @itskingori

    View Slide

  7. Approach
    • We embraced potential guaranteed
    complexity but resolved to keep it at a
    minimum (opportunity-cost).
    • We committed to migrating existing services to
    our new infrastructure one-by one.
    • We rebuilt tooling step-by-step.
    Impact
    • We do not regret our decision.
    • We have an API to hard-problems regarding
    infrastructure .
    • We spend more of our time on developer
    enablement than infrastructure problems.
    • We sleep better (declarative configuration,
    self-healing).
    Challenge
    We had an
    orchestration
    problem.
    We had a serious
    peer-pressure
    problem.
    We had a hard-
    problem, problem.
    Our Journey on Kubernetes
    @itskingori

    View Slide

  8. Approach
    • We invested in building a tool from scratch, for
    us … by us.
    • We build it with a one year horizon (more if
    possible).
    • We prioritize extensibility to cover unknown
    use-cases.
    Impact
    • We do not wait for or hack tools to work how
    we work.
    • We have an API for our internal-workflows.
    • We can on-board a new developer in less
    than 5 minutes (self-serve, immediately
    informed + productive).
    • 50k deployments since April 2017, 3.9k last
    month.
    Challenge
    We had an internal
    workflow problem
    problem.
    We had a
    retrofitting
    problem.
    We had a one “ring”
    to rule them all
    desire.
    Our Journey on Port Control
    @itskingori

    View Slide

  9. In Retrospect
    What is it that we’re looking
    forward to?
    Then we can be more intentional at
    building for the future by laying the
    right foundation as we go.
    What is that we’ve done right?
    So that we can keep doing them and
    guard against complacency.
    What is it that we could have done
    better?
    Then we can focus on those areas and
    see what more potential we can unlock.
    It’s all a narrative fallacy.
    @itskingori

    View Slide

  10. Zappi Confidential & Proprietary Information
    Reduce
    Cognitive
    Load
    1.
    We want to exploit all of the
    advantages that come from
    having a small number of well-
    known tools. When you have a
    small number of well-known
    tools, you can then focus on the
    product.
    — John Allspaw,
    Former Etsy CTO
    @allspaw
    @itskingori

    View Slide

  11. Halt The Proliferation of Tools
    We’re living in amazing over-whelming times ...
    @itskingori
    … can we go back to LAMP stacks?

    View Slide

  12. Zappi Confidential & Proprietary Information
    ... of course, all of this has to be underpinned by …
    the system is stable and performant
    Keep The Main Thing, The Main Thing
    We don’t want to be doing engineering for engineering’s sake …
    @itskingori
    Optimise pushing code
    to production
    Simplify processes so
    that self-service unblocks
    most people
    Make deployments
    robust and atomic
    Because …
    if people are confident
    about the deploy
    process they will deploy
    more!
    Because …
    we want less work for
    ourselves so that we
    can focus on features
    not crisis!
    Because …
    deployments are a
    unit of work and a
    representation of
    business value
    going out!

    View Slide

  13. Post-mortem debriefings every day are
    littered with the artefacts of people insisting,
    the second before an outage, that “I don’t
    have to care about that.
    — John Allspaw
    Former Etsy CTO
    @allspaw
    The Cost of Abstractions
    Realities
    • Knowledge of Kubernetes is not an
    operational requirement for a developer.
    • Not all developers care about infrastructure.
    • Not all developers can care (context switching
    is expensive).
    • The right abstractions can have a multiplier
    effect on developer efficiency (consistency &
    predictability e.g. labels).
    @itskingori

    View Slide

  14. Insert text bla blaov saov;ih
    sdbv awsvn;deor vbla blaov
    .jbd sn z;i h awsvn;deor vbla
    blaov saov;ih awsvn;deor vbla
    blaov saov;ih awsvn;deor v
    Getting
    Out of
    the Way
    2.
    It doesn’t make sense to hire
    smart people and tell them
    what to do; we hire smart
    people so they can tell us what
    to do.
    — Steve Jobs,
    Former Apple CEO

    View Slide

  15. Automate As Much As You Can
    Need Empowerment Tomorrow
    Developer
    needs to figure
    out a way to do
    task-X
    DevOps team provides a
    tool to do task-X
    (albeit manually)
    DevOps team
    teaches the system
    to do task-X
    (automagically)
    once / month
    @itskingori
    multiple times / week multiple times / day
    $ portctl redeploy team --team=supa-team \
    --exclude-app=someapp-1 --exclude-app=someapp-2 \
    --refresh

    View Slide

  16. Need Empowerment Tomorrow
    Developer
    needs to figure
    out a way to do
    task-X
    DevOps team provides a
    tool to do task-X
    (albeit manually)
    DevOps team
    teaches the system
    to do task-X
    (automagically)
    once / month
    @itskingori
    multiple times / week multiple times / day
    Delegate Responsibility Via Tooling
    $ portctl backup full --application=reports \
    --environment=production
    $ portctl restore full --application=reports \
    --environment=sandbox --team=supa-team --backup-id=123

    View Slide

  17. Zappi Confidential & Proprietary Information
    Shared
    Ownership &
    Responsibility
    3.
    Engineering, as a discipline and
    as an activity, is multi-
    disciplinary. It’s just messy. And
    that’s actually the best part of
    engineering. It’s not about
    everyone knowing everything.
    It’s about paying attention to
    the shared, mutual
    understanding.
    — John Allspaw,
    Former Etsy CTO
    @allspaw
    @itskingori

    View Slide

  18. Proactive Education
    @itskingori
    Approach
    • We encourage questions and invest in
    detailed explanations.
    • We train on tooling where it’s not obvious e.g.
    Kibana (for logs) and Grafana (for metrics).
    • We view being viewed as wizards as proof of
    our failure to educate.
    • We haven’t done a good job at high-level
    write-ups (documentation is code, for now).
    As an engineer who starts day one, I am [not] the best
    one to know how network protocols at Etsy work, and I’m
    going to be encouraged to seek out the experts in those
    domains until I do. And maybe something will break, and
    then I’m going to learn something new.
    — John Allspaw
    Former Etsy CTO
    @allspaw

    View Slide

  19. Open Participation
    @itskingori
    Approach
    • We don’t own infrastructure, we just guide its
    vision & evolution.
    • We view our relationship with developers as a
    partnership.
    • We encourage developers to design their
    underlying systems (doors are open for
    consultation).
    • We do not dictate what we run i.e. versions,
    programming languages etc.
    • Everyone has access to our infrastructure (as
    code) … except secrets (work-in-progress).
    • Everyone can participate in infrastructure
    i.e. send pull-requests.

    View Slide

  20. Insert text bla blaov saov;ih
    sdbv awsvn;deor vbla blaov
    .jbd sn z;i h awsvn;deor vbla
    blaov saov;ih awsvn;deor vbla
    blaov saov;ih awsvn;deor v
    Security is
    an Endless
    Journey
    4.
    When you decide to take on
    the [chief security officer]
    title, you decide that you’re
    going to run the risk of having
    decisions made above you or
    issues created by tens of
    thousands of people making
    decisions that will be stapled
    to your resume
    — Alex Stamos,
    Former Facebook CSO
    @alexstamos

    View Slide

  21. Security Is A Team Effort
    @itskingori
    We want to develop generative cultures, where risk is
    shared. It’s everyone’s concern. If you build security
    responsibility into every team, you can scale much
    more powerfully than if security is only the security
    staff’s responsibility.
    — Dai Zovi
    Cash App CTO at Square
    @dinodaizovi
    Approach
    • We generally have a high trust environment.
    • We have trust scopes (vary degrees of trust).
    • We have audit logs.
    • We have a penguin team with 37 volunteers (43%).

    View Slide

  22. Security Is Not A Destination
    @itskingori
    Realities
    • It’s involving and continuously evolving
    work.
    • We haven’t figured everything out (some
    security measures aren’t pragmatic).
    • Fundamentally, we want to avoid the front-
    page news.
    What Works For Us
    • We use SSO everywhere.
    • We pen-test as often as we can.
    • We automate user management; provisioning
    & revocation.

    View Slide

  23. Zappi Confidential & Proprietary Information
    The way a team plays as a
    whole determines its success.
    — Babe Ruth,
    Baseball Player
    Work
    Processes
    That Work
    For Us
    5.
    @itskingori

    View Slide

  24. Empathy Underlies Our Processes
    Infrastructure as code:
    We use terraform to plan and apply
    infrastructure changes which are reviewed in
    pull requests (trust but verify)
    Feedback Loops:
    We view port-control as a
    product and developers
    as our clients … listen,
    fix, listen, improve, listen,
    adapt
    Document everything:
    We memorialize what’s not code in
    Slack, Google Docs, wikis for posterity
    (if you’re not there can someone else
    do it without you?)
    Proactive Support:
    We view ourselves as guides,
    not enforcers. Always having
    the bird’s eye view and jumping
    in to address an issue before
    it’s raised
    Dog-fooding:
    We use port-control to
    deploy port-control
    (api/dashboard) and
    release portctl (cli)
    @itskingori

    View Slide

  25. Where Do We
    Go From
    Here?
    @itskingori

    View Slide

  26. Measure The Four Golden Signals (Better)
    Implement More White-Box
    Monitoring
    Improve Alerting
    Latency, traffic, errors and saturation are
    becoming increasingly important to track
    how well we’re doing.
    Avoid setting up alerts only as a
    reaction to a failure. Codify alerting.
    Get a closer look into our
    applications and supporting
    services (not just your standard
    system metrics).
    Stuff We Need To Improve On
    @itskingori

    View Slide

  27. In The Next Year
    What tools can we
    use to debug
    network calls across
    microservices?
    How can we
    simplify local
    development in a
    micro-services
    world?
    What can we do to
    democratize the
    management of
    secrets?
    How can we
    implement
    different
    deployment
    strategies?
    Can we use machine
    learning to auto-
    suggest resolutions
    to developer issues?
    ??? Service meshes?
    Tracing? Training a
    model?
    Vault +
    Port Control?

    View Slide

  28. In Summary ...
    • Invest in your own internal-workflow tools.
    High initial cost, but returns are worth it.
    • Keep the main thing, the main thing.
    • Use empathy as your key driver and you’ll
    never go wrong.
    • Automate, automate, automate. Delegate,
    delegate, delegate.
    • Scale yourself through empowerment.
    • Security is like a long road-trip with friends with
    no end.
    • Figure out what works for you and get started. It’s
    a long road ahead, don’t get overwhelmed ...
    take a step at a time.
    • It’s never been a better time than now to
    rethink your infrastructure.
    @itskingori

    View Slide

  29. Thank You!
    That’s how we
    Dev + Sec + Ops
    @
    @kingori
    @itskingori

    View Slide