Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Engineering Resilience into the Foundations: Ke...

Syntasso
December 03, 2024

Engineering Resilience into the Foundations: Key Strategies for Building Robust Internal Development Platforms

In today's rapidly evolving technical landscape, the resilience of internal development platforms (IDPs) is paramount. This presentation draws on a series of case studies to showcase how to design and maintain platforms that can withstand various challenges, such as enabling teams to go faster, ensuring policy and workflows are enforced, and scaling from managing one cluster to multiple clusters.

Stories shared include a two-sided marketplace company that built and migrated to a Mesos-based platform to ensure resilient operation during Black Friday sale events, a UK bank that moved from changes requiring approval from multiple boards to compliance and governance being automated within a Kubernetes-based platform, and more.

Syntasso

December 03, 2024
Tweet

More Decks by Syntasso

Other Decks in Technology

Transcript

  1. Engineering Resilience into the Foundations: Key Strategies for Building Robust

    Internal Development Platforms (IDPs) Daniel Bryant Platform Engineer and PMM
  2. tl;dr A well-designed and implemented platform enables resilience throughout the

    SDLC: • Simplifying platform interactions reduces errors • Graceful degradation supports business goals • Shared responsibility promotes resilience
  3. A digital platform is a foundation of self-service APIs, tools,

    services, knowledge and support which are arranged as a compelling internal product. Autonomous delivery teams can make use of the platform to deliver product features at a higher pace, with reduced coordination. Evan Bottcher martinfowler.com/articles/talk-about-platforms.html What is a platform, anyway?
  4. A digital platform is a foundation of self-service APIs, tools,

    services, knowledge and support which are arranged as a compelling internal product . Autonomous delivery teams can make use of the platform to deliver product features at a higher pace, with reduced coordination . Evan Bottcher martinfowler.com/articles/talk-about-platforms.html What is a platform, anyway?
  5. “Platform engineering improves developer experience and productivity by providing self-service

    capabilities with automated infrastructure operations . It is trending because of its promise to optimise the developer experience and accelerate product teams’ delivery of customer value.” gartner.com/en/articles/what-is-platform-engineering Gartner: What is platform engineering?
  6. My hypothesis for today A well designed and implemented platform

    enables resilience throughout the SDLC: • Simplifying platform interactions reduces errors ◦ Developer-centric design (focus on customer) • Graceful degradation supports business goals ◦ Built-in policy and fault tolerance • Shared responsibility promotes resilience ◦ Strong collaboration and governance
  7. Our case studies • Large software company headquartered in the

    UK ◦ Provides business management software to SMBs across the UK, USA, and APAC ◦ Aimed to improve resilience for developers by reducing cognitive load • Not on the High Street ◦ Two-sided ecommerce marketplace focusing on crafts and gifts ◦ Focused on ensuring resilient tech and processes for Black Friday • NatWest Group ◦ British banking and insurance holding company, based in Edinburgh, Scotland ◦ Desire to promote resiliency through continuous improvement of platform
  8. Large UK business management software company • Context ◦ Business

    growing rapidly ◦ Looking to scale software delivery across teams and geographies • Challenge ◦ Diverse range of technologies due to M&A ◦ Software delivery teams experiencing cognitive overload with cloud native tech ◦ Small platform team
  9. Large UK business management software company • Platform solution ◦

    Created “golden path” for delivery ◦ Used Kubernetes + Kratix platform ◦ Devs interact with abstraction rather than K8s • Learnings ◦ APIs, abstraction, and automation reduce developer cognitive load ◦ Hackathons can be valuable for driving adoption and experimentation docs.kratix.io/main/quick-start
  10. Not on the High Street • Context ◦ Struggling to

    meet “Black Friday” demand ◦ Recently migrated to the cloud ◦ Looking to automate scaling and DR/BC • Challenge ◦ Software delivery teams experiencing cognitive overload with cloud native tech ◦ Knowledge in static runbooks ◦ Ops not involved with testing youtube.com/watch?v=g-1oAKSBBJM
  11. Not on the High Street • Platform solution ◦ Module

    architecture and platform ◦ Mesos-based platform with CLI tools and Jenkins ◦ Load testing scenarios with previous year’s data • Learnings ◦ Form a “guiding coalition” with clear goals/KPIs ◦ Automate failover and DR/BC into the platform ◦ “Platform team” is more resilient that “ops people”
  12. NatWest Group • Context ◦ Time to market increasingly important

    in FinTech space ◦ Desire to increase contributions to the platform • Challenge ◦ Developer environment provisioning a limiting factor ◦ Lots of manual processes and patterns ◦ Limited resources (and need to “keep the lights on”) youtube.com/watch?v=RgAutqzxw5U
  13. NatWest Group • Platform solution ◦ Implement “platform as a

    product” with Kubernetes, GitLab, flux, Kratix, and more ◦ Enabling teams facilitate platform contributions • Learnings ◦ A composable platform enabled flexibility ◦ Automate manual processes improved resilience ◦ “Inner sourcing” enabled scalability and promoted ownership
  14. • Well-designed platforms enable resilience throughout the SDLC • APIs,

    abstraction, and automation reduce developer cognitive load • Software architecture and platform architecture are symbiotic • Form a “guiding coalition” with KPIs: Platform team > Ops people • Inner sourcing pools knowledge and increases ownership Conclusion