Embracing DevOps at JUST EAT, within a Microsoft platform

E6216cd7fec3d927f013b20d21dbed12?s=47 Peter Mounce
September 13, 2014

Embracing DevOps at JUST EAT, within a Microsoft platform

JUST EAT changed its culture towards embracing DevOps principles, and heavily leveraged AWS to achieve it.

We're a successful online takeaway ecommerce website running on a Microsoft-based platform.

Come learn how we:

* re-organised our teams and our platform to loosely couple them
* re-organised our architecture to be more modular
* made it possible for developers to operate their code in production directly, starting with shoot-it-in-the-head debugging
* made it possible for developers to continuously ship changes
* eliminated most differences between production and qa environments
* became more resilient as a happy by-product

(The tooling descriptions are at the end because I didn't present them, but still got questions)

E6216cd7fec3d927f013b20d21dbed12?s=128

Peter Mounce

September 13, 2014
Tweet

Transcript

  1. 1.

    JUST EAT: EMBRACING DEVOPS OR: HOW WE MAKE A WINDOWS-BASED

    ECOMMERCE PLATFORM WORK (WITH AWS) @PETEMOUNCE & @JUSTEAT_TECH
  2. 2.

    Who am I? Peter Mounce @petemounce Senior Engineer at JUST

    EAT Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  3. 3.

    JUST EAT: Who are we? • In business since 2001

    in DK, 2005 in UK • Engineering team ◦ ~70 people in UK ◦ ~20 people in Ukraine ◦ new office in Bristol • Cloud native in AWS ◦ Except for the bits that aren’t (yet) • Very predictable load ◦ ~900 orders/minute at peak in UK Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  4. 4.

    JUST EAT: Who are we? Oh, yeah - we do

    online takeaway. We’re the online sales channel for our restaurant partners. Challenging! 45-60 minute cycle from online purchase to still- warm food. We make this work. (On Windows) Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  5. 5.

    What are we? We do high-volume ecommerce. Windows platform. Most

    production code is C#, .NET 4 or 4.5. Most automation is ruby, some powershell. Ongoing legacy transformation; no big rewrites. Splitting up a monolithic system into SOA/APIs, incrementally. Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  6. 7.

    Data centre life, pre 2013 Physical hardware Snowflake servers -

    no configuration management tooling Manual deployments, done by operations team No real time monitoring - SQL queries only, every 5m Monolithic applications, not much fast-running test coverage … But at least we had source control and decent continuous integration! (since 2010) Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  7. 9.

    Estate & High Availability by default At peak, we run

    ~500-600 EC2 instances We migrated from the single data centre in DK, to eu-west-1. We run everything multi-AZ, auto-scaling by default. (Almost). Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  8. 10.

    Delivery pipeline Very standard. Nothing to see here. Multi-tenant. Tenants

    are isolated against bad-neighbour issues; individually scalable. This basically means our tools take a tenant parameter as well as an environment parameter. Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  9. 11.

    Tech organisation structure We stole from AWS - “two-pizza teams”

    We have a team each for • consumer web app • consumer native apps (one iOS, one Android) • consumer apps’ test automation • restaurant apps • business-support apps • ePOS • APIs (actually, four teams in one unit) • PaaS ◦ responsible for internal services; monitoring/alerting/logs ◦ systems automation, deployment, traffic routing Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  10. 12.

    Tech culture “You ship it, you operate it” Each team

    owns their own features, infrastructure-up. Minimise dependencies between teams. Each team has autonomy to work on what they want within some constraints. Rules: • don’t break backwards compatibility • use what you want - but operate it yourself • other teams must be able to launch & verify your stuff in their environments Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  11. 13.

    But how? Table-stakes for this to work (well): 1. Persistent

    group chat 2. Real-time monitoring 3. Real-time alerting 4. Centralised logging Make it easier to debug in production without a debugger. Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  12. 14.

    Anatomy of a feature We decompose the platform into its

    component parts Imaginatively, we call these “platform features” For example • consumer web app == publicweb • back office tools == handle, guard • etc Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  13. 15.

    Platform features Features are defined by AWS CloudFormation. • Everything

    is pull-deployment, from S3. • No state is kept (for long) on the instance itself. • No external actor can tell an instance to do something, beyond what the feature itself allows. Instances boot, and then bootstrap themselves from content in S3 based on CloudFormation::Init metadata Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  14. 16.

    Platform feature: Servers We have several “baseline” AMIs. These have

    required system dependencies like .NET framework, ruby, 7-zip, etc. Periodically we update them for OS-level patches, and roll out new baseline AMIs. We deprecate the older AMIs. Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  15. 17.

    Platform feature: Infrastructure Defined by CloudFormation. Each one stands up

    everything that feature needs to run, excluding cross-cutting dependencies (like DNS, firewall rules). Mostly standard: • ELB • AutoScaling Group + Launch Configuration • IAM as necessary • … anything else required by the feature Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  16. 19.

    Platform feature: code package • A standardised package containing ◦

    built code (website, service, combinations) ◦ configuration + deltas to run any tenant/environment ◦ automation to deploy the feature • CloudFormation::Init has a configSet to ◦ unzip ◦ install automation dependencies ◦ execute the deployment automation ◦ warm up the feature, post-install Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  17. 20.

    What have we gained? Instances are disposable and short lived.

    • Enables “shoot it in the head” debugging • Disks no longer ever fill up • Minimal environmental differences • New environment == mostly automated • Infrastructure as code == testable, repeatable - and we do! Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  18. 21.

    Culture again: On-call Teams are on-call for their features. Decide

    own rota; coverage minimums for peak-time But: teams (must!) have autonomy to improve their features so they don’t get called as often. Otherwise, constant fire-fighting Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  19. 22.

    Things still break! Page me once, shame on you. Page

    me twice, shame on me. Teams do root-cause analysis of incidents that triggered incidents. … An operations team / NOC does not. Warn call-centre proactively Take action proactively Automate mitigation steps! Feature toggles: not just for launching new stuff. Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  20. 23.

    The role of our PaaS team Enablement. • Run monitoring

    & alerting • Run centralised logging • Run deployment service • Apply security updates • Maintain traffic routing + DDoS shield Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  21. 24.

    The future Immutable/golden instances; faster provisioning. Failover to secondary region

    (we operate in CA but host in IE). Centralised configuration / service discovery. Always: more test coverage, more confidence. Publish some of our tools as OSS https://github.com/justeat Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  22. 25.

    The most important things • Culture • Principles that everyone

    lives by • Devolve autonomy down to people on the ground • (Tools) Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  23. 26.

    Did we mention we’re hiring? We’re pragmatic. We’re successful. We

    support each other. We use sharp tools that we pick ourselves based on merit. Join us! ◦ http://tech.just-eat.com/jobs/ ◦ http://tech.just-eat.com/jobs/senior-software-engineer- platform-services/ ◦ Lots of other roles Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  24. 28.

    Persistent group chat We use HipChat. You could use IRC

    / Campfire / Hangouts. • Persistent - jump in, read up • Searchable history • Integrate other tools to it • hubot for fun and profit ◦ @jebot trg pd emergency with msg “we’re out of champagne in the office fridge” Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  25. 29.

    Real-time monitoring Microsoft’s SCOM requires an AD Publish OS-level performance

    counters with perftap - windows analogue of collectd we found and customised Receive metrics into statsd Visualise time-series data with graphite ◦ 10s granularity retained for 13 months ◦ AWS’ CloudWatch gives you 1min / 2 weeks Addictive! Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  26. 30.

    Real-time alerting This is the 21st century; emailing someone their

    server is down doesn’t cut it. seyren runs our checks. Publishes to • HipChat • PagerDuty • SMS • statsd event metrics (coming soon, hopefully) Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  27. 31.

    Centralised logging Windows doesn’t have syslog. Out of the box

    EventLog isn’t quite it. Publish logs via nxlog agent. Receive logs into logstash cluster. Filter, transform and enrich into elasticsearch cluster. Query, visualise and dashboard via kibana. Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  28. 32.

    Without these things, operating a distributed system on Windows is

    hard. Windows at scale assumes that you have an Active Directory. We don’t. • No Windows network load-balancing. • No centrally trusted authentication. • No central monitoring (SCOM) to harvest performance counters. • No easy remote command execution (WinRM wants an AD, too) • Other stuff; these are the highlights. Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  29. 33.

    Open source & build vs buy We treat Microsoft as

    just another third party vendor dependency. We lean on open-source libraries and tools a lot. Peter Mounce - @petemounce - pete@neverrunwithscissors.com
  30. 34.

    Why not Azure / OpenStack et al? Decision to migrate

    to AWS made in late 2011. AWS was more mature than alternatives at the time. It offered many hosted services on top of the IaaS offering. Still is, even accounting for Azure’s recent advances. Peter Mounce - @petemounce - pete@neverrunwithscissors.com