Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Embracing DevOps at JUST EAT, within a Microsoft platform

Peter Mounce
September 13, 2014

Embracing DevOps at JUST EAT, within a Microsoft platform

JUST EAT changed its culture towards embracing DevOps principles, and heavily leveraged AWS to achieve it.

We're a successful online takeaway ecommerce website running on a Microsoft-based platform.

Come learn how we:

* re-organised our teams and our platform to loosely couple them
* re-organised our architecture to be more modular
* made it possible for developers to operate their code in production directly, starting with shoot-it-in-the-head debugging
* made it possible for developers to continuously ship changes
* eliminated most differences between production and qa environments
* became more resilient as a happy by-product

(The tooling descriptions are at the end because I didn't present them, but still got questions)

Peter Mounce

September 13, 2014
Tweet

More Decks by Peter Mounce

Other Decks in Technology

Transcript

  1. JUST EAT: EMBRACING DEVOPS
    OR: HOW WE MAKE A WINDOWS-BASED ECOMMERCE PLATFORM WORK (WITH
    AWS)
    @PETEMOUNCE & @JUSTEAT_TECH

    View Slide

  2. Who am I?
    Peter Mounce
    @petemounce
    Senior Engineer at JUST EAT
    Peter Mounce - @petemounce - [email protected]

    View Slide

  3. JUST EAT: Who are we?
    ● In business since 2001 in DK, 2005 in UK
    ● Engineering team
    ○ ~70 people in UK
    ○ ~20 people in Ukraine
    ○ new office in Bristol
    ● Cloud native in AWS
    ○ Except for the bits that aren’t (yet)
    ● Very predictable load
    ○ ~900 orders/minute at peak in UK
    Peter Mounce - @petemounce - [email protected]

    View Slide

  4. JUST EAT: Who are we?
    Oh, yeah - we do online takeaway.
    We’re the online sales channel for our restaurant partners.
    Challenging! 45-60 minute cycle from online purchase to still-
    warm food.
    We make this work.
    (On Windows)
    Peter Mounce - @petemounce - [email protected]

    View Slide

  5. What are we?
    We do high-volume ecommerce.
    Windows platform.
    Most production code is C#, .NET 4 or 4.5.
    Most automation is ruby, some powershell.
    Ongoing legacy transformation; no big rewrites.
    Splitting up a monolithic system into SOA/APIs, incrementally.
    Peter Mounce - @petemounce - [email protected]

    View Slide

  6. Architecture, before AWS
    Peter Mounce - @petemounce - [email protected]

    View Slide

  7. Data centre life, pre 2013
    Physical hardware
    Snowflake servers - no configuration management tooling
    Manual deployments, done by operations team
    No real time monitoring - SQL queries only, every 5m
    Monolithic applications, not much fast-running test coverage
    … But at least we had source control and decent continuous
    integration! (since 2010)
    Peter Mounce - @petemounce - [email protected]

    View Slide

  8. Architecture, post AWS migration
    Peter Mounce - @petemounce - [email protected]

    View Slide

  9. Estate & High Availability by default
    At peak, we run ~500-600 EC2 instances
    We migrated from the single data centre in DK, to eu-west-1.
    We run everything multi-AZ, auto-scaling by default.
    (Almost).
    Peter Mounce - @petemounce - [email protected]

    View Slide

  10. Delivery pipeline
    Very standard. Nothing to see here.
    Multi-tenant.
    Tenants are isolated against bad-neighbour issues; individually
    scalable.
    This basically means our tools take a tenant parameter as well
    as an environment parameter.
    Peter Mounce - @petemounce - [email protected]

    View Slide

  11. Tech organisation structure
    We stole from AWS - “two-pizza teams”
    We have a team each for
    ● consumer web app
    ● consumer native apps (one iOS, one Android)
    ● consumer apps’ test automation
    ● restaurant apps
    ● business-support apps
    ● ePOS
    ● APIs (actually, four teams in one unit)
    ● PaaS
    ○ responsible for internal services; monitoring/alerting/logs
    ○ systems automation, deployment, traffic routing
    Peter Mounce - @petemounce - [email protected]

    View Slide

  12. Tech culture
    “You ship it, you operate it”
    Each team owns their own features, infrastructure-up.
    Minimise dependencies between teams.
    Each team has autonomy to work on what they want within
    some constraints.
    Rules:
    ● don’t break backwards compatibility
    ● use what you want - but operate it yourself
    ● other teams must be able to launch & verify your stuff in
    their environments
    Peter Mounce - @petemounce - [email protected]

    View Slide

  13. But how?
    Table-stakes for this to work (well):
    1. Persistent group chat
    2. Real-time monitoring
    3. Real-time alerting
    4. Centralised logging
    Make it easier to debug in production without a debugger.
    Peter Mounce - @petemounce - [email protected]

    View Slide

  14. Anatomy of a feature
    We decompose the platform into its component parts
    Imaginatively, we call these “platform features”
    For example
    ● consumer web app == publicweb
    ● back office tools == handle, guard
    ● etc
    Peter Mounce - @petemounce - [email protected]

    View Slide

  15. Platform features
    Features are defined by AWS CloudFormation.
    ● Everything is pull-deployment, from S3.
    ● No state is kept (for long) on the instance itself.
    ● No external actor can tell an instance to do something,
    beyond what the feature itself allows.
    Instances boot, and then bootstrap themselves from content in
    S3 based on CloudFormation::Init metadata
    Peter Mounce - @petemounce - [email protected]

    View Slide

  16. Platform feature: Servers
    We have several “baseline” AMIs.
    These have required system dependencies like .NET
    framework, ruby, 7-zip, etc.
    Periodically we update them for OS-level patches, and roll out
    new baseline AMIs. We deprecate the older AMIs.
    Peter Mounce - @petemounce - [email protected]

    View Slide

  17. Platform feature: Infrastructure
    Defined by CloudFormation. Each one stands up everything
    that feature needs to run, excluding cross-cutting
    dependencies (like DNS, firewall rules).
    Mostly standard:
    ● ELB
    ● AutoScaling Group + Launch Configuration
    ● IAM as necessary
    ● … anything else required by the feature
    Peter Mounce - @petemounce - [email protected]

    View Slide

  18. Platform feature: Infrastructure
    Peter Mounce - @petemounce - [email protected]

    View Slide

  19. Platform feature: code package
    ● A standardised package containing
    ○ built code (website, service, combinations)
    ○ configuration + deltas to run any tenant/environment
    ○ automation to deploy the feature
    ● CloudFormation::Init has a configSet to
    ○ unzip
    ○ install automation dependencies
    ○ execute the deployment automation
    ○ warm up the feature, post-install
    Peter Mounce - @petemounce - [email protected]

    View Slide

  20. What have we gained?
    Instances are disposable and short lived.
    ● Enables “shoot it in the head” debugging
    ● Disks no longer ever fill up
    ● Minimal environmental differences
    ● New environment == mostly automated
    ● Infrastructure as code == testable, repeatable - and we do!
    Peter Mounce - @petemounce - [email protected]

    View Slide

  21. Culture again: On-call
    Teams are on-call for their features.
    Decide own rota; coverage minimums for peak-time
    But: teams (must!) have autonomy to improve their features so
    they don’t get called as often.
    Otherwise, constant fire-fighting
    Peter Mounce - @petemounce - [email protected]

    View Slide

  22. Things still break!
    Page me once, shame on you.
    Page me twice, shame on me.
    Teams do root-cause analysis of incidents that triggered
    incidents.
    … An operations team / NOC does not.
    Warn call-centre proactively
    Take action proactively
    Automate mitigation steps!
    Feature toggles: not just for launching new stuff.
    Peter Mounce - @petemounce - [email protected]

    View Slide

  23. The role of our PaaS team
    Enablement.
    ● Run monitoring & alerting
    ● Run centralised logging
    ● Run deployment service
    ● Apply security updates
    ● Maintain traffic routing + DDoS shield
    Peter Mounce - @petemounce - [email protected]

    View Slide

  24. The future
    Immutable/golden instances; faster provisioning.
    Failover to secondary region (we operate in CA but host in IE).
    Centralised configuration / service discovery.
    Always: more test coverage, more confidence.
    Publish some of our tools as OSS
    https://github.com/justeat
    Peter Mounce - @petemounce - [email protected]

    View Slide

  25. The most important things
    ● Culture
    ● Principles that everyone lives by
    ● Devolve autonomy down to people on the ground
    ● (Tools)
    Peter Mounce - @petemounce - [email protected]

    View Slide

  26. Did we mention we’re hiring?
    We’re pragmatic.
    We’re successful.
    We support each other.
    We use sharp tools that we pick ourselves based on merit.
    Join us!
    ○ http://tech.just-eat.com/jobs/
    ○ http://tech.just-eat.com/jobs/senior-software-engineer-
    platform-services/
    ○ Lots of other roles
    Peter Mounce - @petemounce - [email protected]

    View Slide

  27. ANY QUESTIONS?
    Peter Mounce - @petemounce - [email protected]

    View Slide

  28. Persistent group chat
    We use HipChat.
    You could use IRC / Campfire / Hangouts.
    ● Persistent - jump in, read up
    ● Searchable history
    ● Integrate other tools to it
    ● hubot for fun and profit
    ○ @jebot trg pd emergency with msg “we’re out of champagne in the
    office fridge”
    Peter Mounce - @petemounce - [email protected]

    View Slide

  29. Real-time monitoring
    Microsoft’s SCOM requires an AD
    Publish OS-level performance counters with perftap - windows
    analogue of collectd we found and customised
    Receive metrics into statsd
    Visualise time-series data with graphite
    ○ 10s granularity retained for 13 months
    ○ AWS’ CloudWatch gives you 1min / 2 weeks
    Addictive!
    Peter Mounce - @petemounce - [email protected]

    View Slide

  30. Real-time alerting
    This is the 21st century; emailing someone their server is down
    doesn’t cut it.
    seyren runs our checks.
    Publishes to
    ● HipChat
    ● PagerDuty
    ● SMS
    ● statsd event metrics (coming soon, hopefully)
    Peter Mounce - @petemounce - [email protected]

    View Slide

  31. Centralised logging
    Windows doesn’t have syslog.
    Out of the box EventLog isn’t quite it.
    Publish logs via nxlog agent.
    Receive logs into logstash cluster.
    Filter, transform and enrich into elasticsearch cluster.
    Query, visualise and dashboard via kibana.
    Peter Mounce - @petemounce - [email protected]

    View Slide

  32. Without these things, operating a distributed system on
    Windows is hard.
    Windows at scale assumes that you have an Active Directory.
    We don’t.
    ● No Windows network load-balancing.
    ● No centrally trusted authentication.
    ● No central monitoring (SCOM) to harvest performance
    counters.
    ● No easy remote command execution (WinRM wants an AD,
    too)
    ● Other stuff; these are the highlights.
    Peter Mounce - @petemounce - [email protected]

    View Slide

  33. Open source & build vs buy
    We treat Microsoft as just another third party vendor
    dependency.
    We lean on open-source libraries and tools a lot.
    Peter Mounce - @petemounce - [email protected]

    View Slide

  34. Why not Azure / OpenStack et al?
    Decision to migrate to AWS made in late 2011.
    AWS was more mature than alternatives at the time. It offered
    many hosted services on top of the IaaS offering.
    Still is, even accounting for Azure’s recent advances.
    Peter Mounce - @petemounce - [email protected]

    View Slide