Speeding up without slowing down

6fe43e0038cf0e5579b549d417d4f3ec?s=47 Rhys Evans
November 07, 2017

Speeding up without slowing down

At FT we built one of the world's fastest media websites, and release to production dozens of times a day. But the architectural and organisational decisions aimed at allowing us to deliver reliable features quickly and consistently don't always fit neatly with our desire to optimise performance.

In this warts-and-all talk, you'll learn

- how we build FT.com
- how a highly componentised, microservices stack with a rapid release cycle can sometimes get in the way of performance
- some ideas for working around these obstacles
- that web performance is hard, and no-one's perfect

6fe43e0038cf0e5579b549d417d4f3ec?s=128

Rhys Evans

November 07, 2017
Tweet

Transcript

  1. Speeding up without slowing down Building a faster FT.com, fast

  2. www.ft.com Faster than the average media site

  3. Who is this bearded hippie? • Engineer on FT.com •

    Worked on performance for over a year • Still plenty of gaps in my knowledge • Likes: Birds, Father Ted, w3c • Dislikes: Dogs, Game of Thrones, React @wheresrhys Rhys Evans
  4. We’re hiring! www.ft.com/dev/null

  5. • Our tech stack, how and why we got here

    • How our choices can conflict with performance optimisation • Some examples of how we’ve worked around these problems • Group therapy What’s this talk about?
  6. A brief history lesson … it is the year 3BP*

    *before perf
  7. Welcome to the old FT.com, ‘Falcon’

  8. Welcome to the ironically named old FT.com, ‘Falcon’

  9. Welcome to the ironically named old FT.com, ‘Falcon’ Not just

    slow, Falcon was also slow and dangerous to develop • Multiple environments • Roughly one big bang, cluster-bug release a month • Proliferation of hacks in order to circumvent the release cycle ‘Strategic Products’ formed in order to release small, well made, experimental features quickly
  10. If we ever need another ‘Strategic Products’, then we’ve gone

    badly wrong
  11. 5 pillars of FT.com • Take back control • Straight

    to prod deployment • Feature flags • Microservices • Componentisation
  12. On Falcon nobody had real ownership of the codebase. Free

    for all of: • Tag managers • Third party bloatware and vulnerabilities • Bad ideas nobody ever validated or switched off The tech team insisted on full control over what was allowed on the new FT.com Take back control
  13. We have 2 environments – local development and production. Every

    merged pull request serves production traffic within about 10 minutes. Straight to prod deployment GitHub CircleCI Heroku
  14. Advantages: • Smaller releases, so bugs are easier to find

    and fix • Fewer environment/config bugs • Easy to validate we’re building the right it before building it right • The adrenaline rush Straight to prod deployment
  15. Feature flags • Hide work in progress • Enable QA

    in prod • Split bigger features into many small releases to prod Without flags, it’d be difficult to work as we do without unacceptably buggy results.
  16. FT.com is made of 100+ independent microservices Microservices

  17. Microservices Advantages: • Quick and easy to comprehend and test

    • Confidence and speed deploying (and if necessary rolling back) • Scaling is, in most cases, trivial • Greater fault tolerance through isolation, so easier to experiment
  18. The FT has, over the past few years, invested in

    a set of configurable client-side components, origami.ft.com. Our user-facing apps typically use 30+ components Componentisation
  19. Componentisation Advantages: • Quick and easy to comprehend and test

    • Avoids duplication of effort and client-side code bloat • Single source of truth for branding across many sites • High standards e.g. accessibility
  20. What do these things have in common? — They make

    it easy to release good quality software quickly, frequently and with confidence
  21. Chronicles of FT.com • A fast paywall using microservices •

    Performance optimisation in a distributed front-end • Straight to prod service workers
  22. Chronicles of FT.com • A fast paywall using microservices •

    Performance optimisation in a distributed front-end • Straight to prod service workers
  23. The critical path isn’t just about inlining CSS It’s every

    request involved in serving a meaningful page to the user Microservices have the potential to multiply the number of requests in the critical path The critical path… extended User-facing app Dependency Dependency Dependency
  24. With good caching, the number of requests in the critical

    path can be reduced By sharing a cached response between many users we can reduce the likelihood of a long critical path chain Caching and the critical path User-facing app Dependency Dependency Cache Dependency
  25. Having a paywall means we serve a different page depending

    on who you are Can we share cached pages somehow? Caching and the paywall
  26. A poor strategy Cookie: FT_Session=blahblah; FT_edition=uk;... CDN (Fastly) Application Vary:

    Cookie Aim to vary based on something with fewer unique values than Cookie Preflight
  27. User cohorting with headers Preflight Cookie: FT_Session=blahblah;FT_edition=uk;... FT-Authorized: true FT-Edition:

    uk FT-AB-Tests: fake-news:on; Unique per user 2 x 2 x n x … Shared by everyone in a cohort of users Service that converts cookies and other per user data to more generic headers
  28. Varying content per cohort Preflight Cookie: FT_Session=blahblah; FT_edition=uk;... FT-Authorized: true

    FT-Edition: uk FT-AB-Tests: fake-news:on; CDN (Fastly) Application Vary: FT-Authorized, FT-Edition, FT-AB-Tests Highly cacheable Perf bottleneck Not in the critical path (when cache hit)
  29. Turtles all the way down Preflight Session Access Barriers Vanity

    urls A/B testing Perf bottleneck
  30. 1. Find the microservice that’s ‘sticking up’, i.e. the slowest

    one 2. Whack it 3. Repeat Microservice whack-a-mole
  31. • Measure everything – how you gonna find a mole

    with your eyes closed? • Measure granularly – different code paths mean different perf • Use median, 95th and other percentiles, not the mean • Keep an eye on timeouts Microservice whack-a-mole: some whacking strategies
  32. • Geography matters – Look out for bugs in your

    DNS & routing layer. • Minimise request overhead ◦ in Node.js, HTTP agents with keepAlive is easy and very effective • Persuade your business to impose a perf budget to avoid regressions • The case we made http://bit.ly/2zmGZ4H Microservice whack-a-mole: some whacking strategies
  33. • Median paywall decision within 20ms • No paywall decision

    slower than 200ms • High cache hit rate: speed & resilience • Cost savings on computing power All delivered without impacting our ability to work and release software efficiently The results
  34. Chronicles of FT.com • A fast paywall using microservices •

    Performance optimisation in a distributed front-end • Straight to prod service workers
  35. Any performant front-end should aim to: • Reuse assets between

    visits to the site • Reuse assets between page views on the same visit • Implement modern performance best practices ◦ responsive images ◦ lazy loading ◦ resource hint headers ◦ inlining critical path CSS Some front-end fundamentals
  36. • Shared source code & design: > 50% • Shared

    JS & CSS assets: 0% • Shared JS & CSS between visits: probably 0% • Inlined CSS: 0% How did FT.com measure up?
  37. • Rapid release cycle & many components update often, busting

    the cache • Sharing CSS and JS between independent user-facing services is hard • Decisions about which CSS to inline are… also hard • Hard to retrofit optimisations to 1 app, let alone 12 Why was it so bad? App1 App2 Build1 Assets1
  38. • Semver gets a bad rap… remember left-pad? • But

    our components are generally very high quality and maintained by people close to our team • We trust semver, and this rewards us with consistency and efficiency • Locking down our versions would make it harder to release software Couldn’t we just lock down our versions?
  39. • Each app’s build has its own gremlins • Combining

    all apps, means combining all builds • This multiplies the number of things that can stop us releasing code • Also increases the surface area of potential bugs • A single front-end app would make it harder to release software Couldn’t we just have one front-end app?
  40. It became clear that performance optimisation is: • Too tricky

    and time-consuming to leave it to each app • An intricate dance between the server, the build and the client side code, so any solution would have to be full stack And so n-ui came into being Build a performance thing
  41. n-ui CDN serving assets unique to each app What is

    n-ui? Bundle of preconfigured components used in all our apps npm and Bower component Server with knowledge of all relevant assets and tools Build tool with rudimentary JS and stylesheet splitting App1 CDN serving shared assets App2 Templates and asset loading tools running in the browser Deploy tool for delivering assets to the CDN
  42. None
  43. None
  44. …or don’t build a performance thing could someone get a

    grip on whoever is f**king with the styling on the site this week I’ve *really* had enough of trying to get n-ui updates to work problem started occurring after a n-ui update what I did see was n-ui as a dev-dependency, but I guess that is still going to lead to mind-melting that’s presumably from an n-ui update? what is going on with n-ui? Apologies on the n-ui issues all, am on it. Also added to my todo list to stop us breaking stuff Has anyone managed to recently successfully bower link/npm link n-ui in an app? how to fix this motherf***ing `Projects using n-ui must maintain parity between versions` error?! just pushing a bug fix in n-ui right now If I can get the build to pass! Damn n-ui!!!
  45. • Don’t build a performance thing on your own •

    Benefits to isolating complexity from the rest of your codebase • But it’s an illusion to imagine your abstractions will be perfect • Collaboration ensures comprehensibility and maintainability What went wrong?
  46. • Retreated from a few of the too-clever-by-half ideas —

    be prepared to say “it’s not worth the pain” • Rolling out updates is a team effort • Ideas for what to do next now come from the wider team • Tooling and the web platform will, in time, make the problems easier • Still complex, but getting closer to being as complex as it needs to be Still able to release our front end easily Where we are now?
  47. Chronicles of FT.com • A fast paywall using microservices •

    Performance optimisation in a distributed front-end • Straight to prod service workers
  48. Why should we want a service worker on ft.com? •

    Persistent caching • Gateway to lots of whizzy performance optimisations • Greater resilience when the network lets us down • App-like behaviour e.g. push notifications • Nothing to do with wanting to look cool… who told you that?
  49. • “Service workers essentially act as proxy servers that sit

    between web applications, and the browser and network” – MDN • “I'll eat anything you want me to eat, I'll swallow anything you want me to swallow, So, come on down and I'll chew on a dog!” – Beetlegeuse What are service workers?
  50. https://github.com/popeindustries

  51. Bugs included… • Showing users barriers even after they’d signed

    in • Replacing all pages on Falcon with blank error pages • Users with cookies longer than 4000 characters got permanently stuck on bad versions FT.com service worker round one
  52. • No automated unit or integration tests in CI •

    No way to test changes to service worker without releasing to all users • No way to turn off individual features of the service worker • No easy way to roll back or turn off a broken service worker What went wrong?
  53. How to test your service worker • A surprising number

    of SW APIs are available in the DOM • https://github.com/popeindustries – some great resources, including a mock SW environment • Test runners, e.g. Karma, can be adapted to spin up sandboxed service workers • Instrument your SW and use postMessage to interrogate it from tests running in your page • Careful of CORS
  54. • Feature flags are used to change the url a

    SW is installed from • So we have /__sw-prod, /__sw-qa and /__sw-canary • Differently tagged commits deploy to each of these destinations • Testers can override flags locally to test a QA release • Canary releases target 3% of users Test ‘environments’ for the service worker
  55. • Store feature flags in IndexedDB • Write helpers that

    check a flag is on before carrying out an action Feature flags in service workers
  56. • Don’t cut corners turning your SW off! • Make

    sure you’re able to overwrite any bad SW with a good one • Put code to unregister your SW behind a flag or similar • Serve your SW from an unversioned URL • Once more, https://github.com/popeindustries has lots of advice Try switching it off and off again
  57. • Gradual roll out of features since spring 2017 •

    Push notifications operating for 6 months without a hitch • Most front-end assets now cached in the SW • Running an experiment using SW to improve ad load times • 0 bugs impacting the end user • Bring on the PWA! With a bit of extra effort, service workers are compatible with our way of working FT.com service worker round two
  58. So what’s the meaning of all this?

  59. • How we choose to build complex things is really

    important • These choices won’t necessarily always be good for performance • Finding compromises between these takes thought, effort, and a healthy relationship with failure • You can be a long way from perfect and still get great results PS we’re hiring www.ft.com/dev/null Summary