Speeding up without slowing down

Speeding up without slowing down Building a faster FT.com, fast

www.ft.com Faster than the average media site

Who is this bearded hippie? • Engineer on FT.com •
Worked on performance for over a year • Still plenty of gaps in my knowledge • Likes: Birds, Father Ted, w3c • Dislikes: Dogs, Game of Thrones, React @wheresrhys Rhys Evans

We’re hiring! www.ft.com/dev/null

• Our tech stack, how and why we got here
• How our choices can conflict with performance optimisation • Some examples of how we’ve worked around these problems • Group therapy What’s this talk about?

A brief history lesson … it is the year 3BP*
*before perf

Welcome to the old FT.com, ‘Falcon’

Welcome to the ironically named old FT.com, ‘Falcon’

Welcome to the ironically named old FT.com, ‘Falcon’ Not just
slow, Falcon was also slow and dangerous to develop • Multiple environments • Roughly one big bang, cluster-bug release a month • Proliferation of hacks in order to circumvent the release cycle ‘Strategic Products’ formed in order to release small, well made, experimental features quickly

If we ever need another ‘Strategic Products’, then we’ve gone
badly wrong

5 pillars of FT.com • Take back control • Straight
to prod deployment • Feature flags • Microservices • Componentisation

On Falcon nobody had real ownership of the codebase. Free
for all of: • Tag managers • Third party bloatware and vulnerabilities • Bad ideas nobody ever validated or switched off The tech team insisted on full control over what was allowed on the new FT.com Take back control

We have 2 environments – local development and production. Every
merged pull request serves production traffic within about 10 minutes. Straight to prod deployment GitHub CircleCI Heroku

Advantages: • Smaller releases, so bugs are easier to find
and fix • Fewer environment/config bugs • Easy to validate we’re building the right it before building it right • The adrenaline rush Straight to prod deployment

Feature flags • Hide work in progress • Enable QA
in prod • Split bigger features into many small releases to prod Without flags, it’d be difficult to work as we do without unacceptably buggy results.

FT.com is made of 100+ independent microservices Microservices

Microservices Advantages: • Quick and easy to comprehend and test
• Confidence and speed deploying (and if necessary rolling back) • Scaling is, in most cases, trivial • Greater fault tolerance through isolation, so easier to experiment

The FT has, over the past few years, invested in
a set of configurable client-side components, origami.ft.com. Our user-facing apps typically use 30+ components Componentisation

Componentisation Advantages: • Quick and easy to comprehend and test
• Avoids duplication of effort and client-side code bloat • Single source of truth for branding across many sites • High standards e.g. accessibility

What do these things have in common? — They make
it easy to release good quality software quickly, frequently and with confidence

Chronicles of FT.com • A fast paywall using microservices •
Performance optimisation in a distributed front-end • Straight to prod service workers

The critical path isn’t just about inlining CSS It’s every
request involved in serving a meaningful page to the user Microservices have the potential to multiply the number of requests in the critical path The critical path… extended User-facing app Dependency Dependency Dependency

With good caching, the number of requests in the critical
path can be reduced By sharing a cached response between many users we can reduce the likelihood of a long critical path chain Caching and the critical path User-facing app Dependency Dependency Cache Dependency

Having a paywall means we serve a different page depending
on who you are Can we share cached pages somehow? Caching and the paywall

A poor strategy Cookie: FT_Session=blahblah; FT_edition=uk;... CDN (Fastly) Application Vary:
Cookie Aim to vary based on something with fewer unique values than Cookie Preflight

User cohorting with headers Preflight Cookie: FT_Session=blahblah;FT_edition=uk;... FT-Authorized: true FT-Edition:
uk FT-AB-Tests: fake-news:on; Unique per user 2 x 2 x n x … Shared by everyone in a cohort of users Service that converts cookies and other per user data to more generic headers

Varying content per cohort Preflight Cookie: FT_Session=blahblah; FT_edition=uk;... FT-Authorized: true
FT-Edition: uk FT-AB-Tests: fake-news:on; CDN (Fastly) Application Vary: FT-Authorized, FT-Edition, FT-AB-Tests Highly cacheable Perf bottleneck Not in the critical path (when cache hit)

Turtles all the way down Preflight Session Access Barriers Vanity
urls A/B testing Perf bottleneck

1. Find the microservice that’s ‘sticking up’, i.e. the slowest
one 2. Whack it 3. Repeat Microservice whack-a-mole

• Measure everything – how you gonna find a mole
with your eyes closed? • Measure granularly – different code paths mean different perf • Use median, 95th and other percentiles, not the mean • Keep an eye on timeouts Microservice whack-a-mole: some whacking strategies

• Geography matters – Look out for bugs in your
DNS & routing layer. • Minimise request overhead ◦ in Node.js, HTTP agents with keepAlive is easy and very effective • Persuade your business to impose a perf budget to avoid regressions • The case we made http://bit.ly/2zmGZ4H Microservice whack-a-mole: some whacking strategies

• Median paywall decision within 20ms • No paywall decision
slower than 200ms • High cache hit rate: speed & resilience • Cost savings on computing power All delivered without impacting our ability to work and release software efficiently The results

Any performant front-end should aim to: • Reuse assets between
visits to the site • Reuse assets between page views on the same visit • Implement modern performance best practices ◦ responsive images ◦ lazy loading ◦ resource hint headers ◦ inlining critical path CSS Some front-end fundamentals

• Shared source code & design: > 50% • Shared
JS & CSS assets: 0% • Shared JS & CSS between visits: probably 0% • Inlined CSS: 0% How did FT.com measure up?

• Rapid release cycle & many components update often, busting
the cache • Sharing CSS and JS between independent user-facing services is hard • Decisions about which CSS to inline are… also hard • Hard to retrofit optimisations to 1 app, let alone 12 Why was it so bad? App1 App2 Build1 Assets1

• Semver gets a bad rap… remember left-pad? • But
our components are generally very high quality and maintained by people close to our team • We trust semver, and this rewards us with consistency and efficiency • Locking down our versions would make it harder to release software Couldn’t we just lock down our versions?

• Each app’s build has its own gremlins • Combining
all apps, means combining all builds • This multiplies the number of things that can stop us releasing code • Also increases the surface area of potential bugs • A single front-end app would make it harder to release software Couldn’t we just have one front-end app?

It became clear that performance optimisation is: • Too tricky
and time-consuming to leave it to each app • An intricate dance between the server, the build and the client side code, so any solution would have to be full stack And so n-ui came into being Build a performance thing

n-ui CDN serving assets unique to each app What is
n-ui? Bundle of preconfigured components used in all our apps npm and Bower component Server with knowledge of all relevant assets and tools Build tool with rudimentary JS and stylesheet splitting App1 CDN serving shared assets App2 Templates and asset loading tools running in the browser Deploy tool for delivering assets to the CDN

…or don’t build a performance thing could someone get a
grip on whoever is f**king with the styling on the site this week I’ve *really* had enough of trying to get n-ui updates to work problem started occurring after a n-ui update what I did see was n-ui as a dev-dependency, but I guess that is still going to lead to mind-melting that’s presumably from an n-ui update? what is going on with n-ui? Apologies on the n-ui issues all, am on it. Also added to my todo list to stop us breaking stuff Has anyone managed to recently successfully bower link/npm link n-ui in an app? how to fix this motherf***ing `Projects using n-ui must maintain parity between versions` error?! just pushing a bug fix in n-ui right now If I can get the build to pass! Damn n-ui!!!

• Don’t build a performance thing on your own •
Benefits to isolating complexity from the rest of your codebase • But it’s an illusion to imagine your abstractions will be perfect • Collaboration ensures comprehensibility and maintainability What went wrong?

• Retreated from a few of the too-clever-by-half ideas —
be prepared to say “it’s not worth the pain” • Rolling out updates is a team effort • Ideas for what to do next now come from the wider team • Tooling and the web platform will, in time, make the problems easier • Still complex, but getting closer to being as complex as it needs to be Still able to release our front end easily Where we are now?

Why should we want a service worker on ft.com? •
Persistent caching • Gateway to lots of whizzy performance optimisations • Greater resilience when the network lets us down • App-like behaviour e.g. push notifications • Nothing to do with wanting to look cool… who told you that?

• “Service workers essentially act as proxy servers that sit
between web applications, and the browser and network” – MDN • “I'll eat anything you want me to eat, I'll swallow anything you want me to swallow, So, come on down and I'll chew on a dog!” – Beetlegeuse What are service workers?

https://github.com/popeindustries

Bugs included… • Showing users barriers even after they’d signed
in • Replacing all pages on Falcon with blank error pages • Users with cookies longer than 4000 characters got permanently stuck on bad versions FT.com service worker round one

• No automated unit or integration tests in CI •
No way to test changes to service worker without releasing to all users • No way to turn off individual features of the service worker • No easy way to roll back or turn off a broken service worker What went wrong?

How to test your service worker • A surprising number
of SW APIs are available in the DOM • https://github.com/popeindustries – some great resources, including a mock SW environment • Test runners, e.g. Karma, can be adapted to spin up sandboxed service workers • Instrument your SW and use postMessage to interrogate it from tests running in your page • Careful of CORS

• Feature flags are used to change the url a
SW is installed from • So we have /__sw-prod, /__sw-qa and /__sw-canary • Differently tagged commits deploy to each of these destinations • Testers can override flags locally to test a QA release • Canary releases target 3% of users Test ‘environments’ for the service worker

• Store feature flags in IndexedDB • Write helpers that
check a flag is on before carrying out an action Feature flags in service workers

• Don’t cut corners turning your SW off! • Make
sure you’re able to overwrite any bad SW with a good one • Put code to unregister your SW behind a flag or similar • Serve your SW from an unversioned URL • Once more, https://github.com/popeindustries has lots of advice Try switching it off and off again

• Gradual roll out of features since spring 2017 •
Push notifications operating for 6 months without a hitch • Most front-end assets now cached in the SW • Running an experiment using SW to improve ad load times • 0 bugs impacting the end user • Bring on the PWA! With a bit of extra effort, service workers are compatible with our way of working FT.com service worker round two

So what’s the meaning of all this?

• How we choose to build complex things is really
important • These choices won’t necessarily always be good for performance • Finding compromises between these takes thought, effort, and a healthy relationship with failure • You can be a long way from perfect and still get great results PS we’re hiring www.ft.com/dev/null Summary

Speeding up without slowing down

Speeding up without slowing down

More Decks by Rhys Evans

Other Decks in Technology

Featured

Transcript