Speeding up without slowing down

by Rhys Evans

Slide 1

Slide 1 text

Speeding up without slowing down Building a faster FT.com, fast

Slide 2

Slide 2 text

www.ft.com Faster than the average media site

Slide 3

Slide 3 text

Who is this bearded hippie? ● Engineer on FT.com ● Worked on performance for over a year ● Still plenty of gaps in my knowledge ● Likes: Birds, Father Ted, w3c ● Dislikes: Dogs, Game of Thrones, React @wheresrhys Rhys Evans

Slide 4

Slide 4 text

We’re hiring! www.ft.com/dev/null

Slide 5

Slide 5 text

● Our tech stack, how and why we got here ● How our choices can conflict with performance optimisation ● Some examples of how we’ve worked around these problems ● Group therapy What’s this talk about?

Slide 6

Slide 6 text

A brief history lesson … it is the year 3BP* *before perf

Slide 7

Slide 7 text

Welcome to the old FT.com, ‘Falcon’

Slide 8

Slide 8 text

Welcome to the ironically named old FT.com, ‘Falcon’

Slide 9

Slide 9 text

Welcome to the ironically named old FT.com, ‘Falcon’ Not just slow, Falcon was also slow and dangerous to develop ● Multiple environments ● Roughly one big bang, cluster-bug release a month ● Proliferation of hacks in order to circumvent the release cycle ‘Strategic Products’ formed in order to release small, well made, experimental features quickly

Slide 10

Slide 10 text

If we ever need another ‘Strategic Products’, then we’ve gone badly wrong

Slide 11

Slide 11 text

5 pillars of FT.com ● Take back control ● Straight to prod deployment ● Feature flags ● Microservices ● Componentisation

Slide 12

Slide 12 text

On Falcon nobody had real ownership of the codebase. Free for all of: ● Tag managers ● Third party bloatware and vulnerabilities ● Bad ideas nobody ever validated or switched off The tech team insisted on full control over what was allowed on the new FT.com Take back control

Slide 13

Slide 13 text

We have 2 environments – local development and production. Every merged pull request serves production traffic within about 10 minutes. Straight to prod deployment GitHub CircleCI Heroku

Slide 14

Slide 14 text

Advantages: ● Smaller releases, so bugs are easier to find and fix ● Fewer environment/config bugs ● Easy to validate we’re building the right it before building it right ● The adrenaline rush Straight to prod deployment

Slide 15

Slide 15 text

Feature flags ● Hide work in progress ● Enable QA in prod ● Split bigger features into many small releases to prod Without flags, it’d be difficult to work as we do without unacceptably buggy results.

Slide 16

Slide 16 text

FT.com is made of 100+ independent microservices Microservices

Slide 17

Slide 17 text

Microservices Advantages: ● Quick and easy to comprehend and test ● Confidence and speed deploying (and if necessary rolling back) ● Scaling is, in most cases, trivial ● Greater fault tolerance through isolation, so easier to experiment

Slide 18

Slide 18 text

The FT has, over the past few years, invested in a set of configurable client-side components, origami.ft.com. Our user-facing apps typically use 30+ components Componentisation

Slide 19

Slide 19 text

Componentisation Advantages: ● Quick and easy to comprehend and test ● Avoids duplication of effort and client-side code bloat ● Single source of truth for branding across many sites ● High standards e.g. accessibility

Slide 20

Slide 20 text

What do these things have in common? — They make it easy to release good quality software quickly, frequently and with confidence

Slide 21

Slide 21 text

Chronicles of FT.com ● A fast paywall using microservices ● Performance optimisation in a distributed front-end ● Straight to prod service workers

Slide 22

Slide 22 text

Chronicles of FT.com ● A fast paywall using microservices ● Performance optimisation in a distributed front-end ● Straight to prod service workers

Slide 23

Slide 23 text

The critical path isn’t just about inlining CSS It’s every request involved in serving a meaningful page to the user Microservices have the potential to multiply the number of requests in the critical path The critical path… extended User-facing app Dependency Dependency Dependency

Slide 24

Slide 24 text

With good caching, the number of requests in the critical path can be reduced By sharing a cached response between many users we can reduce the likelihood of a long critical path chain Caching and the critical path User-facing app Dependency Dependency Cache Dependency

Slide 25

Slide 25 text

Having a paywall means we serve a different page depending on who you are Can we share cached pages somehow? Caching and the paywall

Slide 26

Slide 26 text

A poor strategy Cookie: FT_Session=blahblah; FT_edition=uk;... CDN (Fastly) Application Vary: Cookie Aim to vary based on something with fewer unique values than Cookie Preflight

Slide 27

Slide 27 text

User cohorting with headers Preflight Cookie: FT_Session=blahblah;FT_edition=uk;... FT-Authorized: true FT-Edition: uk FT-AB-Tests: fake-news:on; Unique per user 2 x 2 x n x … Shared by everyone in a cohort of users Service that converts cookies and other per user data to more generic headers

Slide 28

Slide 28 text

Varying content per cohort Preflight Cookie: FT_Session=blahblah; FT_edition=uk;... FT-Authorized: true FT-Edition: uk FT-AB-Tests: fake-news:on; CDN (Fastly) Application Vary: FT-Authorized, FT-Edition, FT-AB-Tests Highly cacheable Perf bottleneck Not in the critical path (when cache hit)

Slide 29

Slide 29 text

Turtles all the way down Preflight Session Access Barriers Vanity urls A/B testing Perf bottleneck

Slide 30

Slide 30 text

1. Find the microservice that’s ‘sticking up’, i.e. the slowest one 2. Whack it 3. Repeat Microservice whack-a-mole

Slide 31

Slide 31 text

● Measure everything – how you gonna find a mole with your eyes closed? ● Measure granularly – different code paths mean different perf ● Use median, 95th and other percentiles, not the mean ● Keep an eye on timeouts Microservice whack-a-mole: some whacking strategies

Slide 32

Slide 32 text

● Geography matters – Look out for bugs in your DNS & routing layer. ● Minimise request overhead ○ in Node.js, HTTP agents with keepAlive is easy and very effective ● Persuade your business to impose a perf budget to avoid regressions ● The case we made http://bit.ly/2zmGZ4H Microservice whack-a-mole: some whacking strategies

Slide 33

Slide 33 text

● Median paywall decision within 20ms ● No paywall decision slower than 200ms ● High cache hit rate: speed & resilience ● Cost savings on computing power All delivered without impacting our ability to work and release software efficiently The results

Slide 34

Slide 34 text

Chronicles of FT.com ● A fast paywall using microservices ● Performance optimisation in a distributed front-end ● Straight to prod service workers

Slide 35

Slide 35 text

Any performant front-end should aim to: ● Reuse assets between visits to the site ● Reuse assets between page views on the same visit ● Implement modern performance best practices ○ responsive images ○ lazy loading ○ resource hint headers ○ inlining critical path CSS Some front-end fundamentals

Slide 36

Slide 36 text

● Shared source code & design: > 50% ● Shared JS & CSS assets: 0% ● Shared JS & CSS between visits: probably 0% ● Inlined CSS: 0% How did FT.com measure up?

Slide 37

Slide 37 text

● Rapid release cycle & many components update often, busting the cache ● Sharing CSS and JS between independent user-facing services is hard ● Decisions about which CSS to inline are… also hard ● Hard to retrofit optimisations to 1 app, let alone 12 Why was it so bad? App1 App2 Build1 Assets1

Slide 38

Slide 38 text

● Semver gets a bad rap… remember left-pad? ● But our components are generally very high quality and maintained by people close to our team ● We trust semver, and this rewards us with consistency and efficiency ● Locking down our versions would make it harder to release software Couldn’t we just lock down our versions?

Slide 39

Slide 39 text

● Each app’s build has its own gremlins ● Combining all apps, means combining all builds ● This multiplies the number of things that can stop us releasing code ● Also increases the surface area of potential bugs ● A single front-end app would make it harder to release software Couldn’t we just have one front-end app?

Slide 40

Slide 40 text

It became clear that performance optimisation is: ● Too tricky and time-consuming to leave it to each app ● An intricate dance between the server, the build and the client side code, so any solution would have to be full stack And so n-ui came into being Build a performance thing

Slide 41

Slide 41 text

n-ui CDN serving assets unique to each app What is n-ui? Bundle of preconfigured components used in all our apps npm and Bower component Server with knowledge of all relevant assets and tools Build tool with rudimentary JS and stylesheet splitting App1 CDN serving shared assets App2 Templates and asset loading tools running in the browser Deploy tool for delivering assets to the CDN

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

…or don’t build a performance thing could someone get a grip on whoever is f**king with the styling on the site this week I’ve *really* had enough of trying to get n-ui updates to work problem started occurring after a n-ui update what I did see was n-ui as a dev-dependency, but I guess that is still going to lead to mind-melting that’s presumably from an n-ui update? what is going on with n-ui? Apologies on the n-ui issues all, am on it. Also added to my todo list to stop us breaking stuff Has anyone managed to recently successfully bower link/npm link n-ui in an app? how to fix this motherf***ing `Projects using n-ui must maintain parity between versions` error?! just pushing a bug fix in n-ui right now If I can get the build to pass! Damn n-ui!!!

Slide 45

Slide 45 text

● Don’t build a performance thing on your own ● Benefits to isolating complexity from the rest of your codebase ● But it’s an illusion to imagine your abstractions will be perfect ● Collaboration ensures comprehensibility and maintainability What went wrong?

Slide 46

Slide 46 text

● Retreated from a few of the too-clever-by-half ideas — be prepared to say “it’s not worth the pain” ● Rolling out updates is a team effort ● Ideas for what to do next now come from the wider team ● Tooling and the web platform will, in time, make the problems easier ● Still complex, but getting closer to being as complex as it needs to be Still able to release our front end easily Where we are now?

Slide 47

Slide 47 text

Chronicles of FT.com ● A fast paywall using microservices ● Performance optimisation in a distributed front-end ● Straight to prod service workers

Slide 48

Slide 48 text

Why should we want a service worker on ft.com? ● Persistent caching ● Gateway to lots of whizzy performance optimisations ● Greater resilience when the network lets us down ● App-like behaviour e.g. push notifications ● Nothing to do with wanting to look cool… who told you that?

Slide 49

Slide 49 text

● “Service workers essentially act as proxy servers that sit between web applications, and the browser and network” – MDN ● “I'll eat anything you want me to eat, I'll swallow anything you want me to swallow, So, come on down and I'll chew on a dog!” – Beetlegeuse What are service workers?

Slide 50

Slide 50 text

https://github.com/popeindustries

Slide 51

Slide 51 text

Bugs included… ● Showing users barriers even after they’d signed in ● Replacing all pages on Falcon with blank error pages ● Users with cookies longer than 4000 characters got permanently stuck on bad versions FT.com service worker round one

Slide 52

Slide 52 text

● No automated unit or integration tests in CI ● No way to test changes to service worker without releasing to all users ● No way to turn off individual features of the service worker ● No easy way to roll back or turn off a broken service worker What went wrong?

Slide 53

Slide 53 text

How to test your service worker ● A surprising number of SW APIs are available in the DOM ● https://github.com/popeindustries – some great resources, including a mock SW environment ● Test runners, e.g. Karma, can be adapted to spin up sandboxed service workers ● Instrument your SW and use postMessage to interrogate it from tests running in your page ● Careful of CORS

Slide 54

Slide 54 text

● Feature flags are used to change the url a SW is installed from ● So we have /__sw-prod, /__sw-qa and /__sw-canary ● Differently tagged commits deploy to each of these destinations ● Testers can override flags locally to test a QA release ● Canary releases target 3% of users Test ‘environments’ for the service worker

Slide 55

Slide 55 text

● Store feature flags in IndexedDB ● Write helpers that check a flag is on before carrying out an action Feature flags in service workers

Slide 56

Slide 56 text

● Don’t cut corners turning your SW off! ● Make sure you’re able to overwrite any bad SW with a good one ● Put code to unregister your SW behind a flag or similar ● Serve your SW from an unversioned URL ● Once more, https://github.com/popeindustries has lots of advice Try switching it off and off again

Slide 57

Slide 57 text

● Gradual roll out of features since spring 2017 ● Push notifications operating for 6 months without a hitch ● Most front-end assets now cached in the SW ● Running an experiment using SW to improve ad load times ● 0 bugs impacting the end user ● Bring on the PWA! With a bit of extra effort, service workers are compatible with our way of working FT.com service worker round two

Slide 58

Slide 58 text

So what’s the meaning of all this?

Slide 59

Slide 59 text

● How we choose to build complex things is really important ● These choices won’t necessarily always be good for performance ● Finding compromises between these takes thought, effort, and a healthy relationship with failure ● You can be a long way from perfect and still get great results PS we’re hiring www.ft.com/dev/null Summary