Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Speeding up without slowing down

Rhys Evans
November 07, 2017

Speeding up without slowing down

At FT we built one of the world's fastest media websites, and release to production dozens of times a day. But the architectural and organisational decisions aimed at allowing us to deliver reliable features quickly and consistently don't always fit neatly with our desire to optimise performance.

In this warts-and-all talk, you'll learn

- how we build FT.com
- how a highly componentised, microservices stack with a rapid release cycle can sometimes get in the way of performance
- some ideas for working around these obstacles
- that web performance is hard, and no-one's perfect

Rhys Evans

November 07, 2017
Tweet

More Decks by Rhys Evans

Other Decks in Technology

Transcript

  1. Speeding up
    without slowing down
    Building a faster FT.com, fast

    View Slide

  2. www.ft.com
    Faster than the average media site

    View Slide

  3. Who is this bearded hippie?
    ● Engineer on FT.com
    ● Worked on performance for over a year
    ● Still plenty of gaps in my knowledge
    ● Likes: Birds, Father Ted, w3c
    ● Dislikes: Dogs, Game of Thrones, React
    @wheresrhys Rhys Evans

    View Slide

  4. We’re hiring!
    www.ft.com/dev/null

    View Slide

  5. ● Our tech stack, how and why we got here
    ● How our choices can conflict with performance
    optimisation
    ● Some examples of how we’ve worked around these
    problems
    ● Group therapy
    What’s this talk about?

    View Slide

  6. A brief history lesson
    … it is the year 3BP*
    *before perf

    View Slide

  7. Welcome to the old FT.com, ‘Falcon’

    View Slide

  8. Welcome to the ironically named old FT.com, ‘Falcon’

    View Slide

  9. Welcome to the ironically named old FT.com, ‘Falcon’
    Not just slow, Falcon was also slow and dangerous to develop
    ● Multiple environments
    ● Roughly one big bang, cluster-bug release a month
    ● Proliferation of hacks in order to circumvent the release cycle
    ‘Strategic Products’ formed in order to release small, well made, experimental
    features quickly

    View Slide

  10. If we ever need another ‘Strategic
    Products’, then we’ve gone badly
    wrong

    View Slide

  11. 5 pillars of FT.com
    ● Take back control
    ● Straight to prod deployment
    ● Feature flags
    ● Microservices
    ● Componentisation

    View Slide

  12. On Falcon nobody had real ownership of the codebase.
    Free for all of:
    ● Tag managers
    ● Third party bloatware and vulnerabilities
    ● Bad ideas nobody ever validated or switched off
    The tech team insisted on full control over what was allowed on the new FT.com
    Take back control

    View Slide

  13. We have 2 environments – local development and production.
    Every merged pull request serves production traffic within about 10 minutes.
    Straight to prod deployment
    GitHub CircleCI Heroku

    View Slide

  14. Advantages:
    ● Smaller releases, so bugs are easier to find and fix
    ● Fewer environment/config bugs
    ● Easy to validate we’re building the right it before building it right
    ● The adrenaline rush
    Straight to prod deployment

    View Slide

  15. Feature flags
    ● Hide work in progress
    ● Enable QA in prod
    ● Split bigger features into many small
    releases to prod
    Without flags, it’d be difficult to work as we
    do without unacceptably buggy results.

    View Slide

  16. FT.com is made of
    100+ independent
    microservices
    Microservices

    View Slide

  17. Microservices
    Advantages:
    ● Quick and easy to comprehend and test
    ● Confidence and speed deploying (and if necessary rolling back)
    ● Scaling is, in most cases, trivial
    ● Greater fault tolerance through isolation, so easier to experiment

    View Slide

  18. The FT has, over the past
    few years, invested in a set
    of configurable client-side
    components, origami.ft.com.
    Our user-facing apps
    typically use 30+
    components
    Componentisation

    View Slide

  19. Componentisation
    Advantages:
    ● Quick and easy to comprehend and test
    ● Avoids duplication of effort and client-side code bloat
    ● Single source of truth for branding across many sites
    ● High standards e.g. accessibility

    View Slide

  20. What do these things have in common?
    — They make it easy to release good
    quality software quickly, frequently and
    with confidence

    View Slide

  21. Chronicles of FT.com
    ● A fast paywall using microservices
    ● Performance optimisation in a
    distributed front-end
    ● Straight to prod service workers

    View Slide

  22. Chronicles of FT.com
    ● A fast paywall using microservices
    ● Performance optimisation in a
    distributed front-end
    ● Straight to prod service workers

    View Slide

  23. The critical path isn’t just about inlining CSS
    It’s every request involved in serving a
    meaningful page to the user
    Microservices have the potential to multiply
    the number of requests in the critical path
    The critical path… extended
    User-facing
    app
    Dependency
    Dependency
    Dependency

    View Slide

  24. With good caching, the number of
    requests in the critical path can be
    reduced
    By sharing a cached response between
    many users we can reduce the likelihood
    of a long critical path chain
    Caching and the critical path
    User-facing
    app
    Dependency
    Dependency
    Cache
    Dependency

    View Slide

  25. Having a paywall means we serve a different page depending on who you are
    Can we share cached pages somehow?
    Caching and the paywall

    View Slide

  26. A poor strategy
    Cookie: FT_Session=blahblah;
    FT_edition=uk;...
    CDN
    (Fastly)
    Application Vary: Cookie
    Aim to vary based on something with
    fewer unique values than Cookie
    Preflight

    View Slide

  27. User cohorting with headers
    Preflight
    Cookie: FT_Session=blahblah;FT_edition=uk;...
    FT-Authorized: true
    FT-Edition: uk
    FT-AB-Tests: fake-news:on;
    Unique per user
    2 x
    2 x
    n x
    … Shared by everyone in a cohort of users
    Service that converts cookies
    and other per user data to more
    generic headers

    View Slide

  28. Varying content per cohort
    Preflight
    Cookie: FT_Session=blahblah;
    FT_edition=uk;...
    FT-Authorized: true
    FT-Edition: uk
    FT-AB-Tests: fake-news:on;
    CDN
    (Fastly)
    Application
    Vary: FT-Authorized,
    FT-Edition, FT-AB-Tests
    Highly cacheable
    Perf
    bottleneck
    Not in the critical path
    (when cache hit)

    View Slide

  29. Turtles all the way down
    Preflight
    Session
    Access
    Barriers
    Vanity urls
    A/B
    testing
    Perf bottleneck

    View Slide

  30. 1. Find the microservice that’s
    ‘sticking up’, i.e. the slowest one
    2. Whack it
    3. Repeat
    Microservice whack-a-mole

    View Slide

  31. ● Measure everything – how you gonna find a mole with your eyes closed?
    ● Measure granularly – different code paths mean different perf
    ● Use median, 95th and other percentiles, not the mean
    ● Keep an eye on timeouts
    Microservice whack-a-mole: some whacking strategies

    View Slide

  32. ● Geography matters – Look out for bugs in your DNS & routing layer.
    ● Minimise request overhead
    ○ in Node.js, HTTP agents with keepAlive is easy and very effective
    ● Persuade your business to impose a perf budget to avoid regressions
    ● The case we made http://bit.ly/2zmGZ4H
    Microservice whack-a-mole: some whacking strategies

    View Slide

  33. ● Median paywall decision within 20ms
    ● No paywall decision slower than 200ms
    ● High cache hit rate: speed & resilience
    ● Cost savings on computing power
    All delivered without impacting our ability to work
    and release software efficiently
    The results

    View Slide

  34. Chronicles of FT.com
    ● A fast paywall using microservices
    ● Performance optimisation in a
    distributed front-end
    ● Straight to prod service workers

    View Slide

  35. Any performant front-end should aim to:
    ● Reuse assets between visits to the site
    ● Reuse assets between page views on the same visit
    ● Implement modern performance best practices
    ○ responsive images
    ○ lazy loading
    ○ resource hint headers
    ○ inlining critical path CSS
    Some front-end fundamentals

    View Slide

  36. ● Shared source code & design:
    > 50%
    ● Shared JS & CSS assets:
    0%
    ● Shared JS & CSS between visits:
    probably 0%
    ● Inlined CSS:
    0%
    How did FT.com measure up?

    View Slide

  37. ● Rapid release cycle & many components update often, busting the cache
    ● Sharing CSS and JS between independent user-facing services is hard
    ● Decisions about which CSS to inline are… also hard
    ● Hard to retrofit optimisations to 1 app, let alone 12
    Why was it so bad?
    App1 App2
    Build1
    Assets1

    View Slide

  38. ● Semver gets a bad rap… remember left-pad?
    ● But our components are generally very high quality and maintained by people
    close to our team
    ● We trust semver, and this rewards us with consistency and efficiency
    ● Locking down our versions would make it harder to release software
    Couldn’t we just lock down our versions?

    View Slide

  39. ● Each app’s build has its own gremlins
    ● Combining all apps, means combining all builds
    ● This multiplies the number of things that can stop
    us releasing code
    ● Also increases the surface area of potential bugs
    ● A single front-end app would make it harder to
    release software
    Couldn’t we just have one front-end app?

    View Slide

  40. It became clear that performance optimisation is:
    ● Too tricky and time-consuming to leave it to each app
    ● An intricate dance between the server, the build and the client side code, so
    any solution would have to be full stack
    And so n-ui came into being
    Build a performance thing

    View Slide

  41. n-ui
    CDN serving
    assets unique
    to each app
    What is n-ui?
    Bundle of
    preconfigured
    components
    used in all our
    apps
    npm and
    Bower
    component Server with
    knowledge of all
    relevant assets
    and tools
    Build tool with
    rudimentary JS
    and stylesheet
    splitting
    App1
    CDN serving
    shared assets
    App2
    Templates and
    asset loading
    tools running in
    the browser
    Deploy tool for
    delivering
    assets to the
    CDN

    View Slide

  42. View Slide

  43. View Slide

  44. …or don’t build a performance thing
    could someone get a grip on
    whoever is f**king with the
    styling on the site this week
    I’ve *really* had
    enough of trying to get
    n-ui updates to work
    problem started occurring
    after a n-ui update
    what I did see was n-ui as a
    dev-dependency, but I guess that
    is still going to lead to mind-melting
    that’s presumably
    from an n-ui update?
    what is going
    on with n-ui?
    Apologies on the n-ui issues all,
    am on it. Also added to my todo
    list to stop us breaking stuff
    Has anyone managed to
    recently successfully bower
    link/npm link n-ui in an app?
    how to fix this motherf***ing
    `Projects using n-ui must maintain
    parity between versions` error?!
    just pushing a bug fix
    in n-ui right now
    If I can get the build
    to pass! Damn n-ui!!!

    View Slide

  45. ● Don’t build a performance thing on your own
    ● Benefits to isolating complexity from the rest of your codebase
    ● But it’s an illusion to imagine your abstractions will be perfect
    ● Collaboration ensures comprehensibility and maintainability
    What went wrong?

    View Slide

  46. ● Retreated from a few of the too-clever-by-half ideas
    — be prepared to say “it’s not worth the pain”
    ● Rolling out updates is a team effort
    ● Ideas for what to do next now come from the wider team
    ● Tooling and the web platform will, in time, make the problems easier
    ● Still complex, but getting closer to being as complex as it needs to be
    Still able to release our front end easily
    Where we are now?

    View Slide

  47. Chronicles of FT.com
    ● A fast paywall using microservices
    ● Performance optimisation in a
    distributed front-end
    ● Straight to prod service workers

    View Slide

  48. Why should we want a service worker on ft.com?
    ● Persistent caching
    ● Gateway to lots of whizzy performance optimisations
    ● Greater resilience when the network lets us down
    ● App-like behaviour e.g. push notifications
    ● Nothing to do with wanting to look cool… who told you that?

    View Slide

  49. ● “Service workers essentially act as
    proxy servers that sit between web
    applications, and the browser and
    network” – MDN
    ● “I'll eat anything you want me to eat,
    I'll swallow anything you want me to
    swallow, So, come on down and I'll
    chew on a dog!” – Beetlegeuse
    What are service workers?

    View Slide

  50. https://github.com/popeindustries

    View Slide

  51. Bugs included…
    ● Showing users barriers even after
    they’d signed in
    ● Replacing all pages on Falcon with
    blank error pages
    ● Users with cookies longer than
    4000 characters got permanently
    stuck on bad versions
    FT.com service worker round one

    View Slide

  52. ● No automated unit or integration tests in CI
    ● No way to test changes to service worker without releasing to all users
    ● No way to turn off individual features of the service worker
    ● No easy way to roll back or turn off a broken service worker
    What went wrong?

    View Slide

  53. How to test your service worker
    ● A surprising number of SW APIs are available in the DOM
    ● https://github.com/popeindustries – some great resources, including a mock
    SW environment
    ● Test runners, e.g. Karma, can be adapted to spin up sandboxed service
    workers
    ● Instrument your SW and use postMessage to interrogate it from tests running
    in your page
    ● Careful of CORS

    View Slide

  54. ● Feature flags are used to change the url a SW is installed from
    ● So we have /__sw-prod, /__sw-qa and /__sw-canary
    ● Differently tagged commits deploy to each of these destinations
    ● Testers can override flags locally to test a QA release
    ● Canary releases target 3% of users
    Test ‘environments’ for the service worker

    View Slide

  55. ● Store feature flags in IndexedDB
    ● Write helpers that check a flag is on before carrying out an action
    Feature flags in service workers

    View Slide

  56. ● Don’t cut corners turning your SW off!
    ● Make sure you’re able to overwrite any bad SW with a good one
    ● Put code to unregister your SW behind a flag or similar
    ● Serve your SW from an unversioned URL
    ● Once more, https://github.com/popeindustries has lots of advice
    Try switching it off and off again

    View Slide

  57. ● Gradual roll out of features since spring 2017
    ● Push notifications operating for 6 months without a hitch
    ● Most front-end assets now cached in the SW
    ● Running an experiment using SW to improve ad load times
    ● 0 bugs impacting the end user
    ● Bring on the PWA!
    With a bit of extra effort, service workers are compatible with our way of working
    FT.com service worker round two

    View Slide

  58. So what’s the meaning of all this?

    View Slide

  59. ● How we choose to build complex things is really important
    ● These choices won’t necessarily always be good for performance
    ● Finding compromises between these takes thought, effort, and a healthy
    relationship with failure
    ● You can be a long way from perfect and still get great results
    PS we’re hiring www.ft.com/dev/null
    Summary

    View Slide