Slide 1

Slide 1 text

A field guide to the Financial Times Rhys Evans Principal Engineer, Financial Times @wheresrhys

Slide 2

Slide 2 text

@wheresrhys Who I am ● Worked in tech 10+ years ● Gradually moved into tooling ● Co-lead the FT’s Reliability Engineering team ● Lifelong birdwatcher

Slide 3

Slide 3 text

@wheresrhys From Wikipedia: A book designed to help the reader identify wildlife (plants or animals) or other objects of natural occurrence (e.g. minerals). What is a field guide

Slide 4

Slide 4 text

● Why the FT needs a field guide ● Organising our guide with neo4j and GraphQL ● Filling in the details

Slide 5

Slide 5 text

Why the FT needs a field guide

Slide 6

Slide 6 text

@wheresrhys Insert non dramatic screenshot

Slide 7

Slide 7 text

@wheresrhys

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

@wheresrhys

Slide 10

Slide 10 text

@wheresrhys

Slide 11

Slide 11 text

@wheresrhys

Slide 12

Slide 12 text

@wheresrhys “A tool dating from before the trees that built the ark. Unowned, unknown, and worth £250k of business. One day it fell over. We founds docs dated 1999... which helped” Greg Cope, Tech Director, FT

Slide 13

Slide 13 text

@wheresrhys Starting about 5 years ago, the range of tech we have to support exploded

Slide 14

Slide 14 text

@wheresrhys Previously Centralised decision making Monolithic architectures Data centres Infrequent releases

Slide 15

Slide 15 text

Move slow and achieve little

Slide 16

Slide 16 text

@wheresrhys Microservices FT were early adopters of microservices architecture Lots of independently deployed services easier to ● Pick the right tool for the job ● Release and iterate ● Replace and decommission

Slide 17

Slide 17 text

@wheresrhys Liberalisation Matt Chadburn http://matt.chadburn.co.uk/notes/teams-as-services.html “[...] follow the mechanics of free-market economy. Teams are allowed and encouraged to pick the best value tools for the job at hand”

Slide 18

Slide 18 text

@wheresrhys OUT IN Data Centre Your favourite cloud ‘The FT Platform’ Pick your own SaaS Java, Java, Java I hear Rust’s good... Ivory tower What works

Slide 19

Slide 19 text

@wheresrhys “The upside of this is teams, left to their own devices, and trusted to make responsible decisions will choose what is best for themselves and the business in the long-term.” Matt Chadburn http://matt.chadburn.co.uk/notes/teams-as-services.html

Slide 20

Slide 20 text

Build stuff and disappear

Slide 21

Slide 21 text

@wheresrhys Legacy is sooner than you think ● All images appearing on our websites relied on 1 person... who left ● A vanity url service built by a feature team that disbanded shortly after ● Part of our membership platform built in a niche language ● And many, many more

Slide 22

Slide 22 text

@wheresrhys 5 years is a long time in tech Long enough for ● Shiny new things to become legacy ● Budgets and business priorities to move on ● People to leave

Slide 23

Slide 23 text

@wheresrhys ● Have to keep lots of tech ticking over ● Generating more new stuff than ever before to keep track of ● Liberalising the tech department leads to ownership & maintenance problems Need a field guide to help us navigate the space In summary

Slide 24

Slide 24 text

Unowned & unknown

Slide 25

Slide 25 text

Owned & known

Slide 26

Slide 26 text

Organising our guide with neo4j and GraphQL

Slide 27

Slide 27 text

@wheresrhys ● Reaffirm who owns the various bits of FT tech ● Improve information about what is actually running and why ● Determine what state it’s in at any given time 3 priorities to improve reliability

Slide 28

Slide 28 text

@wheresrhys Who is our audience? Operations team ● Active 24/7 ● Broad knowledge of our tech platforms ● Need to know which approaches can be applied to incident X ● If nothing works, who to call

Slide 29

Slide 29 text

@wheresrhys CMDB versions 1 - 3 were: ● Too inert - Enter once and forget about it ● Too brittle - Chains of responsibility easily lost ● Too discrete - Hard to make important connections Not the first attempt

Slide 30

Slide 30 text

@wheresrhys ● The natural question to ask when addressing a problem ● Links between people and things dotted all over our previous CMDBs ● Intuitive but brittle Who can help me with system X?

Slide 31

Slide 31 text

@wheresrhys ● Hard to connect data, so get overly simplified models of reality ● Several degrees of separation is modelled as a systemOwner field ● Simple, but inaccurate and hard to maintain Relational databases constrain

Slide 32

Slide 32 text

@wheresrhys ● Designed to model complex relationships ● No need to simplify and abstract away details that actually matter ● If person X is a stakeholder via 4 degrees of separation, represent them as such Graph databases liberate

Slide 33

Slide 33 text

@wheresrhys A graph restatement of the problem ‘How can I ensure systems are assigned to the right people’ → ‘How can I ensure systems are connected somehow to the right people’

Slide 34

Slide 34 text

@wheresrhys System ? ? ? ? ? ? ? ?

Slide 35

Slide 35 text

Model the stable stuff first Model the stable stuff first

Slide 36

Slide 36 text

@wheresrhys ● Pick a unique, human readable code ● Kill infrastructure not tagged with it ● In our graph, the System record must be connected to a Team When systems are created we:

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

@wheresrhys ● Stable, manageable subdivisions of the organisation ● Tech director who is ultimately responsible On top of this stable foundation we can add the more ephemeral things Our tech connected to

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

@wheresrhys BIZ-OPS MAN

Slide 41

Slide 41 text

@wheresrhys ● Self-service ● No such thing as a power user ● Extensible ● API first, but UI a close second Data warehouse free

Slide 42

Slide 42 text

@wheresrhys REST API ● OK when fetching a single record type ● Painful to traverse ‘Canned query’ endpoints ● Less generic ● Limited by our imagination Some poor API options

Slide 43

Slide 43 text

@wheresrhys GraphQL to the rescue “GraphQL is a query language for APIs [...] gives clients the power to ask for exactly what they need [...] not just the properties of one resource but also smoothly follows references between them”

Slide 44

Slide 44 text

@wheresrhys neo4j-graphql-js ● GraphQL normally talks to multiple APIs and combines the results ● neo4j-graphql-js converts GraphQL queries to cypher, and talks to neo4j directly

Slide 45

Slide 45 text

@wheresrhys

Slide 46

Slide 46 text

@wheresrhys GraphQL big wins ● User friendly: Single, grokable query to get unlimited connected info ● Future proof: Mirrors the neo4j graph as its complexity grows ● More efficient: Fewer API calls and fewer and faster DB calls

Slide 47

Slide 47 text

@wheresrhys ● Hungry users: Allows unwitting construction of very expensive queries ● Caching: Not obvious what caching behaviour to implement ● To write or not to write: Not persuaded to move away from REST yet Pitfalls of GraphQL

Slide 48

Slide 48 text

@wheresrhys An extensible UI

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

@wheresrhys

Slide 51

Slide 51 text

#GRANDstack GraphQL + React + Apollo + Neo4j Database https://grandstack.io/

Slide 52

Slide 52 text

@wheresrhys In summary ● Some confidence that Biz Ops won’t degrade into a data graveyard ● Unlimited access to data for any person or machine But is the data actually any good?

Slide 53

Slide 53 text

Filling in the details

Slide 54

Slide 54 text

@wheresrhys Not the first attempt CMDB versions 1 -3 were ● Too inert - Enter once and forget about it ● Too brittle - Chains of responsibility easily lost ● Too discrete - Hard to make important connections

Slide 55

Slide 55 text

@wheresrhys Don’t rely on good behaviour ● Automate ● More carrot, less stick ● Gamify ● UX

Slide 56

Slide 56 text

@wheresrhys Automate ● Machines don’t forget to update information ● Restrict write access for certain records/types to privileged clients ○ people-api → Writes details of FT staff ○ github-importer → Writes details of repositories ○ …

Slide 57

Slide 57 text

@wheresrhys More carrot, less stick

Slide 58

Slide 58 text

@wheresrhys Gamify Teams respond well to seeing how they compare, and how they can improve

Slide 59

Slide 59 text

@wheresrhys UX

Slide 60

Slide 60 text

@wheresrhys

Slide 61

Slide 61 text

@wheresrhys Not just visual design ● Understand your users ● Uncover sources of friction ● Learn about their existing/ideal workflow ● Don’t expect them to come to you ● “Good design is invisible”

Slide 62

Slide 62 text

@wheresrhys ● System source code changes in Github, ● But runbook authorship in Biz Ops ● Bound to get out of step ● What if they happened concurrently? Example: runbook authorship

Slide 63

Slide 63 text

@wheresrhys ● Runbooks written in RUNBOOK.md with front matter metadata ● Content pulled into Biz Ops when production code release detected ● Github PR integrations to follow Example: runbook authorship

Slide 64

Slide 64 text

@wheresrhys ● Underpinning how we handle GDPR requests ● Quicker triaging of security incidents ● Integrating with leavers process More benefits → more incentives to improve data Beyond operational info

Slide 65

Slide 65 text

What have we learned today?

Slide 66

Slide 66 text

Model the stable stuff first Legacy code comes to us all

Slide 67

Slide 67 text

Model the stable stuff first Documented legacy is good legacy

Slide 68

Slide 68 text

Model the stable stuff first Graphs enable more powerful modelling

Slide 69

Slide 69 text

Model the stable stuff first Using #GRANDstack is like being the film version of Mark Zuckerberg

Slide 70

Slide 70 text

Model the stable stuff first Your data won’t update itself

Slide 71

Slide 71 text

Model the stable stuff first UX and other feedback loops can keep it fresh

Slide 72

Slide 72 text

Thank you The team: Geoff Thorpe, Laura Carvajal, Charlie Briggs, Katie Koschland, Simon Legg, Maggie Allen, Courtney Osborn, Kat Downes, Sentayhu Mekoonnali, David Balfour Images from: https://www.audubon.org/birds-of-america/ @wheresrhys www.ft.com/dev/null