Slide 1

Slide 1 text

Let's Learn to Identify Technical Requirements for Better Design Philippe Duval & Alexandre Touret Questions: sli.do / #geecon

Slide 2

Slide 2 text

69% Source: Standish Group Chaos Report 2020

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

❑1 million of registered users ❑500GB of data ❑50 TPS (peak) ❑40 million active users ❑40TB secured storage ❑1200TPS Sensitive data repository Customer’s goals After 2 years…

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

You are user #1,492,573 in the waiting queue

Slide 8

Slide 8 text

Backstage… It doesn’t handle the load Unavailable It crashed again The database is dying… yet again… Latency is too high It lags The app doesn’t scale

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

❑ Understand the technical requirements ❑ Draft the simplest design possible ❑ Adapt design costs to the business value How to be a part of 31%?... →Go beyond of the functional requirements!

Slide 11

Slide 11 text

Available 24/7 and beyond Available « We don’t know » Two projects for the following platforms

Slide 12

Slide 12 text

Alexandre TOURET Software Architect @touret_alex blog.touret.info Philippe DUVAL Cloud Architect @malkav30 malkav30.gitlab.io Who are we?

Slide 13

Slide 13 text

We design payments technology that powers the growth of millions of businesses around the world. 7000+ engineers in over 40 countries Managing 43+ billion transactions per year €250M spent in R&D every year Handling 150+ payment methods #1 European payment processor

Slide 14

Slide 14 text

Definitions

Slide 15

Slide 15 text

Volumetry Response time Data Requests per second

Slide 16

Slide 16 text

❑INTERNAL service level objectives 99% of the web pages must be loaded in less than 2 sec Service Level Objective ❑ Indicator for checking our SLO Effective measurement of the rendering time captured from the HTTP server access logs Service Level Indicator ❑ Contractual agreement (SLA < SLO) Service Level Agreement SLO/SLI/SLA ? n ro ides eb pages nd ro ides Is The customer ho uses our application

Slide 17

Slide 17 text

Availability Authorized Interruption time (per year) 90% 36 days, 14 hours and 24 minutes 95% 18 days, 6 hours and 2 minutes 99% 3 days, 15 hours and 36 minutes 99,9% 8 hours, 45 minutes and 36 secondes 99,95% 4 hours, 22 minutes and 56 secondes ❑ Ability of a platform to be available and provide a service to users when they need it (ex. 99%) ❑It can be restricted to a specific time slot (ex. from 8AM to 8PM) Availability

Slide 18

Slide 18 text

What do we do when us-east1 is down? Disaster Recovery Site ❑ How long does it take to restore the service after a crash Recover in 1h Recovery Time Objective ❑ How much data you allow yourself to loose ? Orders from the last five minutes Recovery Point Objective RTO/RPO, DRS ?

Slide 19

Slide 19 text

Application #1: A 24/7 available (and beyond) application

Slide 20

Slide 20 text

✓ Ordering ✓ Payment « THIS APPLICATION MUST NEVER CRASH, 24/7 100% OF THE TIME". 24/7 available application, eleven nines Inc.

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Service SLA App Engine 99,95% Spanner 99,99%

Slide 24

Slide 24 text

The actual need…

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Service SLA Cloud Run 99,95% Cloud SQL ?%

Slide 27

Slide 27 text

❑Cloud SQL • 24/7, db-standard-2 instance • 10.0 GiB storage • EUR 38.65 ❑Cloud Run • CPU per request, 2 CPU 1GiB • Memory: 1 GiB • 5/7, 11h-22h, 5,000 orders/month • EUR 0.00 ❑Cloud Spanner • 100 processing units: 100 • 10 GiB storage • 50 GiB backup • EUR 72.90 ❑App Engine standard instances • 24/7, F4_1G x 4 instances • EUR 752.91 Financial impacts EUR 825,81 EUR 38,65

Slide 28

Slide 28 text

Application #2 : « W don’t now y t»

Slide 29

Slide 29 text

An online French pastry Short description of the purpose: « An e-commerce platform for French local products»

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Architecture (V1) ma on I ate ay ambda ma on DynamoDB ma on ambda a ka Client 2 5

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

https://www.primevideotech.com/video-streaming/scaling-up-the-prime- video-audio-video-monitoring-service-and-reducing-costs-by-90

Slide 34

Slide 34 text

Architecture (V2) ma on I ate ay C2 ma on DynamoDB ma on Client 2

Slide 35

Slide 35 text

Why these failures?

Slide 36

Slide 36 text

❑The NFRs eren’t adapted to the business model ❑The cost management asn’t considered (vs scalability) ❑The NFRs weren’ t ully aligned with the business domain Why may these platforms belong to the 31%? O , let’s ix them!

Slide 37

Slide 37 text

Our approach

Slide 38

Slide 38 text

Understand the functional requirements Technical goals Know and get feedback from Ops Evaluate the risks Answer to the 1-billion-dollar question: « is it worth it? » Our approach

Slide 39

Slide 39 text

Understand the functional requirements Technical goals Know and get feedback from Ops Evaluate the risks Answer to the 1-billion-dollar question: « is it worth it? »

Slide 40

Slide 40 text

HIPAA ome NFRs you don’t get to choose… HDS KRITIS PSD2 PCI-DSS CERD OIV

Slide 41

Slide 41 text

And what if our application #1 was an essential service?

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Service SLA GKE Multi-Région 99,99% Spanner Multi-Région 99,999%

Slide 45

Slide 45 text

Business requirements

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Understand the functional requirements Technical goals Know and get feedback from Ops Evaluate the risks Answer to the 1-billion-dollar question: « is it worth it? »

Slide 49

Slide 49 text

Technical goals Don’t take any technical requirement for granted!

Slide 50

Slide 50 text

Main characteristics of an architecture Evolutivity Modularity Cost Performance Simplicity Testability Fault tolerance

Slide 51

Slide 51 text

Evolutionary Architectures

Slide 52

Slide 52 text

Resilience

Slide 53

Slide 53 text

High Availability Auto Recovery New paradigm From « Hope it will never happen » to « recover easily »

Slide 54

Slide 54 text

What does resilience really mean ? CAP / PACELC theorem

Slide 55

Slide 55 text

Database Resiliency = two distinct problems vs

Slide 56

Slide 56 text

Timeouts “avoid resources overload / avoid w itin fo v ” Circuit breakers “do not ” Graceful degradation “wh t is my MVP” How to mitigate latency problems ? Retries “Miti t ponctual n v i bi ity” Exponentional Backoff “…whi voidin th domino ff t”

Slide 57

Slide 57 text

Understand the functional requirements Technical goals Know and get feedback from Ops Evaluate the risks Answer to the 1-billion-dollar question: « is it worth it? »

Slide 58

Slide 58 text

It’s more a feature of the platform than just a bunch of tools... Think about Observability by Design! Visualize Alert Metrics Traces Logs Global Dashboard Technical console for advanced logs Prometheus ecosystem Elastic ecosystem CaaS specific OpenTelemetry ecosystem Legacy

Slide 59

Slide 59 text

Adapt your technical choices to your skills (or the other way around) Operability

Slide 60

Slide 60 text

Understand the functional requirements Technical goals Know and get feedback from Ops Evaluate the risks Answer to the 1-billion-dollar question: « is it worth it? »

Slide 61

Slide 61 text

Risk management Source: riskstorming.com

Slide 62

Slide 62 text

Low 1 Medium 2 High 3 Low 1 Medium 2 High 3 - Database access issue Probability Impact

Slide 63

Slide 63 text

Low 1 Medium 2 High 3 Low 1 Medium 2 High 3 - Database access issue Probability Impact

Slide 64

Slide 64 text

Error budget

Slide 65

Slide 65 text

Understand the functional requirements Technical goals Know and get feedback from Ops Evaluate the risks Answer to the 1-billion-dollar question: « is it worth it? »

Slide 66

Slide 66 text

« Does the creation of pastries require a 99,95% availability? » « Is the mobile app mandatory for ordering a Kebab? Is it worth it?

Slide 67

Slide 67 text

Wrap up

Slide 68

Slide 68 text

❑ Get clear and fitted technical goals ❑A pragmatic risk management assessment ❑Simplicity & evolutivity What did our approach bring?

Slide 69

Slide 69 text

I you already operate plat orms… Onboard the OPS Gather, pinpoint & the risks Build a knowledge base

Slide 70

Slide 70 text

To sum up Dig into the user needs Pinpoint the requirements Get down to basics! Iterate!

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

Don’t be a stranger! Follow & get in touch @malkav30 linkedin.com/in /phduval/ blog.worldline.tech @Worldlinetech Feedback Follow us: @touret_alex linkedin.com/in /atouret 72 | Follow our tech team:

Slide 73

Slide 73 text

Explore our jobs in tech: careers.worldline.com Want to shape how the world pays and get paid?