Slide 1

Slide 1 text

Putting a Red Nose on the Cloud

Slide 2

Slide 2 text

About Comic Relief o  Comic Relief is a major charity based in the UK which strives to create a just world free from poverty o  Since we first set up shop in 1985, we’ve been doing three main things: o  We raise millions of pounds through two big fundraising campaigns – Red Nose Day and Sport Relief. o  We spend that money in the best possible way to tackle the root causes of poverty and social injustice. o  We use the power of our brand to raise awareness of the issues that we care most about.

Slide 3

Slide 3 text

o  Every two years, we encourage thousands of people to do something funny for money. o  A year of planning o  6 week media campaign o  7 hours of TV on the 15th March

Slide 4

Slide 4 text

What we had o  8 year old Java application o  Deployed and scaled with the help of 12 partners o  Took months to achieve this, run through user testing, penetration testing and authentication o  Changes were kept to an absolute minimum between years for stability and to reduce risk

Slide 5

Slide 5 text

Key Aims of New Platform o  Unlimited by technology o  Minimise PCI exposure o  Remove reliance on any single third party supplier o  Cost-effective o  All the money raised by the public is spent by Comic Relief to help poor and disadvantaged people in the UK and the world's poorest countries.

Slide 6

Slide 6 text

What we have now Reminder: QCon Session Code : 9221 Over to you Tim...

Slide 7

Slide 7 text

Thanks Zenon... This talk is a case study that intends to: o  Give you an insight into the solution we have delivered over the last 9 months o  Discuss the patterns we have applied and how we (and as a consequence, Comic Relief) have benefitted

Slide 8

Slide 8 text

Platform Requirements o  The platform is required to: o  serve a donation page for the public o  manage a lightweight call centre interface o  process in the region of 600,000 transactions in 7 hours o  handle in excess of 10,000 call centre operators o  handle a peak of 300 donations completing per second o  be out of scope for PCI

Slide 9

Slide 9 text

What does that look like?

Slide 10

Slide 10 text

Donations Per Minute

Slide 11

Slide 11 text

Donations Per Second

Slide 12

Slide 12 text

Challenges o  We don't get a second chance o  Its only used once a year for 7 hours

Slide 13

Slide 13 text

Previous Issues o  Testing, Integration and deployment problems o  Lack of consistency o  Single Points of Failure o  Infrastructure provider o  Platform & Networking o  Bandwidth o  Multiple provider relationships o  1 year feedback cycle

Slide 14

Slide 14 text

Solution Patterns o  Distributed architecture o  Multiple Infrastructure as a Service (IaaS) o  Multiple Platform as a Service (PaaS) o  Stateless pattern o  Eventually consistent data o  Minimum Time to Recovery

Slide 15

Slide 15 text

Solution Patterns Stateless/Eventual Consistency o  No High Availability datastore o  Message Queue architecture o  Enables a distributed architecture

Slide 16

Slide 16 text

Solution Patterns PaaS & IaaS o  PaaS o  Homogenised platform o  Enables multi Iaas o  Multi IaaS o  Costs benefits for Comic Relief o  Prevents vendor lock in for Comic Relief o  Enabled rapid rollout of supporting applications

Slide 17

Slide 17 text

Solution Patterns Minimum time to recovery o  History o  Build for failure o  Reduce time to recovery

Slide 18

Slide 18 text

Commoditise Dependencies o  Dependency on 3rd parties o  Usage commoditised o  IAAS o  We can easily deploy across multiple service providers o  Info provided by OpenCloudBrokers o  Payment Service Providers o  We load balance across multiple providers, allowing us to ensure that our service is continuous, and able to cope with projected loads.

Slide 19

Slide 19 text

Insight Layer What does the platform look like? Internet DNS PaaS 1 - AWS US Cloud Foundry (BOSH) PaaS 1 - AWS EU Cloud Foundry (BOSH) Service Layer Workers View API Insight Presentation Layer Service Layer Workers MGMT PaaS 1 - Cz Cloud Foundry (BOSH) Presentation Layer Service Layer Workers Shared Services Logging Metrics Alerting = + PLUS Presentation Layer

Slide 20

Slide 20 text

Pipelines Continuous Deployment to Production o  2 pipelines integrated o  Infrastructure o  Applications o  Converging on multiple test platforms o  Development team managing services

Slide 21

Slide 21 text

Local changes to deployed platform Pipeline - Infrastructure $$$

Slide 22

Slide 22 text

Pipeline - Applications Local changes to deployed platform $$$

Slide 23

Slide 23 text

Continuous Integration Testing The value in our pipeline comes from the testing that gives us confidence in the consistency of our solution o  RSpec - unit tests o  Cucumber - feature/integration tests o  ZAProxy - security tests o  Grinder - benchmarking load tests

Slide 24

Slide 24 text

Other Testing Load Testing o  In addition to small scale load testing as part of our CI deployments o  Grinder, using chef to deploy o  20 minutes lead time, up to 120 nodes used, 60,000 concurrent users (zero wait times) o  Global capability

Slide 25

Slide 25 text

Failure Tolerant o  DNS round robin across multiple shards o  Scripted DNS enabling a measure of load balancing o  "Failure wagons" standing in in case of shard failure and handing off to alternate shards

Slide 26

Slide 26 text

Failure Tolerant o  Minimum time to recovery vs high availability (HA) o  Eventual consistency o  Stateless requests o  Message queue architecture o  Expecting failure

Slide 27

Slide 27 text

Solution Challenges o  Reliance on inflexible third-party providers o  Multiple payment providers, we are able to ensure that we have the redundancy we need. o  Managing and automating complexity

Slide 28

Slide 28 text

Flexibility - Load testing Performance confidence - results (TPS) 0 100 200 300 400 500 Redis Config Added DEAs Added DEAs Increased load test threads Moved load test platform to EU Added HA proxy & 3 Nginx nodes Increased load test threads 8 Nginx nodes 501 Monday 10/9 Tuesday 11/9 Transactions per second

Slide 29

Slide 29 text

Flexibility - Supporting Platforms Whilst building the main platform, we have also built a range of supporting platforms, including: o  Payment provider mocks (>= 500 Donations/sec) o  An email service mock o  A data api mock o  Globally-distributed load test platform (zero to hero in 20 minutes)

Slide 30

Slide 30 text

Flexibility - Payment Service Providers o  We have performed implementations with 11 different payment providers/interfaces, (several of which are not being used.) o  These 3rd party integrations are key to the delivery of our service, and so this enabled us to really understand how they worked, what performance issues we might encounter.

Slide 31

Slide 31 text

The part that's missing! o  no actual data/results o  please watch this space o  only 9 days to go o  The last 9 months have been tough but fun o  The pipelines, once created, have been the driving force of this project o  3rd party service commoditisation has allowed Comic Relief to stay in control of the risk o  Thank you

Slide 32

Slide 32 text

In Conclusion o  QCon is two weeks too soon o  By using the cloud we have put ourselves in a strong position o  New Platform will only be proven on 15th March Don't forget to use the engage feature on the QCon app to rate the talk and ask questions

Slide 33

Slide 33 text

z.hannick@comicrelief.com @zenonhannick tim.savage@armakuni.com @timjsavage