Reviving the platform every day

Reviving the platform every day Emmanouil Kiagias & Josh Hill
Platform Recovery, Pivotal #bbr on Cloud Foundry Slack

Agenda • Disaster Recovery • First steps • Ownership and
Integration • DRATs • Shared CI Tasks • Ongoing development • What’s next?

Disaster Recovery How do we test it? Framing the problem

Disaster Recovery A plan describing how to recover my system
in case of a catastrophe! (Assuming I have taken backups!)

in case of a catastrophe! (Assuming I have taken backups!) Can Cloud Foundry recover from a disaster?

in case of a catastrophe! (Assuming I have taken backups!) Can Cloud Foundry recover from a disaster? - Sure! BBR (Bosh Backup and Restore) can backup CF and help you restore it when a calamity hits your system

Disaster Recovery Good, BBR has laid the technical foundation for
implementing disaster recovery Can I feel safe then? How do I know that my platform can actually recover? How do I that my disaster recovery solution works?

Framing the problem DR strategy: ➔ Take a backup of
my system (record state) ➔ DISASTER ➔ Restore from backup ➔ Things should be back normal

Framing the problem DR strategy: ➔ Take a backup of
my system (record state) ➔ DISASTER ➔ Restore from backup ➔ Things should be back normal Now how do I test that this works?

Framing the problem Test DR strategy of a single component:
➔ Setup a fresh component ➔ Create from state A ➔ Take a backup ➔ DESTROY THINGS! ➔ Restore from backup ➔ Assert state A is back, safe and sound

Framing the problem Test DR strategy of a single component:
➔ Setup a fresh component ➔ Create from state A ➔ Take a backup ➔ DESTROY THINGS! ➔ Restore from backup ➔ Assert state A is back, safe and sound Simple test case, isn’t it?

Framing the problem

Framing the problem Component

Framing the problem Component State

Framing the problem How would this look like in a
distributed system like CF? Component State

Framing the problem CF UAA CF Networking Credhub CAPI ...
...

First steps Challenge Cross cutting work

Challenge Platform Recovery was given the challenge to deliver this
feature across all the teams. - How to test DR on each system component? - How to test all the components as a single system? How do can we coordinate this across the teams?

Challenge • Many components and teams ◦ CAPI ◦ UAA
◦ CF-Networking ◦ Credhub ◦ Platform Recovery ◦ ... • Different time zones: ◦ San Francisco (GMT-7), New York (GMT-4), London (GMT+1)...

First steps

First steps Big problem

First steps Smaller problems Big problem

First steps Smaller problems Big problem CF Individual components

First steps Break down the problem into smaller pieces -
Each component will have its own testcase(s) Write first sample test cases

First steps Test DR strategy of a single component: ➔
Setup a fresh component ➔ Create from state A ➔ Take a backup ➔ DESTROY THINGS! ➔ Restore from backup ➔ Assert state A is back, safe and sound

First steps What do we do with these test cases
now? Implement a test suite which can run the test cases - A framework which orchestrates the lifecycle of DR tests (This is how DRATs came in life)

Ownership & Integration

Ownership We now have: - Test suite - Sample test
cases It is time to: Divide and Conquer - Distributed system? Distribute ownership Each CF team: - Owns and develops the test cases for their component - Works independently

Integration Things are now in motion DR test pieces are
being developed CF teams are test driving their component’s DR strategy

Integration Things are now in motion DR test pieces are
being developed CF teams are test driving their component’s DR strategy But DR is not useful unless it works for the whole system!

Integration Integrate all the things together! - Gather all test
cases from the teams in a common place - Run them all against a CF deployment using our test suite

Integration Now we have a bunch of tests running as
one Who should run these? - Platform Recovery team runs them all - We own and coordinate DR of the platform - Each CF team should run them to test their components’ DR - Release Integration team runs them before cutting a new cf-deployment

DRATS Disaster Recovery Acceptance Tests

DRATS • Automates a full backup and restore of Cloud
Foundry • Provides hooks for test cases • Allows teams to choose which test cases to run

Full backup and restore $ bbr drats

Full backup and restore bbr deployment backup bbr deployment restore
drats

Hooks ➔ Create state bbr deployment backup ➔ Change state
bbr deployment restore ➔ Check state ➔ Cleanup state drats

Test cases Every test case must implement the hooks: type
TestCase interface { Name() string BeforeBackup(Config) AfterBackup(Config) AfterRestore(Config) Cleanup(Config) } Source code

Test case for UAA Before backup • create a user
• verify can log in After backup • delete the user • verify cannot log in After restore • verify can log in Cleanup • delete the user

Test cases • App • App uptime • UAA •
CredHub • Networking • Router groups • NFS broker

Test case for app uptime Before backup • push an
app • start polling the app After backup • stop polling the app • verify every request was successful After restore Cleanup • delete app

Full recovery bbr deployment backup ➔ Delete CF and redeploy
bbr deployment restore drats

Focus • Every team can choose which test cases to
run • Allows teams to iterate independently • Every team can run DRATS in their CI pipelines { "include_cf-uaa": true, "include_cf-credhub": false, ... }

Shared CI Tasks Running DRATS in many pipelines

Shared CI Tasks • All CF teams use Concourse CI
(thankfully!) • Pipelines change often • Tasks are more stable

drats-with-integration-config inputs: - name: disaster-recovery-acceptance-tests - name: bbr-binary-release - name:
drats-integration-config

update-integration-config inputs: - name: integration-config - name: vars-store - name:
bbl-state-store outputs: - name: updated-integration-config

CF team pipelines • CAPI • UAA • CredHub •
CF Networking • Persistence • Release Integration • Platform Recovery

CF team pipelines capi uaa credhub cf-networking persistence drats Platform
Recovery cf-deployment Release Integration

CredHub

Release Integration

Platform Recovery

Ongoing development

Core components • DRATS maintained by Platform Recovery team •
Core CF teams submit pull requests • Support multiple versions e.g. Credhub v1 & v2

Datastore configurations • Extended support to external databases • MySQL
• MariaDB & PXC clusters • Postgres • Extending support for external blobstores • AWS S3-compatible • Azure • Google Cloud

P-DRATS • We extended DRATS to test Pivotal Application Service
• Uses the same open source test cases • Additional test cases for Pivotal components • Tests every version of PAS since 1.11

P-DRATS

What’s next?

Agenda • Disaster Recovery • First steps • Ownership and
Integration • DRATs • Shared CI Tasks • Ongoing development • What’s next?

Thank you for listening! Emmanouil Kiagias & Josh Hill Platform
Recovery, Pivotal #bbr on Cloud Foundry Slack

Reviving the platform every day

Reviving the platform every day

Other Decks in Technology

Featured

Transcript