Writing & Maintaining an Open Source Backup Tool in Python on AWS

Writing & Maintaining an Open Source Backup Tool in Python
on AWS Lambda Martin Smith Rackspace, Inc. [email protected]

How 12 Factor, Serverless, APIs for everything, and NoOps all
break down at some point...

break down at some point... 1. Version control 2. Dependencies 3. Config by environment 4. Backing services 5. Build, release, run 6. Stateless Processes 7. Port binding for services 8. Concurrency via process model (ASG) 9. Fast startup, graceful shutdown 10. Keep environment parity 11. Treat logs as event streams 12. Run admin/mgmt tasks one-off

break down at some point... Are there really no servers? Functions-as-a-service vs. Backend-as-a-service Advantages: Cost, Programming model Disadvantages: Performance, resource limits, monitoring & debugging

break down at some point... All teams will henceforth expose their data and functionality through service interfaces. Teams must communicate with each other through these interfaces. There will be no other form of inter-process communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network. It doesn’t matter what technology they use. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions. Anyone who doesn’t do this will be fired. Thank you; have a nice day!

break down at some point... Forrester coined the term NoOps, which they define as "the goal of completely automating the deployment, monitoring and management of applications and the infrastructure on which they run." According to Forrester Senior Analyst Glenn O’Donnell, who co-authored the report "Augment DevOps with NoOps," it is more likely that although some operations positions will become unnecessary, others will simply evolve from a technical orientation toward a more business-oriented focus.

Headlines about NoOps (No & Ops meaning…) • Why NoOps
is a DevOps disaster waiting to happen • There is no such thing as NoOps: it is an awful word • The fallacy of NoOps • Netflix is Not Doing "NoOps" • Is NoOps the End of DevOps? Think Again • Yet, another post about Ops, DevOps, NoOps!

Elastic Compute Cloud (EC2) Virtual Computing Environment Images, “Hardware” Types
Persistent or Ephemeral Storage Multiple Physical Locations IP & Port based firewall Metadata service Software-defined Networking Elastic Block Store (EBS) provides storage for EC2 instances in the AWS cloud. SSD/Magnetic Storage 99.999% availability Encryption (rest, transit), ACLs Snapshots Elastic provisioning/tuning

Requirements • Snapshotting all EBS volumes on your account at
regular intervals • Ability to select volumes for snapshot by entire ASG, EC2 tags, or instance names • EBS volume snapshotting of select volumes, based on configuration settings or defaults • Flexible scheduling of snapshots per instance, based on configuration settings or defaults • Configurable snapshot retention periods of a select instance's volumes, based on configuration settings or defaults • Ability to retain a minimum number of snapshots regardless of retention period • All tags from a volume should be transferred to snapshots • Rackspace ticket notification and response should an EBS snapshot failure occur • All volumes of an instance will currently have the same setting. This restriction could be loosened later. • Workflow of: Shut down, snapshot, and start up EC2 instance • File level backups: currently a customer responsibility; not provided by Rackspace. • Inconsistent snapshots: customers must work with Rackspace to ensure consistent data is written to disk, e.g. local file-level backups of a database server, so that EBS snapshots are consistent and usable. • Snapshot replication: This tool will not replicate snapshots between regions at this time. Out of Scope

5 minute runtime Aside: Lambda restrictions API rate limit Memory
sizes

Original algorithm - Look at each instance - Look at
each volume - Decide if a snapshot should be taken - Find most recent snapshot - Look at each snapshot - Try to track it back to a volume, instance, configuration - Determine how many other Snapshots exist - Clean up or skip

No parallelism options Aside: Lambda restrictions Can’t control retry Stateless
chunk’d hard

V2 Algorithm - Look at each instance - Look at
each volume - Chunk into groups of 5 volumes at a time - Parallel operations for each chunk (workers = 5) - Decide if a snapshot should be taken - Find most recent snapshot - Group snapshots into chunks of 5 (workers = 5) - Try to track it back to a volume, instance, configuration - Determine how many other Snapshots exist - Clean up or skip

Orphaned snapshots Aside: Lambda restrictions /dev/shm missing Still running out
of time

V2 Algorithm - Collect all snapshots - Collect all volumes
- Collect all running instances - Build lookup table that maps - snapshot to volume to instance - volume to snapshot count - volume to last snapshot taken - snapshot to configuration - 1: Review volumes, decide if snapshot should be taken based on recent - 2: Review snapshots, decide to cleanup if expired and plenty remain - On any run, keep checking execution time and exit gracefully

Now how do I deploy this thing? Releases go to
an S3 bucket with Semantic Versioning. Lambda has version numbers, but (1...2...3…) Lambda has "CodeSHA256": "OjRFuuHKizEE8tHFIMsI+iHR6BPAfJ5S0rW31Mh6jKg=", • Crawl every account, compare hashes. • Only deploy where needed. • Bonus S3 integration.

How much does this cost? For a handful of servers,
daily backups: - $2/mo for Dynamo - $0.05/mo for Lambda Usage-based cost is hard to track down; it depends on: - Frequency of Snapshot - Snapshot Retention - Instance count - Regional pricing

When I’d kill it? - This is a stopgap tool.
If AWS provides scheduled EBS Snapshots via API, I’d migrate to that immediately. - There are very few people who need a more complex schedule or retention policy; most of them are needing to migrate to S3 or RDS. - As soon as possible.

The End https://github.com/rackerlabs/ebs_snapper Martin Smith - @martinb3 Questions? Thank you!

Writing & Maintaining an Open Source Backup Too...

Writing & Maintaining an Open Source Backup Tool in Python on AWS

Martin Smith

More Decks by Martin Smith

Other Decks in Technology

Featured

Transcript

Writing & Maintaining an Open Source Backup Tool in Python

How 12 Factor, Serverless, APIs for everything, and NoOps all

How 12 Factor, Serverless, APIs for everything, and NoOps all

How 12 Factor, Serverless, APIs for everything, and NoOps all

How 12 Factor, Serverless, APIs for everything, and NoOps all

How 12 Factor, Serverless, APIs for everything, and NoOps all

Headlines about NoOps (No & Ops meaning…) • Why NoOps

Elastic Compute Cloud (EC2) Virtual Computing Environment Images, “Hardware” Types

Requirements • Snapshotting all EBS volumes on your account at

5 minute runtime Aside: Lambda restrictions API rate limit Memory

Original algorithm - Look at each instance - Look at

No parallelism options Aside: Lambda restrictions Can’t control retry Stateless

V2 Algorithm - Look at each instance - Look at

Orphaned snapshots Aside: Lambda restrictions /dev/shm missing Still running out

V2 Algorithm - Collect all snapshots - Collect all volumes

Now how do I deploy this thing? Releases go to

How much does this cost? For a handful of servers,

When I’d kill it? - This is a stopgap tool.

The End https://github.com/rackerlabs/ebs_snapper Martin Smith - @martinb3 Questions? Thank you!