Slide 1

Slide 1 text

Writing & Maintaining an Open Source Backup Tool in Python on AWS Lambda Martin Smith Rackspace, Inc. [email protected]

Slide 2

Slide 2 text

How 12 Factor, Serverless, APIs for everything, and NoOps all break down at some point...

Slide 3

Slide 3 text

How 12 Factor, Serverless, APIs for everything, and NoOps all break down at some point... 1. Version control 2. Dependencies 3. Config by environment 4. Backing services 5. Build, release, run 6. Stateless Processes 7. Port binding for services 8. Concurrency via process model (ASG) 9. Fast startup, graceful shutdown 10. Keep environment parity 11. Treat logs as event streams 12. Run admin/mgmt tasks one-off

Slide 4

Slide 4 text

How 12 Factor, Serverless, APIs for everything, and NoOps all break down at some point... Are there really no servers? Functions-as-a-service vs. Backend-as-a-service Advantages: Cost, Programming model Disadvantages: Performance, resource limits, monitoring & debugging

Slide 5

Slide 5 text

How 12 Factor, Serverless, APIs for everything, and NoOps all break down at some point... All teams will henceforth expose their data and functionality through service interfaces. Teams must communicate with each other through these interfaces. There will be no other form of inter-process communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network. It doesn’t matter what technology they use. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions. Anyone who doesn’t do this will be fired. Thank you; have a nice day!

Slide 6

Slide 6 text

How 12 Factor, Serverless, APIs for everything, and NoOps all break down at some point... Forrester coined the term NoOps, which they define as "the goal of completely automating the deployment, monitoring and management of applications and the infrastructure on which they run." According to Forrester Senior Analyst Glenn O’Donnell, who co-authored the report "Augment DevOps with NoOps," it is more likely that although some operations positions will become unnecessary, others will simply evolve from a technical orientation toward a more business-oriented focus.

Slide 7

Slide 7 text

Headlines about NoOps (No & Ops meaning…) ● Why NoOps is a DevOps disaster waiting to happen ● There is no such thing as NoOps: it is an awful word ● The fallacy of NoOps ● Netflix is Not Doing "NoOps" ● Is NoOps the End of DevOps? Think Again ● Yet, another post about Ops, DevOps, NoOps!

Slide 8

Slide 8 text

Elastic Compute Cloud (EC2) Virtual Computing Environment Images, “Hardware” Types Persistent or Ephemeral Storage Multiple Physical Locations IP & Port based firewall Metadata service Software-defined Networking Elastic Block Store (EBS) provides storage for EC2 instances in the AWS cloud. SSD/Magnetic Storage 99.999% availability Encryption (rest, transit), ACLs Snapshots Elastic provisioning/tuning

Slide 9

Slide 9 text

Requirements ● Snapshotting all EBS volumes on your account at regular intervals ● Ability to select volumes for snapshot by entire ASG, EC2 tags, or instance names ● EBS volume snapshotting of select volumes, based on configuration settings or defaults ● Flexible scheduling of snapshots per instance, based on configuration settings or defaults ● Configurable snapshot retention periods of a select instance's volumes, based on configuration settings or defaults ● Ability to retain a minimum number of snapshots regardless of retention period ● All tags from a volume should be transferred to snapshots ● Rackspace ticket notification and response should an EBS snapshot failure occur ● All volumes of an instance will currently have the same setting. This restriction could be loosened later. ● Workflow of: Shut down, snapshot, and start up EC2 instance ● File level backups: currently a customer responsibility; not provided by Rackspace. ● Inconsistent snapshots: customers must work with Rackspace to ensure consistent data is written to disk, e.g. local file-level backups of a database server, so that EBS snapshots are consistent and usable. ● Snapshot replication: This tool will not replicate snapshots between regions at this time. Out of Scope

Slide 10

Slide 10 text

5 minute runtime Aside: Lambda restrictions API rate limit Memory sizes

Slide 11

Slide 11 text

Original algorithm - Look at each instance - Look at each volume - Decide if a snapshot should be taken - Find most recent snapshot - Look at each snapshot - Try to track it back to a volume, instance, configuration - Determine how many other Snapshots exist - Clean up or skip

Slide 12

Slide 12 text

No parallelism options Aside: Lambda restrictions Can’t control retry Stateless chunk’d hard

Slide 13

Slide 13 text

V2 Algorithm - Look at each instance - Look at each volume - Chunk into groups of 5 volumes at a time - Parallel operations for each chunk (workers = 5) - Decide if a snapshot should be taken - Find most recent snapshot - Group snapshots into chunks of 5 (workers = 5) - Try to track it back to a volume, instance, configuration - Determine how many other Snapshots exist - Clean up or skip

Slide 14

Slide 14 text

Orphaned snapshots Aside: Lambda restrictions /dev/shm missing Still running out of time

Slide 15

Slide 15 text

V2 Algorithm - Collect all snapshots - Collect all volumes - Collect all running instances - Build lookup table that maps - snapshot to volume to instance - volume to snapshot count - volume to last snapshot taken - snapshot to configuration - 1: Review volumes, decide if snapshot should be taken based on recent - 2: Review snapshots, decide to cleanup if expired and plenty remain - On any run, keep checking execution time and exit gracefully

Slide 16

Slide 16 text

Now how do I deploy this thing? Releases go to an S3 bucket with Semantic Versioning. Lambda has version numbers, but (1...2...3…) Lambda has "CodeSHA256": "OjRFuuHKizEE8tHFIMsI+iHR6BPAfJ5S0rW31Mh6jKg=", ● Crawl every account, compare hashes. ● Only deploy where needed. ● Bonus S3 integration.

Slide 17

Slide 17 text

How much does this cost? For a handful of servers, daily backups: - $2/mo for Dynamo - $0.05/mo for Lambda Usage-based cost is hard to track down; it depends on: - Frequency of Snapshot - Snapshot Retention - Instance count - Regional pricing

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

When I’d kill it? - This is a stopgap tool. If AWS provides scheduled EBS Snapshots via API, I’d migrate to that immediately. - There are very few people who need a more complex schedule or retention policy; most of them are needing to migrate to S3 or RDS. - As soon as possible.

Slide 22

Slide 22 text

The End https://github.com/rackerlabs/ebs_snapper Martin Smith - @martinb3 Questions? Thank you!