Writing & Maintaining an Open Source Backup Tool in Python on AWS

Writing & Maintaining an Open Source Backup Tool in Python on AWS

Or: How 12 Factor, Serverless, APIs for everything, and NoOps all break down at some point...

9ad2a5355d8cfa842e24b7a4322b2535?s=128

Martin Smith

April 19, 2017
Tweet

Transcript

  1. Writing & Maintaining an Open Source Backup Tool in Python

    on AWS Lambda Martin Smith Rackspace, Inc. martin@mbs3.org
  2. How 12 Factor, Serverless, APIs for everything, and NoOps all

    break down at some point...
  3. How 12 Factor, Serverless, APIs for everything, and NoOps all

    break down at some point... 1. Version control 2. Dependencies 3. Config by environment 4. Backing services 5. Build, release, run 6. Stateless Processes 7. Port binding for services 8. Concurrency via process model (ASG) 9. Fast startup, graceful shutdown 10. Keep environment parity 11. Treat logs as event streams 12. Run admin/mgmt tasks one-off
  4. How 12 Factor, Serverless, APIs for everything, and NoOps all

    break down at some point... Are there really no servers? Functions-as-a-service vs. Backend-as-a-service Advantages: Cost, Programming model Disadvantages: Performance, resource limits, monitoring & debugging
  5. How 12 Factor, Serverless, APIs for everything, and NoOps all

    break down at some point... All teams will henceforth expose their data and functionality through service interfaces. Teams must communicate with each other through these interfaces. There will be no other form of inter-process communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network. It doesn’t matter what technology they use. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions. Anyone who doesn’t do this will be fired. Thank you; have a nice day!
  6. How 12 Factor, Serverless, APIs for everything, and NoOps all

    break down at some point... Forrester coined the term NoOps, which they define as "the goal of completely automating the deployment, monitoring and management of applications and the infrastructure on which they run." According to Forrester Senior Analyst Glenn O’Donnell, who co-authored the report "Augment DevOps with NoOps," it is more likely that although some operations positions will become unnecessary, others will simply evolve from a technical orientation toward a more business-oriented focus.
  7. Headlines about NoOps (No & Ops meaning…) • Why NoOps

    is a DevOps disaster waiting to happen • There is no such thing as NoOps: it is an awful word • The fallacy of NoOps • Netflix is Not Doing "NoOps" • Is NoOps the End of DevOps? Think Again • Yet, another post about Ops, DevOps, NoOps!
  8. Elastic Compute Cloud (EC2) Virtual Computing Environment Images, “Hardware” Types

    Persistent or Ephemeral Storage Multiple Physical Locations IP & Port based firewall Metadata service Software-defined Networking Elastic Block Store (EBS) provides storage for EC2 instances in the AWS cloud. SSD/Magnetic Storage 99.999% availability Encryption (rest, transit), ACLs Snapshots Elastic provisioning/tuning
  9. Requirements • Snapshotting all EBS volumes on your account at

    regular intervals • Ability to select volumes for snapshot by entire ASG, EC2 tags, or instance names • EBS volume snapshotting of select volumes, based on configuration settings or defaults • Flexible scheduling of snapshots per instance, based on configuration settings or defaults • Configurable snapshot retention periods of a select instance's volumes, based on configuration settings or defaults • Ability to retain a minimum number of snapshots regardless of retention period • All tags from a volume should be transferred to snapshots • Rackspace ticket notification and response should an EBS snapshot failure occur • All volumes of an instance will currently have the same setting. This restriction could be loosened later. • Workflow of: Shut down, snapshot, and start up EC2 instance • File level backups: currently a customer responsibility; not provided by Rackspace. • Inconsistent snapshots: customers must work with Rackspace to ensure consistent data is written to disk, e.g. local file-level backups of a database server, so that EBS snapshots are consistent and usable. • Snapshot replication: This tool will not replicate snapshots between regions at this time. Out of Scope
  10. 5 minute runtime Aside: Lambda restrictions API rate limit Memory

    sizes
  11. Original algorithm - Look at each instance - Look at

    each volume - Decide if a snapshot should be taken - Find most recent snapshot - Look at each snapshot - Try to track it back to a volume, instance, configuration - Determine how many other Snapshots exist - Clean up or skip
  12. No parallelism options Aside: Lambda restrictions Can’t control retry Stateless

    chunk’d hard
  13. V2 Algorithm - Look at each instance - Look at

    each volume - Chunk into groups of 5 volumes at a time - Parallel operations for each chunk (workers = 5) - Decide if a snapshot should be taken - Find most recent snapshot - Group snapshots into chunks of 5 (workers = 5) - Try to track it back to a volume, instance, configuration - Determine how many other Snapshots exist - Clean up or skip
  14. Orphaned snapshots Aside: Lambda restrictions /dev/shm missing Still running out

    of time
  15. V2 Algorithm - Collect all snapshots - Collect all volumes

    - Collect all running instances - Build lookup table that maps - snapshot to volume to instance - volume to snapshot count - volume to last snapshot taken - snapshot to configuration - 1: Review volumes, decide if snapshot should be taken based on recent - 2: Review snapshots, decide to cleanup if expired and plenty remain - On any run, keep checking execution time and exit gracefully
  16. Now how do I deploy this thing? Releases go to

    an S3 bucket with Semantic Versioning. Lambda has version numbers, but (1...2...3…) Lambda has "CodeSHA256": "OjRFuuHKizEE8tHFIMsI+iHR6BPAfJ5S0rW31Mh6jKg=", • Crawl every account, compare hashes. • Only deploy where needed. • Bonus S3 integration.
  17. How much does this cost? For a handful of servers,

    daily backups: - $2/mo for Dynamo - $0.05/mo for Lambda Usage-based cost is hard to track down; it depends on: - Frequency of Snapshot - Snapshot Retention - Instance count - Regional pricing
  18. None
  19. None
  20. None
  21. When I’d kill it? - This is a stopgap tool.

    If AWS provides scheduled EBS Snapshots via API, I’d migrate to that immediately. - There are very few people who need a more complex schedule or retention policy; most of them are needing to migrate to S3 or RDS. - As soon as possible.
  22. The End https://github.com/rackerlabs/ebs_snapper Martin Smith - @martinb3 Questions? Thank you!