$30 off During Our Annual Pro Sale. View Details »

Writing & Maintaining an Open Source Backup Tool in Python on AWS

Writing & Maintaining an Open Source Backup Tool in Python on AWS

Or: How 12 Factor, Serverless, APIs for everything, and NoOps all break down at some point...

Martin Smith

April 19, 2017
Tweet

More Decks by Martin Smith

Other Decks in Technology

Transcript

  1. Writing & Maintaining an Open Source Backup
    Tool in Python on AWS Lambda
    Martin Smith
    Rackspace, Inc.
    [email protected]

    View Slide

  2. How 12 Factor, Serverless, APIs for everything, and NoOps all
    break down at some point...

    View Slide

  3. How 12 Factor, Serverless, APIs for everything, and NoOps all
    break down at some point...
    1. Version control
    2. Dependencies
    3. Config by environment
    4. Backing services
    5. Build, release, run
    6. Stateless Processes
    7. Port binding for services
    8. Concurrency via process model (ASG)
    9. Fast startup, graceful shutdown
    10. Keep environment parity
    11. Treat logs as event streams
    12. Run admin/mgmt tasks one-off

    View Slide

  4. How 12 Factor, Serverless, APIs for everything, and NoOps all
    break down at some point...
    Are there really no servers?
    Functions-as-a-service vs. Backend-as-a-service
    Advantages: Cost, Programming model
    Disadvantages: Performance, resource limits, monitoring & debugging

    View Slide

  5. How 12 Factor, Serverless, APIs for everything, and NoOps all
    break down at some point...
    All teams will henceforth expose their data and functionality through service interfaces.
    Teams must communicate with each other through these interfaces.
    There will be no other form of inter-process communication allowed: no direct linking, no direct
    reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only
    communication allowed is via service interface calls over the network.
    It doesn’t matter what technology they use.
    All service interfaces, without exception, must be designed from the ground up to be
    externalizable. That is to say, the team must plan and design to be able to expose the interface to
    developers in the outside world. No exceptions.
    Anyone who doesn’t do this will be fired. Thank you; have a nice day!

    View Slide

  6. How 12 Factor, Serverless, APIs for everything, and NoOps all
    break down at some point...
    Forrester coined the term NoOps, which they define as "the goal of completely
    automating the deployment, monitoring and management of applications and the
    infrastructure on which they run."
    According to Forrester Senior Analyst Glenn O’Donnell, who co-authored the
    report "Augment DevOps with NoOps," it is more likely that although some
    operations positions will become unnecessary, others will simply evolve from
    a technical orientation toward a more business-oriented focus.

    View Slide

  7. Headlines about NoOps (No & Ops meaning…)
    ● Why NoOps is a DevOps disaster waiting to happen
    ● There is no such thing as NoOps: it is an awful word
    ● The fallacy of NoOps
    ● Netflix is Not Doing "NoOps"
    ● Is NoOps the End of DevOps? Think Again
    ● Yet, another post about Ops, DevOps, NoOps!

    View Slide

  8. Elastic Compute Cloud
    (EC2)
    Virtual Computing Environment
    Images, “Hardware” Types
    Persistent or Ephemeral Storage
    Multiple Physical Locations
    IP & Port based firewall
    Metadata service
    Software-defined Networking
    Elastic Block Store (EBS) provides
    storage for EC2 instances in the
    AWS cloud.
    SSD/Magnetic Storage
    99.999% availability
    Encryption (rest, transit), ACLs
    Snapshots
    Elastic provisioning/tuning

    View Slide

  9. Requirements
    ● Snapshotting all EBS volumes on your
    account at regular intervals
    ● Ability to select volumes for snapshot by
    entire ASG, EC2 tags, or instance names
    ● EBS volume snapshotting of select
    volumes, based on configuration settings
    or defaults
    ● Flexible scheduling of snapshots per
    instance, based on configuration settings
    or defaults
    ● Configurable snapshot retention periods
    of a select instance's volumes, based on
    configuration settings or defaults
    ● Ability to retain a minimum number of
    snapshots regardless of retention period
    ● All tags from a volume should be
    transferred to snapshots
    ● Rackspace ticket notification and
    response should an EBS snapshot failure
    occur
    ● All volumes of an instance will
    currently have the same setting. This
    restriction could be loosened later.
    ● Workflow of: Shut down, snapshot,
    and start up EC2 instance
    ● File level backups: currently a
    customer responsibility; not provided
    by Rackspace.
    ● Inconsistent snapshots: customers
    must work with Rackspace to ensure
    consistent data is written to disk, e.g.
    local file-level backups of a database
    server, so that EBS snapshots are
    consistent and usable.
    ● Snapshot replication: This tool will not
    replicate snapshots between regions
    at this time.
    Out of Scope

    View Slide

  10. 5 minute runtime
    Aside: Lambda restrictions
    API rate limit
    Memory sizes

    View Slide

  11. Original algorithm
    - Look at each instance
    - Look at each volume
    - Decide if a snapshot should be taken
    - Find most recent snapshot
    - Look at each snapshot
    - Try to track it back to a volume, instance, configuration
    - Determine how many other Snapshots exist
    - Clean up or skip

    View Slide

  12. No parallelism options
    Aside: Lambda restrictions
    Can’t control retry
    Stateless chunk’d hard

    View Slide

  13. V2 Algorithm
    - Look at each instance
    - Look at each volume
    - Chunk into groups of 5 volumes at a time
    - Parallel operations for each chunk (workers = 5)
    - Decide if a snapshot should be taken
    - Find most recent snapshot
    - Group snapshots into chunks of 5 (workers = 5)
    - Try to track it back to a volume, instance, configuration
    - Determine how many other Snapshots exist
    - Clean up or skip

    View Slide

  14. Orphaned snapshots
    Aside: Lambda restrictions
    /dev/shm missing
    Still running out of time

    View Slide

  15. V2 Algorithm
    - Collect all snapshots
    - Collect all volumes
    - Collect all running instances
    - Build lookup table that maps
    - snapshot to volume to instance
    - volume to snapshot count
    - volume to last snapshot taken
    - snapshot to configuration
    - 1: Review volumes, decide if snapshot should be taken based on recent
    - 2: Review snapshots, decide to cleanup if expired and plenty remain
    - On any run, keep checking execution time and exit gracefully

    View Slide

  16. Now how do I deploy this thing?
    Releases go to an S3 bucket with Semantic Versioning.
    Lambda has version numbers, but (1...2...3…)
    Lambda has
    "CodeSHA256": "OjRFuuHKizEE8tHFIMsI+iHR6BPAfJ5S0rW31Mh6jKg=",
    ● Crawl every account, compare hashes.
    ● Only deploy where needed.
    ● Bonus S3 integration.

    View Slide

  17. How much does this cost?
    For a handful of servers, daily backups:
    - $2/mo for Dynamo
    - $0.05/mo for Lambda
    Usage-based cost is hard to track down; it depends on:
    - Frequency of Snapshot
    - Snapshot Retention
    - Instance count
    - Regional pricing

    View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. When I’d kill it?
    - This is a stopgap tool. If AWS provides scheduled EBS Snapshots via API,
    I’d migrate to that immediately.
    - There are very few people who need a more complex schedule or
    retention policy; most of them are needing to migrate to S3 or RDS.
    - As soon as possible.

    View Slide

  22. The End
    https://github.com/rackerlabs/ebs_snapper
    Martin Smith - @martinb3
    Questions? Thank you!

    View Slide