Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building an AMI Factory with Open Source Tools

Building an AMI Factory with Open Source Tools

DevOpsDays Portland - 2013

Jeremy Carroll

November 06, 2013
Tweet

More Decks by Jeremy Carroll

Other Decks in Technology

Transcript

  1. Building an AMI Factory
    Jeremy Carroll
    DevOps Days - Portland
    with open source tools

    View full-size slide

  2. - Bake: produce a complete virtual machine offline, before first use.
    - Fry: Produce a skeleton virtual machine by booting a basic VM, and then applying
    configuration.
    - Tradeoffs in both approaches.

    View full-size slide

  3. - You see a lot of appliance VM’s on EC2 public images.
    - Make it easy. Press start, get application.
    - Bake once, use many times. ‘Network Effect’ in time by doing something once and then
    reuse.
    - You can always ‘fry’ something later as well. Example: Launching an instance then using
    Puppet or another conf. mgmt tool for live management.

    View full-size slide

  4. Why?
    - Makes things more efficient. Bake your base roles and classes to reduce work at instance
    launch time.
    - Example: Your package repository server is probably sad. Every machine launch installs the
    same packages over and over.
    - Operations and Development love the system. Not much more complicated then launching
    an instance.
    - You will have a better time.

    View full-size slide

  5. `Fry` Method - On Demand System
    Provisioning
    • Puppet run at boot
    • Many dependent services - 3rd party network calls
    • Boot times vary. Usually around 10-15 minutes
    - Provisioning systems at boot is slow with on-demand provisioning. Can take 15+ minutes
    for a system to become fully ready.
    - For us, autoscaling was a primary driver.
    - Autoscaling as a way of using EC2 API’s as a lightweight ‘state’ machine. Always keep ‘x’ of
    these running in multiple availability zones.

    View full-size slide

  6. AutoScale - In Production
    - Feels like this picture. When it works, it’s like being a rockstar. Taming the beast.
    - Orchestration of legend. Lots of moving parts thousands of times per day.
    - But has trade offs as it pertains with AutoScaling services. We still on-demand provision
    instances for persistent services at this time.
    - Previous system was ‘fry’ based. Had Just Enough Operating System (JEoS), and provisioned
    via Puppet at boot.
    - Anybody checked in bugs to Puppet trunk broke autoscaling.
    - Increased load on dependent services (Puppet, Apt, Internal APIs, etc..). Have to provision
    for spikiness (Launching 100’s of Instances in case of failure).
    - Spot pricing story. Lost an entire AZ due to run on spot, relaunch entire DataCenter for
    application.

    View full-size slide

  7. Reliability
    - Previous system was fragile. Patterns to increase reliability
    - Decouple from lower SLA system to increase availability.
    - Application can continue to function without remote system.
    - System that does not require any third party services can have improved SLA.
    - Reduce or eliminate dependency on CMDB. Eliminate or reduce third party network calls

    View full-size slide

  8. Image Creation
    Tools Used
    • Packer.io (https://github.com/mitchellh/packer)
    • Jenkins (https://https://github.com/jenkinsci/jenkins)
    • BATS (https://github.com/sstephenson/bats)
    • Puppet (https://github.com/puppetlabs/puppet)
    • CloudInit (https://help.ubuntu.com/community/CloudInit)
    Build Pipeline
    Testing
    Configuration Management
    Runtime Modificaitons
    - Tool availability is increasing. Aminator, BoxGrinder (Kinda Stale), Packer, or in-house scripts with EC2 tools (bundle-volume).
    - Looked at Aminator, was EBS specific. A lot of patterns gleaned from contribution.
    - Packer felt right due to multi-cloud approach and ability to deal with EBS and instance-store systems.
    - Puppet for deterministic configuration management. Insert your CMDB here.
    - Packer concurrency was a must have feature. Building multiple images at a time due to the 'launch and provision via SSH' style system.
    - Jenkins for workflow. Really just a scheduler as Low resource utilization on Jenkins slaves.
    - Cloud-init for run-time modifications. Packer builds can modify launch time operations such as 'Do not register in Route53', 'Which puppet
    role to apply', and 'What BATs tests to run'.
    - BATS for testing due to simplicity. Just write bash, exit codes are your friend.

    View full-size slide

  9. Image Evolution
    - Images have to start somewhere.
    - Foundation AMI to start. Can use ubuntu cloud images. Or roll your own. Just the Base
    operating system with patches, not much else.
    - Golden AMI to hous

    View full-size slide

  10. How do we create these images?
    Packer.io
    • Packer is a tool for creating identical machine images for
    multiple platforms from a single source configuration.
    • Written in GO
    • Supports a lot of platforms
    • EC2, DigitalOcean, OpenStack, VirtualBox, VMWare
    • A lot better than scripts + ec2-bundle-volume
    • Some bugs, but it’s getting better all the time
    - Written in GO
    - Supports parallel builds. Ex: Multiple builds at one time.
    - DSL for describing images.
    - Looked at Aminator (Too EBS specific). Boxgrinder was CentOS only, and almost abandon
    ware.
    - Does one job really well.

    View full-size slide

  11. {
    "provisioners": [
    {
    "type": "shell",
    "scripts": [
    "provision.sh"
    ]
    }
    ],
    "builders": [
    {
    "access_key": "{{user `ec2_access_key`}}",
    "source_ami": "ami-1234567",
    "account_id": "1234",
    "bundle_destination": "/mnt",
    "region": "{{user `region`}}",
    "tags": {
    "application": "myapp",
    "environment": "test",
    "release": "precise",
    "host": "packerci-slave001",
    "owner": "root",
    "ancestor": "ami-34567890"
    },
    "user_data": "#cloud-config\nrole: myapp",
    "x509_key_path": "{{user `ec2_private_key`}}",
    "instance_type": "{{user `size`}}",
    "x509_upload_path": "/mnt",
    "x509_cert_path": "{{user `ec2_cert`}}",
    "iam_instance_profile": "provisioning",
    "ami_name": "{{user `application`}}-{{user `version`}}.{{user `build_number`}}-{{user `architecture`}}-{{user `user_timestamp`}}-{{user
    `type`}}",
    "ami_description": "store=amazon-instance,ancestor_name=golden-12.04-precise-amd64-201311010411-
    instance_store,ancestor_id=ami-12345678,version=0.1,env=test,app=myapp,release=precise",
    "secret_key": "{{user `ec2_secret_key`}}",
    "security_group_id": "sg-12345678",
    "type": "amazon-instance",
    "s3_bucket": "mybucket",
    "ssh_timeout": "15m"
    }
    ]
    }
    Packer.json
    - Example of packer.json output. JSON configuration with a lot of options
    - Has variable substitution for environment variables.
    - Supports a lot of options, such as ‘instance-store’, EBS, or chroot builds.
    - We call a single script ‘provision.sh’ when the instance launches.
    - Script waits for CloudInit to drop a file telling the system that puppet has finished.
    - We then use our tests at this point to determine if we have a successful build. Puppet has
    converged, etc..
    - If successful, the script then cleans up temporary data. SSH keys, log files, /etc/hosts if
    managed, CloudInit data

    View full-size slide

  12. Packer Wrapper
    - Packer does not know anything about our image management systems.
    - What’s the current ‘Golden’ AMI i need to use to create a machine in us-east-1 with an EBS
    store, on raring, etc..
    - Jenkins calls this script with the build parameters to generate a usable packer.json file.
    - We also put business logic on how to create the packer file here to individuals do not need
    to know all the syntax.
    - What tags do I need to create on this ami? What’s the ami naming convention, etc..

    View full-size slide

  13. Image Query
    - Currently using Edda to store information. Cache of EC2 API
    - Enriches data structures to allow for dynamic querying for a variety of tags. Ex: prod vs dev
    vs test.
    - Query for certain types of images. EBS vs instance store. Different regions (Us-west vs east).
    Etc.
    - Can use Boto with EC2 tag filters as well. Sort by version number, take the most recent
    highest version number. That’s your build.

    View full-size slide

  14. The Factory
    - All starts with Jenkins. Can build on a schedule, on a code commit. Etc..
    - Using the packer wrapper, creates a json templte. Boots a machine inside of EC2 and starts
    provisioning.
    - Integration testing is tricky, and depends on the service.
    - Sometimes for our autoscaling stateless services that we build nightly and deploy a canary
    for testing. Once reviewed (Automatic or manual) can be moved to production.
    - Tags and metadata about images is important. Allows us to filter / move images along the
    pipeline.

    View full-size slide

  15. Packer Build
    - Video of Packer Build

    View full-size slide

  16. Packer Build
    - Video of Packer Build

    View full-size slide

  17. So What Just Happened?
    - The whole process. Take the latest golden AMI with user-data.
    - Run puppet on the machine.
    - Using the provision.sh script, run tests to return exit codes. Packer will look at exit codes
    and determine to abort the build or continue.
    - If successfull, prepare the image for bundling.

    View full-size slide

  18. Parameterized Build Pipeline
    Jenkins
    • Takes parameters to launch a build
    • Uses ‘Packer Wrapper’ to query for Golden AMI
    • Triggers downstream jobs on success / failure
    • If successfully build AMI, trigger test run.
    - Jenkins job as a parameterized build pipeline.
    - Create parameters on what type of image to create.
    - Want a new image, copy any current job and modify the parameters. A little cumbersome
    but works.
    - Jenkins will do multi step builds. Create the ami. Launch the AMI. Do some testing (Ex: does
    code work?). Then tag the image as production. Etc..

    View full-size slide

  19. def handle(name, cfg, cloud, log, args):
    # puppet:
    # run_on_boot: true
    # shutdown_on_error: true
    # shutdown_timer: 5
    config = {}
    # Create puppet object
    p = puppet.puppet()
    if 'puppet' in cfg:
    if isinstance(cfg['puppet'], dict):
    config = cfg['puppet']
    # Puppet CloudInit result file
    puppet_status = os.path.join(cloud.get_cpath('data'), 'puppet')
    run_on_boot = util.get_cfg_option_bool(config, 'run_on_boot', True)
    shutdown_on_error = util.get_cfg_option_bool(config, 'shutdown_on_error', True)
    if run_on_boot:
    try:
    log.info('Puppet: running puppet')
    do_puppet_run(p, log)
    log.info('Puppet: convergance successful')
    util.write_file(puppet_status, "ok\n", 0644)
    except Exception:
    log.error('Puppet: failed to converge')
    util.write_file(puppet_status, "failed\n", 0644)
    if shutdown_on_error:
    log.error('Puppet: shutting down instance')
    CloudInit Handlers
    - CloudInit allows runtime modification.
    - Example of a cloudinit handler for puppet.
    - Any python you can write can take user-data in the form of strings, booleans, etc..
    - We use it to turn on / off features. Ex: create Route53 DNS entries on boot. Which puppet
    group to provision.
    - Should you shutdown or terminate if puppet cannot converge? Have a bad box out there for
    a while.
    - Phone home to notify a system that we are ready for operation. You name it.

    View full-size slide

  20. Puppet
    Configuration Management
    • Still use a configuration management product to manage
    state on baked images. Repeatable process. Deterministic.
    • Works with CloudInit to pull down the modules required to
    make this image. Not much different than `Fry` model
    • Beware: Dynamic Variables. IP Addresses. Hostnames
    - Beware of dynamic fact driven templating.
    - The system you launch will not have the same IP address, hostname, availability zone, etc..
    - Still need run-time modification of these attributes. Can run something lightweight on
    boot, or use CloudInit to manage these types of files.

    View full-size slide

  21. #!/usr/bin/env bats
    @test "addition using bc" {
    result="$(echo 2+2 | bc)"
    [ "$result" -eq 4 ]
    }
    @test "addition using dc" {
    result="$(echo 2 2+p | dc)"
    [ "$result" -eq 4 ]
    }
    BATS
    - Drop dead simple.
    - Use exit codes and bash.
    - For example. When testing Puppet. We parse last_run_summary.yaml, and determine if we
    have failed resources / classes.
    - Can look for other things. Such as ‘Does mount point exist’, ‘is package installed’, ‘is
    service running and has network port’, ‘can query for web health interface’. Packer works
    with exit codes, so works well.

    View full-size slide

  22. Lots, and lots of appliances.
    We Have Appliances!
    • EC2 console becomes pretty much unusable
    • Console does not encapsulate your business rules, and image
    management practices
    • What’s the latest Golden AMI for 12.04?
    - Wish we would have known about this before starting to roll this out.
    - Users have a hard time finding images when you have an explosion of them.
    - Not so simple to launch a new instance now. Need something that can query your metadata
    and tags about image state.

    View full-size slide

  23. Tools
    - Internal tool. Still a work in progress
    - We have command line tools at this time, but not many web interfaces.
    - Influenced the cloud image finder on Canonical website.
    - Filter for images that you would like. Give a big red launch button.
    - We need more additional work on a unifed tool which represents our business logic. Ex:
    Asgard.

    View full-size slide

  24. Janitor Your Packer Builds
    Image Management
    • Packer only builds images. It does not attempt to manage them
    in any way. After they're built, it is up to you to launch or
    destroy them as you see fit.
    • Process like NetFlix ‘JanitorMonkey’ to clean up unused
    images / snapshots. Tags used heavily to influence process
    • Image not in use by any ASG launch configuration, or a
    ‘Golden’ AMI
    • https://github.com/Netflix/SimianArmy/blob/master/src/main/java/com/
    netflix/simianarmy/aws/janitor/rule/ami/UnusedImageRule.java
    - Not so simple. Lots of subtile business rules.
    - Last time an instance launched with this AMI? Track this information. Cleanup after
    days
    - Cannot delete an AMI part of an ASG. Or a Base AMI
    - Multi region, or multi account cleanup. Look at Janitor monkey for lots of good examples.

    View full-size slide

  25. Future Work
    • EBS images for speed of creation
    • Generic ‘Application AMI’
    • Would allow for ‘Just add your configuration’ images. Ex: Jetty.
    • More work on Janitor process
    • Business workflow UI (Ex: Asgard)
    - Foundation AMI is the core operating system you will build from. Barely any modifications
    at this stage.
    - An example would be the Ubuntu Cloud Images.
    - Modifications to the Foundation AMI then creates our ‘Base’ AMI for an application stack.
    - Example: Foundation + Jetty = Jetty Base AMI. Then can use to install ‘Jetty’ applications.
    - Example2: Foundation + Hadoop + HBase RegionServer = RegionServer Base AMI.
    - Adding Code + Configuration can create an ‘application ami’. You can launch these
    instances and they will go to work on launch.

    View full-size slide

  26. - We are hiring. http://about.pinterest.com

    View full-size slide