Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Configuring Autoscaled Instances via Ansible an...

Configuring Autoscaled Instances via Ansible and Ansible Tower

Don't spend your time baking AMIs! As your autoscaled boxes come online, have them get their configuration at startup.

Jim Altieri

July 23, 2015
Tweet

Other Decks in Programming

Transcript

  1. Deploying and Configuring Autoscale Groups Using Ansible Tower ...when you're

    way too impatient to wait for an image to build Jim Altieri - software engineer at AllTrails [email protected]
  2. Me: [email protected] • Programmer for artists and musicians • Previous

    job: ultra-HA events, hybrid cloud/metal • Current job: AllTrails
  3. AllTrails • Tools (web and mobile) to help you in

    all phases of hikes and other outdoor activities: discovery, planning, doing, and sharing. • 2m+ monthly users • Unusual traffic pattern (weekend and holidays) • Small engineering team - multiple roles On weekends and holidays we can see up to 10x of our baseline traffic.
  4. Spoiler Alert • Our DevOps progression • The tricky problems

    we encountered • Our solutions • Tips / Tricks • Q & A
  5. Here’s the basic architecture we’ll be referring to throughout the

    presentation. Two different types of boxes: Front End and Back End. All boxes are in communication with a database and an ElastiCache cluster. The Front End boxes are attached to an ELB. Where we started in our DevOps journey - a developer would run a script from her/his box, which talked to a hosted Chef server, which knew about all of our long-running EC2 instances. It would deploy code to the running boxes, and restart the apps.
  6. Manual Rolling Deploys • Manual?!?! • When things go wrong…

    • Way over-provisioned • Except when it wasn’t 1. No need for manual deploys - use CI/CD as a layer of abstraction. Devs write code, they see code. 2. Many of you have encountered - things go wrong, weird state. Some boxes have new version, some have old, some incomplete? A mess. 3. And with just a set fleet of instances, we had to over-provision in order to handle our peak traffic.
  7. Ansible • Agentless - only need sshd • Declarative -

    YAML files • Variables, roles, playbooks • “Idempotent”
  8. An example of an Ansible role which tries to download

    a file from S3, and if unable to do that, will precompile assets. Notice the mustache templated variables…
  9. Work in Ansible Tower makes DevOps become a game of

    turn the donut orange (or green). No red donuts == happy.
  10. Blue-Green? • Let your new deploy be a completely separate

    stack. • Bring it up while letting the previous stack continue to work. • Only once the new site is connected and working, tear down old stack.
  11. Same general architecture. The rounded rectangle containers represent CloudFormation stacks.

    Each CF stack has a complete set of all EC2 instances needed. Ansible brings up the stack…
  12. Now the new boxes are connected to the ELB, db

    and ElastiCache. There is a brief period where both new and old are serving traffic. We haven’t found a good way to avoid this and still provide zero downtime deploys.
  13. The old CF stack is brought down (atomically) and we’re

    left with an identical architecture, but with new code.
  14. Autoscaling Deploys • Already no state on boxes, thanks to

    Blue-Green • No more over-provisioning for peak traffic Refactoring your app for a blue-green deploy will require you to remove state from your instances. Which is handy, because you also need that for autoscaling.
  15. So what’s the problem? • Autoscaled boxes are launched asynchronously

    - how do we know when they should be configured? • How do we accommodate different roles for different boxes? asynchronous - on deploying, you create the asg, and as soon as the group is created, it’s “there.” but the individual boxes come up on their own time. also as new autoscaled instances come online during scaling out events, this is completely outside of whatever our original provisioning flow might be. And - how do we handle different types of boxes coming up? How do we do autoscaling and have these different boxes?
  16. Make separate AMIs? • Aminator (outdated, limited) • Packer (slow!)

    • Docker? (the containers still need to be provisioned - just passing the buck) The traditional route is by “baking” an AMI for each of your box types, and just using that AMI in your LaunchConfiguration. Aminator - outdated, limited to certain linux dists Packer - slow. The workflow is to spin up a box, provision it, image it. Docker - haven’t looked into it, but seems promising as an extra layer of abstraction / indirection between code and the infrastructure it’s running on
  17. Step 1 • Install any users, packages, or configuration files

    that are needed for all boxes in your stack. • Use Ansible for this, too! • Make a common “meta” role. Make a common base AMI
  18. --- dependencies: - role: rbenv - role: ruby - role:

    opencv - role: … … roles/common/meta/main.yml Using the “meta” folder of a role allows you to better match your roles to your mental model. It also allows for a level of indirection.
  19. --- - name: Spin up box roles: - bare_box -

    name: Set up deploy user hosts: new_box roles: - base_packages - deploy_user - name: configure for role hosts: new_box remote_user: deploy vars: ansible_ssh_private_key_file: *LOCAL PATH* roles: - common playbook
  20. - name: create image hosts: new_box remote_user: deploy vars: ansible_ssh_private_key_file:

    … tasks: - action: ec2_facts - local_action: ec2_ami args: instance_id: "{{ ansible_ec2_instance_id }}" region: "us-west-1" name: "{{ image_name }}" when: create_image - name: Terminate instances hosts: localhost connection: local tasks: - local_action: ec2 args: state: 'absent' instance_ids: '{{ ec2.instance_ids }}' region: "us-west-1" when: terminate_on_finish playbook (continued)
  21. Step 2 • On tests passing, launch an Ansible Tower

    job to provision the stack using cloudformation (from Travis, or whatever) • “Bake” needed info into LaunchConfigs and ASGs at this point. • Two places where we need to put info: EC2 Metadata Tags, and the ansible facts dir! • “UserData” script will run a script on startup
  22. EC2 Tags "FrontEndASG" : { "Type" : "AWS::AutoScaling::AutoScalingGroup", … "Tags"

    : [ {"Key" : “AnsibleRole", "Value" : “FrontEnd", "PropagateAtLaunch" : "true"}, {"Key" : “Environment", "Value" : {"Ref" : “DeployEnvironment"}, "PropagateAtLaunch" : “true"}, {"Key" : “SomethingASGSpecific", "Value" : “foo”, "PropagateAtLaunch" : “false”}], … } In declaring ASGs in a cloudformation, you can tag the ASG and declare whether or not the tag should be passed on to its instances Notice here that we are putting tags on the instances that identify the role it will be playing and the environment that it belongs in
  23. /etc/ansible/facts.d/ • Any YAML or JSON files in this directory

    are available to Ansible tasks running on the machine • Bake on during CloudFormation provisioning using LaunchConfiguration metadata /etc/ansible/facts.d/deploy_params.fact! {“git_commit”: “af7b666…”} playbook:! …! vars: commit: "{{ ansible_local.deploy_params.git_commit }}” … There is a magical location on boxes - the ansible facts.d directory. Any JSON or YAML files with the extension “.fact” will be read in by any Ansible plays running on that box (assuming you’re gathering facts). You can bake files onto autoscaled instances while provisioning via CloudFormation by putting the file info in the metadata of the LaunchConfig, and then running cf-init from the UserData startup script
  24. Step 3 • The job to configure a box is

    launched via the callback API • This happens both during deployment AND during scale-out events • Have all your boxes call the same job, and let the information that you have now baked onto the boxes work their magic
  25. (nearly) Agentless • Ansible is agentless - the computer being

    configured doesn’t need anything except sshd. • Ansible Tower offers a great “phone home” option - jobs can be kicked off via a REST API. • The only action we need to do on the box that needs to be configured is: curl <endpoint>
  26. phoning home • call request_tower_configuration.sh from UserData startup script The

    kind folks at ansible give you a script that wraps the curl call and will retry it if it fails for some reason
  27. ` - name: "configure front end boxes for role”
 hosts:

    "tag_AnsibleRole_FrontEnd:&tag_Environment_{{ deploy_env }}” vars: db_host: "{{ ansible_local.deploy_params.db_host }}" roles: - front_end - name: "configure backend boxes for role" hosts: "tag_AnsibleRole_BackEnd:&tag_Environment_{{ deploy_env }}” vars: db_host: "{{ ansible_local.deploy_params.db_host }}” roles: - backend playbook this same playbook gets called by all boxes, but it will only run roles and tasks on the hosts that match the tags notice the roles called front_end and app_box - these utilize the meta dependencies similarly to the common role
  28. Boxes configured! • Step 1: Build Common Image (only once

    in a while) • Step 2: Launch stack via CloudFormation, baking in necessary information • Step 3: Phone home to Tower, let Tower configure new box(es)
  29. ResourceSignals • Tell CloudFormation that a resource isn’t successfully created

    unless it’s received the proper signals "FrontEndASG" : { "Type" : "AWS::AutoScaling::AutoScalingGroup", "CreationPolicy": { "ResourceSignal": { "Count": { "Fn::FindInMap" : [ "AsgDesiredSizes", “FrontEnd", {"Ref" : "DeployEnvironment"} ]}, "Timeout": "PT20M"}}, this way if something goes wrong, the job sitting up the new cloudformation will fail, and you’ll never tear down the old cloudformation
  30. ResourceSignals • Tell CloudFormation that a resource isn’t successfully created

    unless it’s received the proper signals cfn-signal --stack {{ cf_stack_name }} --resource {{ cf_resource_name }} --region us-west-1 this way if something goes wrong, the job sitting up the new cloudformation will fail, and you’ll never tear down the old cloudformation
  31. Waiting for 200s • If your Ansible job is setting

    up a web server, have it wait to make sure the all of the boxes in the front end ASG are returning 200s before moving on. - name: wait for site to start returning 200s local_action: uri args: url: "http://{{ ansible_ec2_public_hostname }}/" register: result until: result|success retries: 30 delay: 10 again, don’t tear down your old site until your new site is working!
  32. Notify all the ASG things • Utilize ASG notifications. Publish

    to an SNS topic, and use Lambda to subscribe and post to your chat API of choice. "NotificationConfigurations" : [ {"NotificationTypes" : [ "autoscaling:EC2_INSTANCE_LAUNCH", "autoscaling:EC2_INSTANCE_LAUNCH_ERROR", "autoscaling:EC2_INSTANCE_TERMINATE", "autoscaling:EC2_INSTANCE_TERMINATE_ERROR"], "TopicARN" : <SNS ARN> }]
  33. RoR folks: precompile separately • If precompiling assets takes a

    long time, have your CI system spin up a box and precompile while the tests are running. • When it’s done, it can upload to a known place in S3, and from then on, any box that needs the assets can just download them, rather than having to precompile