Configuring Autoscaled Instances via Ansible and Ansible Tower

Deploying and Conﬁguring Autoscale Groups Using Ansible Tower ...when you're
way too impatient to wait for an image to build Jim Altieri - software engineer at AllTrails [email protected]

Me: [email protected] • Programmer for artists and musicians • Previous
job: ultra-HA events, hybrid cloud/metal • Current job: AllTrails

AllTrails • Tools (web and mobile) to help you in
all phases of hikes and other outdoor activities: discovery, planning, doing, and sharing. • 2m+ monthly users • Unusual trafﬁc pattern (weekend and holidays) • Small engineering team - multiple roles On weekends and holidays we can see up to 10x of our baseline traffic.

Spoiler Alert • Our DevOps progression • The tricky problems
we encountered • Our solutions • Tips / Tricks • Q & A

Where we started Manual rolling deploys

Here’s the basic architecture we’ll be referring to throughout the
presentation. Two different types of boxes: Front End and Back End. All boxes are in communication with a database and an ElastiCache cluster. The Front End boxes are attached to an ELB. Where we started in our DevOps journey - a developer would run a script from her/his box, which talked to a hosted Chef server, which knew about all of our long-running EC2 instances. It would deploy code to the running boxes, and restart the apps.

Manual Rolling Deploys • Manual?!?! • When things go wrong…
• Way over-provisioned • Except when it wasn’t 1. No need for manual deploys - use CI/CD as a layer of abstraction. Devs write code, they see code. 2. Many of you have encountered - things go wrong, weird state. Some boxes have new version, some have old, some incomplete? A mess. 3. And with just a set fleet of instances, we had to over-provision in order to handle our peak traffic.

Enter: Ansible and Ansible Tower

Ansible • Agentless - only need sshd • Declarative -
YAML ﬁles • Variables, roles, playbooks • “Idempotent”

An example of an Ansible role which tries to download
a file from S3, and if unable to do that, will precompile assets. Notice the mustache templated variables…

Ansible Tower • Runs Ansible playbooks • Web interface +
API • Secure(-ish), auditable

Work in Ansible Tower makes DevOps become a game of
turn the donut orange (or green). No red donuts == happy.

Phase 1 Immutable system pattern and blue-green deploys

Blue-Green? • Let your new deploy be a completely separate
stack. • Bring it up while letting the previous stack continue to work. • Only once the new site is connected and working, tear down old stack.

Blue-Green? h/t Alan Norton Thanks Alan Norton

Same general architecture. The rounded rectangle containers represent CloudFormation stacks.
Each CF stack has a complete set of all EC2 instances needed. Ansible brings up the stack…

Then Ansible configures the new boxes

Now the new boxes are connected to the ELB, db
and ElastiCache. There is a brief period where both new and old are serving traffic. We haven’t found a good way to avoid this and still provide zero downtime deploys.

The old CF stack is brought down (atomically) and we’re
left with an identical architecture, but with new code.

Phase 2 Autoscaling!

Autoscaling Deploys • Already no state on boxes, thanks to
Blue-Green • No more over-provisioning for peak trafﬁc Refactoring your app for a blue-green deploy will require you to remove state from your instances. Which is handy, because you also need that for autoscaling.

So what’s the problem? • Autoscaled boxes are launched asynchronously
- how do we know when they should be conﬁgured? • How do we accommodate different roles for different boxes? asynchronous - on deploying, you create the asg, and as soon as the group is created, it’s “there.” but the individual boxes come up on their own time. also as new autoscaled instances come online during scaling out events, this is completely outside of whatever our original provisioning flow might be. And - how do we handle different types of boxes coming up? How do we do autoscaling and have these different boxes?

Make separate AMIs? • Aminator (outdated, limited) • Packer (slow!)
• Docker? (the containers still need to be provisioned - just passing the buck) The traditional route is by “baking” an AMI for each of your box types, and just using that AMI in your LaunchConfiguration. Aminator - outdated, limited to certain linux dists Packer - slow. The workflow is to spin up a box, provision it, image it. Docker - haven’t looked into it, but seems promising as an extra layer of abstraction / indirection between code and the infrastructure it’s running on

Our Solution (in three steps)

Step 1

Step 1 • Install any users, packages, or conﬁguration ﬁles
that are needed for all boxes in your stack. • Use Ansible for this, too! • Make a common “meta” role. Make a common base AMI

roles/common/meta/main.yml

--- dependencies: - role: rbenv - role: ruby - role:
opencv - role: … … roles/common/meta/main.yml Using the “meta” folder of a role allows you to better match your roles to your mental model. It also allows for a level of indirection.

--- - name: Spin up box roles: - bare_box -
name: Set up deploy user hosts: new_box roles: - base_packages - deploy_user - name: configure for role hosts: new_box remote_user: deploy vars: ansible_ssh_private_key_file: *LOCAL PATH* roles: - common playbook

- name: create image hosts: new_box remote_user: deploy vars: ansible_ssh_private_key_file:
… tasks: - action: ec2_facts - local_action: ec2_ami args: instance_id: "{{ ansible_ec2_instance_id }}" region: "us-west-1" name: "{{ image_name }}" when: create_image - name: Terminate instances hosts: localhost connection: local tasks: - local_action: ec2 args: state: 'absent' instance_ids: '{{ ec2.instance_ids }}' region: "us-west-1" when: terminate_on_finish playbook (continued)

Step 2

Step 2 • On tests passing, launch an Ansible Tower
job to provision the stack using cloudformation (from Travis, or whatever) • “Bake” needed info into LaunchConﬁgs and ASGs at this point. • Two places where we need to put info: EC2 Metadata Tags, and the ansible facts dir! • “UserData” script will run a script on startup

EC2 Tags "FrontEndASG" : { "Type" : "AWS::AutoScaling::AutoScalingGroup", … "Tags"
: [ {"Key" : “AnsibleRole", "Value" : “FrontEnd", "PropagateAtLaunch" : "true"}, {"Key" : “Environment", "Value" : {"Ref" : “DeployEnvironment"}, "PropagateAtLaunch" : “true"}, {"Key" : “SomethingASGSpecific", "Value" : “foo”, "PropagateAtLaunch" : “false”}], … } In declaring ASGs in a cloudformation, you can tag the ASG and declare whether or not the tag should be passed on to its instances Notice here that we are putting tags on the instances that identify the role it will be playing and the environment that it belongs in

/etc/ansible/facts.d/

/etc/ansible/facts.d/ • Any YAML or JSON ﬁles in this directory
are available to Ansible tasks running on the machine • Bake on during CloudFormation provisioning using LaunchConﬁguration metadata /etc/ansible/facts.d/deploy_params.fact! {“git_commit”: “af7b666…”} playbook:! …! vars: commit: "{{ ansible_local.deploy_params.git_commit }}” … There is a magical location on boxes - the ansible facts.d directory. Any JSON or YAML files with the extension “.fact” will be read in by any Ansible plays running on that box (assuming you’re gathering facts). You can bake files onto autoscaled instances while provisioning via CloudFormation by putting the file info in the metadata of the LaunchConfig, and then running cf-init from the UserData startup script

Step 3

Step 3 • The job to conﬁgure a box is
launched via the callback API • This happens both during deployment AND during scale-out events • Have all your boxes call the same job, and let the information that you have now baked onto the boxes work their magic

(nearly) Agentless • Ansible is agentless - the computer being
conﬁgured doesn’t need anything except sshd. • Ansible Tower offers a great “phone home” option - jobs can be kicked off via a REST API. • The only action we need to do on the box that needs to be conﬁgured is: curl <endpoint>

phoning home • call request_tower_configuration.sh from UserData startup script The
kind folks at ansible give you a script that wraps the curl call and will retry it if it fails for some reason

` - name: "configure front end boxes for role”  hosts:
"tag_AnsibleRole_FrontEnd:&tag_Environment_{{ deploy_env }}” vars: db_host: "{{ ansible_local.deploy_params.db_host }}" roles: - front_end - name: "configure backend boxes for role" hosts: "tag_AnsibleRole_BackEnd:&tag_Environment_{{ deploy_env }}” vars: db_host: "{{ ansible_local.deploy_params.db_host }}” roles: - backend playbook this same playbook gets called by all boxes, but it will only run roles and tasks on the hosts that match the tags notice the roles called front_end and app_box - these utilize the meta dependencies similarly to the common role

Boxes conﬁgured! • Step 1: Build Common Image (only once
in a while) • Step 2: Launch stack via CloudFormation, baking in necessary information • Step 3: Phone home to Tower, let Tower conﬁgure new box(es)

(some) details, tips, tricks

ResourceSignals • Tell CloudFormation that a resource isn’t successfully created
unless it’s received the proper signals "FrontEndASG" : { "Type" : "AWS::AutoScaling::AutoScalingGroup", "CreationPolicy": { "ResourceSignal": { "Count": { "Fn::FindInMap" : [ "AsgDesiredSizes", “FrontEnd", {"Ref" : "DeployEnvironment"} ]}, "Timeout": "PT20M"}}, this way if something goes wrong, the job sitting up the new cloudformation will fail, and you’ll never tear down the old cloudformation

ResourceSignals • Tell CloudFormation that a resource isn’t successfully created
unless it’s received the proper signals cfn-signal --stack {{ cf_stack_name }} --resource {{ cf_resource_name }} --region us-west-1 this way if something goes wrong, the job sitting up the new cloudformation will fail, and you’ll never tear down the old cloudformation

Waiting for 200s • If your Ansible job is setting
up a web server, have it wait to make sure the all of the boxes in the front end ASG are returning 200s before moving on. - name: wait for site to start returning 200s local_action: uri args: url: "http://{{ ansible_ec2_public_hostname }}/" register: result until: result|success retries: 30 delay: 10 again, don’t tear down your old site until your new site is working!

Notify all the ASG things • Utilize ASG notiﬁcations. Publish
to an SNS topic, and use Lambda to subscribe and post to your chat API of choice. "NotificationConfigurations" : [ {"NotificationTypes" : [ "autoscaling:EC2_INSTANCE_LAUNCH", "autoscaling:EC2_INSTANCE_LAUNCH_ERROR", "autoscaling:EC2_INSTANCE_TERMINATE", "autoscaling:EC2_INSTANCE_TERMINATE_ERROR"], "TopicARN" : <SNS ARN> }]

RoR folks: precompile separately • If precompiling assets takes a
long time, have your CI system spin up a box and precompile while the tests are running. • When it’s done, it can upload to a known place in S3, and from then on, any box that needs the assets can just download them, rather than having to precompile

Thanks! Questions? • [email protected] • [email protected] • http://alltrails.com

Configuring Autoscaled Instances via Ansible an...

Configuring Autoscaled Instances via Ansible and Ansible Tower

Other Decks in Programming

Featured

Transcript