A Year with ECS - Speaker Deck

Slide 1

Slide 1 text

A Year With ECS Greg Poirier CTO - Opsee

Slide 2

Slide 2 text

Origin Story • Fifteen years of operations experience • Observability, infrastructure architecture • Operations/Systems/Software engineer

Slide 3

Slide 3 text

Opsee • Effortless monitoring for your AWS environment • Reacts to changes in your infrastructure • Monitoring that stays out of the way

Slide 4

Slide 4 text

Infrastructure: What do we want? • As little code as possible • Reproducibility • Easy engineer onboarding • Automate all the things

Slide 5

Slide 5 text

Development Process • Open PR • Run build against PR • Merge to master • Build master and publish artifacts • Deploy

Slide 6

Slide 6 text

Reproducibility • If it builds in CI, it should build locally. • CI should not be a magical, mystical place that makes tests pass. • In case of emergency, break glass and deploy from my laptop to production with confidence.

Slide 7

Slide 7 text

I kind of like Docker.

Slide 8

Slide 8 text

Docker? • "Docker is great for local dev." • "I don't do anything in production with Docker yet." • "I'm not sure Docker is ready for production."

Slide 9

Slide 9 text

Really?

Slide 10

Slide 10 text

One year of Docker and ECS later… • A year of Docker in production • Started with CoreOS's Fleet • Migrated to ECS a year ago

Slide 11

Slide 11 text

Why Fleet? • It was there (we use CoreOS) • Trivial to setup/use (already running) • It did a thing (deployed services to VMs)

Slide 12

Slide 12 text

What could go wrong?

Slide 13

Slide 13 text

Highly problematic. • Etcd outage -> cannot deploy • Etcd and Docker share same disk space by default • Systemd • (Very) Bad default configurations

Slide 14

Slide 14 text

How do I even Docker? • Container fills up /var/lib/docker • Node crashes • Naïve scheduling in Fleet causes containers to be scheduled to new nodes • Next node crashes…

Slide 15

Slide 15 text

Ffffffffffffffleet

Slide 16

Slide 16 text

We could fix Fleet, but… • Seed-stage startup • Must focus solely on our own product • Just Make it Work Mode

Slide 17

Slide 17 text

What are we trying to do here?

Slide 18

Slide 18 text

Make Building and Deploying Easy • Commit • Push • Merge • Deploy

Slide 19

Slide 19 text

Make building and deploying easy. • Commit • Push • Merge • Done.

Slide 20

Slide 20 text

How do we do it?

Slide 21

Slide 21 text

Components we need. • Container scheduler • Continuous integration • Automated deploys

Slide 22

Slide 22 text

Don't build what you don't sell.

Slide 23

Slide 23 text

What did we use? • Container scheduler - ECS • Continuous integration - CircleCI • Automated deploys - Ansible

Slide 24

Slide 24 text

Allons-y!

Slide 25

Slide 25 text

How do we ECS? • Deploy ecs-agent with CoreOS cloud-config • Service and Task definitions in YAML w/ Ansible • Ansible playbook to deploy Task definitions / trigger service updates

Slide 26

Slide 26 text

ECS on CoreOS w/ cloud-config • Cloud-config w/ EC2 instance userdata • Add a systemd unit and configuration data • We do this with Ansible • https://coreos.com/os/docs/latest/booting- on-ecs.htmlECS in Ansible

Slide 27

Slide 27 text

ECS and Ansible • Task and service definitions in YAML • Small Ansible libraries using boto3 for task/ service definition • Deploy with a simple playbook that creates a task definition and updates service

Slide 28

Slide 28 text

But not everything is totally automated…

Slide 29

Slide 29 text

It’s true. • Commit • Push • Merge • Push-button deploy

Slide 30

Slide 30 text

What did we do Wrong? • Didn’t use CloudFormation • Deployed 'latest' tag • Used default configurations • Poor management of service configuration material

Slide 31

Slide 31 text

If I could turn back time… • Tag Docker image w/ Git SHA • Use CloudFormation: stack all the things • Remember that defaults are usually bad • Configuration…

Slide 32

Slide 32 text

What about configuration? • Configuration data stored in S3 with KMS encryption • Startup script pulls config data from S3 • Sources the config data (all in env vars) • Starts the service

Slide 33

Slide 33 text

What to do instead? • Separate container for config. • Export config file as a volume. • Mount config volume in service’s container.

Slide 34

Slide 34 text

Mostly, I am happy… Mostly.

Slide 35

Slide 35 text

What does ECS do well? • Easy to deploy • “Transactional” changes to ECS cluster • ELB Integration • Keeps pace with Docker

Slide 36

Slide 36 text

Let ECS tell you its secrets. • Task-specific information • Exposed ports • Service information • ELB name, Service name, desired count • Metrics • CPU Utilization, Memory utilization

Slide 37

Slide 37 text

Go a step further. • CloudWatch -> Lambda -> Autoscale ECS Services • Compute nodes in ASG • CloudWatch on cluster utilization to scale up compute ASG

Slide 38

Slide 38 text

What is ECS missing? • IP-per-Container w/ ELB integration • Persistent storage • Per-Task/Service IAM roles • Per-Task/Service security groups

Slide 39

Slide 39 text

IP per Container • Don’t have to worry about port collisions. • Directly addressable containers are useful. • Plenty of private IP space to go around.

Slide 40

Slide 40 text

Least-Privileged Access • Every service has the same IAM privileges as the service with the most IAM privileges. • This goes directly against AWS best practices.

Slide 41

Slide 41 text

I dreamed a dream…

Slide 42

Slide 42 text

What do I want, though? • I want to think less about infrastructure. • No more “instances.” • I want a framework for building and deploying applications.

Slide 43

Slide 43 text

Questions?

Slide 44

Slide 44 text

Thank you! • Thanks AdvAWS and sponsors. • Thank you for attending. • Greg Poirier - @grepory • Opsee - https://opsee.com - @GetOpsee