CFEngine on AWS: a Stateless Infrastructure

Slide 1

Slide 1 text

Laurent Raufaste @_LR_ CFEngine on AWS: a Stateless Infrastructure

Slide 2

Slide 2 text

Hello Ops

Slide 3

Slide 3 text

I work at Percolate

Slide 4

Slide 4 text

Percolate helps brands create content at a social scale We are a tech company

Slide 5

Slide 5 text

We are a SaaS We live in the cloud

Slide 6

Slide 6 text

• 5% serving data • 10% doing chores • 85% working on data We use a bunch of servers

Slide 7

Slide 7 text

• ingest data • digest data • close to RT Those 85% do

Slide 8

Slide 8 text

We need to act smart to keep the business sustainable It’s expensive

Slide 9

Slide 9 text

CFEngine

Slide 10

Slide 10 text

A tool to gently "dictate" what your infrastructure should be #dontgetmadmarkburgess #WTF is #CFE ?

Slide 11

Slide 11 text

• 1993 CFEngine • 2003 Puppet • 2006 EC2 • 2008 CFEngine 3 • 2009 Chef Some history

Slide 12

Slide 12 text

Our Redis policy A simple example

Slide 13

Slide 13 text

Why CFEngine ? Chef, Puppet, Ansible

Slide 14

Slide 14 text

Convergence Keep promises if it can. No need to start from a known state.

Slide 15

Slide 15 text

Portability Same policies on Solaris, GNU/Linux, *BSD, AIX, HP-UX, Windows, OSX, …

Slide 16

Slide 16 text

CFEngine v1 released in 1993, as a “teddy bear”, it’s reassuring: it’s been used for this long without any big problem, cf. OpenBSD’s “2 holes since 1996” It’s old

Slide 17

Slide 17 text

Here come the deal breakers Let’s focus

Slide 18

Slide 18 text

The CFEngine DSL has been tailored for this purpose, no legacy, based on the promise theory Dedicated Language

Slide 19

Slide 19 text

Documented Infrastructure Solves the outdated and useless doc problem

Slide 20

Slide 20 text

• grep the whole cluster • what's in there is what's live • no need to SSH • knowledge is shared • history is kept • company is more valuable Documented Infrastructure

Slide 21

Slide 21 text

We want to build for success, not failure We hope what we build will succeed Scalability

Slide 22

Slide 22 text

• Decentralized by nature • Can scale both ways • Largest cluster is X00,000s • m1.small on AWS Scalability

Slide 23

Slide 23 text

It let us build things that last and can be reused Reusability

Slide 24

Slide 24 text

• DRY • Build service/servers blocks • Reuse them on live, staging, dev • Change them once for all Reusability

Slide 25

Slide 25 text

It’s tailored for the job Footprint

Slide 26

Slide 26 text

• Package to install is < 3MB • Largest binary is 320kB (96% C, 3% C++) • The server is just letting clients download policies • Clients are trying to apply the policies locally Footprint

Slide 27

Slide 27 text

It’s free (libre) and will ever be. It’s in Debian so it passed the DFSG test: Fastest way to check. It’s GPL

Slide 28

Slide 28 text

You can open bug reports and submit Pull Requests on Github, a must nowadays Open & active community

Slide 29

Slide 29 text

Here’s what CFEngine allows us to do

Slide 30

Slide 30 text

We don’t let it pwn us Pwn our infrastructure

Slide 31

Slide 31 text

Minimize redundancy and dependency Normalized Infrastructure

Slide 32

Slide 32 text

As the Netflix Chaos Monkey, I randomly kill instances Being unpredictable, it’s fun

Slide 33

Slide 33 text

2011-2013: Employees x10, Clients x20, Servers x2, Infrastructure cost x1.2 Maintain costs

Slide 34

Slide 34 text

Don’t let exceptions waste your time Keep your infrastructure homogeneous

Slide 35

Slide 35 text

Ops should not slow things down Not scared of changes

Slide 36

Slide 36 text

Ops at Percolate

Slide 37

Slide 37 text

Ops are sysadmins that do their job well: Build+Automate+Maintain+Monitor+Document Ops are not DevOps

Slide 38

Slide 38 text

Ask your devs for the commands make them a policy Devs are not DevOps

Slide 39

Slide 39 text

Same infrastructure on all environments Live policies are used to build staging, smaller & fewer instances, and it’s always up to date

Slide 40

Slide 40 text

Same infrastructure on all environments It takes a few mins to get a small replica of live on your workstation, and it’s always up to date

Slide 41

Slide 41 text

• Develop in a branch • Test (Vagrant) • Review (Pull Request) • Merge • Deploy GitHub Flow applied to Ops

Slide 42

Slide 42 text

Ops use IaaS+Metal to provide a PaaS to devs Be the Heroku or the GAE of your team

Slide 43

Slide 43 text

Pieces we added around CFEngine It does not solve it all

Slide 44

Slide 44 text

CFEngine is missing the bootstrap process, is it really its job ? We did it in-house, in Python/Bash Bootstrapping

Slide 45

Slide 45 text

• Request an instance • Name it • Install CFEngine • CFEngine handles the rest Bootstrapping

Slide 46

Slide 46 text

We define all our servers in a INI file Bootstrapping

Slide 47

Slide 47 text

Everything can be overridden per instance type Bootstrapping

Slide 48

Slide 48 text

Easy to define, easy to launch Bootstrapping

Slide 49

Slide 49 text

3 ordered dependencies max, e.g. “Hell” or deploy a Python app with on-demand pip requirements We don’t use CFEngine for complex stuff

Slide 50

Slide 50 text

• [id.][subrole.]role.environment • smtp.live.com • i-1ab345.worker.live.com • i-23f432.api.staging.com • lb.api.staging.com Naming convention to leverage CFEngine classes

Slide 51

Slide 51 text

• Our DNS is our inventory • We leverage it with a coordination service (AWS Tags (does not scale), Zookeeper, …) Naming convention to leverage CFEngine classes

Slide 52

Slide 52 text

• Application layer • CFE: Specialized layer (Role) • CFE: Basic layer (Environment) • Pristine Ubuntu • EC2 Server Structure

Slide 53

Slide 53 text

CFEngine does not take care of it all It takes care of all the basics

Slide 54

Slide 54 text

CFEngine does not take care of it all It makes sure the complex pieces are there and operational

Slide 55

Slide 55 text

syslog, smtp, ... you don’t want to fail big We started with the simple and obvious

Slide 56

Slide 56 text

When we reached the big stuff, it was easy, and we had all the bricks to reuse We finished with the critical

Slide 57

Slide 57 text

Achievements

Slide 58

Slide 58 text

• Documentation • Scalability • Reusability • Easy and fast to change Recap of previous benefits

Slide 59

Slide 59 text

But our huge win is ...

Slide 60

Slide 60 text

What’s the big deal ? Our infrastructure has no state

Slide 61

Slide 61 text

• Policies in git • App code in git • Data in datastores • No backup: Images are cache Our infrastructure has no state

Slide 62

Slide 62 text

2 exceptions: S3 for cryptic generated config files (Jenkins) EBS for large non-vital changing data (RabbitMQ) No instance backup at all ?

Slide 63

Slide 63 text

No state is left on AWS (No AMI), we migrate away For better prices, stability, features, mood We are independent

Slide 64

Slide 64 text

But tell everyone to shut up (email). When something happens, you'll know. Your goal is silence: 0 email. We know and hear everything

Slide 65

Slide 65 text

It does not scale. We update the live version and every server updates itself. You can do this if your infrastructure is limpid, CFEnginized. We don’t push to deploy

Slide 66

Slide 66 text

Anything can go down, it will go up and rebuild itself automatically - It happens nightly. We are resilient

Slide 67

Slide 67 text

Upgrading a server takes 2 commands: 1. Launch a beefier instance with the same name 2. Kill the weak one We can change our shape

Slide 68

Slide 68 text

We can launch and kill any server anytime. It happens while we sleep. We use spot instances, it’s cheap!

Slide 69

Slide 69 text

For the smaller instance types

Slide 70

Slide 70 text

Some free tips We are almost there

Slide 71

Slide 71 text

It’s pretty dense, e.g. “The Promise of System Configuration” enlightened me Watch Mark’s videos

Slide 72

Slide 72 text

Don’t bother anything else, it will give you the “I understand” feeling we all love Buy Diego’s book