Slide 1

Slide 1 text

The Perils of Writing a PaaS Andrew Godwin http://www.flickr.com/photos/jannem/2719976702/

Slide 2

Slide 2 text

Hi, I'm Andrew. Serial Python developer Django core committer Sysadmin by night

Slide 3

Slide 3 text

We're ep.io Python Platform-as-a-Service Utility billing PostgreSQL, Redis, Celery, and more

Slide 4

Slide 4 text

We built a… prototype. Me and Ben Firshman Three or four days' hacking at DjangoCon Ran code, had simple deployment

Slide 5

Slide 5 text

The last 10%... A month or two of hibernation Went part-time in December Private beta since February Public launch later this year

Slide 6

Slide 6 text

Why? Why not?

Slide 7

Slide 7 text

Why? Why not? Lack of good solutions Strong, technical team Writing backend code is fun

Slide 8

Slide 8 text

It's a challenge We're still a closed beta 300+ apps, on 4 servers Some people just have crazy code Security, security, security

Slide 9

Slide 9 text

Our Architecture

Slide 10

Slide 10 text

ep.io Cloud Request Sugar XML Response Code Magic

Slide 11

Slide 11 text

Balancer Runner Runner Runner App 1 App 2 App 3 App 2 App 4 App 1 Databases File Storage

Slide 12

Slide 12 text

Load Balancer Started with HaProxy Moved to custom Python loadbalancer Still needs refinement

Slide 13

Slide 13 text

Runners Daemon on each machine Nginx + gunicorn for each app instance Output captured, CPU time measured

Slide 14

Slide 14 text

Coordinator Analyses whole system Juggles apps between servers Detects dead servers

Slide 15

Slide 15 text

PostgreSQL Normal PostgreSQL 9 install Daemon to read query logs, make users

Slide 16

Slide 16 text

Redis Custom Redis loadbalancer/manager Starts processes on demand Handles multi-user security

Slide 17

Slide 17 text

Upload Receiver SSH endpoint for git, hg, commands Wraps VCSs, extracts uploaded files Creates filesystem images

Slide 18

Slide 18 text

Other Services Log aggregation UID assignment Calculate costs

Slide 19

Slide 19 text

Statistics Queued in Redis Consumed asynchronously Currently stored in Redis, changing soon Graphed and profiled

Slide 20

Slide 20 text

Configuration Management Puppet for the simpler stuff Daemons handle complex stuff Don't try to reinvent the wheel

Slide 21

Slide 21 text

Monitoring Nagios SaaS monitoring Nagios Emails, texts, pager Several custom checks

Slide 22

Slide 22 text

Backups Currently just rdiff-backup Moving to btrfs snapshots + DRBD HA is not a backup solution

Slide 23

Slide 23 text

Perils

Slide 24

Slide 24 text

Initial bad design (To be fair, it was a prototype)

Slide 25

Slide 25 text

Networks really aren't reliable (Well, EC2's, at least.)

Slide 26

Slide 26 text

Memory pressure is bad (Prepare to have a fallback. And another.)

Slide 27

Slide 27 text

Raw file handles are… fun. (As is the PTY subsystem. Be very careful.)

Slide 28

Slide 28 text

Write just enough automation (If a server dies, I now just go and get a drink)

Slide 29

Slide 29 text

HaProxy doesn't like 500+ backends (it's not exactly common)

Slide 30

Slide 30 text

Single redundancy is only so good (and remember, HA is not backups!)

Slide 31

Slide 31 text

Future Perils

Slide 32

Slide 32 text

Payment (Already underway, still hard)

Slide 33

Slide 33 text

Oversized Sites (we need to get a lot bigger first)

Slide 34

Slide 34 text

European Servers (people really do want them)

Slide 35

Slide 35 text

More Databases (how on earth do you measure MongoDB use?)

Slide 36

Slide 36 text

More Languages (easy to get it working, hard to polish)

Slide 37

Slide 37 text

The Potential Big Outage (quite useful as a motivational tool)

Slide 38

Slide 38 text

Thank you. Andrew Godwin @andrewgodwin andrew@ep.io