The Peris of Writing a PaaS

The Perils of Writing a PaaS Andrew Godwin http://www.flickr.com/photos/jannem/2719976702/

Hi, I'm Andrew. Serial Python developer Django core committer Sysadmin
by night

We're ep.io Python Platform-as-a-Service Utility billing PostgreSQL, Redis, Celery, and
more

We built a… prototype. Me and Ben Firshman Three or
four days' hacking at DjangoCon Ran code, had simple deployment

The last 10%... A month or two of hibernation Went
part-time in December Private beta since February Public launch later this year

Why? Why not?

Why? Why not? Lack of good solutions Strong, technical team
Writing backend code is fun

It's a challenge We're still a closed beta 300+ apps,
on 4 servers Some people just have crazy code Security, security, security

Our Architecture

ep.io Cloud Request Sugar XML Response Code Magic

Balancer Runner Runner Runner App 1 App 2 App 3
App 2 App 4 App 1 Databases File Storage

Load Balancer Started with HaProxy Moved to custom Python loadbalancer
Still needs refinement

Runners Daemon on each machine Nginx + gunicorn for each
app instance Output captured, CPU time measured

Coordinator Analyses whole system Juggles apps between servers Detects dead
servers

PostgreSQL Normal PostgreSQL 9 install Daemon to read query logs,
make users

Redis Custom Redis loadbalancer/manager Starts processes on demand Handles multi-user
security

Upload Receiver SSH endpoint for git, hg, commands Wraps VCSs,
extracts uploaded files Creates filesystem images

Other Services Log aggregation UID assignment Calculate costs

Statistics Queued in Redis Consumed asynchronously Currently stored in Redis,
changing soon Graphed and profiled

Configuration Management Puppet for the simpler stuff Daemons handle complex
stuff Don't try to reinvent the wheel

Monitoring Nagios SaaS monitoring Nagios Emails, texts, pager Several custom
checks

Backups Currently just rdiff-backup Moving to btrfs snapshots + DRBD
HA is not a backup solution

Perils

Initial bad design (To be fair, it was a prototype)

Networks really aren't reliable (Well, EC2's, at least.)

Memory pressure is bad (Prepare to have a fallback. And
another.)

Raw file handles are… fun. (As is the PTY subsystem.
Be very careful.)

Write just enough automation (If a server dies, I now
just go and get a drink)

HaProxy doesn't like 500+ backends (it's not exactly common)

Single redundancy is only so good (and remember, HA is
not backups!)

Future Perils

Payment (Already underway, still hard)

Oversized Sites (we need to get a lot bigger first)

European Servers (people really do want them)

More Databases (how on earth do you measure MongoDB use?)

More Languages (easy to get it working, hard to polish)

The Potential Big Outage (quite useful as a motivational tool)

Thank you. Andrew Godwin @andrewgodwin [email protected]

More Decks by Andrew Godwin