Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
The Perils of Writing a PaaS Andrew Godwin http://www.flickr.com/photos/jannem/2719976702/
Slide 2
Slide 2 text
Hi, I'm Andrew. Serial Python developer Django core committer Sysadmin by night
Slide 3
Slide 3 text
We're ep.io Python Platform-as-a-Service Utility billing PostgreSQL, Redis, Celery, and more
Slide 4
Slide 4 text
We built a… prototype. Me and Ben Firshman Three or four days' hacking at DjangoCon Ran code, had simple deployment
Slide 5
Slide 5 text
The last 10%... A month or two of hibernation Went part-time in December Private beta since February Public launch later this year
Slide 6
Slide 6 text
Why? Why not?
Slide 7
Slide 7 text
Why? Why not? Lack of good solutions Strong, technical team Writing backend code is fun
Slide 8
Slide 8 text
It's a challenge We're still a closed beta 300+ apps, on 4 servers Some people just have crazy code Security, security, security
Slide 9
Slide 9 text
Our Architecture
Slide 10
Slide 10 text
ep.io Cloud Request Sugar XML Response Code Magic
Slide 11
Slide 11 text
Balancer Runner Runner Runner App 1 App 2 App 3 App 2 App 4 App 1 Databases File Storage
Slide 12
Slide 12 text
Load Balancer Started with HaProxy Moved to custom Python loadbalancer Still needs refinement
Slide 13
Slide 13 text
Runners Daemon on each machine Nginx + gunicorn for each app instance Output captured, CPU time measured
Slide 14
Slide 14 text
Coordinator Analyses whole system Juggles apps between servers Detects dead servers
Slide 15
Slide 15 text
PostgreSQL Normal PostgreSQL 9 install Daemon to read query logs, make users
Slide 16
Slide 16 text
Redis Custom Redis loadbalancer/manager Starts processes on demand Handles multi-user security
Slide 17
Slide 17 text
Upload Receiver SSH endpoint for git, hg, commands Wraps VCSs, extracts uploaded files Creates filesystem images
Slide 18
Slide 18 text
Other Services Log aggregation UID assignment Calculate costs
Slide 19
Slide 19 text
Statistics Queued in Redis Consumed asynchronously Currently stored in Redis, changing soon Graphed and profiled
Slide 20
Slide 20 text
Configuration Management Puppet for the simpler stuff Daemons handle complex stuff Don't try to reinvent the wheel
Slide 21
Slide 21 text
Monitoring Nagios SaaS monitoring Nagios Emails, texts, pager Several custom checks
Slide 22
Slide 22 text
Backups Currently just rdiff-backup Moving to btrfs snapshots + DRBD HA is not a backup solution
Slide 23
Slide 23 text
Perils
Slide 24
Slide 24 text
Initial bad design (To be fair, it was a prototype)
Slide 25
Slide 25 text
Networks really aren't reliable (Well, EC2's, at least.)
Slide 26
Slide 26 text
Memory pressure is bad (Prepare to have a fallback. And another.)
Slide 27
Slide 27 text
Raw file handles are… fun. (As is the PTY subsystem. Be very careful.)
Slide 28
Slide 28 text
Write just enough automation (If a server dies, I now just go and get a drink)
Slide 29
Slide 29 text
HaProxy doesn't like 500+ backends (it's not exactly common)
Slide 30
Slide 30 text
Single redundancy is only so good (and remember, HA is not backups!)
Slide 31
Slide 31 text
Future Perils
Slide 32
Slide 32 text
Payment (Already underway, still hard)
Slide 33
Slide 33 text
Oversized Sites (we need to get a lot bigger first)
Slide 34
Slide 34 text
European Servers (people really do want them)
Slide 35
Slide 35 text
More Databases (how on earth do you measure MongoDB use?)
Slide 36
Slide 36 text
More Languages (easy to get it working, hard to polish)
Slide 37
Slide 37 text
The Potential Big Outage (quite useful as a motivational tool)
Slide 38
Slide 38 text
Thank you. Andrew Godwin @andrewgodwin andrew@ep.io