A talk I gave at London Devops in May of 2011.
The Perils of Writing a PaaSAndrew Godwinhttp://www.flickr.com/photos/jannem/2719976702/
View Slide
Hi, I'm Andrew.Serial Python developerDjango core committerSysadmin by night
We're ep.ioPython Platform-as-a-ServiceUtility billingPostgreSQL, Redis, Celery, and more
We built a… prototype.Me and Ben FirshmanThree or four days' hacking at DjangoConRan code, had simple deployment
The last 10%...A month or two of hibernationWent part-time in DecemberPrivate beta since FebruaryPublic launch later this year
Why?Why not?
Why?Why not?Lack of good solutionsStrong, technical teamWriting backend code is fun
It's a challengeWe're still a closed beta300+ apps, on 4 serversSome people just have crazy codeSecurity, security, security
Our Architecture
ep.io CloudRequestSugarXMLResponseCode Magic
BalancerRunner Runner RunnerApp 1App 2App 3App 2App 4App 1Databases File Storage
Load BalancerStarted with HaProxyMoved to custom Python loadbalancerStill needs refinement
RunnersDaemon on each machineNginx + gunicorn for each app instanceOutput captured, CPU time measured
CoordinatorAnalyses whole systemJuggles apps between serversDetects dead servers
PostgreSQLNormal PostgreSQL 9 installDaemon to read query logs, make users
RedisCustom Redis loadbalancer/managerStarts processes on demandHandles multi-user security
Upload ReceiverSSH endpoint for git, hg, commandsWraps VCSs, extracts uploaded filesCreates filesystem images
Other ServicesLog aggregationUID assignmentCalculate costs
StatisticsQueued in RedisConsumed asynchronouslyCurrently stored in Redis, changing soonGraphed and profiled
Configuration ManagementPuppet for the simpler stuffDaemons handle complex stuffDon't try to reinvent the wheel
MonitoringNagiosSaaS monitoring NagiosEmails, texts, pagerSeveral custom checks
BackupsCurrently just rdiff-backupMoving to btrfs snapshots + DRBDHA is not a backup solution
Perils
Initial bad design(To be fair, it was a prototype)
Networks really aren't reliable(Well, EC2's, at least.)
Memory pressure is bad(Prepare to have a fallback. And another.)
Raw file handles are… fun.(As is the PTY subsystem. Be very careful.)
Write just enough automation(If a server dies, I now just go and get a drink)
HaProxy doesn't like 500+ backends(it's not exactly common)
Single redundancy is only so good(and remember, HA is not backups!)
Future Perils
Payment(Already underway, still hard)
Oversized Sites(we need to get a lot bigger first)
European Servers(people really do want them)
More Databases(how on earth do you measure MongoDB use?)
More Languages(easy to get it working, hard to polish)
The Potential Big Outage(quite useful as a motivational tool)
Thank you.Andrew Godwin@andrewgodwin[email protected]