This talk offers a frank insight into the mistakes, lessons learned from running a cloud service provider and the processes and systems we have in place today to avoid outages for our customers.
Continuous Integration Integration tests Unit tests Deployment on integration servers Acceptance Tests Code Review Mercurial SCM Stable/Prod. Review changes Merge to stable/prod Mercurial SCM Separate repo for deployment Re-deployment script Production servers Approve for Deployment
storage machines • A dedicated project internally call steigy (roughly translates to ‘CRATES’) ◦ preparing and maintaining the environment for production ◦ applying system/security patches Automation
you know, for the crates :-) We created a powerful testing framework by using python unit tests, saltstack and a headless hypervisor agent (npax). It is used for verifying the status of a hardware box prior going to production by: ◦ testing disk performance, network, etc., on bare metal ◦ spinning up virtual machines (with a salted disk image) ◦ testing cpu,memory and public/private network performance inside the virtual machines Automation
and task management really simple ◦ we have different queues for different groups of tasks ( starting servers, drive operations, etc.) ◦ if something goes wrong, the task can be retried or executed later ◦ gives flexibility - you can chain tasks together, add callbacks and track progress Prioritised Message Management
◦ qnez - our custom storage solution ◦ npax - local agent for managing guest servers ◦ background - our agent for managing all kinds of background tasks and callbacks like ▪ receiving runtime data from hosts and VMs ▪ tracking task process ▪ billing ▪ usage reports Prioritised Message Management
the capabilities of a VM in terms of performance and memory utilization • Libvirt is sufficient for simple scenarios, but we needed a bit more • cgroupspy was born - a python library for managing cgroups • Under new BSD license on github – http://cld.sg/cgroupspy • cgroups to manage Docker: http://cld.sg/dockercgroups Go check them out! The Challenge of Managing Load
- nfacct - capture *flow, or even directly from port mirroring and dump in any way imaginable, in our case, in a PostgreSQL database. - exabgp - scriptable BGP daemon. we use it to announce blackholes and to graph our network (in the future) - some code to analyze traffic and trigger blackholes on obvious DoS patterns. Existential threats