Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from a cloud service provider

Lessons from a cloud service provider

This talk offers a frank insight into the mistakes, lessons learned from running a cloud service provider and the processes and systems we have in place today to avoid outages for our customers.

Robert Jenkins

May 13, 2015
Tweet

More Decks by Robert Jenkins

Other Decks in Technology

Transcript

  1. Our Dev Process Mercurial SCM Feature Branch/trunk Commit code Jenkins

    Continuous Integration Integration tests Unit tests Deployment on integration servers Acceptance Tests Code Review Mercurial SCM Stable/Prod. Review changes Merge to stable/prod Mercurial SCM Separate repo for deployment Re-deployment script Production servers Approve for Deployment
  2. • Puppet for automatic deployment and provisioning of hypervisors and

    storage machines • A dedicated project internally call steigy (roughly translates to ‘CRATES’) ◦ preparing and maintaining the environment for production ◦ applying system/security patches Automation
  3. Using saltstack - the MOTOCAR project (it means a forklift-

    you know, for the crates :-) We created a powerful testing framework by using python unit tests, saltstack and a headless hypervisor agent (npax). It is used for verifying the status of a hardware box prior going to production by: ◦ testing disk performance, network, etc., on bare metal ◦ spinning up virtual machines (with a salted disk image) ◦ testing cpu,memory and public/private network performance inside the virtual machines Automation
  4. API Orchestration Node API Background Services PostgreSQL RabbitMQ Compute Host

    Node libvirt cgroups iptables ebtables qemu/kvm npax agent micro service Storage Node zfs ComSTAR qnez agent micro service Cloud Architecture
  5. • We are using celery with rabbitmq. It makes RPC

    and task management really simple ◦ we have different queues for different groups of tasks ( starting servers, drive operations, etc.) ◦ if something goes wrong, the task can be retried or executed later ◦ gives flexibility - you can chain tasks together, add callbacks and track progress Prioritised Message Management
  6. We use this approach for all of our micro services

    ◦ qnez - our custom storage solution ◦ npax - local agent for managing guest servers ◦ background - our agent for managing all kinds of background tasks and callbacks like ▪ receiving runtime data from hosts and VMs ▪ tracking task process ▪ billing ▪ usage reports Prioritised Message Management
  7. • We wanted to have more fine grained control over

    the capabilities of a VM in terms of performance and memory utilization • Libvirt is sufficient for simple scenarios, but we needed a bit more • cgroupspy was born - a python library for managing cgroups • Under new BSD license on github – http://cld.sg/cgroupspy • cgroups to manage Docker: http://cld.sg/dockercgroups Go check them out! The Challenge of Managing Load
  8. DDOS attacks: for simple things, the building blocks are there:

    - nfacct - capture *flow, or even directly from port mirroring and dump in any way imaginable, in our case, in a PostgreSQL database. - exabgp - scriptable BGP daemon. we use it to announce blackholes and to graph our network (in the future) - some code to analyze traffic and trigger blackholes on obvious DoS patterns. Existential threats
  9. Recap • Eliminate human error through process • Robust messaging

    to scale services • Powerful tools to manage computing load • Network edge systems to protect cloud
  10. Instant Access • Get instant access today • Ping us

    and we’ll upgrade your resources • Free support 24/7