Lessons from a cloud service provider

Experiences of a Cloud Service Provider

Next generation power station for digital energy

We offer a virtual data center platform with equivalent performance
and configurability to hardware.

Even devops superheros...

...make mistakes...

...which can lead to this.

• Human error • System Timeouts • Customer Computing Load
• DDOS The Main Root Causes

Our Dev Process Mercurial SCM Feature Branch/trunk Commit code Jenkins
Continuous Integration Integration tests Unit tests Deployment on integration servers Acceptance Tests Code Review Mercurial SCM Stable/Prod. Review changes Merge to stable/prod Mercurial SCM Separate repo for deployment Re-deployment script Production servers Approve for Deployment

• Puppet for automatic deployment and provisioning of hypervisors and
storage machines • A dedicated project internally call steigy (roughly translates to ‘CRATES’) ◦ preparing and maintaining the environment for production ◦ applying system/security patches Automation

Using saltstack - the MOTOCAR project (it means a forklift-
you know, for the crates :-) We created a powerful testing framework by using python unit tests, saltstack and a headless hypervisor agent (npax). It is used for verifying the status of a hardware box prior going to production by: ◦ testing disk performance, network, etc., on bare metal ◦ spinning up virtual machines (with a salted disk image) ◦ testing cpu,memory and public/private network performance inside the virtual machines Automation

API Orchestration Node API Background Services PostgreSQL RabbitMQ Compute Host
Node libvirt cgroups iptables ebtables qemu/kvm npax agent micro service Storage Node zfs ComSTAR qnez agent micro service Cloud Architecture

• We are using celery with rabbitmq. It makes RPC
and task management really simple ◦ we have different queues for different groups of tasks ( starting servers, drive operations, etc.) ◦ if something goes wrong, the task can be retried or executed later ◦ gives flexibility - you can chain tasks together, add callbacks and track progress Prioritised Message Management

We use this approach for all of our micro services
◦ qnez - our custom storage solution ◦ npax - local agent for managing guest servers ◦ background - our agent for managing all kinds of background tasks and callbacks like ▪ receiving runtime data from hosts and VMs ▪ tracking task process ▪ billing ▪ usage reports Prioritised Message Management

• We wanted to have more fine grained control over
the capabilities of a VM in terms of performance and memory utilization • Libvirt is sufficient for simple scenarios, but we needed a bit more • cgroupspy was born - a python library for managing cgroups • Under new BSD license on github – http://cld.sg/cgroupspy • cgroups to manage Docker: http://cld.sg/dockercgroups Go check them out! The Challenge of Managing Load

DDOS attacks: for simple things, the building blocks are there:
- nfacct - capture *flow, or even directly from port mirroring and dump in any way imaginable, in our case, in a PostgreSQL database. - exabgp - scriptable BGP daemon. we use it to announce blackholes and to graph our network (in the future) - some code to analyze traffic and trigger blackholes on obvious DoS patterns. Existential threats

So we got a lot better & more cautious...

Recap • Eliminate human error through process • Robust messaging
to scale services • Powerful tools to manage computing load • Network edge systems to protect cloud

Use mistakes as opportunities

Instant Access • Get instant access today • Ping us
and we’ll upgrade your resources • Free support 24/7

Thank you! [email protected] www.cloudsigma.com/blog status.cloudsigma.com

Lessons from a cloud service provider

Lessons from a cloud service provider

Robert Jenkins

More Decks by Robert Jenkins

Other Decks in Technology

Featured

Transcript

Experiences of a Cloud Service Provider

Next generation power station for digital energy

We offer a virtual data center platform with equivalent performance

Even devops superheros...

...make mistakes...

...which can lead to this.

• Human error • System Timeouts • Customer Computing Load

Our Dev Process Mercurial SCM Feature Branch/trunk Commit code Jenkins

• Puppet for automatic deployment and provisioning of hypervisors and

Using saltstack - the MOTOCAR project (it means a forklift-

API Orchestration Node API Background Services PostgreSQL RabbitMQ Compute Host

• We are using celery with rabbitmq. It makes RPC

We use this approach for all of our micro services

• We wanted to have more fine grained control over

DDOS attacks: for simple things, the building blocks are there:

So we got a lot better & more cautious...

Recap • Eliminate human error through process • Robust messaging

Use mistakes as opportunities

Instant Access • Get instant access today • Ping us

Thank you! [email protected] www.cloudsigma.com/blog status.cloudsigma.com