Deploying a Private OpenStack Cloud at Scale

Matt Speaking

•  Both principal engineers at TWC on the OpenStack team
o  different backgrounds ranging from software development to operations to IT and engineering o  If you’d like to get in touch with us after the talk o  here today to discuss our story •  Audience - show of hands: o  How many people here have used Openstack? o  How many of you are using OpenStack hosted by someone else (RAX/ HP etc)? o  How many are running their own OpenStack cloud? o  How many are planning run their own OpenStack cloud?

•  We’ll start by explaining our deployment and some of
our architectural decisions •  Then we’ll dive into some thoughts that you need to consider when planning your deployment

OPENSTACK IS COMPLICATED Multi-node, multi-service interdependencies New version every 6
months So…. •  We cannot design your entire infrastructure during this talk. •  You will need to learn a lot and there’s a lot to think about •  What we can do is help you get started thinking about how you will deploy openstack

•  Innovation o  Over 400 companies (IBM, RedHat, HP, Intel,
Percona, and TWC!) o  Thousands of engineers o  This drives innovation •  API focused o  Everything has a REST API o  GUI and CLI are just wrappers o  Makes it easier to integrate with other tools •  Self-service culture o  Gives teams access to the APIs so they can design their own workflows o  Teams can create their own VMs, networks, and services o  This makes teams inside TWC more efficient •  Open o  Design in public, chance to influence it via email, IRC, and summits o  You can have input on the design •  DevOps Philosophy o  Affords rapid deployments and fail fast o  Stand-up your app in minutes o  Bring up a new copy for A/B or nightly testing o  Scale up and down rapidly •  OpenStack is a Platform, not a Product

•  We’ll be mentioning OpenStack releases occasionally through the talk,
and they’re kind of weird if you’re not used to how they work •  They’re named after places near, or related to where the associated OpenStack Summit occurred •  They started off with Austin at A. •  The current release is J for Juno, which is named after a town near Atlanta •  The last design summit was in Paris, Paris is home to the Kilogram, the only remaining metric unit tied to an artifact •  There are also numeric names you may hear o  Bug fix releases: juno.1, juno.2

Clayton Speaking I want to talk a little bit about
our deployment, so that you have some context to keep in mind for the rest of the talk

•  Currently expanding compute and storage capacity and moving to
a new network architecture

When we were originally standing up our OpenStack cloud, there
were 3 key decisions we needed to make about the architecture. •  Identity o  We wanted a shared identity across our data centers, and we wanted to do that by leveraging our existing Active Directory infrastructure •  Network o  We wanted to give our customers the ability to fulfil their own network requests, instead of filing tickets for someone else to do it •  Storage o  We wanted live-migration to allow us to do maintenance on hosts without downtime for our customers o  We wanted replication between our data centers for DR and performance purposes. In the next few slides we’ll talk about each of these areas in a more detail

•  We had a requirement to allow customers to run
their apps in either DC. Given this requirement, This is the architecture we have today •  Made Identity shared across data centers: o  6 node Keystone cluster - 3 nodes per data center §  Use MySQL on each node with Galera for replication between nodes §  An arbitrator in a 3rd DC to prevent split brain scenarios o  Active Directory integration using the Hybrid-Auth Keystone driver which we forked from SuSE and adapted §  User authentication against MySQL first, falling back to AD. o  Service accounts in MySQL only §  Don’t want to store personal AD credentials in text files on servers o  In this architecture, all other identity information is stored in MySQL

•  OpenStack Networking is very flexible o  Challenging because there
are so many options o  You will need to work closely with your network team to determine what is right for you •  We’re using VXLAN to provide our customers with private virtual networks (tenant networks) o  NAT allows external connectivity for virtual machines o  Public IPs are assigned as needed from an external (provider) network for applications that the outside world needs to talk to •  Tenant networking is all self-service o  If a customer wants another network, security policy changes, etc, they can do it o  Firewall in front of environment, but defaults are fairly permissive •  We chose this approach because our network provisioning process was manual and slow o  Our customers come to us because they’re slowed down by the more traditional processes •  Do you want your customers to be able to create their own networks, define their own security policies, etc? o  There are many options for self service networks o  Flexibility comes with a lot more complexity and less maturity o  When we first set this up, it didn’t work that well under real load. o  We’ve put a lot of time into patching, upgrading and working out the

•  Storage Architecture was designed with our DR and operational
requirements in mind. •  Object Storage - Swift o  Similar to Amazon S3 - same use cases o  We replicate between DCs o  Internally we use this for storing things like backups and images/ snapshots. •  Block Storage - Ceph o  Also working on adding different tiers of storage -- For example, an SSD tier using SolidFire. o  Main reason we’re using Ceph is for shared storage, which allows live migration

•  You’ll need to think about if you want to
support live migration o  What is LM? Enables moving running instances between compute hosts §  Really useful when doing maintenance work o  Polarizing subject - not very cloudy o  Downside: Requires shared storage, which is generally more expensive than local storage in servers o  Upside: Enables more traditional, less-cloudy applications, ones that may not handle being taken offline or having the host rebooted. o  If your customers care about their instances going away, this can make operations much simpler o  One challenge for us was that not all storage vendors support live- migration

Over the past year we’ve iterated on our tool chain.
This shows some of our main tools for our openstack development and deployment. Rather than trying to draw a picture as complex as the openstack design, we’ll be talking about how we use these tools during the rest of this presentation.

Matt Speaking: Now that we’ve explained our architecture, lets dive
into the 7 areas that we think you need to consider when planning your deployment.

Matt Speaking: All snowflakes are special and different, which is
great for snowflakes, but it is not a good model for managing servers. Hardware will fail, links will fail, power will fail. How many of you can rebuild a server if it dies in the middle of the night? We can destroy any node we want and have it back up and online in less than 45 minutes How many of you spend too much time troubleshooting weird one-off problems? -- Don’t, rebuild instead -  It’s not a bug if it doesn’t happen twice. You don’t have time to diagnose it in many cases Design your system with HA in mind. Tools like galera mean that losing one box is not a big deal

OS installs: •  We use cobbler for OS installs. o 
We’re not endorsing Cobbler, but it’s simple and works well for us. o  There are a lot of good tools in this space •  Cobbler is configured by puppet on the build server •  The goal of cobbler is to install Ubuntu and drop a puppet config file such that the node can successfully talk to the puppet master Puppet: •  Installs packages •  Upgrades the kernel •  Manages configs •  Ensures services are running orchestration: •  Manage inter-node dependencies •  For example, don’t restart galera on all nodes at once •  Pre and post puppet run checks during deployments o  If you break the first control node, you should stop deploying before you break all 3 Why both: We use each tool to their strengths

•  External dependencies are unreliable and change all the time
o  External dependencies are things outside of your environment that you depend on: package repos, GPG keys, puppet modules, etc o  Mirroring alone isn’t enough, you also need to version so that you can hold old versions until you decide to upgrade •  Could you still deploy or rebuild a box if Percona’s software repos went down? What about the main Ubuntu keyserver? Within the past year we’ve had failures like these several times. Even if they’re up, what if the repo you’re using deletes the version you need? •  Story: Late last year we wanted to rebuild some nodes, but when we did them we found out that we’d accidentally upgraded some of the nodes from Icehouse to Icehouse.2 in the process. Now we had a cluster with a mix of versions making problems nearly impossible to track down. Fortunately this was in staging, but it was a real driver for us to start mirroring and managing our external dependencies. o  Upgrading regularly is important, but you want to do it intentionally. •  External dependency & Repo management leads to Repeatable builds which provide the ability to have throw away environments

•  A prod node goes down at 3am. Your cluster
is now compromised and only has 2 members, what if you lose one more? You’ve set up automation so you should be able to rebuild this node, but, when was the last time you did it? 3 months ago? How much code has changed since then? How many packages in the OS? Kernels? Think it will work? Do you want to be debugging this at 3am? •  Using vagrant and the vagrant openstack-plugin, we can build any of our node types on top of OpenStack. This includes build nodes, control, compute, keystone, monasca, swift, etc. It’s as simple as vagrant up dev02- keystone-001. •  Automating installs only valuable if you know the rebuilds will be successful. •  So every hour we rebuild all our major node types using Jenkins. •  We also use these vagrant based dev environments to do all our development work, testing code, config changes, and upgrades.

Clayton Speaking: Very early in the process you’re going to
have to think about your plans for High Availability. One of the big choices is whether or not to do Active/Active or Active/Passive HA for services, so let’s talk about the pros and cons of each

•  Active/Active o  Cluster of peers §  All nodes are
typically sharing load §  You don’t have to worry about a broken passive node o  Can be more complex §  Frequently requires specialized application support •  Galera for replication, RabbitMQ clustering, etc §  This will require learning the intricacies of these features o  Faster failover §  Clients just reconnect to another active node. §  There is no delay to transfer resources and bring up the service on another node o  Easier maintenance §  Take a node out of your load balancer and the cluster and do whatever you need o  More hardware §  Have to avoid split-brain in active/active clusters §  Typically requires an odd number of nodes, so you need 3 instead of 2 minimum o  Works well with OpenStack §  State is generally stored in the database and services communicate using RabbitMQ §  Standard scaling for OpenStack services is to run more of them

HA Active/Passive •  Generally the pros and cons are just
reversed •  Only requires two nodes o  This can be a big advantage in smaller environments •  Better tested o  This is a very traditional, well tested approach to HA o  Software is generally very mature (Corosync, pacemaker, etc) •  More configuration o  May require shared IP addresses, shared storage o  May require writing scripts to transfer and startup services. •  We prefer the Active/Active approach o  Most OpenStack operators are doing Active/Active also o  Enough people are using Active/Passive that you still see support for it in automation tools

•  HA - Other considerations o  All environments should be
HA §  Even if that means most dev environments are “single node HA” §  Important that your configuration be the same §  Important that you can test HA in dev •  Want to be able to test failover, test upgrades, etc

Clayton speaking Everyone has to make changes to their OpenStack
environments... What should that process look like?

•  Have a process o  Do those the same way,
every time o  Do them regularly! Deploy often, smaller changes are better understood, less risk. •  It’s ok to start with a written, manual process o  It will take time to learn what will work for your team o  It will change rapidly in the short term, a manual process can be easier to change o  Start by automating pieces at a time, chain those pieces together by hand o  Over time iterate towards a completely automated process •  Characterize the type of deployments you have o  Regularly, weekly, bi-weekly, whatever o  Or…. •  One-off deployments, upgrades o  Automate! §  Usually more possible to automate than you’d think o  Simplifies testing §  for example, for our most recent OpenStack upgrade, we upgraded dozens of times in a dev environment, in order to test the process

•  Overview - shows our deployment process o  Almost all
dev work (Puppet/Python/whatever) happens in virtualized environments o  Changes submitted to Gerrit for Code Review and automated testing by Jenkins o  Merged to Master o  Master deployed to shared dev o  Release tag are cut at least once a week o  Deployed to Staging §  Standard validation process + one-off validations for specific changes §  Bakes for x days, depending on changes o  Deployed to Production •  We started with just Production -- Worked backwards •  Ansible & Puppet for deployment & config management in all environments

Matt Speaking: OpenStack is driven by the community, so you
need to consider how you will or will not participate in it.

Everyone should join the operators list: everyone has the same
issues, some may have solved them for you processes: •  Bugs •  Features •  Fixes (gerrit) - even just doc fixes Participation •  Summit every 6 months •  Operators meetup between summits •  Doesn’t just mean meetings, it can mean mailing lists, IRC, meetups, etc.

Tools: •  Obvious ones: Jenkins, git •  Gerrit & Git-review,
etc •  Nodepool •  Jenkins Job Builder •  Git-upstream

Clayton Speaking: OpenStack has a new release every 6 months,
how are you going to handle that?

•  Openstack upgrades used to be very painful, they are
better now but still require careful planning and testing o  You do not want to get stuck 3 releases behind •  Automation for upgrade ◦  Automation of upgrades allows extensive testing, less downtime, less human error. ▪  Doesn’t save development time, saves you from mistakes •  Have an environment where you can test your upgrades ◦  if possible test with production data;, we ran into issues that we only discovered when we used production data. •  DB migrations are the biggest source of downtime during upgrades ◦  Can’t run code from old version of service against the new schema in most cases. ◦  This is getting better ▪  In Kilo, Nova is moving to no downtime migrations ▪  Deprecating schema before dropping columns ▪  Adding new columns instead of changing semantics existing ones

•  Database, rabbitmq, your OS, kernel & reboots •  Handle
these with same approach as upgrading OpenStack ◦  Automation ◦  Extensive testing

Matt Speaking

•  We use Icinga & are rolling out Monasca, but
it doesn’t matter what monitoring tool you setup, the tools matter less than the process. •  Start Small and build. o  we’re still adding checks almost every week •  Actionable: send them somewhere, and make sure you can do something, if they just sit in Icinga nobody will see them. You need to document what happens when an alert comes in. •  Make sure someone is responsible. You need to have an on call rotation or someone deemed responsible per service. o  Need to be sure someone is monitoring staging and dev environments o  No point in changes sitting in staging if no one is checking if it is working •  Don’t configure your checks by hand. When a new node comes up, the profile defines it’s Icinga checks and the icinga server just picks them up. This makes your configuration very flexible. o  Don’t require updates in two places for new servers -- It will always be out of sync

Matt Speaking: There are some areas of OpenStack that will
give you some headaches.

•  RabbitMQ o  Message broker used by almost all openstack
components to communicate with each other o  Rabbit failures are difficult to detect and have wide effects §  May manifest itself, for example volume attach randomly failing §  It takes time to determine a pattern of behavior for these types of failures and then figure out that rabbit is having issues. o  This is a focus area for us and the operators community §  Adding monitoring §  Adding some automated queue cleanup §  Heartbeat support

•  Neutron (OpenStack Networking) o  When neutron has problems, guest
VMs can lose network access and customers get upset o  OpenvSwitch crashes, newer is usually better o  don’t be the first one to try a new feature (Distributed Virtual Router or HA routers) o  have maintenance/migration plans in place §  created ansible tools for this

•  Kernel panics o  we’ve had issues in the past
with kernel panics on compute nodes (hypervisors) and control nodes. o  we’ve upgraded the kernel several times and hope to have mitigated this some. o  you might need a plan to handle how you will do maintenance like kernel upgrades o  debugging kernel panics is not our strong suit, so we relied heavily on Canonical to provide kernel fixes for us,

•  Users o  yes, users can be challenging o  users
need to be educated about openstack, we invested in user training o  OpenStack, and more broadly cloud computing, represents a cultural shift for some users who may be used to running on dedicated (and expensive) five 9s hardware, their app may not be cloud-ready o  finally you need to build tooling here for users. §  you need an on-board process to setup accounts, projects, etc. You need a similar process to offboard when people leave the company, you might need to plan what happens to a VM that a whole team is using that was created by a guy who just resigned, who takes ownership of the resources? §  you need a ticketing system and you might need at least an informal SLA with your users •  Finally, one of our key Takeaways is that you need to know your strengths and weaknesses, you might need a db support contract, you might need an OS contract

We’ve been live now for almost a year, so what
are our future plans for OpenStack?

•  Integration testing o  Add more automated validation in our
multi-node testing o  Automated deployment testing in multi-node environments •  Deployment tool improvements o  Would like more visibility into what exactly is in the deployment o  Investigating tooling around generating release notes from git •  Integration testing & deployment tooling allow more frequent, better understood deployments •  Python virtual environments for deploying openstack services o  If you want to deploy something newer than your vendor provides you have to package it §  Packages are heavy-weight and dependencies are a pain o  Virtual environments make it easy to have multiple versions installed and to switch between versions o  Experimenting with this right now with designate (DNS as a service)

Now that we have the underlying pieces of OpenStack pretty
solid, including our tool chain and processes, we’re focusing on providing more services for our customers. These services listed here are the most requested features we have from our customers, and so over the next few months we’ll be rolling them out. •  DNS as a Service (Designate) o  provides a way for customers to make their own DNS records o  released to limited Beta in Early April •  Load Balancer as a Service o  A large part of HA for many of our customers apps is load balancing and failover, we plan on rolling this out by summer •  Monitoring as a Service (Monasca) o  We’re working closely with the upstream Monasca team to get this rolled out. Two members of our team are writing the puppet module to deploy this. •  Looking even further out into 2015 we plan on offering services like: o  Database as a Service (Trove) o  File Share as a Service (Manila)

Matt Speaking: In summary: •  We did not solve all
these 7 things the exact way we have them now on day one •  You may not address all these issues on day one either but you should plan to iterate on them. Plan to constantly improve. •  You need to have a cadence for deployments and upgrades, you don’t want to fall behind •  Automation prevents human error and leads to testable and repeatable processes. Our goal today was to help you to start considering how you plan to address the seven areas we’ve brought up, and hopefully you’ve got some notes or questions you need to think about.

•  Clayton and I will be giving several talks at
the OpenStack Summit in Vancouver in May which will expand on some of the areas that we covered today •  if you want to dive into any of these areas, we’d love to see you there.

•  These are some other talks that our colleagues will
be giving

•  BONUS! •  We’re all human, and any time you
start making config or code changes, you’re going to make dumb mistakes. •  One of the ways we try to mitigate this is through code review

•  We use Gerrit for Code review •  We see
a lot of benefits from doing code review o  Code quality goes up any time you have someone else even skim over a change that you’re proposing o  This is also a good opportunity for knowledge sharing and mentoring o  One thing we didn’t really anticipate is that it provides a better sense of shared ownership of your configuration and code §  If you make a mistake and 3 other people approved it, then it’s really hard to point fingers o  One nice feature of Gerrit is that it’s very easy integrate pre-merge testing with Jenkins §  This means that you can prevent merging changes to your master branch that don’t work •  Code review can be hard to sell o  Our management had experience with it, and was very supportive o  There is definitely a learning curve and a change in process. o  We feel really strongly that if your infrastructure is defined by code, you should be doing code reviews

Deploying a Private OpenStack Cloud at Scale

Deploying a Private OpenStack Cloud at Scale

More Decks by Matt Fischer

Other Decks in Technology

Featured

Transcript