So Long, Public Cloud

So Long, Public Cloud Seattle DevOps Meetup Lessons learned migrating
400 servers to an OpenStack private cloud Chris Snell Revinate, Inc.

Who is Revinate? SaaS for the hospitality industry We deliver
rich guest data and feedback to hotels that improve their customers’ experiences

Genesis In Q1 2014, we were…  …running ~200 instances in
Rackspace Public Cloud …using about 800 GB of RAM in aggregate …and a bunch of Cloud Databases (DBaaS) instances …and a dozen Cloud Load Balancers (LBaaS) …and growing 20% month-over-month

Genesis It was really expensive. Five-ﬁgure hosting bill, every month.

Genesis And it was a crappy experience. We had… …unpredictable
and often poor network performance. …noisy neighbor problems. …constant scanning from unknowns probing for vulnerabilities.

Why Private Cloud? Three reasons: - Cost - Performance -
Security

Private Cloud: Cost Public Cloud 800 GB RAM Shared networks
Shared SW load balancers  IPtables ﬁrewalls $X / month Private Cloud 2.75 TB RAM 22 dedicated physical servers Dedicated Networks Dedicated HW load balancers Dedicated HW ﬁrewalls < 0.9 x $X / month

Private Cloud: Performance The network is incredible. 1 Gbit/sec (bidirectional)
from any instance, no matter the size No more paying for larger instances than you need just to get the bandwidth you need.

Private Cloud: Performance You are your own noisy neighbor. Steal
all the CPU and disk I/O you need! (We don’t use much)

Private Cloud: Security No more automated scans hitting instances No
more maintaining complex IPtables rulesets Dom0 threat exposure greatly reduced IPsec VPN for errr’body! Dedicated, HA ﬁrewalls and load balancers in front of all instances

Other Beneﬁts Flexible instance sizing Create any combination you desire!

Other Beneﬁts API-driven Use the same tools you used in
public cloud.

Division of Responsibilities DATACENTER NETWORK / LB / FW SERVER
HARDWARE OPENSTACK INSTANCE APPLICATION Revinate Responsibility Revinate Visibility Rackspace Responsibility

The Migration

The Migration Problem: We had far too many moving pieces
to move everything at once. Solution: Move component-by-component, connecting old to new with hybrid cloud

Connecting the Clouds RackConnect Rackspace’s networking technology to tie public
& private clouds together

How We Moved Building a base OS image We used
Ubuntu 14.04 LTS. Added a run-once /etc/rc.local to do ﬁrst-boot provisioning: • Set up apt to use RAX mirrors • Set up /etc/hosts • Enable OpenStack ohai hints • Alert us via HipChat if anything goes wrong

How We Moved /etc/rc.local

How We Moved In preparation for the move, we built
some infrastructure: DNS - Authoritative DNS servers (DJB’s djbdns) x 2 - DNS caches (DJB’s dnscache) x 2 Logging - rsyslog - forwarding to Papertrail Chef - Open Source Chef server SMTP - Postﬁx forwarders (Sendgrid, Mailgun, etc)

How We Moved Moving with Chef • We started with
a fresh Chef repo. • Set up our own Chef Open Source server • Built cookbooks as we moved each service • Gangnam-style cookbook structure • Base role to install common components

How We Moved We started with the data layer It’s
the hardest to move. Use replication to move the data. One cluster at a time. MYSQL MASTER MYSQL SLAVE MYSQL SLAVE PRIVATE CLOUD PUBLIC CLOUD MYSQL MASTER MYSQL SLAVE STEP 1 STEP 2

How We Moved Elasticsearch was much harder Playing the egg
toss game while driving down the highway at 90 MPH

How We Moved The Elasticsearch Egg Toss Move one node
at a time - recover to “green” cluster state after each move Designate master-only (no data) nodes Use cluster.routing.allocation.exclude._ip to exclude public cloud nodes when decommissioning public cloud nodes and turning on new private cloud nodes

Monitoring + Metrics

OpenStack Monitoring Built-in monitoring w/ OpenStack Havana is a joke.
LOL, what?

• Inaccurate metrics displayed in Horizon • No historical trends/graphs • No cluster-wide metrics • Useless for capacity planning • No alerting • Ceilometer API is broken (in Havana, anyway…)

Rackspace’s response: “We’re working on it.” One year later and we still haven’t seen anything. We’re not holding our breath.

OpenStack Monitoring Solution: Build our own!

Datadog - Metrics + Alerting - Agents installed on every
(production) instance - Many out-of-the-box app integrations - Custom checks written with Python (we have about 10 so far) - Very powerful graph builder - Cons: it’s expensive

Datadog Demo

Thanks! [email protected] http://github.com/chrissnell http://output.chrissnell.com https://www.linkedin.com/in/csnell I don’t tweet.

So Long, Public Cloud

So Long, Public Cloud

Christopher Snell

More Decks by Christopher Snell

Other Decks in Technology

Featured

Transcript