Slide 1

Slide 1 text

So Long, Public Cloud Seattle DevOps Meetup Lessons learned migrating 400 servers to an OpenStack private cloud Chris Snell Revinate, Inc.

Slide 2

Slide 2 text

Who is Revinate? SaaS for the hospitality industry We deliver rich guest data and feedback to hotels that improve their customers’ experiences

Slide 3

Slide 3 text

Genesis In Q1 2014, we were…
 …running ~200 instances in Rackspace Public Cloud …using about 800 GB of RAM in aggregate …and a bunch of Cloud Databases (DBaaS) instances …and a dozen Cloud Load Balancers (LBaaS) …and growing 20% month-over-month

Slide 4

Slide 4 text

Genesis It was really expensive. Five-figure hosting bill, every month.

Slide 5

Slide 5 text

Genesis And it was a crappy experience. We had… …unpredictable and often poor network performance. …noisy neighbor problems. …constant scanning from unknowns probing for vulnerabilities.

Slide 6

Slide 6 text

Why Private Cloud? Three reasons: - Cost - Performance - Security

Slide 7

Slide 7 text

Private Cloud: Cost Public Cloud 800 GB RAM Shared networks Shared SW load balancers
 IPtables firewalls $X / month Private Cloud 2.75 TB RAM 22 dedicated physical servers Dedicated Networks Dedicated HW load balancers Dedicated HW firewalls < 0.9 x $X / month

Slide 8

Slide 8 text

Private Cloud: Performance The network is incredible. 1 Gbit/sec (bidirectional) from any instance, no matter the size No more paying for larger instances than you need just to get the bandwidth you need.

Slide 9

Slide 9 text

Private Cloud: Performance You are your own noisy neighbor. Steal all the CPU and disk I/O you need! (We don’t use much)

Slide 10

Slide 10 text

Private Cloud: Security No more automated scans hitting instances No more maintaining complex IPtables rulesets Dom0 threat exposure greatly reduced IPsec VPN for errr’body! Dedicated, HA firewalls and load balancers in front of all instances

Slide 11

Slide 11 text

Other Benefits Flexible instance sizing Create any combination you desire!

Slide 12

Slide 12 text

Other Benefits API-driven Use the same tools you used in public cloud.

Slide 13

Slide 13 text

Division of Responsibilities DATACENTER NETWORK / LB / FW SERVER HARDWARE OPENSTACK INSTANCE APPLICATION Revinate Responsibility Revinate Visibility Rackspace Responsibility

Slide 14

Slide 14 text

The Migration

Slide 15

Slide 15 text

The Migration Problem: We had far too many moving pieces to move everything at once. Solution: Move component-by-component, connecting old to new with hybrid cloud

Slide 16

Slide 16 text

Connecting the Clouds RackConnect Rackspace’s networking technology to tie public & private clouds together

Slide 17

Slide 17 text

How We Moved Building a base OS image We used Ubuntu 14.04 LTS. Added a run-once /etc/rc.local to do first-boot provisioning: • Set up apt to use RAX mirrors • Set up /etc/hosts • Enable OpenStack ohai hints • Alert us via HipChat if anything goes wrong

Slide 18

Slide 18 text

How We Moved /etc/rc.local

Slide 19

Slide 19 text

How We Moved In preparation for the move, we built some infrastructure: DNS - Authoritative DNS servers (DJB’s djbdns) x 2 - DNS caches (DJB’s dnscache) x 2 Logging - rsyslog - forwarding to Papertrail Chef - Open Source Chef server SMTP - Postfix forwarders (Sendgrid, Mailgun, etc)

Slide 20

Slide 20 text

How We Moved Moving with Chef • We started with a fresh Chef repo. • Set up our own Chef Open Source server • Built cookbooks as we moved each service • Gangnam-style cookbook structure • Base role to install common components

Slide 21

Slide 21 text

How We Moved We started with the data layer It’s the hardest to move. Use replication to move the data. One cluster at a time. MYSQL MASTER MYSQL SLAVE MYSQL SLAVE PRIVATE CLOUD PUBLIC CLOUD MYSQL MASTER MYSQL SLAVE STEP 1 STEP 2

Slide 22

Slide 22 text

How We Moved Elasticsearch was much harder Playing the egg toss game while driving down the highway at 90 MPH

Slide 23

Slide 23 text

How We Moved The Elasticsearch Egg Toss Move one node at a time - recover to “green” cluster state after each move Designate master-only (no data) nodes Use cluster.routing.allocation.exclude._ip to exclude public cloud nodes when decommissioning public cloud nodes and turning on new private cloud nodes

Slide 24

Slide 24 text

Monitoring + Metrics

Slide 25

Slide 25 text

OpenStack Monitoring Built-in monitoring w/ OpenStack Havana is a joke. LOL, what?

Slide 26

Slide 26 text

OpenStack Monitoring Built-in monitoring w/ OpenStack Havana is a joke. • Inaccurate metrics displayed in Horizon • No historical trends/graphs • No cluster-wide metrics • Useless for capacity planning • No alerting • Ceilometer API is broken (in Havana, anyway…)

Slide 27

Slide 27 text

OpenStack Monitoring Built-in monitoring w/ OpenStack Havana is a joke. Rackspace’s response: “We’re working on it.” One year later and we still haven’t seen anything. We’re not holding our breath.

Slide 28

Slide 28 text

OpenStack Monitoring Solution: Build our own!

Slide 29

Slide 29 text

Datadog - Metrics + Alerting - Agents installed on every (production) instance - Many out-of-the-box app integrations - Custom checks written with Python (we have about 10 so far) - Very powerful graph builder - Cons: it’s expensive

Slide 30

Slide 30 text

Datadog Demo

Slide 31

Slide 31 text

Thanks! [email protected] http://github.com/chrissnell http://output.chrissnell.com https://www.linkedin.com/in/csnell I don’t tweet.