Upgrade to Pro — share decks privately, control downloads, hide ads and more …

So Long, Public Cloud

So Long, Public Cloud

Lessons learned migrating 400 servers from Rackspace public cloud to a hosted OpenStack private cloud. Presented by Chris Snell at the Seattle DevOps Meetup Group.

Christopher Snell

March 31, 2015
Tweet

More Decks by Christopher Snell

Other Decks in Technology

Transcript

  1. So Long, Public Cloud Seattle DevOps Meetup Lessons learned migrating

    400 servers to an OpenStack private cloud Chris Snell Revinate, Inc.
  2. Who is Revinate? SaaS for the hospitality industry We deliver

    rich guest data and feedback to hotels that improve their customers’ experiences
  3. Genesis In Q1 2014, we were…
 …running ~200 instances in

    Rackspace Public Cloud …using about 800 GB of RAM in aggregate …and a bunch of Cloud Databases (DBaaS) instances …and a dozen Cloud Load Balancers (LBaaS) …and growing 20% month-over-month
  4. Genesis And it was a crappy experience. We had… …unpredictable

    and often poor network performance. …noisy neighbor problems. …constant scanning from unknowns probing for vulnerabilities.
  5. Private Cloud: Cost Public Cloud 800 GB RAM Shared networks

    Shared SW load balancers
 IPtables firewalls $X / month Private Cloud 2.75 TB RAM 22 dedicated physical servers Dedicated Networks Dedicated HW load balancers Dedicated HW firewalls < 0.9 x $X / month
  6. Private Cloud: Performance The network is incredible. 1 Gbit/sec (bidirectional)

    from any instance, no matter the size No more paying for larger instances than you need just to get the bandwidth you need.
  7. Private Cloud: Performance You are your own noisy neighbor. Steal

    all the CPU and disk I/O you need! (We don’t use much)
  8. Private Cloud: Security No more automated scans hitting instances No

    more maintaining complex IPtables rulesets Dom0 threat exposure greatly reduced IPsec VPN for errr’body! Dedicated, HA firewalls and load balancers in front of all instances
  9. Division of Responsibilities DATACENTER NETWORK / LB / FW SERVER

    HARDWARE OPENSTACK INSTANCE APPLICATION Revinate Responsibility Revinate Visibility Rackspace Responsibility
  10. The Migration Problem: We had far too many moving pieces

    to move everything at once. Solution: Move component-by-component, connecting old to new with hybrid cloud
  11. How We Moved Building a base OS image We used

    Ubuntu 14.04 LTS. Added a run-once /etc/rc.local to do first-boot provisioning: • Set up apt to use RAX mirrors • Set up /etc/hosts • Enable OpenStack ohai hints • Alert us via HipChat if anything goes wrong
  12. How We Moved In preparation for the move, we built

    some infrastructure: DNS - Authoritative DNS servers (DJB’s djbdns) x 2 - DNS caches (DJB’s dnscache) x 2 Logging - rsyslog - forwarding to Papertrail Chef - Open Source Chef server SMTP - Postfix forwarders (Sendgrid, Mailgun, etc)
  13. How We Moved Moving with Chef • We started with

    a fresh Chef repo. • Set up our own Chef Open Source server • Built cookbooks as we moved each service • Gangnam-style cookbook structure • Base role to install common components
  14. How We Moved We started with the data layer It’s

    the hardest to move. Use replication to move the data. One cluster at a time. MYSQL MASTER MYSQL SLAVE MYSQL SLAVE PRIVATE CLOUD PUBLIC CLOUD MYSQL MASTER MYSQL SLAVE STEP 1 STEP 2
  15. How We Moved Elasticsearch was much harder Playing the egg

    toss game while driving down the highway at 90 MPH
  16. How We Moved The Elasticsearch Egg Toss Move one node

    at a time - recover to “green” cluster state after each move Designate master-only (no data) nodes Use cluster.routing.allocation.exclude._ip to exclude public cloud nodes when decommissioning public cloud nodes and turning on new private cloud nodes
  17. OpenStack Monitoring Built-in monitoring w/ OpenStack Havana is a joke.

    • Inaccurate metrics displayed in Horizon • No historical trends/graphs • No cluster-wide metrics • Useless for capacity planning • No alerting • Ceilometer API is broken (in Havana, anyway…)
  18. OpenStack Monitoring Built-in monitoring w/ OpenStack Havana is a joke.

    Rackspace’s response: “We’re working on it.” One year later and we still haven’t seen anything. We’re not holding our breath.
  19. Datadog - Metrics + Alerting - Agents installed on every

    (production) instance - Many out-of-the-box app integrations - Custom checks written with Python (we have about 10 so far) - Very powerful graph builder - Cons: it’s expensive