Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making clouds go faster, for fun and profit!

Making clouds go faster, for fun and profit!

Everyone loves it when things are fast, and that statement holds true whether you're visiting http://www.livingsocial.com or whether you're hitting the OpenStack Nova API and requesting, "Please show me all the instances which I've got running". Nobody ever writes in asking for support and saying, "All of my API calls are completing far too quickly. Slow it down!".

Optimizing the performance of software is arguably a never ending crusade. At some point in time you'll get things fast enough that you can say, "Any effort invested beyond this point is not adding value for the business" but then along comes new code which adds a zillion awesome features, but also regresses performance back to a level where it needs another tune-up.

In the process of transforming our infrastructure and preparing our new OpenStack IaaS to host all our applications, we've been looking for performance wins across the whole stack. We've got some aggressive targets to meet. We've investigated many hardware options and chosen an optimal solution, we've instrumented some of the OpenStack APIs and benchmarked to produce interesting results, and whilst we're not done yet, we do have a "Half-Time Match Report".

Join me as I walk through our learnings so far and propose follow-on areas for investigation and optimization.

Alex Howells

October 17, 2012
Tweet

Other Decks in Technology

Transcript

  1. Bedtime Reading You can get a copy of these slides

    after the talk - https://speakerdeck.com/u/nixgeek Wednesday, 17 October 12
  2. Performance It doesn’t need to be rocket science. It does

    matter though! I promise I’m not trolling you. 9 Wednesday, 17 October 12
  3. “Oh man, that was too fast! It’s so much better

    now it’s slow!!” -- Average User In a parallel universe... 10 Wednesday, 17 October 12
  4. YEAH RIGHT I wish I had users who were that

    easy to please! But since we live in the real world... 11 Wednesday, 17 October 12
  5. “Why is that dude smiling?! This is too slow! Why

    can’t it be faster?” -- Average Users In our universe... 12 Wednesday, 17 October 12
  6. THINGS ARE IMPROVING Cactus => Diablo => Essex => Folsom

    13 But things can improve faster with focus! Wednesday, 17 October 12
  7. WE’RE A LOT LIKE YOU! Developers. Operators. Engineers. Users. We

    see potential. We see opportunities. 17 Wednesday, 17 October 12
  8. Airspace LivingSocial PaaS We care about speed because ... 19

    * Scaling services up/down needs to happen fast! * Needing to maintain huge pools of “slack capacity” to account for sudden spikes in traffic sucks. * Upgrading applications should be fast. What does fast mean to us? One example? New instances online in under 10 seconds. Wednesday, 17 October 12
  9. Performance Matters 20 What could your business do if instances

    came online in under 5 seconds vs. 50 seconds? > Makes integration tests leveraging the Cloud complete much faster. > Seasonal spikes? React to them faster - happier customers spend more money. > Engineers who don’t grumble that “getting servers is a pain in the ass”. > Deploy new applications and services more quickly and easily. Along with many other things ... Wednesday, 17 October 12
  10. Warning! Picking the right hardware is quite hard. It’s often

    individual to your users needs. What works for us may not rock your world. 25 Wednesday, 17 October 12
  11. Our Servers 27 Supermicro 1027R-WRFT+ 2x Intel Xeon E5-2670 (8C/16T

    2.60GHz) 16 x 8GB 1600MHz ECC Memory LSI 9266-8i (1-LD RAID-10) 8 x Intel 520-series 240GB SSD Dual-Port Intel X540 10GBASE-T Wednesday, 17 October 12
  12. Benefits 28 * ‘Just right’ balance of CPU/RAM for us.

    * Exceptional ephemeral I/O performance > Not using eMLC - trade off? > We can think about SQL on IaaS * A surplus of network bandwidth Servers are not a bottleneck! Wednesday, 17 October 12
  13. Our Network 29 Top of Rack - Arista Networks 7050T

    48-port 10GBASE-T Switch + 4-port 40GbE (uplinks) Zone Spine - Arista Networks 7050Q 16-port 40GbE Switch Wednesday, 17 October 12
  14. Benefits 30 * A network which runs Linux! * Ability

    to automate it via ZTP and Chef * Non-blocking communication in a rack. * Provision 160Gbps to spine via four cables. * Under 2:1 contention for comms in/out of rack. * Less need to think about QoS! Network is not a bottleneck! Wednesday, 17 October 12
  15. Production 32 Ubuntu 12.04 LTS (‘Precise Pangolin’) Hypervisor -- KVM

    CloudScaling OCS 1.3 .. based off OpenStack Essex .. Moving to OCS 2.0 in near future... .. that one is OpenStack Folsom .. Wednesday, 17 October 12
  16. 33 Ubuntu 12.04 LTS (‘Precise Pangolin’) Hypervisor -- KVM Useful

    for development and testing .. we’re running OpenStack Folsom now .. Most of the data shown later was grabbed with help from DevStack running on similar hardware to our production environment. Wednesday, 17 October 12
  17. 34 WHAT NOW? We’ve picked the hardware stack. It’s awesome.

    We’ve got our software installed. It’s looking great. Wednesday, 17 October 12
  18. Old School * Is my service (API) responding on TCP/8774?

    * Am I able to make a GET and fetch instance info? * Is my server running all the processes it should? * Are there any errors on my network ports? If any of this looks broken, send me alerts saying so! Wednesday, 17 October 12
  19. New Thinking * “How long did my website take to

    show?” * Individual performance of each click or API call * Inspection of latency within the application If lots of users interactions are slow, then I want you to alert me. If its just an outlier - log it and shut up. “End-User Experience Monitoring” Wednesday, 17 October 12
  20. DEMO TIME! Because pretty pictures are awesome. We’ll call the

    slowest transactions our “Disaster Porn”. 38 Wednesday, 17 October 12
  21. Boundary 39 “AppViz” * Port-to-port throughput/latency * How much SQL

    traffic are you doing? Updates in real-time. Look backwards in time. Powered by IPFIX (RFC 5101) Wednesday, 17 October 12
  22. Tracelytics 40 Lots more cool stuff to help ... We’ll

    blitz through a few more things next ... Latency Trends * Over the last 60 minutes * Over the last 24 hours * Over the last 7 days Top Tip: This is bad news. Wednesday, 17 October 12
  23. Tracelytics Patches 41 If you want to try out OpenStack

    APM - https://github.com/Afterglow/tracelytics-openstack Any questions? Just open an issue! Wednesday, 17 October 12
  24. “Call to Arms” 48 Reminder about those patches - https://github.com/Afterglow/tracelytics-openstack

    > Performance regression tests as an OpenStack CI gate? > More people talking about “How I fixed those >5 second outliers!” > Better ‘shared knowledge’ about what settings to tweak for added oomph > Architectural analysis asking about “big picture” (big impact) changes Wednesday, 17 October 12
  25. Credits Because these folks are awesome 49 N.B. Not intended

    as an exhaustive list of all the awesome people in the world/room! Wednesday, 17 October 12
  26. Interested? E-mail Ken - [email protected] Or just find me! Reminder

    that these slides are over at - https://speakerdeck.com/u/nixgeek Wednesday, 17 October 12