Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloudy with a Chance of Scaling

Cloudy with a Chance of Scaling

Keeping your highly scaled application highly available using the cloud.

Lee Atchison

June 22, 2016
Tweet

More Decks by Lee Atchison

Other Decks in Technology

Transcript

  1. Cloudy with a Chance of Scaling A Guide to High

    Availability in the Cloud Lee Atchison, Principal Cloud Architect and Advocate at New Relic, Inc. ©2008-16 New Relic, Inc. All rights reserved.
  2. 2 Safe Harbor ©2008-16 New Relic, Inc. All rights reserved.

    This document and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission. Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,”, “expects” or words of similar import. Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings we make with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at http://ir.newrelic.com or the SEC’s website at www.sec.gov. New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this document or otherwise, with respectto the information provided.
  3. Who am I? Specialize in: Cloud computing Services & Microservices

    Scalability, Availability 29 years in industry 7 in Amazon Retail & AWS (Built SW/VG AppStore, AWS Elastic Beanstalk) 4 in New Relic (Architecture Lead, Cloud, Service Migration) ©2008-16 New Relic, Inc. All rights reserved. 3 @leeatchison leeatchison
  4. I want to tell you a story… ©2008-16 New Relic,

    Inc. All rights reserved. 5 You tell me if this is ok or not… This was a recently overheard conversation…
  5. Is this ok? ©2008-16 New Relic, Inc. All rights reserved.

    6 “We were wondering how changing a setting on our MySQL database might impact our performance…
  6. Is this ok? ©2008-16 New Relic, Inc. All rights reserved.

    7 “We were wondering how changing a setting on our MySQL database might impact our performance… … but we were worried that the change may cause our production database to fail…”
  7. Is this ok? ©2008-16 New Relic, Inc. All rights reserved.

    8 “… Since we didn’t want to bring down production, we decided to make the change to our backup (replica) database instead… Under Construction … but we were worried that the change may cause our production database to fail…”
  8. Is this ok? ©2008-16 New Relic, Inc. All rights reserved.

    9 “… Since we didn’t want to bring down production, we decided to make the change to our backup (replica, hot standby) database instead… … After all, it wasn’t being used for anything at the moment.” Under Construction
  9. Is this ok? ©2008-16 New Relic, Inc. All rights reserved.

    10 Until, of course, the backup was needed… Under Construction X
  10. Is this ok? ©2008-16 New Relic, Inc. All rights reserved.

    11 Until, of course, the backup was needed… This was a true story Under Construction !!!! X X
  11. I fly radio controlled model airplanes “Keep your plane at

    least two mistakes high.” There’s an old adage: ©2008-16 New Relic, Inc. All rights reserved. 12
  12. “Keep your plane at least two mistakes high.” ©2008-16 New

    Relic, Inc. All rights reserved. 13 But Why?
  13. Why Two Mistakes High? You perform some stunt, and it

    fails … You lose altitude ©2008-16 New Relic, Inc. All rights reserved. 14
  14. Why Two Mistakes High? You perform some stunt, and it

    fails … You lose altitude Now, you are lower, and you are trying to recover ©2008-16 New Relic, Inc. All rights reserved. 15
  15. Why Two Mistakes High? You perform some stunt, and it

    fails … You lose altitude Now, you are lower, and you are trying to recover You want to still be high enough, so that if you make another mistake, you won’t crash ©2008-16 New Relic, Inc. All rights reserved. 16
  16. Why Two Mistakes High? You perform some stunt, and it

    fails … You lose altitude Now, you are lower, and you are trying to recover You want to still be high enough, so that if you make another mistake, you won’t crash ©2008-16 New Relic, Inc. All rights reserved. 17 You always want to be high enough to make a mistake, even if you’ve just made a mistake…
  17. Put another way… ©2008-16 New Relic, Inc. All rights reserved.

    18 … even if you are currently recovering from a mistake …flying two mistakes high, you can always have a backup plan for recovering from a mistake
  18. This same applies when building highly available, high scale applications

    ©2008-16 New Relic, Inc. All rights reserved. 20
  19. How do we keep “Two Mistakes High” in an application?

    ©2008-16 New Relic, Inc. All rights reserved. 21 Walk through ramifications and recovery plan
  20. How do we keep “Two Mistakes High” in an application?

    ©2008-16 New Relic, Inc. All rights reserved. 22 Walk through ramifications and recovery plan Make sure recovery plan works § Has no mistakes § Has its own recovery plan
  21. How do we keep “Two Mistakes High” in an application?

    ©2008-16 New Relic, Inc. All rights reserved. 23 Walk through ramifications and recovery plan If recovery plan doesn’t work… it’s not a good recovery plan Make sure recovery plan works § Has no mistakes § Has its own recovery plan
  22. EXAMPLE How many nodes do we need? ©2008-16 New Relic,

    Inc. All rights reserved. 25 How many nodes do I need to handle my traffic demands? Building a Service § Designed to handle 1,000 req/sec (assume single node = 300 req/sec)
  23. EXAMPLE How many nodes do we need? ©2008-16 New Relic,

    Inc. All rights reserved. 26 Right??? § ceil[1,000 / 300] = 4 nodes § With four nodes, we can handle our traffic § PLUS we have enough nodes that we can lose one! We have redundancy!
  24. EXAMPLE Well no… ©2008-16 New Relic, Inc. All rights reserved.

    27 You think 4 nodes gives you redundancy, but it doesn’t... If you lose one of those nodes: § Remaining nodes can only handle 300 * 3 = 900 req/sec § Cannot handle the 1,000 req/sec load
  25. EXAMPLE How many do we need? ©2008-16 New Relic, Inc.

    All rights reserved. 28 4 nodes ... allows handling our traffic but we cannot handle a node failure 5 nodes ... allows handling a single node failure But… No upgrading 6 nodes ... a multi-node failure, Or… Handle a failure during an upgrade or more…
  26. LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc. All

    rights reserved. 29 Even if you think you have redundancy… § Think through the failure modes § … and make sure
  27. 31 What is a Rolling Deploy? ©2008-16 New Relic, Inc.

    All rights reserved. Load Balancer Server Server Server Server Server
  28. 32 What is a Rolling Deploy? ©2008-16 New Relic, Inc.

    All rights reserved. Server Server Server Server Server Remove one server from service Load Balancer
  29. 33 What is a Rolling Deploy? ©2008-16 New Relic, Inc.

    All rights reserved. Server Server Server Server Server Deploy new application version to this server Load Balancer
  30. 34 What is a Rolling Deploy? ©2008-16 New Relic, Inc.

    All rights reserved. Load Balancer Server Server Server Server Server Put back into service
  31. 35 What is a Rolling Deploy? ©2008-16 New Relic, Inc.

    All rights reserved. Load Balancer Server Server Server Server Server Repeat 1 by 1 with remaining servers § Allows deploying changes to your servers without bringing your entire application down
  32. EXAMPLE Rolling Deploys ©2008-16 New Relic, Inc. All rights reserved.

    36 Are you safe? You need 10 nodes to run your application You have 11 nodes, so that you can do rolling deploy § Bring one node down at a time to upgrade… § Always at least 10 available...
  33. EXAMPLE Well no… ©2008-16 New Relic, Inc. All rights reserved.

    37 With the failed server to contend with… you have no room to do an upgrade or rollback, and you are at risk for another failure § What if that node fails during upgrade? § What if you now have to roll back?
  34. LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc. All

    rights reserved. 38 Make sure you can handle failures § Even during “exceptional” events, such as upgrades § Exceptional events can cause failures
  35. EXAMPLE Unknown dependencies ©2008-16 New Relic, Inc. All rights reserved.

    40 Are you safe? You have your application running on 20 servers… § You can run on 15 servers if necessary § Plenty of redundancy
  36. EXAMPLE Well, depends… ©2008-16 New Relic, Inc. All rights reserved.

    41 Are any of the 20 servers in the same rack?
  37. EXAMPLE Well, depends… ©2008-16 New Relic, Inc. All rights reserved.

    42 Are any of the 20 servers in the same rack? Share the same power supply?
  38. EXAMPLE Well, depends… ©2008-16 New Relic, Inc. All rights reserved.

    43 Are any of the 20 servers in the same rack? Share the same power supply? Share the same power source?
  39. EXAMPLE Well, depends… ©2008-16 New Relic, Inc. All rights reserved.

    44 Are any of the 20 servers in the same rack? Share the same power supply? Share the same power source? Share the same A/C system?
  40. LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc. All

    rights reserved. 46 Redundancy is not redundancy when the resources are not independent
  41. EXAMPLE Failure loop ©2008-16 New Relic, Inc. All rights reserved.

    48 Are you safe from power outages? You live in an apartment… § The apartment provides an enclosed garage to store things in § The power goes out in your place a lot… § ... you buy a generator, store it in the garage
  42. EXAMPLE Failure loop ©2008-16 New Relic, Inc. All rights reserved.

    49 Oops Oops… the garage: § Has a single door, the big garage door § It has a garage door opener § That requires electricity to open... § The generator is only available... when you already have power…
  43. LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc. All

    rights reserved. 50 Make sure your recovery plans actually are operational when you are in a failure mode
  44. EXAMPLE A real system… ©2008-16 New Relic, Inc. All rights

    reserved. 52 Great example: Highly independent Multi-level error recovery Highly recoverable system Redundant
  45. EXAMPLE A real system… ©2008-16 New Relic, Inc. All rights

    reserved. 53 In fact, one of the very first large scale software applications utilizing extreme redundancy and failure management Great example: Highly independent Multi-level error recovery Highly recoverable system Redundant
  46. EXAMPLE US Space Shuttle Program ©2008-16 New Relic, Inc. All

    rights reserved. 55 § They had problems… serious mechanical problems... § But the software system utilized state of the art: • Redundancy techniques • Error recovery techniques
  47. EXAMPLE US Space Shuttle System ©2008-16 New Relic, Inc. All

    rights reserved. 56 Five onboard computers § Four were identical (fifth talk about later) § All four: – Ran the exact same program during critical periods – Given same data – Expected to generate the same result
  48. EXAMPLE Four computers ©2008-16 New Relic, Inc. All rights reserved.

    57 Computers voted on the proper outcome If any one computer did not generate the same results:
  49. EXAMPLE Four computers ©2008-16 New Relic, Inc. All rights reserved.

    58 Computers voted on the proper outcome Those that disagreed with the outcome were turned off for remainder of the flight If any one computer did not generate the same results:
  50. EXAMPLE Four computers ©2008-16 New Relic, Inc. All rights reserved.

    59 Ultimate in democratic systems… Computers voted on the proper outcome Those that disagreed with the outcome were turned off for remainder of the flight If any one computer did not generate the same results:
  51. EXAMPLE Four computers ©2008-16 New Relic, Inc. All rights reserved.

    60 Could FLY with only THREE computers working Could LAND with only TWO computers working
  52. EXAMPLE Deadlock ©2008-16 New Relic, Inc. All rights reserved. 61

    What if the four computers couldn’t decide? (software bug or multiple failures)
  53. EXAMPLE Deadlock ©2008-16 New Relic, Inc. All rights reserved. 62

    What if the four computers couldn’t decide? (software bug or multiple failures) Fifth computer was used as a tie breaker § Much simpler version of software… only used for key decisions § Software written by independent software team, unconnected with rest of software developers § (In theory) would not introduce same software errors…
  54. ©2008-16 New Relic, Inc. All rights reserved. 63 Highly Successful

    30-year operation of Space Shuttle: § Never a case where a serious life threatening problem occurred that was a result of a software problem § Even though software was the most complex software ever built for a space program
  55. US Space Shuttle ©2008-16 New Relic, Inc. All rights reserved.

    64 This is extreme (not needed by most projects) § Shows what is possible... § Independence is critical to high availability
  56. LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc. All

    rights reserved. 65 Use availability solution consistent with the risk
  57. LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc. All

    rights reserved. 66 Use availability solution consistent with the risk Higher the risk, higher the focus on availability
  58. LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc. All

    rights reserved. 67 Use availability solution consistent with the risk Higher the risk, higher the focus on availability Don’t over invest, don’t under invest
  59. LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc. All

    rights reserved. 68 Use availability solution consistent with the risk Higher the risk, higher the focus on availability Don’t over invest, don’t under invest But think ahead, avoid the surprise
  60. And remember… “Keep your plane at least two mistakes high.”

    ©2008-16 New Relic, Inc. All rights reserved. 69
  61. Architecting for Scale By: Lee Atchison Published by: O’Reilly Media,

    Available: June 2016 www.architectingforscale.com Preview edition available at New Relic booth Want to Learn More? Velocity Events “Static vs Dynamic Cloud” Thursday 12noon, New Relic Booth Office Hours Thursday 3pm, O’Reilly Booth Book Signing Today 2:30pm, O’Reilly Booth Throughout show, New Relic Booth @leeatchison leeatchison
  62. ©2008-15 New Relic, Inc. All rights reserved. Thank you. Lee

    Atchison Principal Cloud Architect and Advocate at New Relic, Inc. Architecting for Scale Published by: O’Reilly Media, Available: June 2016 www.architectingforscale.com @leeatchison leeatchison