Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Expo: Flying Two Mistakes High

Cloud Expo: Flying Two Mistakes High

A guide to not crashing. Building high availability applications.

Lee Atchison

June 08, 2016
Tweet

More Decks by Lee Atchison

Other Decks in Technology

Transcript

  1. Flying Two Mistakes High A Guide to Not Crashing Lee

    Atchison, Principal Cloud Architect and Advocate at New Relic, Inc. ©2008-16 New Relic, Inc. All rights reserved.
  2. 2 Safe Harbor ©2008-16 New Relic, Inc. All rights reserved.

    This document and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission. Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,”, “expects” or words of similar import. Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings we make with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at http://ir.newrelic.com or the SEC’s website at www.sec.gov. New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this document or otherwise, with respect to the information provided.
  3. 3 Who am I? ©2008-16 New Relic, Inc. All rights

    reserved. Specialize in: Cloud computing Services & Microservices Scalability, Availability 28 years in industry 7 in Amazon Retail & AWS (Built SW/VG AppStore, AWS Elastic Beanstalk) 4 in New Relic (Architecture Lead, Cloud, Service Migration) @leeatchison leeatchison
  4. 4 I want to tell you a story… ©2008-16 New

    Relic, Inc. All rights reserved.
  5. 5 I want to tell you a story… ©2008-16 New

    Relic, Inc. All rights reserved. You tell me if this is ok or not…
  6. 6 I want to tell you a story… ©2008-16 New

    Relic, Inc. All rights reserved. This was a recently overheard conversation… You tell me if this is ok or not…
  7. 7 Is this ok? ©2008-16 New Relic, Inc. All rights

    reserved. “We were wondering how changing a setting on our MySQL database might impact our performance…
  8. 8 Is this ok? ©2008-16 New Relic, Inc. All rights

    reserved. “We were wondering how changing a setting on our MySQL database might impact our performance… … but we were worried that the change may cause our production database to fail…”
  9. 9 Is this ok? ©2008-16 New Relic, Inc. All rights

    reserved. “… Since we didn’t want to bring down production, we decided to make the change to our backup (replica) database instead… Under Construction … but we were worried that the change may cause our production database to fail…”
  10. 10 Is this ok? ©2008-16 New Relic, Inc. All rights

    reserved. “… Since we didn’t want to bring down production, we decided to make the change to our backup (replica) database instead… … After all, it wasn’t being used for anything at the moment.” Under Construction
  11. 11 Is this ok? ©2008-16 New Relic, Inc. All rights

    reserved. Until, of course, the backup was needed… Under Construction X
  12. 12 Is this ok? ©2008-16 New Relic, Inc. All rights

    reserved. Until, of course, the backup was needed… This was a true story Under Construction !!!! X X
  13. 13 I fly radio controlled model airplanes ©2008-16 New Relic,

    Inc. All rights reserved. “Keep your plane at least two mistakes high.” There’s an old adage:
  14. 14 But Why? ©2008-16 New Relic, Inc. All rights reserved.

    “Keep your plane at least two mistakes high.”
  15. 15 Why Two Mistakes High? ©2008-16 New Relic, Inc. All

    rights reserved. You perform some stunt, and it fails … You lose altitude You always want to be high enough to make a mistake, even if you’ve just made a mistake…
  16. 16 Why Two Mistakes High? ©2008-16 New Relic, Inc. All

    rights reserved. You perform some stunt, and it fails … You lose altitude Now, you are lower, and you are trying to recover You always want to be high enough to make a mistake, even if you’ve just made a mistake…
  17. 17 Why Two Mistakes High? ©2008-16 New Relic, Inc. All

    rights reserved. You perform some stunt, and it fails … You lose altitude Now, you are lower, and you are trying to recover You want to still be high enough, so that if you make another mistake, you won’t crash You always want to be high enough to make a mistake, even if you’ve just made a mistake…
  18. 18 Why Two Mistakes High? ©2008-16 New Relic, Inc. All

    rights reserved. You perform some stunt, and it fails … You lose altitude Now, you are lower, and you are trying to recover You want to still be high enough, so that if you make another mistake, you won’t crash You always want to be high enough to make a mistake, even if you’ve just made a mistake…
  19. 19 Put another way… ©2008-16 New Relic, Inc. All rights

    reserved. … even if you are currently recovering from a mistake …flying two mistakes high, you can always have a backup plan for recovering from a mistake
  20. This same applies when building highly available, high scale applications

    21 ©2008-16 New Relic, Inc. All rights reserved.
  21. 22 How do we keep “Two Mistakes High” in an

    application? ©2008-16 New Relic, Inc. All rights reserved. Walk through ramifications and recovery plan
  22. 23 ©2008-16 New Relic, Inc. All rights reserved. Walk through

    ramifications and recovery plan Make sure recovery plan works § Has no mistakes § Has its own recovery plan How do we keep “Two Mistakes High” in an application?
  23. 24 ©2008-16 New Relic, Inc. All rights reserved. Walk through

    ramifications and recovery plan If recovery plan doesn’t work… it’s not a good recovery plan Make sure recovery plan works § Has no mistakes § Has its own recovery plan How do we keep “Two Mistakes High” in an application?
  24. 26 EXAMPLE How many nodes do we need? ©2008-16 New

    Relic, Inc. All rights reserved. How many nodes do I need to handle my traffic demands? Building a Service § Designed to handle 1,000 req/sec (assume single node = 300 req/sec)
  25. 27 EXAMPLE How many nodes do we need? ©2008-16 New

    Relic, Inc. All rights reserved. Right??? § ceil[1,000 / 300] = 4 nodes § With four nodes, can handle our traffic § PLUS we have enough nodes that we can lose one! We have redundancy!
  26. 28 EXAMPLE Well no… ©2008-16 New Relic, Inc. All rights

    reserved. You think 4 nodes gives you redundancy, but it doesn’t... If you lose one of those nodes: § Remaining nodes can only handle 300 * 3 = 900 req/sec § Cannot handle the 1,000 req/sec load
  27. 29 EXAMPLE How many do we need? ©2008-16 New Relic,

    Inc. All rights reserved. 4 nodes ... allows handling our traffic but we cannot handle a node failure 5 nodes ... allows handling a single node failure But… No upgrading 6 nodes ... a multi-node failure, Or… Handle a failure during an upgrade
  28. 30 LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc.

    All rights reserved. Even if you think you have redundancy… § Think through the failure modes § … and make sure
  29. 32 EXAMPLE Rolling upgrades ©2008-16 New Relic, Inc. All rights

    reserved. Are you safe? You need 10 nodes to run your application You have 11 nodes, so that you can do rolling upgrades § Bring one node down at a time to upgrade… § Always at least 10 available...
  30. 33 EXAMPLE Well no… ©2008-16 New Relic, Inc. All rights

    reserved. With the failed server to contend with… you have no room to do an upgrade or rollback, and you are at risk for another failure § What if that node fails during upgrade? § What if you now have to roll back?
  31. 34 LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc.

    All rights reserved. Make sure you can handle failures § Even during “exceptional” events, such as upgrades § Exceptional events can cause failures
  32. 36 EXAMPLE Unknown dependencies ©2008-16 New Relic, Inc. All rights

    reserved. Are you safe? You have your application running on 20 servers… § You can run on 15 servers if necessary § Plenty of redundancy
  33. 37 EXAMPLE Well, depends… ©2008-16 New Relic, Inc. All rights

    reserved. Are any of the 20 servers in the same rack?
  34. 38 EXAMPLE Well, depends… ©2008-16 New Relic, Inc. All rights

    reserved. Are any of the 20 servers in the same rack? Share the same power supply?
  35. 39 EXAMPLE Well, depends… ©2008-16 New Relic, Inc. All rights

    reserved. Are any of the 20 servers in the same rack? Share the same power supply? Share the same power source?
  36. 40 EXAMPLE Well, depends… ©2008-16 New Relic, Inc. All rights

    reserved. Are any of the 20 servers in the same rack? Share the same power supply? Share the same power source? Share the same A/C system?
  37. 41 LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc.

    All rights reserved. Redundancy is not redundancy when the resources are not independent
  38. 43 EXAMPLE Failure loop ©2008-16 New Relic, Inc. All rights

    reserved. Are you safe from power outages? You live in an apartment… § The apartment provides an enclosed garage to store things in § The power goes out in your place a lot… § ... you buy a generator, store it in the garage
  39. 44 EXAMPLE Failure loop ©2008-16 New Relic, Inc. All rights

    reserved. Oops Oops… the garage: § Has a single door, the big garage door § It has a garage door opener § That requires electricity to open... § The generator is only available... when you already have power…
  40. 45 LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc.

    All rights reserved. Make sure your recovery plans actually are operational when you are in a failure mode
  41. 47 EXAMPLE A real system… ©2008-16 New Relic, Inc. All

    rights reserved. Highly independent Multi-level error recovery Highly recoverable system Redundant
  42. 48 EXAMPLE A real system… ©2008-16 New Relic, Inc. All

    rights reserved. In fact, one of the very first large scale software applications utilizing extreme redundancy and failure management Highly independent Multi-level error recovery Highly recoverable system Redundant
  43. 50 EXAMPLE US Space Shuttle Program ©2008-16 New Relic, Inc.

    All rights reserved. § They had problems… serious mechanical problems... § But the software system utilized state of the art: • Redundancy techniques • Error recovery techniques
  44. 51 EXAMPLE US Space Shuttle System ©2008-16 New Relic, Inc.

    All rights reserved. Five onboard computers § Four were identical (fifth talk about later) § All four: – Ran the exact same program during critical periods – Given same data – Expected to generate the same result
  45. 52 EXAMPLE Four computers ©2008-16 New Relic, Inc. All rights

    reserved. Computers voted on the proper outcome If any one computer did not generate the same results:
  46. 53 EXAMPLE Four computers ©2008-16 New Relic, Inc. All rights

    reserved. Computers voted on the proper outcome Those that disagreed with the outcome were turned off for remainder of the flight If any one computer did not generate the same results:
  47. 54 EXAMPLE Four computers ©2008-16 New Relic, Inc. All rights

    reserved. Ultimate in democratic systems… Computers voted on the proper outcome Those that disagreed with the outcome were turned off for remainder of the flight If any one computer did not generate the same results:
  48. 55 EXAMPLE Four computers ©2008-16 New Relic, Inc. All rights

    reserved. Could FLY with only THREE computers working Could LAND with only TWO computers working
  49. 56 EXAMPLE Ties ©2008-16 New Relic, Inc. All rights reserved.

    What if the four computers couldn’t decide? (software bug or multiple failures)
  50. 57 EXAMPLE Ties ©2008-16 New Relic, Inc. All rights reserved.

    What if the four computers couldn’t decide? (software bug or multiple failures) Fifth computer was used as a tie breaker § Much simpler version of software… only used for key decisions § Software written by independent software team, unconnected with rest of software developers § (In theory) would not introduce same software errors…
  51. 58 Highly Successful ©2008-16 New Relic, Inc. All rights reserved.

    30-year operation of Space Shuttle: § Never a case where a serious life threatening problem occurred that was a result of a software problem § Even though software was the most complex software ever built for a space program
  52. 59 US Space Shuttle ©2008-16 New Relic, Inc. All rights

    reserved. This is extreme (not needed by most projects) § Shows what is possible... § Independence is critical to high availability
  53. 60 LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc.

    All rights reserved. Use availability solution consistent with the risk
  54. 61 LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc.

    All rights reserved. Use availability solution consistent with the risk Higher the risk, higher the focus on availability
  55. 62 LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc.

    All rights reserved. Use availability solution consistent with the risk Higher the risk, higher the focus on availability Don’t over invest, don’t under invest
  56. 63 LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc.

    All rights reserved. Use availability solution consistent with the risk Higher the risk, higher the focus on availability Don’t over invest, don’t under invest But think ahead, avoid the surprise
  57. 64 And remember… ©2008-16 New Relic, Inc. All rights reserved.

    “Keep your plane at least two mistakes high.”
  58. Architecting for Scale By: Lee Atchison Published by: O’Reilly Media,

    Available: June 2016 www.architectingforscale.com Want to Learn More?
  59. ©2008-15 New Relic, Inc. All rights reserved. Thank you. Lee

    Atchison Principal Cloud Architect and Advocate at New Relic, Inc. Architecting for Scale Published by: O’Reilly Media, Available: June 2016 www.architectingforscale.com @leeatchison leeatchison