Cloudy with a Chance of Scaling

Cloudy with a Chance of Scaling A Guide to High
Availability in the Cloud Lee Atchison, Principal Cloud Architect and Advocate at New Relic, Inc. ©2008-16 New Relic, Inc. All rights reserved.

2 Safe Harbor ©2008-16 New Relic, Inc. All rights reserved.
This document and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission. Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,”, “expects” or words of similar import. Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings we make with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at http://ir.newrelic.com or the SEC’s website at www.sec.gov. New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this document or otherwise, with respectto the information provided.

Who am I? Specialize in: Cloud computing Services & Microservices
Scalability, Availability 29 years in industry 7 in Amazon Retail & AWS (Built SW/VG AppStore, AWS Elastic Beanstalk) 4 in New Relic (Architecture Lead, Cloud, Service Migration) ©2008-16 New Relic, Inc. All rights reserved. 3 @leeatchison leeatchison

Is this ok? ©2008-16 New Relic, Inc. All rights reserved.
6 “We were wondering how changing a setting on our MySQL database might impact our performance…

7 “We were wondering how changing a setting on our MySQL database might impact our performance… … but we were worried that the change may cause our production database to fail…”

8 “… Since we didn’t want to bring down production, we decided to make the change to our backup (replica) database instead… Under Construction … but we were worried that the change may cause our production database to fail…”

9 “… Since we didn’t want to bring down production, we decided to make the change to our backup (replica, hot standby) database instead… … After all, it wasn’t being used for anything at the moment.” Under Construction

10 Until, of course, the backup was needed… Under Construction X

11 Until, of course, the backup was needed… This was a true story Under Construction !!!! X X

I fly radio controlled model airplanes “Keep your plane at
least two mistakes high.” There’s an old adage: ©2008-16 New Relic, Inc. All rights reserved. 12

Why Two Mistakes High? You perform some stunt, and it
fails … You lose altitude ©2008-16 New Relic, Inc. All rights reserved. 14

fails … You lose altitude Now, you are lower, and you are trying to recover You want to still be high enough, so that if you make another mistake, you won’t crash ©2008-16 New Relic, Inc. All rights reserved. 16

fails … You lose altitude Now, you are lower, and you are trying to recover You want to still be high enough, so that if you make another mistake, you won’t crash ©2008-16 New Relic, Inc. All rights reserved. 17 You always want to be high enough to make a mistake, even if you’ve just made a mistake…

Put another way… ©2008-16 New Relic, Inc. All rights reserved.
18 … even if you are currently recovering from a mistake …flying two mistakes high, you can always have a backup plan for recovering from a mistake

©2008-16 New Relic, Inc. All rights reserved. 19 Don’t screw
up... …while you are screwing up

This same applies when building highly available, high scale applications
©2008-16 New Relic, Inc. All rights reserved. 20

How do we keep “Two Mistakes High” in an application?
©2008-16 New Relic, Inc. All rights reserved. 21 Walk through ramifications and recovery plan

©2008-16 New Relic, Inc. All rights reserved. 23 Walk through ramifications and recovery plan If recovery plan doesn’t work… it’s not a good recovery plan Make sure recovery plan works § Has no mistakes § Has its own recovery plan

24 ©2008-16 New Relic, Inc. All rights reserved. EXAMPLE How
many nodes do we need?

EXAMPLE How many nodes do we need? ©2008-16 New Relic,
Inc. All rights reserved. 25 How many nodes do I need to handle my traffic demands? Building a Service § Designed to handle 1,000 req/sec (assume single node = 300 req/sec)

EXAMPLE How many nodes do we need? ©2008-16 New Relic,
Inc. All rights reserved. 26 Right??? § ceil[1,000 / 300] = 4 nodes § With four nodes, we can handle our traffic § PLUS we have enough nodes that we can lose one! We have redundancy!

EXAMPLE Well no… ©2008-16 New Relic, Inc. All rights reserved.
27 You think 4 nodes gives you redundancy, but it doesn’t... If you lose one of those nodes: § Remaining nodes can only handle 300 * 3 = 900 req/sec § Cannot handle the 1,000 req/sec load

EXAMPLE How many do we need? ©2008-16 New Relic, Inc.
All rights reserved. 28 4 nodes ... allows handling our traffic but we cannot handle a node failure 5 nodes ... allows handling a single node failure But… No upgrading 6 nodes ... a multi-node failure, Or… Handle a failure during an upgrade or more…

LESSON Fly Two Mistakes High ©2008-16 New Relic, Inc. All
rights reserved. 29 Even if you think you have redundancy… § Think through the failure modes § … and make sure

30 ©2008-16 New Relic, Inc. All rights reserved. EXAMPLE Rolling
Deploys

EXAMPLE Rolling Deploys ©2008-16 New Relic, Inc. All rights reserved.
36 Are you safe? You need 10 nodes to run your application You have 11 nodes, so that you can do rolling deploy § Bring one node down at a time to upgrade… § Always at least 10 available...

EXAMPLE Well no… ©2008-16 New Relic, Inc. All rights reserved.
37 With the failed server to contend with… you have no room to do an upgrade or rollback, and you are at risk for another failure § What if that node fails during upgrade? § What if you now have to roll back?

rights reserved. 38 Make sure you can handle failures § Even during “exceptional” events, such as upgrades § Exceptional events can cause failures

39 ©2008-16 New Relic, Inc. All rights reserved. EXAMPLE Unknown
dependencies ? ?

EXAMPLE Unknown dependencies ©2008-16 New Relic, Inc. All rights reserved.
40 Are you safe? You have your application running on 20 servers… § You can run on 15 servers if necessary § Plenty of redundancy

EXAMPLE Well, depends… ©2008-16 New Relic, Inc. All rights reserved.
41 Are any of the 20 servers in the same rack?

42 Are any of the 20 servers in the same rack? Share the same power supply?

43 Are any of the 20 servers in the same rack? Share the same power supply? Share the same power source?

44 Are any of the 20 servers in the same rack? Share the same power supply? Share the same power source? Share the same A/C system?

45 The Cloud is not immune!

rights reserved. 46 Redundancy is not redundancy when the resources are not independent

47 ©2008-16 New Relic, Inc. All rights reserved. EXAMPLE Failure
loop

EXAMPLE Failure loop ©2008-16 New Relic, Inc. All rights reserved.
48 Are you safe from power outages? You live in an apartment… § The apartment provides an enclosed garage to store things in § The power goes out in your place a lot… § ... you buy a generator, store it in the garage

EXAMPLE Failure loop ©2008-16 New Relic, Inc. All rights reserved.
49 Oops Oops… the garage: § Has a single door, the big garage door § It has a garage door opener § That requires electricity to open... § The generator is only available... when you already have power…

rights reserved. 50 Make sure your recovery plans actually are operational when you are in a failure mode

51 ©2008-16 New Relic, Inc. All rights reserved. EXAMPLE High
redundancy in action

EXAMPLE A real system… ©2008-16 New Relic, Inc. All rights
reserved. 52 Great example: Highly independent Multi-level error recovery Highly recoverable system Redundant

EXAMPLE A real system… ©2008-16 New Relic, Inc. All rights
reserved. 53 In fact, one of the very first large scale software applications utilizing extreme redundancy and failure management Great example: Highly independent Multi-level error recovery Highly recoverable system Redundant

EXAMPLE US Space Shuttle Program ©2008-16 New Relic, Inc. All
rights reserved. 55 § They had problems… serious mechanical problems... § But the software system utilized state of the art: • Redundancy techniques • Error recovery techniques

EXAMPLE US Space Shuttle System ©2008-16 New Relic, Inc. All
rights reserved. 56 Five onboard computers § Four were identical (fifth talk about later) § All four: – Ran the exact same program during critical periods – Given same data – Expected to generate the same result

58 Computers voted on the proper outcome Those that disagreed with the outcome were turned off for remainder of the flight If any one computer did not generate the same results:

59 Ultimate in democratic systems… Computers voted on the proper outcome Those that disagreed with the outcome were turned off for remainder of the flight If any one computer did not generate the same results:

60 Could FLY with only THREE computers working Could LAND with only TWO computers working

EXAMPLE Deadlock ©2008-16 New Relic, Inc. All rights reserved. 62
What if the four computers couldn’t decide? (software bug or multiple failures) Fifth computer was used as a tie breaker § Much simpler version of software… only used for key decisions § Software written by independent software team, unconnected with rest of software developers § (In theory) would not introduce same software errors…

©2008-16 New Relic, Inc. All rights reserved. 63 Highly Successful
30-year operation of Space Shuttle: § Never a case where a serious life threatening problem occurred that was a result of a software problem § Even though software was the most complex software ever built for a space program

rights reserved. 65 Use availability solution consistent with the risk

rights reserved. 66 Use availability solution consistent with the risk Higher the risk, higher the focus on availability

rights reserved. 67 Use availability solution consistent with the risk Higher the risk, higher the focus on availability Don’t over invest, don’t under invest

rights reserved. 68 Use availability solution consistent with the risk Higher the risk, higher the focus on availability Don’t over invest, don’t under invest But think ahead, avoid the surprise

Architecting for Scale By: Lee Atchison Published by: O’Reilly Media,
Available: June 2016 www.architectingforscale.com Preview edition available at New Relic booth Want to Learn More? Velocity Events “Static vs Dynamic Cloud” Thursday 12noon, New Relic Booth Office Hours Thursday 3pm, O’Reilly Booth Book Signing Today 2:30pm, O’Reilly Booth Throughout show, New Relic Booth @leeatchison leeatchison

©2008-15 New Relic, Inc. All rights reserved. Thank you. Lee
Atchison Principal Cloud Architect and Advocate at New Relic, Inc. Architecting for Scale Published by: O’Reilly Media, Available: June 2016 www.architectingforscale.com @leeatchison leeatchison

Cloudy with a Chance of Scaling

Cloudy with a Chance of Scaling

More Decks by Lee Atchison

Other Decks in Technology

Featured

Transcript