stateRunnerJavaSummit.pdf

55c0d841a8399c427e72c2fa7a6bf1b8?s=47 rauluka7
August 02, 2020

 stateRunnerJavaSummit.pdf

In Blade Runner by P. K. Dick, trained hunters had to retire problematic Androids. We, Developers, are similar to those hunters. Our job is to solve problems. State brings complexity and troubles. Getting rid of it is not always possible. How to make our stateful distributed system highly available?

It’s a story based on the experience that I gained while working on stateful distributed systems deployed in cloud environments (Azure, AWS). It includes what went well and what is more important, what went wrong. I’ll start with defining state and explain differences between stateful and stateless apps (it’s not so obvious!).

Then I’ll discuss the strategies that we can use in cloud environments to ensure high availability our or systems. We’ll go through scaling, multi-region deployments, and why sometimes we need to care where our machines are located.

In the third part of this talk, I’ll focus on tools that help us to deal with the state and their high availability features provided by cloud. I’ll show you the live demo of Azure SQL failover and compare it to Cosmos DB. I’ll also discuss Storage and Queues. Understanding the limitations of tools we use is as important as being aware of what happens under the hood. It is needed to build reliable architecture.

I’ll sum up the talk by explaining what is SLA and how to calculate it for your system (yes, there will be some math). So, are we problem hunters or we are haunted by problems? Join my presentation, make your system highly available and dream peaceful dreams.

55c0d841a8399c427e72c2fa7a6bf1b8?s=128

rauluka7

August 02, 2020
Tweet

Transcript

  1. Do Developers Dream of Stateless Apps? Java Global Summit, August

    2nd by Łukasz Gebel @rauluka7
  2. None
  3. None
  4. Do We Dream of Stateless Apps?

  5. None
  6. None
  7. None
  8. None
  9. None
  10. None
  11. State “A condition or way of being that exists at

    a particular time.” Cambridge Dictionary
  12. State - computer science “(...) a program is described as

    stateful if it is designed to remember preceding events or user interactions.” Wikipedia
  13. State - computer science “The remembered information is called the

    state of the system.” Wikipedia
  14. State, state everywhere

  15. State, state everywhere

  16. State, state everywhere

  17. State, state everywhere

  18. Stateful vs Stateless

  19. Fully Stateless

  20. Stateful or Stateless?

  21. Fully Stateful

  22. Awfully Stateful

  23. High availability - strategies

  24. Scaling • Vertical • Horizontal

  25. Multi-region

  26. Multi-region - basic units

  27. Multi-region Azure vs AWS Azure • 60 Regions...

  28. Multi-region Azure vs AWS Azure • 60 Regions... no Azure

    you have 53 ;)
  29. Multi-region Azure vs AWS Azure • 60 Regions... no Azure

    you have 53 ;) • 10 regions with Availability Zones
  30. Multi-region Azure vs AWS Azure • 60 Regions... no Azure

    you have 53 ;) • 10 regions with Availability Zones • Minimum 3xAZ per region
  31. Multi-region Azure vs AWS Azure • 60 Regions... no Azure

    you have 53 ;) • 10 regions with Availability Zones • Minimum 3xAZ per region AWS • 23 Regions
  32. Multi-region Azure vs AWS Azure • 60 Regions... no Azure

    you have 53 ;) • 10 regions with Availability Zones • Minimum 3xAZ per region AWS • 23 Regions • 70 Availability Zones
  33. Multi-region Azure vs AWS Azure • 60 Regions... no Azure

    you have 53 ;) • 10 regions with Availability Zones • Minimum 3xAZ per region AWS • 23 Regions • 70 Availability Zones • 2 or more AZ (except Osaka region)
  34. VM Placement Strategies • Fault domains • Update domains

  35. VM Placement Strategies

  36. Why should we care? • Distributed systems need coordination: •

    distrubute configuration • synchronize state • 2n + 1 machines • Minimal HA setup = 3 instances
  37. None
  38. None
  39. I DON'T WANT TO LIVE ON THIS PLANET ANYMORE...

  40. None
  41. So let's scale Zookeepers! N = 2 2 N +

    1 = 5
  42. None
  43. NOOOOOOOOOOOO!!!!!!!

  44. Highly Available Cloud Services

  45. Database

  46. Azure SQL • DB as a managed service • Microsoft

    SQL Server Database Engine • Scalability & High availability features
  47. Scalability Dynamic scaling • not equal to autoscaling! • manual

    process
  48. Scalability • Automatic scaling • Use elastic pool - databases

    share assigned resources • Implement custom solution based on DB metrics
  49. High Availability Features • Active Geo-Replication • Failover Groups

  50. Active Geo-Replication Primary Database

  51. Active Geo-Replication Secondary Database in the same or different region

  52. Active Geo-Replication Asynchronous data replication

  53. Active Geo-Replication Manual failover to secondary database

  54. Failover groups Secondary DB by default in other region West

    Europe North Europe
  55. Failover groups Automatic failover West Europe North Europe

  56. Failover groups Single connection string directing to current primary db

    West Europe North Europe YourPrimaryDB.com
  57. Paired regions • Physical isolation - if possible at least

    300 miles • Region recovery order - if multiple regions fail one of each pair is prioritized for recovery • Sequential updates - minimize impact of bugs or breaking changes West Europe North Europe
  58. Demo time!

  59. Azure SQL - my experience

  60. But there is a hope... Azure SQL gained new features

    lately: • Business Critical Tier • Zone Redundant deployments
  61. Amazon RDS • Amazon Aurora, MySQL, PostreSQL ... • Multi

    AZ deployment • Synchronous replication to Stand-by instance in different AZ
  62. None
  63. Cosmos DB

  64. Cosmos DB • Multi-model DB • API: SQL, MongoDB, Cassandra,

    Gremlin and more • Document based, table -row, graph, key value
  65. High Availability Features • Single master - multiple readers replication

    • Multi master replication • Add and remove regions on the go
  66. Cosmos DB - High availability • Multi-homing api - interact

    with replica that is closest to you • Regional failovers
  67. HA - behind the scenes • Partitions are regionally redundant

    • Within region every partion is replicated • Replicas have 10 - 20 fault domains 4 replicas X REGIONS
  68. Demo time!

  69. Cosmos DB - limitations • Documents size is max 2MB

    • Only 5 geospatial functions • Remember - it's a document based DB • Consistency levels - choose and handle
  70. Amazon DynamoDB • key-value, document based DB • multiregion •

    multimaster
  71. Storage

  72. Storage • Distributed resources, assets • Share resources by HTTPS

    • Backups, logs • Long living resources
  73. Locally-redundant storage 3x within datacentre

  74. Geo-redundant storage 3x primary region, then secondary region

  75. Read-access GRS You can read from secondary region ;)

  76. Azure Storage V2

  77. Azure Storage V2 • Zone-redundant storage (ZRS) - 3 clusters

    in different AZ • Geo-zone-redundant storage (GZRS) - ZRS + secondary region • Still, reading requires RA-GZRS ;)
  78. Azure queue • Integrate different parts of sytem or different

    systems • Infinite time to live (ttl) • 500 TB • Order is not guaranted (Service bus message session has it)
  79. Azure queue • Unlimited number of concurrent clients

  80. And suddenly • 503 Server busy ...

  81. AWS • S3 • SQS • More mature Java SDKs

  82. Storage - experience • Use latest APIs, SDKs (Java ones

    are not so good as .NET ecosystem, but they improve!) • Take care of storage life cycle • Be ready for migration
  83. Proxy

  84. Dynamic approach

  85. SLA - Service Level Agreement

  86. SLA “An SLA is a contractual agreement between a service

    provider and a customer buying a service.” What Are The Chances An Availability SLA Will Be Violated? [1]
  87. SLA “The agreement stipulates some minimum Quality of Service (QOS)

    requirement.” What Are The Chances An Availability SLA Will Be Violated? [1]
  88. SLA • How to understand it?

  89. SLA Example Let's assume that we need both services at

    the same time: • DB 99.99% • Storage 99.99% What's our SLA? • Use uptime approach
  90. Probability When two events, A and B, are independent, the

    probability of both occurring is: P(A and B) = P(A) * P(B) MathGoodies.com
  91. Uptime approach DB↑ - DB is up S↑ - Storage

    is up P(DB↑) = 0.9999 P(S↑) = 0.9999 P(DB↑) * P(S↑) = 0.9999 * 0.9999
  92. Uptime approach 99.980001%

  93. So let's go multiregion!

  94. Multiregion SLA Traffic Manager 99.99% Europe↑ - up US↑ -

    up Europe↓ - down US↓ - down P(Europe↑ or US↑) = 1 - P(Europe↓and US↓)
  95. Multiregion SLA Traffic Manager 99.99% Europe↑ - up US↑ -

    up Europe↓ - down US↓ - down P(Europe↑ or US↑) = 1 - P(Europe↓) * P(US↓)
  96. Multiregion SLA P(Europe↓) = P(US↓) = 1 - 99.980001 =

    0.00019999
  97. Multiregion SLA P(Europe↑ or US↑) = 1 - 0.00019999 *

    0.00019999
  98. Multiregion SLA 99.999996%

  99. Multiregion SLA TM↑ - Traffic Manager is up TM= 99.99%

    P(TM↑) * P(Europe↑ or US↑) = 0.9999 * 0.99999996
  100. Multiregion SLA 99.989996%

  101. Complexity is deceptively simple

  102. None
  103. None
  104. None
  105. None
  106. None
  107. None
  108. Call to action! 1. Calculate SLA for you system.

  109. Call to action! 1. Calculate SLA for you system. 2.

    Play with HA features and be ready for failure.
  110. Call to action! 1. Calculate SLA for you system. 2.

    Play with HA features and be ready for failure. 3. Check if you using cloud geo infrastructre in a way that it fits your HA needs.
  111. Do We Dream of Stateless Apps?

  112. None
  113. Q & A

  114. Bibliography • https://nofluffjuststuff.com/magazine/2017/10/cloud_native_apps_must_be_stateless_myth_or_ fact_ • https://www.researchgate.net/publication/221056071_What_Are_the_Chances_an_Availability_S LA_will_be_Violated • https://vincentlauzon.com/2018/01/22/solution-slas-in-azure/ •

    http://tuxlabs.com/?p=267 • https://www.bizety.com/2018/08/21/stateful-vs-stateless-architecture-overview/ • https://www.xenonstack.com/insights/stateful-and-stateless-applications/ • https://docs.microsoft.com/en-us/azure/architecture/aws-professional/ • https://cloudacademy.com/blog/aws-regions-and-availability-zones-the-simplest-explanation- you-will-ever-find-around/
  115. Bibliography • https://aws.amazon.com/about-aws/global-infrastructure/regions_az/ • https://docs.aws.amazon.com/en_pv/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZone s.html • https://docs.microsoft.com/en-us/azure/best-practices-availability-paired-regions • https://docs.microsoft.com/en-us/azure/sql-database/sql-database-disaster-recovery-strategies-for-

    applications-with-elastic-pool • https://azure.microsoft.com/en-us/blog/azure-sql-databases-disaster-recovery-101/ • https://docs.microsoft.com/en-us/azure/sql-database/sql-database-high-availability • https://docs.microsoft.com/en-us/azure/sql-database/sql-database-auto-failover-group Graphics attributions: • https://www.behance.net/gallery/64391071/Tinkoff-systems?tracking_source=search%7Ccyber%20map • https://www.studiodaily.com/2018/01/designing-retro-tech-look-future-blade-runner-2049/ • https://www.behance.net/gallery/42879183/URBS?tracking_source=search%7Cfuturistic%20warehouse
  116. Slides and code • Code: https://github.com/rauluka/state-runner-code