Facing Vulnerabilities in Spring Boot Architectures

YURY NIÑO ROA Site Reliability Engineer Chaos Engineering Advocate @yurynino
https://www.yurynino.dev/

• What could go wrong? • War Stories on PROD.
• Stability: Antipatterns. • Resilience: Patterns. • Framework for Chaos GameDays. • Demo: Chaos Monkey for Spring Boot. Agenda

TITANIC QUEBEC BRIDGE

Our bodies face all kinds of adversities: genetic mutations, toxic
substances, attacks by (corona)viruses and bacteria and all a lot of diseases. In this dangerous world, how can they still be alive? What could go wrong? https://www.yurynino.dev/

Our systems face all kinds of adversities: hard disks failures,
network can go down, customer traﬃc can overload and a cyberattack can happen. In this chaotic world, how can they still be alive? What could go wrong? https://www.yurynino.dev/

War Stories on PROD

https://www.yurynino.dev/

Netflix Twitter The infrastructure required by a software system can
be as complex as the software itself. Every production failure is unique. No two incidents will share the precise chain of failure!

Stability: AntiPatterns

Integration Points Every integration point will eventually fail in some
way, and you need to be prepared for that failure. Defend with: • Circuit Breaker. • Timeouts. • Decoupling Middleware. • Handshaking www.yurynino.dev

Chain Reactions A chain reaction happens because the death of
one server makes the others pick up the slack. Defend with: • Defensive programming. • Circuit Breakers. • Bulkheads. www.yurynino.dev

Cascading Failures A cascading failure occurs when cracks jump from
one system to another, until the threads will be blocked forever. Defend with: • Defensive programming. • Circuit breakers. • Timeouts. www.yurynino.dev

Users Users are a terrible thing. Users consume memory. Systems
would be much better oﬀ with no users :) Users do weird things! • It is hard to predict what users will do. • Use AI techniques. • Run special stress tests to hammer deep links. www.yurynino.dev

Blocked Threads Blocked threads used to be relate to all
type of failures, including gradual slowdown and hung server. Defend with: • Always use timeouts, even though it needs • More error-handling code. www.yurynino.dev

Self-Denial Attacks Self-denial attacks originate inside your own organization, when
people cause self-inﬂicted wounds. Defend with: • Protect shared resources. • Avoid unexpected scaling eﬀects. • Avoid front-end load causes increments in the back-end processing. www.yurynino.dev

Scaling Effects If you have a many-to-few relationship, you can
be hit by scaling eﬀects when one side increases. Defend with: • Examine PROD vs QA environments to spot scaling eﬀects. • Avoid point-to-point communication. • Avoid shared resources. www.yurynino.dev

Unbalanced Capacities Mismatched ratios between different layers makes one tier
to flood another with requests beyond its capacity. Defend with: • Examine server and thread counts. • Virtualize QA and scale it up. • Stress both sides of the interface. • Observe near Scaling Effects and users. www.yurynino.dev

Dogpile A Dogpile concentrates demand and force you to spend
too much to handle peak demand. Defend with: • Use random clock slews. • Don’t set all cron jobs for midnight. • Avoid pulsing increasing backoﬀ times.

Slow Responses Slow Responses happen when the response times exceed
their own timeouts. Defend with: • Fail fast. • Send an immediate error response. • Probe your timeouts in many scenarios. www.yurynino.dev

Unbounded Results What happens when the database suddenly returns ﬁve
million rows instead of the usual hundred or so? Defend with: • Use realistic data volumes. • Paginate at the front-end. • Don’t rely on the data producers. • Put limits in your application-level protocols. www.yurynino.dev

Resilience: Patterns

Face them with Resilience https://www.yurynino.dev/

Fail Fast Bulkhead Fail Fast

Circuit Breaker Taken from Release it!

How to verify that we are facing these antipatterns properly?

Chaos Engineering It is the discipline of experimenting failures in
production in order to reveal their weakness and to build conﬁdence in their resilience capability. https://principlesofchaos.org/

Security Chaos Engineering It is the identiﬁcation of security control
failures through proactive experimentation to build conﬁdence in the system’s ability to defend against malicious conditions in production. Security Chaos Engineering Book

Chaos Principles Hypothesize about Steady State Run Experiments Vary Real-World
Events Automate Experiments

Chaos History 2008 Chaos Engineering was born at Netﬂix 2010
Chaos Monkey & Simian Army were launched 2016 Gremlin was born 2019 Chaos Massiﬁcation 2017 SRE USenix Chaos IQ ChaosConf 2018 Book Chaos Eng 2020 Book Chaos Eng

Chaos Tools Chaos Monkey Chaos Toolkit Gremlin Chaos Mesh Chaos
for Spring Boot

Practicing Chaos GameDays Interactive, real-world and learning exercises. They are
designed to give players a chance to put their skills in a technology to test. GameDays were created by Jesse Robbins inspired by his experience & training as a firefighter. Our Journey

GameDays Framework Before After During • Pick a hypothesis. •
Pick a style. • Decide who. • Decide where. • Decide when. • Document. • Get approval! • Detect the situation. • Take a deep breath. • Communicate. • Visit dashboards. • Analyze data. • Propose solutions. • Apply and solve! • Write a postmortem. • What Happened • Impact • Duration • Resolution Time • Resolution • Timeline • Action Items Russ Miles

GameDays Framework Before After During • Pick a hypothesis. •
Pick a style. • Decide who. • Decide where. • Decide when. • Document. • Get approval! • Detect the situation. • Take a deep breath. • Communicate. • Visit dashboards. • Analyze data. • Propose solutions. • Apply and solve! • Write a postmortem. • What Happened • Impact • Duration • Resolution Time • Resolution • Timeline • Action Items Evolve • Improve your method. • Integrate in pipelines. • Adjust metrics. • Validate CMM position. • Adapt next GameDay. • Continuous Veriﬁcation.

• Spring Boot • Chaos Monkey • Azure • Pulumi
Gamedays Framework Before After During • Pick a hypothesis. • Pick a style. • Decide who. • Decide where. • Decide when. • Document. • Get approval! • Detect the situation. • Take a deep breath. • Communicate. • Visit dashboards. • Analyze data. • Propose solutions. • Apply and solve! • Write a postmortem. • What Happened • Impact • Duration • Resolution Time • Resolution • Timeline • Action Items Automate

Demo Time

Chaos Maturity Model

Our Solutions

How to Begin? https://www.gremlin.com https://chaosengineering.slack.com https://github.com/dastergon/awesome-chaosengineering https://www.infoq.com/chaos-engineering

Take a look at these books! They were useful for
this Talk!

Thanks for coming! @yurynino

Facing Vulnerabilities in Spring Boot Architect...

Facing Vulnerabilities in Spring Boot Architectures

More Decks by Yury Nino

Other Decks in Technology

Featured

Transcript