Slide 1

Slide 1 text

@debe & @tboeghk πŸ”ͺ How we cut our AWS costs in half

Slide 2

Slide 2 text

@debe software engineer, architect and performance enthusiast #jvm #high performance #rust @tboeghk Freelance Search & Operations Engineer #solr #observability #ops #java 🀝

Slide 3

Slide 3 text

πŸ“š A holistic approach Project profile Architectural challenges Infrastructure challenges Software challenges Lessons learned Foto von Agnieszka Kowalczyk auf Unsplash

Slide 4

Slide 4 text

(1) Architectural challenges

Slide 5

Slide 5 text

πŸ›’ Project setting Foto von Mike Petrucci auf Unsplash

Slide 6

Slide 6 text

🏑 Platform architecture

Slide 7

Slide 7 text

🐳 Team system architecture

Slide 8

Slide 8 text

🐳 Solr cluster topology Split API & shop traffic API cluster absorbes request peaks Medium cluster utilization πŸ’° 1k per instance/month

Slide 9

Slide 9 text

🐳 Solr Cluster topology Goal: get rid of redundant infrastructure How to enable Solr to handle request spikes?

Slide 10

Slide 10 text

(2) Operational challenges Foto von Chris Lawton auf Unsplash

Slide 11

Slide 11 text

🦾 ARM our lord and savior Docker buildx to build 
 multi-arch builds Custom arm64 AMI Multi-Arch ASGs via MixedInstances in LaunchTemplate Graviton AMD Intel

Slide 12

Slide 12 text

πŸ’° ARM vs AMD Graviton2 AMD Linux load1 (less is better) AMD power management ⚑

Slide 13

Slide 13 text

🏎 Graviton2 vs Graviton3 Linux load1 (less is better) Graviton3 Graviton2

Slide 14

Slide 14 text

🏝 Our instance type journey

Slide 15

Slide 15 text

🌱 monitor ecosystem innovation

Slide 16

Slide 16 text

(3) Software Challenges Foto von Alexander Hafemann auf Unsplash

Slide 17

Slide 17 text

πŸ™Œ Why software challenges? Response time is tied to cpu utilization 😱 Foto von Jeremy Lapak auf Unsplash

Slide 18

Slide 18 text

πŸ‘¨βœˆ Java Flight Recorder Event based tracing framework built into the JVM Very low overhead < 1% Designed for production use Free to use Foto von Richard Cartmell auf Unsplash

Slide 19

Slide 19 text

πŸ‘©βœˆ JDK Mission Control Java Object Locking problems

Slide 20

Slide 20 text

πŸ‘©βœˆ JDK Mission Control: Locking Monitor classes Thread wait time Call stack

Slide 21

Slide 21 text

πŸ›¬ Flight recorder always on -XX:StartFlightRecording=disk=true, maxsize=512M, dumponexit=true, name=continuous, settings=default, filename=/tmp/jfr/search_REDACTED.jfr

Slide 22

Slide 22 text

πŸ›¬ Flight recorder always on

Slide 23

Slide 23 text

πŸ§‘πŸ’» Analysis process

Slide 24

Slide 24 text

Flight Recorder Findings using 60s on-demand recordings Foto von Elias Maurer auf Unsplash

Slide 25

Slide 25 text

πŸ” RSA-Certificates

Slide 26

Slide 26 text

πŸͺ΅ Async logging

Slide 27

Slide 27 text

🀦 Locking in Solr

Slide 28

Slide 28 text

🧞 Deep dive into Solr code

Slide 29

Slide 29 text

πŸ‘Œ Locking in Solr

Slide 30

Slide 30 text

πŸ›« Achievement unlocked Scale out at 50% cpu usage possible πŸ’ͺ Linux load1 
 (ideal @ 32)

Slide 31

Slide 31 text

πŸ“‰ Overall reductions 1/3rd 😱

Slide 32

Slide 32 text

(4) Lessons learned Foto von Kelli Tungay auf Unsplash

Slide 33

Slide 33 text

πŸ€“ Failures β€’ Spot-Instances β€’ Java Z garbage collector (for now)

Slide 34

Slide 34 text

πŸ‘΄ G1 Garbage Collector Minimum pause collector 200ms max collection time Uses 90-105ms for a 32g heap

Slide 35

Slide 35 text

β™» Z Garbage Collector Reduces collection time to 600Β΅s for a 32g heap 😳 Uses Linux huge pages

Slide 36

Slide 36 text

🐾 Lessons learned β€’ Have strong observability in place β€’ Involve yourself in the OSS software you use – yes maybe you are the first one having this exact problem β€’ Test in prod or live a lie

Slide 37

Slide 37 text

Dennis Berger (@debe) Torsten KΓΆster (@tboeghk) Questions?