Project “Instance Lottery"@JAWS PANKRATION 2024

by Hiroshi Hayakawa (p0n)

Slide 1

Slide 1 text

Project “Instance Lottery" (Instance Gatcha) Hiroshi Hayakawa 󰏦

Slide 2

Slide 2 text

2 Hiroshi HAYAKAWA 󰏦 AWS Community Builders (Security & Identity) AWS Ambassadors Japan AWS Top Engineers (Services) Japan AWS All Certifications Engineers GameDay enthusiast: 🥇x2 🥈x1 🥉x3 Favorites: GuardDuty, Step Functions Photo from AWS Blog

Slide 3

Slide 3 text

Agenda 1. Introduction 2. Measurement methods 3. Lottery Results 4. Takeaways 3

Slide 4

Slide 4 text

1. Introduction 4

Slide 5

Slide 5 text

1.1 Motivations ● Trigger = an X post ○ Implementing proactive instance replacement against noticeable difference in network latency to an RDS endpoint ● First impression: ○ Distance-induced latency will be around 500 microseconds and is unlikely to impact the performance for most applications. ○ Possibility of other factors in AWS infrastructure causing latency 5 c.f. https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/availability-zones.html Distances between two DCs < 100 km?

Slide 6

Slide 6 text

1.2 Goals and approach ● Goals ○ Diving deep into Instance "hit or miss" ● Approach ○ Launch as many EC2 instances as possible ○ Look for instances with significant latency from others by measuring latency ○ 6

Slide 7

Slide 7 text

2. Measurement Methods 7

Slide 8

Slide 8 text

● Selected method: ○ Develop a simple tool “DBPing” (Psycopg 3) to connect & disconnect with RDS and record TAT ○ Collect tcpdump logs and calculate TAT & RTT (in three-way handshake sequence) in TCP-level 2.1 Measurement method 8 RTT TAT . . Connect (SYN) Disconnected (ACK) SYN+ACK Connected (ACK) . . . . Disconnect (FIN+ACK)

Slide 9

Slide 9 text

2.2 Validation of the selected method ● Objectives: ○ Check the reliability of the measurement method ○ Check whether the metrics are suitable to identify the “hit-or-miss” of instances ○ Determine the configuration of large-scale measurements (# of trials/instance, instance types, …) ○ ● Configuration: ○ EC2 instance type: m7g.large & t4g.large (2 vCPUs) ○ RDS instance type: db.m7g.large, PostgreSQL (15.8-R1) ○ Place EC2 instances and RDS in the same AZ (usw2-az2) ○ Run DBPing every second for 3 hours for each instance 9

Slide 10

Slide 10 text

2.3 Validation results #1 10 ● RTT is stable enough, and outliers will have little impact on app performance. ● The first measurement of each instance should be excluded. ○ Narrow box-range Small deviation Largest one appears at 1st trial Sudden & not consistent

Slide 11

Slide 11 text

2.4 Validation results #2 ● TAT or RTT as KPI ○ Little correlation seen between RTT and TAT -> TAT are not suitable to decide the need of Instance replacement due to network latency ● Instance type difference: ○ c7g.large shows 10% faster TAT than m4g.large. ○ RTT has no difference as expected. -> Burstable instances are applicable. 11 Little correlation between RTT and app-level TAT c7g.large m4g.large

Slide 12

Slide 12 text

3. Lottery Results 12

Slide 13

Slide 13 text

3.1 Lottery Time ● Configuration: ○ EC2 - Auto scaling group across four AZs in Oregon with spot instances - Instance type : any (2+ vCPUs, 1GB+ Mem) except for A1 - Tested 12,239 instances ○ RDS - Instance type: db.m7g.large, PostgreSQL (15.8-R1) ○ DBPing ○ - Run every second for 60 times for each instance (1st measurement to be excluded) ○ - Placement: us-west-2a ○ 13

Slide 14

Slide 14 text

3.2 Trends by AZs ● Cross-AZ communication tends to have larger latency and spikes. ● Any instance may experience spikes in latency regardless of AZs. 14

Slide 15

Slide 15 text

3.3 Trends by AZs #2 ● RTT trends: ● No consistent latency was observed affecting most applications. 15 Same-AZ Cross-AZ Communication cost < 376 microseconds < 0.96 ms. Spike size < 567 microseconds < 3.2 ms. Spikes / Instance P99 2.0 MAX 4.0

Slide 16

Slide 16 text

3.4 Trends by Instance types ● Regardless of instance type… ○ Occurrence rate of RTT spikes appears almost same. ○ Spikes occur approximately once during 60 measurements. ● 16

Slide 17

Slide 17 text

3.5 Trends by Instance types #2 ● t4g.micro instances seem to have larger differences in spike size among instances of the same type. ● Further investigation is required to conclude by eliminating the possibility that the characteristics of the population may not be well-captured. 17

Slide 18

Slide 18 text

4. Summary 18

Slide 19

Slide 19 text

4. Takeaways ● From the measurement results, ○ RTT is small enough and stable for most applications. ○ There is little justification for proactively replacing instances. ● Be aware of spikes when designing latency-sensitive systems ● [Tips] Config is costly… ● 19 Same-AZ Cross-AZ Communication cost < 376 microseconds < 0.96 ms. Spike size < 567 microseconds < 3.2 ms. Spikes / Instance P99 2.0 MAX 4.0

Slide 20

Slide 20 text

Thank you!