AWS Ambassadors Japan AWS Top Engineers (Services) Japan AWS All Certifications Engineers GameDay enthusiast: 🥇x2 🥈x1 🥉x3 Favorites: GuardDuty, Step Functions Photo from AWS Blog
proactive instance replacement against noticeable difference in network latency to an RDS endpoint • First impression: ◦ Distance-induced latency will be around 500 microseconds and is unlikely to impact the performance for most applications. ◦ Possibility of other factors in AWS infrastructure causing latency 5 c.f. https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/availability-zones.html Distances between two DCs < 100 km?
Instance "hit or miss" • Approach ◦ Launch as many EC2 instances as possible ◦ Look for instances with significant latency from others by measuring latency ◦ 6
3) to connect & disconnect with RDS and record TAT ◦ Collect tcpdump logs and calculate TAT & RTT (in three-way handshake sequence) in TCP-level 2.1 Measurement method 8 RTT TAT . . Connect (SYN) Disconnected (ACK) SYN+ACK Connected (ACK) . . . . Disconnect (FIN+ACK)
the reliability of the measurement method ◦ Check whether the metrics are suitable to identify the “hit-or-miss” of instances ◦ Determine the configuration of large-scale measurements (# of trials/instance, instance types, …) ◦ • Configuration: ◦ EC2 instance type: m7g.large & t4g.large (2 vCPUs) ◦ RDS instance type: db.m7g.large, PostgreSQL (15.8-R1) ◦ Place EC2 instances and RDS in the same AZ (usw2-az2) ◦ Run DBPing every second for 3 hours for each instance 9
and outliers will have little impact on app performance. • The first measurement of each instance should be excluded. ◦ Narrow box-range Small deviation Largest one appears at 1st trial Sudden & not consistent
◦ Little correlation seen between RTT and TAT -> TAT are not suitable to decide the need of Instance replacement due to network latency • Instance type difference: ◦ c7g.large shows 10% faster TAT than m4g.large. ◦ RTT has no difference as expected. -> Burstable instances are applicable. 11 Little correlation between RTT and app-level TAT c7g.large m4g.large
group across four AZs in Oregon with spot instances - Instance type : any (2+ vCPUs, 1GB+ Mem) except for A1 - Tested 12,239 instances ◦ RDS - Instance type: db.m7g.large, PostgreSQL (15.8-R1) ◦ DBPing ◦ - Run every second for 60 times for each instance (1st measurement to be excluded) ◦ - Placement: us-west-2a ◦ 13
to have larger differences in spike size among instances of the same type. • Further investigation is required to conclude by eliminating the possibility that the characteristics of the population may not be well-captured. 17
small enough and stable for most applications. ◦ There is little justification for proactively replacing instances. • Be aware of spikes when designing latency-sensitive systems • [Tips] Config is costly… • 19 Same-AZ Cross-AZ Communication cost < 376 microseconds < 0.96 ms. Spike size < 567 microseconds < 3.2 ms. Spikes / Instance P99 2.0 MAX 4.0