Karol Nowak - Monitoring clock drift in Amazon EC2 environment

Monitoring clock drift in Amazon EC2 environment Karol Nowak

Why? To answer questions about our system: • what customer
data looked like at a given point of time ◦ what data was used to make a decision (reports) ◦ what happened first (total order of operations) • when things happened, correlation (monitoring)

Where? • Heterogeneous SOA (effectively a distributed DB) ◦ ~70
services in production ◦ objects with relations spanning across services ◦ multiple workers on many machines ◦ Ruby (MRI/JRuby), Python; more coming • Amazon EC2 ◦ VMs (Xen) ◦ 7 instance families, different underlying hardware ◦ Ubuntu Linux (12.04.* LTS)

What’s the problem? How to build an (eventually) consistent system…
...using independent data stores? How far ahead should we look at the data stream to be sure we’ve seen all changes that happened “now”? Perhaps timestamped changes can be reconciled using additional information about the logic behind them.

How? No go: distributed transactions, singular datastores, massive rewrites, etc.
Let’s assume relying on clocks is Good Enough™. But how do we know? Let’s talk NTP.

How well can we do? “NTP can usually maintain time
to within tens of milliseconds over the public Internet, and can achieve better than one millisecond accuracy in local area networks under ideal conditions. Asymmetric routes and network congestion can cause errors of 100 ms or more.” -- Wikipedia

What the Internet says... “My clock was 18.5 seconds off
on an EC2 micro instance with 5 days uptime.” “After shutting down and starting again, it was back to normal, so there seems to be some kind of drift.” “Clocks on virtual servers are especially prone to a whole class of these problems. 12 seconds a day is pretty bad until you come across virtual boxes with clocks that run at 180–200% speed!”

What the Internet says... “My clock was 18.5 seconds off
on an EC2 micro instance with 5 days uptime.” “After shutting down and starting again, it was back to normal, so there seems to be some kind of drift.” “Clocks on virtual servers are especially prone to a whole class of these problems. 12 seconds a day is pretty bad until you come across virtual boxes with clocks that run at 180–200% speed!” (ಠ_ಠ)

Our Data Based on a sample of 15 monitored machines
vs. a single monitoring “timekeeper”. Out-of-the-box NTP configuration using {0,1,2}.amazon. pool.ntp.org yields sets of 3 different time servers for each NTP daemon. Usually one stratum 1 source in the mix (GPS, CDMA, ...).

70 milliseconds / hour = 50 minutes / month!

Why 30ms?

Thank you! Questions? Let’s talk afterwards! [email protected]

Karol Nowak - Monitoring clock drift in Amazon ...

Karol Nowak - Monitoring clock drift in Amazon EC2 environment

Base Lab

More Decks by Base Lab

Other Decks in Technology

Featured

Transcript

Monitoring clock drift in Amazon EC2 environment Karol Nowak

Why? To answer questions about our system: • what customer

Where? • Heterogeneous SOA (effectively a distributed DB) ◦ ~70

What’s the problem? How to build an (eventually) consistent system…

How? No go: distributed transactions, singular datastores, massive rewrites, etc.

How well can we do? “NTP can usually maintain time

What the Internet says... “My clock was 18.5 seconds off

What the Internet says... “My clock was 18.5 seconds off

Our Data Based on a sample of 15 monitored machines

70 milliseconds / hour = 50 minutes / month!

Why 30ms?

Thank you! Questions? Let’s talk afterwards! [email protected]