Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Karol Nowak - Monitoring clock drift in Amazon ...

Base Lab
August 06, 2014

Karol Nowak - Monitoring clock drift in Amazon EC2 environment

Monitoring clock drift in Amazon EC2 environment

Base Lab

August 06, 2014
Tweet

More Decks by Base Lab

Other Decks in Technology

Transcript

  1. Why? To answer questions about our system: • what customer

    data looked like at a given point of time ◦ what data was used to make a decision (reports) ◦ what happened first (total order of operations) • when things happened, correlation (monitoring)
  2. Where? • Heterogeneous SOA (effectively a distributed DB) ◦ ~70

    services in production ◦ objects with relations spanning across services ◦ multiple workers on many machines ◦ Ruby (MRI/JRuby), Python; more coming • Amazon EC2 ◦ VMs (Xen) ◦ 7 instance families, different underlying hardware ◦ Ubuntu Linux (12.04.* LTS)
  3. What’s the problem? How to build an (eventually) consistent system…

    ...using independent data stores? How far ahead should we look at the data stream to be sure we’ve seen all changes that happened “now”? Perhaps timestamped changes can be reconciled using additional information about the logic behind them.
  4. How? No go: distributed transactions, singular datastores, massive rewrites, etc.

    Let’s assume relying on clocks is Good Enough™. But how do we know? Let’s talk NTP.
  5. How well can we do? “NTP can usually maintain time

    to within tens of milliseconds over the public Internet, and can achieve better than one millisecond accuracy in local area networks under ideal conditions. Asymmetric routes and network congestion can cause errors of 100 ms or more.” -- Wikipedia
  6. What the Internet says... “My clock was 18.5 seconds off

    on an EC2 micro instance with 5 days uptime.” “After shutting down and starting again, it was back to normal, so there seems to be some kind of drift.” “Clocks on virtual servers are especially prone to a whole class of these problems. 12 seconds a day is pretty bad until you come across virtual boxes with clocks that run at 180–200% speed!”
  7. What the Internet says... “My clock was 18.5 seconds off

    on an EC2 micro instance with 5 days uptime.” “After shutting down and starting again, it was back to normal, so there seems to be some kind of drift.” “Clocks on virtual servers are especially prone to a whole class of these problems. 12 seconds a day is pretty bad until you come across virtual boxes with clocks that run at 180–200% speed!” (ಠ_ಠ)
  8. Our Data Based on a sample of 15 monitored machines

    vs. a single monitoring “timekeeper”. Out-of-the-box NTP configuration using {0,1,2}.amazon. pool.ntp.org yields sets of 3 different time servers for each NTP daemon. Usually one stratum 1 source in the mix (GPS, CDMA, ...).