August 18, 2016
610

# Vivaldi: Decentralized Network Coordinates

Large scale distributed systems can use round trip time estimates between peers to make intelligent decisions about request routing, data replication, and failure handling. Vivaldi is a distributed algorithm for computing network coordinates for a large set of peers efficiently. In this talk, we motivate the need for network coordinates and introduce the Vivaldi algorithm. We do a brief survey of interesting extensions and related work, both to understand how to use Vivaldi in the wild and to understand the source of errors in it's modeling. Lastly we talk about how Vivaldi is used in the Serf and Consul tools to solve user problems.

August 18, 2016

## Transcript

1. HASHICORP
Vivaldi
A Decentralized Network Coordinate System

2. HASHICORP
@armon

3. HASHICORP

4. HASHICORP

5. HASHICORP
Network Coordinates

6. HASHICORP
Euclidean Coordinates
p1 = {x: 1, y: 2, z: 3}
p2 = {x: 4, y: 5, z: 6}
dist(p1, p2) = sqrt((p2.x-p1.x)^2
+ (p2.y-p1.y)^2
+ (p2.z-p1.z)^2)

7. HASHICORP
Euclidean Space
Euclidean Distance deﬁned in Euclidean Space
Cartesian Coordinates {x, y, z} are Euclidean

8. HASHICORP
Network Space
p1 = ipv4(1.2.3.4)
p2 = ipv4(5.6.7.8)
dist(p1, p2) = ?

9. HASHICORP
Network Space
p1 = ipv4(1.2.3.4)
p2 = ipv4(5.6.7.8)
dist(p1, p2) = rtt(p1, p2)

10. HASHICORP
Network Distance?
Peer
Peer
Seed
Seed
Peer
Seed
P2P Application

11. HASHICORP
Network Distance?
Nearest Neighbor Routing
Web Server
API Server
API Server
API Server

12. HASHICORP
Network Distance?
Datacenter Failover

13. HASHICORP
Network Space
p1 = ipv4(1.2.3.4)
p2 = ipv4(5.6.7.8)
dist(p1, p2) = rtt(p1, p2)
ping?

14. HASHICORP
Ping Problem
Suppose you have 20K+ peers (BitTorrent)
Pair-wise distance from {PeerN, PeerM} requires N2 Probes
Samples = 3 Probes = 1.2B Storage = 9.6GB (double)

15. HASHICORP
Ping Representation
Ping creates a matrix of pairwise latency
dist(p1, p2) = rtt(p1, p2)
rtt(p1, p2) = pairwise[p1][p2]

16. HASHICORP
Cartesian Representation
Cartesian Coordinates allow us to exploit Pythagorean Theorem
a2 + b2 = c2

17. HASHICORP
Vivaldi
Decentralized Network Coordinates
Frank Dabek, Russ Cox, Frans Kaashoek, Robert Morris

18. HASHICORP
Vivaldi
Pairwise connect peers with a spring
Spring’s natural length is the RTT
Compress down all peers to the origin and then relax

19. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer
Peer

20. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer

21. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer

22. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer

23. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer

24. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer
Peer

25. HASHICORP
Vivaldi
Coordinates provide predictive model
Communication between nodes updates the model
Coordinates converge over time

26. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer
const sensitivity = 0.25
var local = {x: 0, y: 0, z: 0}
var remote = {x: 0, y: 0, z: 0}
def update(rtt=500msec, remote):
estimate = euclidean_dist(local,remote)
err = rtt - estimate
direction_of_err = unitVector(local - remote)
scaled_direction = direction_of_err * err
local = local + scaled_direction * sensitivity

27. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer
const sensitivity = 0.25
var local = {x: 0, y: 0, z: 0}
var remote = {x: 0, y: 0, z: 0}
def update(rtt=500msec, remote):
estimate = 0msec
err = rtt - estimate
direction_of_err = unitVector(local - remote)
scaled_direction = direction_of_err * err
local = local + scaled_direction * sensitivity

28. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer
const sensitivity = 0.25
var local = {x: 0, y: 0, z: 0}
var remote = {x: 0, y: 0, z: 0}
def update(rtt=500msec, remote):
estimate = 0msec
err = 500msec
direction_of_err = unitVector(local - remote)
scaled_direction = direction_of_err * err
local = local + scaled_direction * sensitivity

29. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer
const sensitivity = 0.25
var local = {x: 0, y: 0, z: 0}
var remote = {x: 0, y: 0, z: 0}
def update(rtt=500msec, remote):
estimate = 0msec
err = 500msec
direction_of_err = {x: -0.1, y: 0.6, z: 0.8}
scaled_direction = direction_of_err * err
local = local + scaled_direction * sensitivity

30. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer
const sensitivity = 0.25
var local = {x: 0, y: 0, z: 0}
var remote = {x: 0, y: 0, z: 0}
def update(rtt=500msec, remote):
estimate = 0msec
err = 500msec
direction_of_err = {x: -0.1, y: 0.6, z: 0.8}
scaled_direction = {x: -50, y: 300, z: 400}
local = local + scaled_direction * sensitivity

31. HASHICORP
Vivaldi
Peer
Peer
Peer
Peer Peer
const sensitivity = 0.25
var local = {x: -12.5, y: 75, z: 100}
var remote = {x: 0, y: 0, z: 0}
def update(rtt=500msec, remote):
estimate = 0msec
err = 500msec
direction_of_err = {x: -0.1, y: 0.6, z: 0.8}
scaled_direction = {x: -50, y: 300, z: 400}
local = {x: -12.5, y: 75, z: 100}

32. HASHICORP
Vivaldi
const sensitivity changes how rapidly we adjust
Large value = fast to update, but unstable
Small value = slow to converge, but stable
Dynamic value?

33. HASHICORP
Vivaldi
var local_err = 1000msec
def update(rtt, remote, remote_err):

balance_err = local_err / (local_err + remote_err)
rel_err = (estimate - rtt) / rtt
local_err = rel_err * error_sensitivity_adj * balance_err
local = local + scaled_direction * sensitivity

34. HASHICORP
Vivaldi
var local_err = 1000msec
def update(rtt, remote, remote_err):

balance_err = local_err / (local_err + remote_err)
rel_err = (estimate - rtt) / rtt
local_err = rel_err * error_sensitivity_adj * balance_err
local = local + scaled_direction * sensitivity
High Remote Error =>
Low Sensitivity

35. HASHICORP
Vivaldi
var local_err = 1000msec
def update(rtt, remote, remote_err):

balance_err = local_err / (local_err + remote_err)
rel_err = (estimate - rtt) / rtt
local_err = rel_err * error_sensitivity_adj * balance_err
local = local + scaled_direction * sensitivity
High Local Error =>
High Sensitivity

36. HASHICORP
Vivaldi
Each node tracks position and error estimate
Coordinate converges over time
Local error goes does as estimates become accurate
Several tuning parameters, including dimensionality

37. HASHICORP
Dimensionality
Coordinates can be in any Euclidean Space
2D, 3D, or N Dimensions?
Principle Component Analysis (PCA) to reduce dimensions

38. HASHICORP
Dimensionality Reduction
Time of Day Brightness Angle of Sun
12PM Very Bright 90 degrees
3PM Very Bright 80 degrees
9PM Very Dark 0 degrees
12AM Very Dark 0 degrees

39. HASHICORP
Dimensionality Reduction
Time of Day Brightness Angle of Sun
12PM Very Bright 90 degrees
3PM Very Bright 80 degrees
9PM Very Dark 0 degrees
12AM Very Dark 0 degrees

40. HASHICORP
Dimensionality
Performance dramatically reduced below 2D
Marginal improvement past 5D
Depends on the complexity of the underlying topology

41. HASHICORP
Height / Fixed Costs
Application
Userspace Runtime
Operating System
Hypervisor
Network Card
Fixed Cost
0.5 msec

42. HASHICORP
Coordinate + Height
Allows coordinates to model non-ﬁxed latency
Improves the predictive power of the coordinates
Reduces the dimensionality required
RTT = dist(p1, p2) + p1.Height + p2.Height

43. HASHICORP
Extensions to Vivaldi

44. HASHICORP
Network Coordinates in the Wild
Azureus BitTorrent Client (10K+ clients)
Dimensionality Analysis in the Wild
Latency and Update Filters
Churn, Drift, Intrinsic Error, Latency Variation
Ledlie, Gardner, and Seltzer

45. HASHICORP
Drift
Peer
Peer
Peer
Peer
Peer

46. HASHICORP
Drift
Peer
Peer
Peer
Peer
Peer

47. HASHICORP
Gravity
Applying small “gravity” toward origin
Prevents run away coordinates
Cluster can still “rotate” about the origin

48. HASHICORP
On Suitability of Euclidean Embedding for
Host-based Network Coordinate Systems
Lee, Zhang, Sahu, Saha
Analysis of Triangle Inequality Violations (TIV) - Intrinsic Error
Understanding source of TIV

49. HASHICORP
Triangle Inequality Violation
Server 1
Server 2
Server 3
Core Router
Top of Rack Switch Top of Rack Switch
c < a + b
Server 1 -> Server 2 : 0.1 msec
Server 2 -> Server 3 : 0.3 msec
Server 1 -> Server 3 : 0.3 msec
Packet Processing Time > Transit Time

50. HASHICORP
Track the estimation error from measurement
Adjustment is the average over a sample window

51. HASHICORP
Serf Implementation

52. HASHICORP
Serf
Serf is a decentralized solution for cluster
membership, failure detection, and
orchestration.
Built on gossip protocol (SWIM)
Runs at 10K+ node scale
https://serf.io

53. HASHICORP
Serf
Assign a coordinate to each node?
Applications can leverage for intelligent routing,
peer selection, etc
Gossip is doing background communication

54. HASHICORP
Failure Detection
Peer Peer
Ping

55. HASHICORP
Failure Detection
Peer Peer
Ack

56. HASHICORP
Serf
Attach Coordinate to Ack messages
RTT computed from the send time of Ping
Coordinates of peers cached
Random peers avoid selection bias

57. HASHICORP
Serf
Implementation uses 8D + Height
3 Sample Latency Filter
Small Gravity
Coordinate Snapshotting

58. HASHICORP
Estimated n1 <-> n2 rtt: 0.610 ms
demo  master \$ serf rtt n1 n2
demo  master
Estimated n1 <-> n2 rtt: 0.610 ms
\$ serf rtt n2 # Running from n1

59. HASHICORP
Consul Usage

60. HASHICORP
Consul
Consul is a solution for service discovery,
monitoring, conﬁguration and
orchestration.
Built on Serf + Raft (Paxos)
Runs at 50K+ node scale
https://consul.io

61. HASHICORP
Consul
Coordinates are periodically pushed to central servers
Servers expose the coordinates over APIs
Nearest neighbor routing, datacenter failover, etc.

62. Terminal
HASHICORP
\$ consul rtt node-10-0-1-8
Estimated node-10-0-1-8 <-> node-10-0-1-6 rtt:
0.781 ms (using LAN coordinates)\$
\$ sleep 30
\$ consul rtt node-10-0-1-8
Estimated node-10-0-1-8 <-> node-10-0-1-6 rtt:
0.719 ms (using LAN coordinates)

63. Terminal
HASHICORP
\$ curl localhost:8500/v1/catalog/nodes?
near=node-78r16zb3q | jq '.[].Node'
"node-78r16zb3q"
"node-10-0-4-190"
"node-10-0-1-7"
"node-10-0-4-240"
\$ curl localhost:8500/v1/catalog/service/vault?
near=node-78r16zb3q | jq '.[].Node'
"node-10-0-1-71"
"node-10-0-3-119"
"node-10-0-3-249"

64. HASHICORP
Conclusion
Vivaldi provides a decentralized algorithm for coordinates
Networks not Euclidean, leads to TIV
Interesting uses in distributed systems
Serf and Consul expose via APIs

65. HASHICORP
Thanks!
Q/A