Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The art of service discovery at scale at Strangeloop 2015

A3668e66eb7b8980ac91daaa4e9fe691?s=47 Nitesh Kant
September 25, 2015

The art of service discovery at scale at Strangeloop 2015

Whether it is a simple DNS lookup or a complex dedicated solution, service discovery is the backbone of any microservices architecture and an immature solution can soon turn into an achilles' heel.

Nitesh Kant in this talk will introduce the concept of service discovery and various use cases it solves in a complex service based architecture. He will then be introducing Netflix's Eureka (https://github.com/Netflix/eureka); a highly-available, multi-datacenter aware service discovery solution built from scratch, it's architecture and how it is unique in this space, by favoring Availability over Consistency in the wake of network partitions.

Presented at Strangeloop 2015: http://www.thestrangeloop.com/2015/the-art-of-service-discovery-at-scale.html.

Video: https://www.youtube.com/watch?v=27ynM2tbNXM

A3668e66eb7b8980ac91daaa4e9fe691?s=128

Nitesh Kant

September 25, 2015
Tweet

More Decks by Nitesh Kant

Other Decks in Technology

Transcript

  1. The art of service discovery at scale Nitesh Kant, Software

    Engineer, Netflix Edge Engineering. @NiteshKant
  2. None
  3. None
  4. None
  5. Nitesh Kant Who Am I? ❖ Engineer, Edge Engineering, Netflix.

    ❖ Core contributor, RxNetty* ❖ Contributor, Zuul** ❖ Ran Netflix’s Service Discovery for a year. ❖ Conceptualized and designed Eureka v2*** . * https://github.com/ReactiveX/RxNetty ** https://github.com/Netflix/zuul @NiteshKant *** https://github.com/Netflix/eureka
  6. What is Service Discovery?

  7. Delivering Netflix What goes behind the scenes?

  8. Pipe Dream Always available, infinitely scalable machine.

  9. Reality Thousands of machines with hundreds of microservices

  10. Microservices A typical relationship between different microservices.

  11. None
  12. None
  13. User Service Metadata service Recommendations

  14. Service Discovery!

  15. None
  16. Instance variations We auto-scale up/down all the time.

  17. VMs crash

  18. User Service Metadata Service Metadata Service User Service

  19. It’s an ever changing eco-system, static mappings, don’t work.

  20. Service discovery is a problem of cloud environments.

  21. X

  22. Service Name Nodes User Service 10.10.1.1, 10.10.1.2, 10.10.1.3, 10.10.1.4 Metadata

    Service 10.10.2.1 Recommendations Service 10.10.3.1, 10.10.3.2, 10.10.3.3
  23. Service Name Nodes User Service 10.10.1.1, 10.10.1.2 Metadata Service 10.10.2.1,

    10.10.2.2 Recommendations Service 10.10.3.1, 10.10.3.2
  24. Service Name Nodes User Service 10.10.1.1, 10.10.1.2 Metadata Service 10.10.2.1,

    10.10.2.2 Recommen dations 10.10.3.1, 10.10.3.2 G ET/PU T/D ELETE GET/PUT/DELETE GET/PUT/DELETE
  25. How is service discovery used?

  26. Delivering Netflix Edge Service Ratings Service Video Metadata Service Bookmarks

    Service Recommendations service Disclaimer: This is an example and not an exact representation of the processing
  27. Delivering Netflix Edge Service Disclaimer: This is an example and

    not an exact representation of the processing Recommendations service
  28. Delivering Netflix Edge Service Disclaimer: This is an example and

    not an exact representation of the processing Recommendations service Service Discovery Which instances of recommendations service are available now?
  29. Which instances of recommendations service are available now?

  30. Which of the machines in the datacenter have the recommendations

    service software deployed?
  31. Service Name Nodes User Service 10.10.1.1, 10.10.1.2 Metadata Service 10.10.2.1,

    10.10.2.2 Recommendations Service 10.10.3.1, 10.10.3.2
  32. Recs Service Instance Service discovery @Startup I am a recommendations

    service instance. @Shutdown I am not a recommendations service instance.
  33. Which instances of recommendations service are available now?

  34. available now?

  35. Recs Service Instance Service discovery @Startup I am a recommendations

    service instance. @Shutdown I am not a recommendations service instance.
  36. Recs Service Instance Service discovery @Startup I am a recommendations

    service instance. @Shutdown I am not a recommendations service instance. I am alive.
  37. available now?

  38. Network Partitions Ratings Service Edge Service (Instance 1) Edge Service

    (Instance 2) X
  39. Network Partitions Being available in a distributed system, is very

    subjective.
  40. Network Partitions Being available in a distributed system, is very

    subjective. What is available to one can be unavailable to other.
  41. Network Partitions Being available in a distributed system, is very

    subjective. What is available to one can be unavailable to other. A node’s availability decision is best when it is local.
  42. Service discovery does not guarantee node availability.

  43. Why should Service discovery store node status?

  44. Why should Service discovery store node status? Availability is doubtful

    but Unavailability can be trusted.
  45. Unavailability can be trusted

  46. Unavailability can be trusted Node status override

  47. Node status override Ratings Service (Instance 1) Edge Service Ratings

    Service (Instance 2)
  48. Node status override Ratings Service (Instance 1) Edge Service Ratings

    Service (Instance 2) Service Discovery Take Instance 2 out of service
  49. Node status override Ratings Service (Instance 1) Edge Service Ratings

    Service (Instance 2) Service Discovery Take Instance 2 out of service X
  50. Nodes can be isolated for debugging.

  51. Nodes can be started in isolation.

  52. Service discovery controls visibility of nodes but does not guarantee

    availability.
  53. Failures What if?

  54. What if … Service discovery is unavailable?

  55. Service discovery controls visibility of nodes but does not guarantee

    availability.
  56. All you lose is the visibility of new nodes.

  57. User Service Metadata Service Metadata Service User Service

  58. Instance variations

  59. None
  60. None
  61. High availability is a strong requirement.

  62. CAP theorem

  63. Consistency Visibility Availability

  64. Consistency Visibility A

  65. Worst time

  66. Consistency Visibility Availability

  67. C Visibility Availability

  68. Worst case User Service Metadata Service Metadata Service User Service

  69. A persistent, verified connection based interaction solves this issue.

  70. Edge Service Recommendations Service Connect

  71. Edge Service Recommendations Service Connect Verify
 Are you a Recs

    service instance?
  72. Edge Service Recommendations Service Connect Verify
 Are you a Recs

    service instance? Requests . . .
  73. In most cases choosing availability over consistency for Service Discovery

    is the correct thing to do.
  74. Durability

  75. Service discovery data is ephemeral and re-generatable.

  76. Ephemeral Recs Service Instance Service discovery @Start I am a

    recommendations service @Shutdo I am not a recommendations service In dynamic environments, this duration is in order of hours or days.
  77. Service Discovery data is the instance information. It is always

    available to the instance. Re-generatable data
  78. None
  79. Can we create a co-ordination free service discovery?

  80. Stateful Client-Server interaction Recs Service Instance Service discovery @Startup I

    am a recommendations service instance. @Shutdown I am not a recommendations service instance. I am alive.
  81. Stateful Client-Server interaction Needs a stateful protocol.

  82. Recs Service Instance Service discovery @Startup I am a recommendations

    service instance. @Shutdown I am not a recommendations service instance. I am alive. A connection oriented, ordered and reliable protocol.
  83. Nice Benefits Causal ordering

  84. Nice Benefits Causal ordering Natural lifecycle (connection)

  85. Nice Benefits Causal ordering Natural lifecycle (connection) Ability to send

    data“diff”.
  86. Recs Service Instance Service Discovery Node 1 Register Heartbeats Shutdown

    Heartbeat
  87. Bigger Picture Service Discovery Node 1 Service Discovery Node 2

    Service Discovery Node 3
  88. Replication Node 1 Node 2 Node 3

  89. Replication Node 3 Node 1 Node 2

  90. Recs Service Instance Service Discovery Node 1 Register Heartbeats Acks

  91. Recs Service Instance Service Discovery Node 1 Register Heartbeat Acks

    Recs Service Instance Service Discovery Node 2 Register Heartbeat Acks
  92. Recs Service Instance Service Discovery Node 1 Register Heartbeat Acks

    Recs Service Instance Service Discovery Node 2 Register Heartbeat Acks
  93. Data conflicts

  94. Recs Service Instance Service Discovery IP: 10.10.2.1 Port: 7001 Status:

    Starting Status: UP Conflicts
  95. Recs Service Instance Service Discovery Node 1 Recs Service Instance

    Service Discovery Node 2 IP: 10.10.2.1 Port: 7001 Status: Starting IP: 10.10.2.1 Port: 7001 Status: UP Conflicts
  96. Service Discovery Node 3 IP: 10.10.2.1 Port: 7001 Status: Starting

    IP: 10.10.2.1 Port: 7001 Status: UP From Service Discovery Node 1 From Service Discovery Node 2 Conflicts
  97. Conflict resolution

  98. Recs Service Instance Service Discovery Node 1 What should be

    the state of Recs Service Instance data now? IP: 10.10.2.1 Port: 7001 Status: Starting
  99. Recs Service Instance Service Discovery Instance 1 What should be

    the state of Recs Service Instance data now? Show tolerance towards broken connections, evicting an instance too early causes churn. IP: 10.10.2.1 Port: 7001 Status: Starting
  100. Recs Service Instance Service Discovery Instance 1 What should be

    the state of Recs Service Instance data now? Show tolerance towards broken connections, evicting an instance too early causes churn. Typically, wait for a while for a reconnect before eviction. IP: 10.10.2.1 Port: 7001 Status: Starting
  101. Recs Service Instance Service Discovery Instance 1 What should be

    the state of Recs Service Instance data now? Show tolerance towards broken connections, evicting an instance too early causes churn. Typically, wait for a while for a reconnect before eviction. Evict in absence of reconnects after a while. IP: 10.10.2.1 Port: 7001 Status: Starting
  102. Well behaved clients at steady state connects to a single

    Service Discovery node.
  103. Service Discovery Node 3 IP: 10.10.2.1 Port: 7001 Status: Starting

    IP: 10.10.2.1 Port: 7001 Status: UP From Service Discovery Node 1 From Service Discovery Node 2 Conflicts X
  104. Data conflicts are resolved naturally for service discovery.

  105. Tolerate temporary conflicts Node 1 Data Node 2 Data Node

    3 Data Service Discovery Node 1 Service Discovery Node 2 Service Discovery Node 3 Updates Updates Updates IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN IP: 10.10.2.1 Port: 7001 Status: STARTING
  106. Read from a version (till it is gone) Service Discovery

    Node 1 Service Discovery Node 2 Service Discovery Node 3 Updates Updates Updates Read end Node 1 Data Node 2 Data Node 3 Data IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN IP: 10.10.2.1 Port: 7001 Status: STARTING
  107. Read from a version (till it is gone) Service Discovery

    Node 1 Service Discovery Node 2 Service Discovery Node 3 Updates Updates Updates Read end X Node 1 Data Node 2 Data Node 3 Data IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN IP: 10.10.2.1 Port: 7001 Status: STARTING
  108. Read from a version (till it is gone) Service Discovery

    Node 2 Service Discovery Node 3 Updates Updates Read end Node 2 Data Node 3 Data IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN X
  109. Steady State Service Discovery Node 2 Updates Read end Node

    2 Data IP: 10.10.2.1 Port: 7001 Status: DOWN
  110. Time to converge (worst case) Service Discovery Node 1 Service

    Discovery Node 2 Service Discovery Node 3 Updates Updates Updates Read end Latest Oldest Node 1 Data Node 2 Data Node 3 Data IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN IP: 10.10.2.1 Port: 7001 Status: STARTING
  111. Time to converge (worst case) Time to evict stale copies.

    (Constant) Time to replicate from the owner node. +
  112. Time to evict stale copy Heartbeat interval * Number of

    tolerated missing heartbeats
  113. Time to replicate from the owner node Theoretically unbounded

  114. Adding bounds Instance 1 Instance 2 Persistent connection => Replication

    channel
  115. Adding bounds Instance 1 Instance 2 Persistent connection => Replication

    channel Heartbeats Heartbeats
  116. Time to replicate from the owner node Somewhat bounded Heartbeat

    interval * Number of tolerated missing heartbeats
  117. Time to converge (worst case) Cost of divergence?

  118. Cost of divergence? Instance data is hardly changing!

  119. In most cases, there isn’t a divergence and when it

    happens, the impact is low.
  120. Cost of divergence? Service discovery controls visibility of nodes but

    does not guarantee availability.
  121. Reads How to implement reads on service discovery data?

  122. Service discovery data is an ordered stream. IP: 10.10.2.1 Port:

    7001 Status: Starting Status: UP Status: DOWN
  123. IP: 10.10.2.1 Port: 7001 Status: Starting Status: UP Status: DOWN

    IP: 10.10.2. Port: 7001 Status: Starting Status: UP Status: DOWN IP: 10.10.2.4 Port: 7001 Status: Starting Status: UP Status: DOWN IP: 10.10.2.3 Port: 7001 Status: Starting Status: UP Status: DOWN IP: 10.10.2.1 Port: 7001 Status: Starting Status: UP Status: DOWN IP: 10.10.2.1 Port: 7001 Status: Starting Status: UP Status: DOWN Service discovery data is a ordered stream merged ordered stream
  124. Data as a stream (Lookup)

  125. Data as a stream (Lookup) Edge Service Instance Service Discovery

    Node 1 Give me all recs service instances ID: 1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting
  126. Edge Service Instance Give me all recs service instances ID:

    1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting Data as a stream (Lookup) ID: 1 Status: DOWN ID: 2 Status: UP Service Discovery Node 1
  127. Edge Service Instance Give me all recs service instances ID:

    1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting ID: 1 Status: DOWN ID: 2 Status: UP Data Diffs Data as a stream (Lookup) Service Discovery Node 1
  128. Data as a stream (Replication)

  129. Data as a stream (Replication) Service Discovery Node 1 Service

    Discovery Node 2 Give me all “non-replicated” instances ID: 1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting
  130. Data as a stream (Replication) Service Discovery Node 1 Service

    Discovery Node 2 ID: 1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting Service Discovery Node 3 Service Discovery Node 4 ID: 3 IP: 10.10.2.3 Port: 7001 Status: UP ID: 4 IP: 10.10.2.4 Port: 7001 Status: Starting Give me all “non-replicated” instances ID: 5 IP: 10.10.2.5 Port: 7001 Status: UP ID: 6 IP: 10.10.2.6 Port: 7001 Status: UP ID: 4 IP: 10.10.2.4 Port: 7001 Status: UP
  131. None
  132. Learnt the hard-way

  133. Network partitions gone wild!

  134. Service Discovery Node 1 Service Discovery Node 2 Service Discovery

    Node 3
  135. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 Service Discovery Nodes can talk to each other.
  136. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 Service Discovery Nodes can talk to each other. No outside instance can talk to a node.
  137. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 X
  138. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 X X X
  139. One fine day …. … you lost most of your

    service instances …
  140. One fine day …. … you lost most of your

    service instances … because … one node of service discovery was partitioned
  141. Taming evictions a.k.a Self preservation

  142. All instances “generally” do not vanish

  143. All instances “generally” do not vanish If they do, let

    local decisions prevail
  144. Do NOT evict If > X % of instances vanish

  145. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 X
  146. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 Preserve the data.
  147. Clients talking to the “in-doubt” instance will detect failure, if

    any.
  148. Total unavailability a.ka. Apocalypse

  149. Impact Fatal New nodes can not be started.

  150. Impact Fatal Degraded New nodes can not be started. Existing

    nodes uses cached data.
  151. Netflix Eureka V2 https://github.com/Netflix/eureka

  152. Don’t make Service Discovery your Achilles heel. Service discovery controls

    visibility of nodes but does not guarantee availability.
  153. Nitesh Kant, Engineer, Netflix Edge Gateway @NiteshKant