Asynchronous Service Oriented Design

Asynchronous Service Oriented Design

03e04db3b6880c3a2f8114649312f733?s=128

John Pignata

June 08, 2013
Tweet

Transcript

  1. Asynchronous Service-Oriented Design Patterns for Building and Operating Flexible Services

  2. @jpignata

  3. None
  4. None
  5. P ro b l e m Statement

  6. “Our Ruby on Rails application is difficult to change.”

  7. None
  8. None
  9. monolith

  10. None
  11. monorail

  12. “Amazon.com started [in 1996] as a monolithic application, running on

    a Web server, talking to a database on the back end. This application evolved to hold all the business logic, all the display logic, and all the functionality that Amazon eventually became famous for. This went on until 2001 when it became clear that the front-end application couldn’t scale anymore.” http://queue.acm.org/detail.cfm?id=1142065 A Conversation with Werner Vogels
  13. None
  14. Asynchronous Service-Oriented Design

  15. message passing

  16. a medium is used to exchange messages

  17. Patterns

  18. Service Mailbox Pattern

  19. clients write messages to a queue

  20. a service processes messages from the queue

  21. FIFO Queue

  22. first in, first out

  23. queue.enqueue(item)

  24. None
  25. queue.dequeue()

  26. None
  27. Application Service Mailbox Service dequeue enqueue

  28. Feature: Message Delivery

  29. Best Friends Group

  30. "What is up?" "Steve: What is up?"

  31. "Heading to drinks." "Tara: Heading to drinks."

  32. iOS (APNS) Android (C2DM / GCM) Windows (WPNS) SMS (SIP

    / SMPP / HTTP)
  33. messages should be delivered exactly once

  34. messages should be delivered as soon as possible

  35. drop messages after 2 hours if undeliverable

  36. GroupMe Twilio

  37. GroupMe Resque Twilio Redis

  38. GroupMe Resque Twilio Redis APNS C2DM

  39. throughput choppy and low

  40. large number of processes

  41. different rates of change

  42. Transport Service

  43. high-speed eventmachine resque worker

  44. None
  45. GroupMe Redis Transport APNS C2DM GCM SMS WPNS

  46. uses a redis list as a queue

  47. resque serialization format and key names

  48. GroupMe Redis Transport RPUSH resque:queue:apns [serialized item] BLPOP resque:queue:apns

  49. redis lists provide the exactly- once semantics required

  50. higher, more even throughput

  51. easy migration path

  52. same redis dependency as resque

  53. Event Stream Pattern

  54. File

  55. log.append(item)

  56. None
  57. log.read(offset, total) [

  58. Service Service Service Event Stream Service Service Service Subscribe Publish

  59. Feature: News Feed

  60. Best Friends Group

  61. Pat: Something pithy and funny about the current sporting event

    2 Pat: Rails is Benihana.
  62. Steve and Tara liked your message: Something pithy and funny

    about the current sporting event Steve and Tara liked your message: Rails is Benihana.
  63. class NewsFeedController < ApplicationController def index render likes: Like.find_all_by_recipient_id(current_user_id), #

    other locals ... end end
  64. slow database queries

  65. low-value news feed reads competed for database time

  66. unnecessarily real-time

  67. News Feed Service

  68. Kafka

  69. GroupMe Message Service Kafka News Feed Service Relationship Service Cache

  70. Apache Kafka is a distributed, high-throughput pub/sub messaging system

  71. messages are appended to a topic by a producer

  72. consumers iterate through the topic starting at a given byte

    offset
  73. picture a distributed log file that you can write to

    and tail from other services across the network
  74. components publish user events to the event stream

  75. the news feed service creates content from these events where

    necessary
  76. multi-consumer

  77. rewindable

  78. Benefits

  79. scalability

  80. each process consumes a uniform number of resources

  81. infinite horizontal scale is usually a lie

  82. services allow us to add targeted capacity

  83. the services scale individually

  84. agility

  85. updates can deployed independently

  86. deploys are low-impact because our messaging layer acts as a

    buffer
  87. reusability

  88. any component that needs to send messages can use the

    same interface
  89. cheap experimentation

  90. data ownership silos naturally form in larger organizations

  91. conway’s law

  92. “organizations which design systems ... are constrained to produce designs

    which are copies of the communication structures of these organizations”
  93. publishing to event streams can mitigate data silos

  94. all notable events within a system are published to an

    event stream
  95. experimental projects can read from the event stream and use

    the data in unanticipated ways
  96. GroupMe Message Service Kafka Experimental Likes Leaderboard Relationship Service Experimental

    Email Newsletter Experimental Photo Gallery
  97. publishing to an event stream reduces the coupling between teams

  98. data availability removes barriers to exploration

  99. back pressure

  100. some partners mandate our maximum delivery rate

  101. message-based service boundaries are natural buffers

  102. groupme / pace

  103. worker = Pace::Worker.new(queue: "sms", jobs_per_second: 100) worker.pause worker.resume worker.pause(0.5)

  104. automatic backoff on failure

  105. we pause if we exceed a threshold of remote failures

    to prevent compounding the problem
  106. messages are buffered in the queue and flush when the

    system stablizes
  107. since redis’ working set must fit in memory, this must

    be monitored closely
  108. interoperability

  109. SMS Provider SIP Receiver SIP Redis Mobile Originated Consumer SMS

  110. SMS Provider SIP Receiver SIP Redis Mobile Originated Consumer SMS

  111. Concerns

  112. monitoring

  113. QoS

  114. QoS = [msg exited timestamp] - [msg entered timestamp]

  115. nginx GroupMe Transport Messaging Provider Message Entered Timestamp Captured Message

    Exited Timestamp Captured
  116. QoS = 1370098106.90963 - 1370098106.493368

  117. QoS = ~416ms

  118. None
  119. Queue Size

  120. LLEN [queue name]

  121. often zero unless failures or throttling requirements have caused the

    senders to apply back pressure
  122. delivery guarantees

  123. exactly once at least once at most once

  124. business requirements drive what delivery guarantees must be maintained

  125. interface synchronization

  126. your client and service must agree on the format of

    the messages
  127. ruby objects are used to represent messages and serialize themselves

    for transport
  128. consumers and producers work with the same objects which enforce

    a consistent interface
  129. these clients are shared via a gem which is versioned

    along with the service
  130. failure modes

  131. the behavior of your system during failure

  132. avoid cascading failure

  133. what happens when redis is unavailable?

  134. or memory fills up?

  135. durability

  136. (and performance)

  137. redis by default isn’t very durable

  138. even with AOF and replication, message loss is still possible

  139. kafka doesn’t acknowledge produced messages

  140. Learnings

  141. you can go a long way with a monolith

  142. messaging is a tool for service-oriented systems building

  143. there are alternatives to HTTP when designing cooperating systems

  144. backgrounded work is a prime candidate for service extraction

  145. service extraction provides many of the promised benefits

  146. as with all engineering decisions: there are trade-offs to make

    to realize these benefits
  147. you must explicitly plan for failure cases

  148. failure is another mode of operation

  149. the system must bend but not break

  150. you won’t know how broken something is until you measure

    it
  151. look at percentile slices rather than averages

  152. None
  153. you will have to experience system failure to build resilience

    into the system and your team
  154. 18) Failure free operations require experience with failure. Recognizing hazard

    and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf How Complex Systems Fail
  155. we prefer smaller building blocks like redis and kafka over

    message brokers like rabbitmq, qpid, etc
  156. focus on the interfaces between your systems

  157. ad-hoc JSON serialization formats are fraught

  158. there’s more than one way to do it

  159. your mileage will vary

  160. http://tx.pignata.com Thank You! @jpignata

  161. http://tx.pignata.com Thank You! @jpignata ps we’re hiring!