Asynchronous Service Oriented Design

Asynchronous Service-Oriented Design Patterns for Building and Operating Flexible Services

@jpignata

P ro b l e m Statement

“Our Ruby on Rails application is difﬁcult to change.”

monolith

monorail

“Amazon.com started [in 1996] as a monolithic application, running on
a Web server, talking to a database on the back end. This application evolved to hold all the business logic, all the display logic, and all the functionality that Amazon eventually became famous for. This went on until 2001 when it became clear that the front-end application couldn’t scale anymore.” http://queue.acm.org/detail.cfm?id=1142065 A Conversation with Werner Vogels

Asynchronous Service-Oriented Design

message passing

a medium is used to exchange messages

Patterns

Service Mailbox Pattern

clients write messages to a queue

a service processes messages from the queue

FIFO Queue

ﬁrst in, ﬁrst out

queue.enqueue(item)

queue.dequeue()

Application Service Mailbox Service dequeue enqueue

Feature: Message Delivery

Best Friends Group

"What is up?" "Steve: What is up?"

"Heading to drinks." "Tara: Heading to drinks."

iOS (APNS) Android (C2DM / GCM) Windows (WPNS) SMS (SIP
/ SMPP / HTTP)

messages should be delivered exactly once

messages should be delivered as soon as possible

drop messages after 2 hours if undeliverable

GroupMe Twilio

GroupMe Resque Twilio Redis

GroupMe Resque Twilio Redis APNS C2DM

throughput choppy and low

large number of processes

different rates of change

Transport Service

high-speed eventmachine resque worker

GroupMe Redis Transport APNS C2DM GCM SMS WPNS

uses a redis list as a queue

resque serialization format and key names

GroupMe Redis Transport RPUSH resque:queue:apns [serialized item] BLPOP resque:queue:apns

redis lists provide the exactly- once semantics required

higher, more even throughput

easy migration path

same redis dependency as resque

Event Stream Pattern

log.append(item)

log.read(offset, total) [

Service Service Service Event Stream Service Service Service Subscribe Publish

Feature: News Feed

Best Friends Group

Pat: Something pithy and funny about the current sporting event
2 Pat: Rails is Benihana.

Steve and Tara liked your message: Something pithy and funny
about the current sporting event Steve and Tara liked your message: Rails is Benihana.

class NewsFeedController < ApplicationController def index render likes: Like.find_all_by_recipient_id(current_user_id), #
other locals ... end end

slow database queries

low-value news feed reads competed for database time

unnecessarily real-time

News Feed Service

GroupMe Message Service Kafka News Feed Service Relationship Service Cache

Apache Kafka is a distributed, high-throughput pub/sub messaging system

messages are appended to a topic by a producer

consumers iterate through the topic starting at a given byte
offset

picture a distributed log ﬁle that you can write to
and tail from other services across the network

components publish user events to the event stream

the news feed service creates content from these events where
necessary

multi-consumer

rewindable

Beneﬁts

scalability

each process consumes a uniform number of resources

inﬁnite horizontal scale is usually a lie

services allow us to add targeted capacity

the services scale individually

agility

updates can deployed independently

deploys are low-impact because our messaging layer acts as a
buffer

reusability

any component that needs to send messages can use the
same interface

cheap experimentation

data ownership silos naturally form in larger organizations

conway’s law

“organizations which design systems ... are constrained to produce designs
which are copies of the communication structures of these organizations”

publishing to event streams can mitigate data silos

all notable events within a system are published to an
event stream

experimental projects can read from the event stream and use
the data in unanticipated ways

GroupMe Message Service Kafka Experimental Likes Leaderboard Relationship Service Experimental
Email Newsletter Experimental Photo Gallery

publishing to an event stream reduces the coupling between teams

data availability removes barriers to exploration

back pressure

some partners mandate our maximum delivery rate

message-based service boundaries are natural buffers

groupme / pace

worker = Pace::Worker.new(queue: "sms", jobs_per_second: 100) worker.pause worker.resume worker.pause(0.5)

automatic backoff on failure

we pause if we exceed a threshold of remote failures
to prevent compounding the problem

messages are buffered in the queue and ﬂush when the
system stablizes

since redis’ working set must ﬁt in memory, this must
be monitored closely

interoperability

SMS Provider SIP Receiver SIP Redis Mobile Originated Consumer SMS

Concerns

monitoring

QoS = [msg exited timestamp] - [msg entered timestamp]

nginx GroupMe Transport Messaging Provider Message Entered Timestamp Captured Message
Exited Timestamp Captured

QoS = 1370098106.90963 - 1370098106.493368

QoS = ~416ms

Queue Size

LLEN [queue name]

often zero unless failures or throttling requirements have caused the
senders to apply back pressure

delivery guarantees

exactly once at least once at most once

business requirements drive what delivery guarantees must be maintained

interface synchronization

your client and service must agree on the format of
the messages

ruby objects are used to represent messages and serialize themselves
for transport

consumers and producers work with the same objects which enforce
a consistent interface

these clients are shared via a gem which is versioned
along with the service

failure modes

the behavior of your system during failure

avoid cascading failure

what happens when redis is unavailable?

or memory ﬁlls up?

durability

(and performance)

redis by default isn’t very durable

even with AOF and replication, message loss is still possible

kafka doesn’t acknowledge produced messages

Learnings

you can go a long way with a monolith

messaging is a tool for service-oriented systems building

there are alternatives to HTTP when designing cooperating systems

backgrounded work is a prime candidate for service extraction

service extraction provides many of the promised beneﬁts

as with all engineering decisions: there are trade-offs to make
to realize these beneﬁts

you must explicitly plan for failure cases

failure is another mode of operation

the system must bend but not break

you won’t know how broken something is until you measure
it

look at percentile slices rather than averages

you will have to experience system failure to build resilience
into the system and your team

18) Failure free operations require experience with failure. Recognizing hazard
and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf How Complex Systems Fail

we prefer smaller building blocks like redis and kafka over
message brokers like rabbitmq, qpid, etc

focus on the interfaces between your systems

ad-hoc JSON serialization formats are fraught

there’s more than one way to do it

your mileage will vary

http://tx.pignata.com Thank You! @jpignata

http://tx.pignata.com Thank You! @jpignata ps we’re hiring!

Asynchronous Service Oriented Design

Asynchronous Service Oriented Design

More Decks by John Pignata

Other Decks in Programming

Featured

Transcript