Simple Solutions for Complex Problems

Simple Solutions  for Complex Problems Tyler Treat / Workiva Bay
Area NATS Meetup 3/22/2016

• Messaging tech lead at Workiva • Platform infrastructure •
Distributed systems • bravenewgeek.com @tyler_treat  tyler.treat@workiva.com ABOUT THE SPEAKER

• Embracing the reality of complex systems • Using simplicity
to your advantage • Why NATS? • How Workiva uses NATS ABOUT THIS TALK

There are a lot of parallels between real-world systems and 
distributed software systems.

The world is eventually consistent…

…and the database is just an optimization.[1] [1] https://christophermeiklejohn.com/lasp/erlang/2015/10/27/tendency.html

“There will be no further print editions [of the Merck
Manual]. Publishing a printed book every ﬁve years and sending reams of paper around the world on trucks, planes, and boats is no longer the optimal way to provide medical information.” Dr. Robert S. Porter  Editor-in-Chief, The Merck Manuals

Programmers ﬁnd asynchrony hard to reason about, but the truth
is…

Life is mostly asynchronous.

What does this mean for us as programmers?

time / complexity timesharing monoliths soa virtualization microservices ??? Complicated
made complex…

Distributed!

Distributed computation is  inherently asynchronous  and the network is  inherently
unreliable[2]… [2] http://queue.acm.org/detail.cfm?id=2655736

…but the natural tendency is to build distributed systems as
if they aren’t distributed at all because it’s  easy to reason about. strong consistency - reliable messaging - predictability

• Complicated algorithms • Transaction managers • Coordination services •
Distributed locking What’s in a guarantee?

• Message handed to the transport layer? • Enqueued in
the recipient’s mailbox? • Recipient started processing it? • Recipient ﬁnished processing it? What’s a delivery guarantee?

Each of these has a very different set of conditions,
constraints, and costs.

Guaranteed, ordered, exactly-once delivery is expensive (if not impossible[3]). [3]
http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/

Over-engineered

Complex

Difﬁcult to deploy & operate

Fragile

At large scale, guarantees will give out.

0.1% failure at scale is huge.

Replayable > Guaranteed

Replayable > Guaranteed Idempotent > Exactly-once

Replayable > Guaranteed Idempotent > Exactly-once Commutative > Ordered

But delivery != processing

Also, what does it even mean to “process” a message?

It depends on the  business context!

If you need business-level guarantees, build them into  the business
layer.

We can always build  stronger guarantees on top,  but we
can’t always remove  them from below.

End-to-end system semantics matter much more than the semantics of
an  individual building block[4]. [4] http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf

Embrace the chaos!

“Simplicity is the ultimate sophistication.”

EMBRACING THE CHAOS MEANS  LOOKING AT THE NEGATIVE SPACE.

A simple technology  in a sea of complexity.

Simple doesn’t mean easy. [5] https://blog.wearewizards.io/some-a-priori-good-qualities-of-software-development

“Simple can be harder than complex. You have to work
hard to get your thinking clean to make it simple. But it’s worth it in the end because once you get there, you can move mountains.”

• Wdesk: platform for enterprises to collect, manage, and report
critical business data in real time • Increasing amounts of data and complexity of formats • Cloud solution:  - Data accuracy  - Secure  - Highly available  - Scalable  - Mobile-enabled About Workiva

• First solution built on Google App Engine • Scaling
new solutions requires service-oriented approach • Scaling new services requires a low-latency communication backplane About Workiva

Availability  over  everything.

• Always on, always available • Protects itself at all
costs—no compromises on performance • Disconnects slow consumers and lazy listeners • Clients have automatic failover and reconnect logic • Clients buffer messages while temporarily partitioned Availability over Everything

Simplicity as a feature.

• Single, lightweight binary • Embraces the “negative space”:  -
Simplicity —> high-performance  - No complicated conﬁguration or external dependencies  (e.g. ZooKeeper)  - No fragile guarantees —> face complexity head-on, encourage async • Simple pub/sub semantics provide a versatile primitive:  - Fan-in  - Fan-out  - Request/response  - Distributed queueing • Simple text-based wire protocol Simplicity as a Feature

Fast as hell.

[6] http://bravenewgeek.com/benchmarking-message-queue-latency/

• Fast, predictable performance at scale and at tail •
~8 million messages per second • Auto-pruning of interest graph allows efﬁcient routing • When SLAs matter, it’s hard to beat NATS Fast as Hell

• Low-latency service bus • Pub/Sub • RPC How We
Use NATS

Service Service Service NATS Service Gateway Web Client Web Client
Web Client

Service Service Service Service Service NATS Service Gateway Web Client
Web Client Web Client

Web Client Web Client Web Client Service Gateway NATS Service
Service Service

Service Service Service NATS

Pub/Sub

“Just send this thing containing these ﬁelds serialized in this
way using that encoding to this topic!”

“Just subscribe to this topic and decode using that encoding
then deserialize in  this way and extract these ﬁelds from  this thing!”

Pub/Sub is meant to decouple services but often ends up
coupling the teams developing them.

How do we evolve services in isolation and reduce development
overhead?

• Extension of Apache Thrift • IDL and cross-language, code-generated
pub/sub APIs • Allows developers to think in terms of services and APIs rather than opaque messages and topics • Allows APIs to evolve while maintaining compatibility • Transports are pluggable (we use NATS) Frugal RPC

struct Event {  1: i64 id,  2: string message,  3:
i64 timestamp,  } scope Events prefix {user} {  EventCreated: Event  EventUpdated: Event  EventDeleted: Event  } subscriber.SubscribeEventCreated(  "user-1", func(e *event.Event) {  fmt.Println(e)  },  ) . . . publisher.PublishEventCreated(  "user-1", event.NewEvent()) generated

• Service instances form a queue group • Client “connects”
to instance by publishing a message to the service queue group • Serving instance sets up an inbox for the client and sends it back in the response • Client sends requests to the inbox • Connecting is cheap—no service discovery and no sockets to create, just a request/response • Heartbeats used to check health of server and client • Very early prototype code: https://github.com/workiva/thrift-nats RPC over NATS

• Store JSON containing cluster membership in S3 • Container
reads JSON on startup and creates routes w/ correct credentials • Services only talk to the NATS daemon on their VM via localhost • Don’t have to worry about encryption between services and NATS, only between NATS peers NATS per VM

• Only messages intended for a process on another host
go over the network since NATS cluster maintains interest graph • Greatly reduces network hops (usually 0 vs. 2-3) • If local NATS daemon goes down, restart it automatically NATS per VM

• Doesn’t scale to large number of VMs • Fairly
easy to transition to ﬂoating NATS cluster or running on a subset of machines per AZ • NATS communication abstracted from service • Send messages to services without thinking about routing or service discovery • Queue groups provide service load balancing NATS per VM

• We’re a SaaS company, not an infrastructure company •
High availability • Operational simplicity • Performance • First-party clients:  Go Java C C#  Python Ruby Elixir Node.js NATS as a Messaging Backplane

–Derek Landy, Skulduggery Pleasant “Every solution to every problem is
simple…  It's the distance between the two where the mystery lies.”

@tyler_treat github.com/tylertreat bravenewgeek.com Thanks!

Simple Solutions for Complex Problems

Simple Solutions for Complex Problems

More Decks by Tyler Treat

Other Decks in Programming

Featured

Transcript