Slide 1

Slide 1 text

Simple Solutions
 for Complex Problems Tyler Treat / Workiva Bay Area NATS Meetup 3/22/2016

Slide 2

Slide 2 text

• Messaging tech lead at Workiva • Platform infrastructure • Distributed systems • bravenewgeek.com @tyler_treat
 [email protected] ABOUT THE SPEAKER

Slide 3

Slide 3 text

• Embracing the reality of complex systems • Using simplicity to your advantage • Why NATS? • How Workiva uses NATS ABOUT THIS TALK

Slide 4

Slide 4 text

There are a lot of parallels between real-world systems and
 distributed software systems.

Slide 5

Slide 5 text

The world is eventually consistent…

Slide 6

Slide 6 text

…and the database is just an optimization.[1] [1] https://christophermeiklejohn.com/lasp/erlang/2015/10/27/tendency.html

Slide 7

Slide 7 text

“There will be no further print editions [of the Merck Manual]. Publishing a printed book every five years and sending reams of paper around the world on trucks, planes, and boats is no longer the optimal way to provide medical information.” Dr. Robert S. Porter
 Editor-in-Chief, The Merck Manuals

Slide 8

Slide 8 text

Programmers find asynchrony hard to reason about, but the truth is…

Slide 9

Slide 9 text

Life is mostly asynchronous.

Slide 10

Slide 10 text

What does this mean for us as programmers?

Slide 11

Slide 11 text

time / complexity timesharing monoliths soa virtualization microservices ??? Complicated made complex…

Slide 12

Slide 12 text

Distributed!

Slide 13

Slide 13 text

Distributed computation is
 inherently asynchronous
 and the network is
 inherently unreliable[2]… [2] http://queue.acm.org/detail.cfm?id=2655736

Slide 14

Slide 14 text

…but the natural tendency is to build distributed systems as if they aren’t distributed at all because it’s
 easy to reason about. strong consistency - reliable messaging - predictability

Slide 15

Slide 15 text

• Complicated algorithms • Transaction managers • Coordination services • Distributed locking What’s in a guarantee?

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

• Message handed to the transport layer? • Enqueued in the recipient’s mailbox? • Recipient started processing it? • Recipient finished processing it? What’s a delivery guarantee?

Slide 18

Slide 18 text

Each of these has a very different set of conditions, constraints, and costs.

Slide 19

Slide 19 text

Guaranteed, ordered, exactly-once delivery is expensive (if not impossible[3]). [3] http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/

Slide 20

Slide 20 text

Over-engineered

Slide 21

Slide 21 text

Complex

Slide 22

Slide 22 text

Difficult to deploy & operate

Slide 23

Slide 23 text

Fragile

Slide 24

Slide 24 text

Slow

Slide 25

Slide 25 text

At large scale, guarantees will give out.

Slide 26

Slide 26 text

0.1% failure at scale is huge.

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Replayable > Guaranteed

Slide 30

Slide 30 text

Replayable > Guaranteed Idempotent > Exactly-once

Slide 31

Slide 31 text

Replayable > Guaranteed Idempotent > Exactly-once Commutative > Ordered

Slide 32

Slide 32 text

But delivery != processing

Slide 33

Slide 33 text

Also, what does it even mean to “process” a message?

Slide 34

Slide 34 text

It depends on the
 business context!

Slide 35

Slide 35 text

If you need business-level guarantees, build them into
 the business layer.

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

We can always build
 stronger guarantees on top,
 but we can’t always remove
 them from below.

Slide 38

Slide 38 text

End-to-end system semantics matter much more than the semantics of an
 individual building block[4]. [4] http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf

Slide 39

Slide 39 text

Embrace the chaos!

Slide 40

Slide 40 text

“Simplicity is the ultimate sophistication.”

Slide 41

Slide 41 text

EMBRACING THE CHAOS MEANS
 LOOKING AT THE NEGATIVE SPACE.

Slide 42

Slide 42 text

A simple technology
 in a sea of complexity.

Slide 43

Slide 43 text

Simple doesn’t mean easy. [5] https://blog.wearewizards.io/some-a-priori-good-qualities-of-software-development

Slide 44

Slide 44 text

“Simple can be harder than complex. You have to work hard to get your thinking clean to make it simple. But it’s worth it in the end because once you get there, you can move mountains.”

Slide 45

Slide 45 text

• Wdesk: platform for enterprises to collect, manage, and report critical business data in real time • Increasing amounts of data and complexity of formats • Cloud solution:
 - Data accuracy
 - Secure
 - Highly available
 - Scalable
 - Mobile-enabled About Workiva

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

• First solution built on Google App Engine • Scaling new solutions requires service-oriented approach • Scaling new services requires a low-latency communication backplane About Workiva

Slide 49

Slide 49 text

Why ?

Slide 50

Slide 50 text

Availability
 over
 everything.

Slide 51

Slide 51 text

• Always on, always available • Protects itself at all costs—no compromises on performance • Disconnects slow consumers and lazy listeners • Clients have automatic failover and reconnect logic • Clients buffer messages while temporarily partitioned Availability over Everything

Slide 52

Slide 52 text

Simplicity as a feature.

Slide 53

Slide 53 text

• Single, lightweight binary • Embraces the “negative space”:
 - Simplicity —> high-performance
 - No complicated configuration or external dependencies
 (e.g. ZooKeeper)
 - No fragile guarantees —> face complexity head-on, encourage async • Simple pub/sub semantics provide a versatile primitive:
 - Fan-in
 - Fan-out
 - Request/response
 - Distributed queueing • Simple text-based wire protocol Simplicity as a Feature

Slide 54

Slide 54 text

Fast as hell.

Slide 55

Slide 55 text

[6] http://bravenewgeek.com/benchmarking-message-queue-latency/

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

• Fast, predictable performance at scale and at tail • ~8 million messages per second • Auto-pruning of interest graph allows efficient routing • When SLAs matter, it’s hard to beat NATS Fast as Hell

Slide 58

Slide 58 text

• Low-latency service bus • Pub/Sub • RPC How We Use NATS

Slide 59

Slide 59 text

Service Service Service NATS Service Gateway Web Client Web Client Web Client

Slide 60

Slide 60 text

Service Service Service NATS Service Gateway Web Client Web Client Web Client

Slide 61

Slide 61 text

Service Service Service NATS Service Gateway Web Client Web Client Web Client

Slide 62

Slide 62 text

Service Service Service NATS Service Gateway Web Client Web Client Web Client

Slide 63

Slide 63 text

Service Service Service Service Service NATS Service Gateway Web Client Web Client Web Client

Slide 64

Slide 64 text

Web Client Web Client Web Client Service Gateway NATS Service Service Service

Slide 65

Slide 65 text

Service Service Service NATS

Slide 66

Slide 66 text

Pub/Sub

Slide 67

Slide 67 text

“Just send this thing containing these fields serialized in this way using that encoding to this topic!”

Slide 68

Slide 68 text

“Just subscribe to this topic and decode using that encoding then deserialize in
 this way and extract these fields from
 this thing!”

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

Pub/Sub is meant to decouple services but often ends up coupling the teams developing them.

Slide 71

Slide 71 text

How do we evolve services in isolation and reduce development overhead?

Slide 72

Slide 72 text

• Extension of Apache Thrift • IDL and cross-language, code-generated pub/sub APIs • Allows developers to think in terms of services and APIs rather than opaque messages and topics • Allows APIs to evolve while maintaining compatibility • Transports are pluggable (we use NATS) Frugal RPC

Slide 73

Slide 73 text

struct Event {
 1: i64 id,
 2: string message,
 3: i64 timestamp,
 } scope Events prefix {user} {
 EventCreated: Event
 EventUpdated: Event
 EventDeleted: Event
 } subscriber.SubscribeEventCreated(
 "user-1", func(e *event.Event) {
 fmt.Println(e)
 },
 ) . . . publisher.PublishEventCreated(
 "user-1", event.NewEvent()) generated

Slide 74

Slide 74 text

• Service instances form a queue group • Client “connects” to instance by publishing a message to the service queue group • Serving instance sets up an inbox for the client and sends it back in the response • Client sends requests to the inbox • Connecting is cheap—no service discovery and no sockets to create, just a request/response • Heartbeats used to check health of server and client • Very early prototype code: https://github.com/workiva/thrift-nats RPC over NATS

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

• Store JSON containing cluster membership in S3 • Container reads JSON on startup and creates routes w/ correct credentials • Services only talk to the NATS daemon on their VM via localhost • Don’t have to worry about encryption between services and NATS, only between NATS peers NATS per VM

Slide 77

Slide 77 text

• Only messages intended for a process on another host go over the network since NATS cluster maintains interest graph • Greatly reduces network hops (usually 0 vs. 2-3) • If local NATS daemon goes down, restart it automatically NATS per VM

Slide 78

Slide 78 text

• Doesn’t scale to large number of VMs • Fairly easy to transition to floating NATS cluster or running on a subset of machines per AZ • NATS communication abstracted from service • Send messages to services without thinking about routing or service discovery • Queue groups provide service load balancing NATS per VM

Slide 79

Slide 79 text

• We’re a SaaS company, not an infrastructure company • High availability • Operational simplicity • Performance • First-party clients:
 Go Java C C#
 Python Ruby Elixir Node.js NATS as a Messaging Backplane

Slide 80

Slide 80 text

–Derek Landy, Skulduggery Pleasant “Every solution to every problem is simple…
 It's the distance between the two where the mystery lies.”

Slide 81

Slide 81 text

@tyler_treat github.com/tylertreat bravenewgeek.com Thanks!