Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simple Solutions for Complex Problems

Simple Solutions for Complex Problems

Bay Area NATS Meetup 3/22/2016

Dcbf01e42178cd9698fb3d4806e33d84?s=128

Tyler Treat

March 22, 2016
Tweet

Transcript

  1. Simple Solutions
 for Complex Problems Tyler Treat / Workiva Bay

    Area NATS Meetup 3/22/2016
  2. • Messaging tech lead at Workiva • Platform infrastructure •

    Distributed systems • bravenewgeek.com @tyler_treat
 tyler.treat@workiva.com ABOUT THE SPEAKER
  3. • Embracing the reality of complex systems • Using simplicity

    to your advantage • Why NATS? • How Workiva uses NATS ABOUT THIS TALK
  4. There are a lot of parallels between real-world systems and


    distributed software systems.
  5. The world is eventually consistent…

  6. …and the database is just an optimization.[1] [1] https://christophermeiklejohn.com/lasp/erlang/2015/10/27/tendency.html

  7. “There will be no further print editions [of the Merck

    Manual]. Publishing a printed book every five years and sending reams of paper around the world on trucks, planes, and boats is no longer the optimal way to provide medical information.” Dr. Robert S. Porter
 Editor-in-Chief, The Merck Manuals
  8. Programmers find asynchrony hard to reason about, but the truth

    is…
  9. Life is mostly asynchronous.

  10. What does this mean for us as programmers?

  11. time / complexity timesharing monoliths soa virtualization microservices ??? Complicated

    made complex…
  12. Distributed!

  13. Distributed computation is
 inherently asynchronous
 and the network is
 inherently

    unreliable[2]… [2] http://queue.acm.org/detail.cfm?id=2655736
  14. …but the natural tendency is to build distributed systems as

    if they aren’t distributed at all because it’s
 easy to reason about. strong consistency - reliable messaging - predictability
  15. • Complicated algorithms • Transaction managers • Coordination services •

    Distributed locking What’s in a guarantee?
  16. None
  17. • Message handed to the transport layer? • Enqueued in

    the recipient’s mailbox? • Recipient started processing it? • Recipient finished processing it? What’s a delivery guarantee?
  18. Each of these has a very different set of conditions,

    constraints, and costs.
  19. Guaranteed, ordered, exactly-once delivery is expensive (if not impossible[3]). [3]

    http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
  20. Over-engineered

  21. Complex

  22. Difficult to deploy & operate

  23. Fragile

  24. Slow

  25. At large scale, guarantees will give out.

  26. 0.1% failure at scale is huge.

  27. None
  28. None
  29. Replayable > Guaranteed

  30. Replayable > Guaranteed Idempotent > Exactly-once

  31. Replayable > Guaranteed Idempotent > Exactly-once Commutative > Ordered

  32. But delivery != processing

  33. Also, what does it even mean to “process” a message?

  34. It depends on the
 business context!

  35. If you need business-level guarantees, build them into
 the business

    layer.
  36. None
  37. We can always build
 stronger guarantees on top,
 but we

    can’t always remove
 them from below.
  38. End-to-end system semantics matter much more than the semantics of

    an
 individual building block[4]. [4] http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf
  39. Embrace the chaos!

  40. “Simplicity is the ultimate sophistication.”

  41. EMBRACING THE CHAOS MEANS
 LOOKING AT THE NEGATIVE SPACE.

  42. A simple technology
 in a sea of complexity.

  43. Simple doesn’t mean easy. [5] https://blog.wearewizards.io/some-a-priori-good-qualities-of-software-development

  44. “Simple can be harder than complex. You have to work

    hard to get your thinking clean to make it simple. But it’s worth it in the end because once you get there, you can move mountains.”
  45. • Wdesk: platform for enterprises to collect, manage, and report

    critical business data in real time • Increasing amounts of data and complexity of formats • Cloud solution:
 - Data accuracy
 - Secure
 - Highly available
 - Scalable
 - Mobile-enabled About Workiva
  46. None
  47. None
  48. • First solution built on Google App Engine • Scaling

    new solutions requires service-oriented approach • Scaling new services requires a low-latency communication backplane About Workiva
  49. Why ?

  50. Availability
 over
 everything.

  51. • Always on, always available • Protects itself at all

    costs—no compromises on performance • Disconnects slow consumers and lazy listeners • Clients have automatic failover and reconnect logic • Clients buffer messages while temporarily partitioned Availability over Everything
  52. Simplicity as a feature.

  53. • Single, lightweight binary • Embraces the “negative space”:
 -

    Simplicity —> high-performance
 - No complicated configuration or external dependencies
 (e.g. ZooKeeper)
 - No fragile guarantees —> face complexity head-on, encourage async • Simple pub/sub semantics provide a versatile primitive:
 - Fan-in
 - Fan-out
 - Request/response
 - Distributed queueing • Simple text-based wire protocol Simplicity as a Feature
  54. Fast as hell.

  55. [6] http://bravenewgeek.com/benchmarking-message-queue-latency/

  56. None
  57. • Fast, predictable performance at scale and at tail •

    ~8 million messages per second • Auto-pruning of interest graph allows efficient routing • When SLAs matter, it’s hard to beat NATS Fast as Hell
  58. • Low-latency service bus • Pub/Sub • RPC How We

    Use NATS
  59. Service Service Service NATS Service Gateway Web Client Web Client

    Web Client
  60. Service Service Service NATS Service Gateway Web Client Web Client

    Web Client
  61. Service Service Service NATS Service Gateway Web Client Web Client

    Web Client
  62. Service Service Service NATS Service Gateway Web Client Web Client

    Web Client
  63. Service Service Service Service Service NATS Service Gateway Web Client

    Web Client Web Client
  64. Web Client Web Client Web Client Service Gateway NATS Service

    Service Service
  65. Service Service Service NATS

  66. Pub/Sub

  67. “Just send this thing containing these fields serialized in this

    way using that encoding to this topic!”
  68. “Just subscribe to this topic and decode using that encoding

    then deserialize in
 this way and extract these fields from
 this thing!”
  69. None
  70. Pub/Sub is meant to decouple services but often ends up

    coupling the teams developing them.
  71. How do we evolve services in isolation and reduce development

    overhead?
  72. • Extension of Apache Thrift • IDL and cross-language, code-generated

    pub/sub APIs • Allows developers to think in terms of services and APIs rather than opaque messages and topics • Allows APIs to evolve while maintaining compatibility • Transports are pluggable (we use NATS) Frugal RPC
  73. struct Event {
 1: i64 id,
 2: string message,
 3:

    i64 timestamp,
 } scope Events prefix {user} {
 EventCreated: Event
 EventUpdated: Event
 EventDeleted: Event
 } subscriber.SubscribeEventCreated(
 "user-1", func(e *event.Event) {
 fmt.Println(e)
 },
 ) . . . publisher.PublishEventCreated(
 "user-1", event.NewEvent()) generated
  74. • Service instances form a queue group • Client “connects”

    to instance by publishing a message to the service queue group • Serving instance sets up an inbox for the client and sends it back in the response • Client sends requests to the inbox • Connecting is cheap—no service discovery and no sockets to create, just a request/response • Heartbeats used to check health of server and client • Very early prototype code: https://github.com/workiva/thrift-nats RPC over NATS
  75. None
  76. • Store JSON containing cluster membership in S3 • Container

    reads JSON on startup and creates routes w/ correct credentials • Services only talk to the NATS daemon on their VM via localhost • Don’t have to worry about encryption between services and NATS, only between NATS peers NATS per VM
  77. • Only messages intended for a process on another host

    go over the network since NATS cluster maintains interest graph • Greatly reduces network hops (usually 0 vs. 2-3) • If local NATS daemon goes down, restart it automatically NATS per VM
  78. • Doesn’t scale to large number of VMs • Fairly

    easy to transition to floating NATS cluster or running on a subset of machines per AZ • NATS communication abstracted from service • Send messages to services without thinking about routing or service discovery • Queue groups provide service load balancing NATS per VM
  79. • We’re a SaaS company, not an infrastructure company •

    High availability • Operational simplicity • Performance • First-party clients:
 Go Java C C#
 Python Ruby Elixir Node.js NATS as a Messaging Backplane
  80. –Derek Landy, Skulduggery Pleasant “Every solution to every problem is

    simple…
 It's the distance between the two where the mystery lies.”
  81. @tyler_treat github.com/tylertreat bravenewgeek.com Thanks!