Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simple Solutions for Complex Problems

Simple Solutions for Complex Problems

Bay Area NATS Meetup 3/22/2016

Tyler Treat

March 22, 2016
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. Simple Solutions

    for Complex Problems
    Tyler Treat / Workiva
    Bay Area NATS Meetup 3/22/2016

    View full-size slide

  2. • Messaging tech lead at Workiva
    • Platform infrastructure
    • Distributed systems
    • bravenewgeek.com
    @tyler_treat

    [email protected]
    ABOUT THE SPEAKER

    View full-size slide

  3. • Embracing the reality of complex
    systems
    • Using simplicity to your advantage
    • Why NATS?
    • How Workiva uses NATS
    ABOUT THIS TALK

    View full-size slide

  4. There are a lot of parallels between
    real-world systems and

    distributed software systems.

    View full-size slide

  5. The world is eventually consistent…

    View full-size slide

  6. …and the database is just
    an optimization.[1]
    [1] https://christophermeiklejohn.com/lasp/erlang/2015/10/27/tendency.html

    View full-size slide

  7. “There will be no further print editions
    [of the Merck Manual]. Publishing a
    printed book every five years and
    sending reams of paper around the
    world on trucks, planes, and boats is
    no longer the optimal way to provide
    medical information.”
    Dr. Robert S. Porter

    Editor-in-Chief, The Merck Manuals

    View full-size slide

  8. Programmers find asynchrony hard
    to reason about, but the truth is…

    View full-size slide

  9. Life is mostly asynchronous.

    View full-size slide

  10. What does this mean for us as
    programmers?

    View full-size slide

  11. time / complexity
    timesharing
    monoliths
    soa
    virtualization
    microservices
    ???
    Complicated made complex…

    View full-size slide

  12. Distributed!

    View full-size slide

  13. Distributed computation is

    inherently asynchronous

    and the network is

    inherently unreliable[2]…
    [2] http://queue.acm.org/detail.cfm?id=2655736

    View full-size slide

  14. …but the natural tendency is to build
    distributed systems as if they aren’t
    distributed at all because it’s

    easy to reason about.
    strong consistency - reliable messaging - predictability

    View full-size slide

  15. • Complicated algorithms
    • Transaction managers
    • Coordination services
    • Distributed locking
    What’s in a guarantee?

    View full-size slide

  16. • Message handed to the transport layer?
    • Enqueued in the recipient’s mailbox?
    • Recipient started processing it?
    • Recipient finished processing it?
    What’s a delivery guarantee?

    View full-size slide

  17. Each of these has a very different set of
    conditions, constraints, and costs.

    View full-size slide

  18. Guaranteed, ordered,
    exactly-once delivery
    is expensive (if not impossible[3]).
    [3] http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/

    View full-size slide

  19. Over-engineered

    View full-size slide

  20. Difficult to deploy & operate

    View full-size slide

  21. At large scale, guarantees will give out.

    View full-size slide

  22. 0.1% failure at scale is huge.

    View full-size slide

  23. Replayable > Guaranteed

    View full-size slide

  24. Replayable > Guaranteed
    Idempotent > Exactly-once

    View full-size slide

  25. Replayable > Guaranteed
    Idempotent > Exactly-once
    Commutative > Ordered

    View full-size slide

  26. But delivery != processing

    View full-size slide

  27. Also, what does it even mean to
    “process” a message?

    View full-size slide

  28. It depends on the

    business context!

    View full-size slide

  29. If you need business-level
    guarantees, build them into

    the business layer.

    View full-size slide

  30. We can always build

    stronger guarantees on top,

    but we can’t always remove

    them from below.

    View full-size slide

  31. End-to-end system semantics matter
    much more than the semantics of an

    individual building block[4].
    [4] http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf

    View full-size slide

  32. Embrace the chaos!

    View full-size slide

  33. “Simplicity is the ultimate sophistication.”

    View full-size slide

  34. EMBRACING THE CHAOS MEANS

    LOOKING AT THE NEGATIVE SPACE.

    View full-size slide

  35. A simple technology

    in a sea of complexity.

    View full-size slide

  36. Simple doesn’t mean easy.
    [5] https://blog.wearewizards.io/some-a-priori-good-qualities-of-software-development

    View full-size slide

  37. “Simple can be harder than complex.
    You have to work hard to get your thinking
    clean to make it simple. But it’s worth it in
    the end because once you get there, you
    can move mountains.”

    View full-size slide

  38. • Wdesk: platform for enterprises to collect, manage,
    and report critical business data in real time
    • Increasing amounts of data and complexity of
    formats
    • Cloud solution:

    - Data accuracy

    - Secure

    - Highly available

    - Scalable

    - Mobile-enabled
    About Workiva

    View full-size slide

  39. • First solution built on Google App Engine
    • Scaling new solutions requires service-oriented
    approach
    • Scaling new services requires a low-latency
    communication backplane
    About Workiva

    View full-size slide

  40. Availability

    over

    everything.

    View full-size slide

  41. • Always on, always available
    • Protects itself at all costs—no compromises on
    performance
    • Disconnects slow consumers and lazy listeners
    • Clients have automatic failover and reconnect logic
    • Clients buffer messages while temporarily
    partitioned
    Availability over Everything

    View full-size slide

  42. Simplicity as a feature.

    View full-size slide

  43. • Single, lightweight binary
    • Embraces the “negative space”:

    - Simplicity —> high-performance

    - No complicated configuration or external dependencies

    (e.g. ZooKeeper)

    - No fragile guarantees —> face complexity head-on, encourage async
    • Simple pub/sub semantics provide a versatile primitive:

    - Fan-in

    - Fan-out

    - Request/response

    - Distributed queueing
    • Simple text-based wire protocol
    Simplicity as a Feature

    View full-size slide

  44. Fast as hell.

    View full-size slide

  45. [6] http://bravenewgeek.com/benchmarking-message-queue-latency/

    View full-size slide

  46. • Fast, predictable performance at scale and at tail
    • ~8 million messages per second
    • Auto-pruning of interest graph allows efficient
    routing
    • When SLAs matter, it’s hard to beat NATS
    Fast as Hell

    View full-size slide

  47. • Low-latency service bus
    • Pub/Sub
    • RPC
    How We Use NATS

    View full-size slide

  48. Service
    Service
    Service
    NATS
    Service
    Gateway
    Web
    Client
    Web
    Client
    Web
    Client

    View full-size slide

  49. Service
    Service
    Service
    NATS
    Service
    Gateway
    Web
    Client
    Web
    Client
    Web
    Client

    View full-size slide

  50. Service
    Service
    Service
    NATS
    Service
    Gateway
    Web
    Client
    Web
    Client
    Web
    Client

    View full-size slide

  51. Service
    Service
    Service
    NATS
    Service
    Gateway
    Web
    Client
    Web
    Client
    Web
    Client

    View full-size slide

  52. Service
    Service
    Service
    Service
    Service
    NATS
    Service
    Gateway
    Web
    Client
    Web
    Client
    Web
    Client

    View full-size slide

  53. Web
    Client
    Web
    Client
    Web
    Client
    Service
    Gateway
    NATS
    Service
    Service
    Service

    View full-size slide

  54. Service
    Service
    Service
    NATS

    View full-size slide

  55. “Just send this thing containing these fields
    serialized in this way using that encoding to
    this topic!”

    View full-size slide

  56. “Just subscribe to this topic and decode
    using that encoding then deserialize in

    this way and extract these fields from

    this thing!”

    View full-size slide

  57. Pub/Sub is meant to decouple services
    but often ends up coupling the teams
    developing them.

    View full-size slide

  58. How do we evolve services in isolation
    and reduce development overhead?

    View full-size slide

  59. • Extension of Apache Thrift
    • IDL and cross-language, code-generated pub/sub
    APIs
    • Allows developers to think in terms of services and
    APIs rather than opaque messages and topics
    • Allows APIs to evolve while maintaining compatibility
    • Transports are pluggable (we use NATS)
    Frugal RPC

    View full-size slide

  60. struct Event {

    1: i64 id,

    2: string message,

    3: i64 timestamp,

    }
    scope Events prefix {user} {

    EventCreated: Event

    EventUpdated: Event

    EventDeleted: Event

    }
    subscriber.SubscribeEventCreated(

    "user-1", func(e *event.Event) {

    fmt.Println(e)

    },

    )
    . . .
    publisher.PublishEventCreated(

    "user-1", event.NewEvent())
    generated

    View full-size slide

  61. • Service instances form a queue group
    • Client “connects” to instance by publishing a message to the service
    queue group
    • Serving instance sets up an inbox for the client and sends it back in the
    response
    • Client sends requests to the inbox
    • Connecting is cheap—no service discovery and no sockets to create, just
    a request/response
    • Heartbeats used to check health of server and client
    • Very early prototype code: https://github.com/workiva/thrift-nats
    RPC over NATS

    View full-size slide

  62. • Store JSON containing cluster membership in S3
    • Container reads JSON on startup and creates
    routes w/ correct credentials
    • Services only talk to the NATS daemon on their VM
    via localhost
    • Don’t have to worry about encryption between
    services and NATS, only between NATS peers
    NATS per VM

    View full-size slide

  63. • Only messages intended for a process on another
    host go over the network since NATS cluster
    maintains interest graph
    • Greatly reduces network hops (usually 0 vs. 2-3)
    • If local NATS daemon goes down, restart it
    automatically
    NATS per VM

    View full-size slide

  64. • Doesn’t scale to large number of VMs
    • Fairly easy to transition to floating NATS cluster or
    running on a subset of machines per AZ
    • NATS communication abstracted from service
    • Send messages to services without thinking about
    routing or service discovery
    • Queue groups provide service load balancing
    NATS per VM

    View full-size slide

  65. • We’re a SaaS company, not an infrastructure company
    • High availability
    • Operational simplicity
    • Performance
    • First-party clients:

    Go Java C C#

    Python Ruby Elixir Node.js
    NATS as a Messaging Backplane

    View full-size slide

  66. –Derek Landy, Skulduggery Pleasant
    “Every solution to every problem is simple…

    It's the distance between the two where the mystery lies.”

    View full-size slide

  67. @tyler_treat
    github.com/tylertreat
    bravenewgeek.com
    Thanks!

    View full-size slide