Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NSQ - NYC Data Engineering Meetup

NSQ - NYC Data Engineering Meetup

Our talk at Ebay on 2013-05-15 for the NYC Data Engineering Meetup

Matt Reiferson

May 15, 2013
Tweet

More Decks by Matt Reiferson

Other Decks in Programming

Transcript

  1. NSQ
    realtime distributed message processing at scale
    https://github.com/bitly/nsq
    May 15th 2013 - NYC Data Engineering Meetup
    @imsnakes & @jehiah (infrastructure @bitly)
    Thursday, May 16, 13

    View full-size slide

  2. THE WAY OF THE BITLY
    Thursday, May 16, 13

    View full-size slide

  3. PHILOSOPHY
    •service-oriented
    •avoid SPOFs
    •perform work asynchronously
    •de-couple services and use messaging in between
    •dependencies suck (make it easy to deploy)
    •use HTTP and JSON
    Thursday, May 16, 13

    View full-size slide

  4. App

    DATA FLOW
    incoming request
    Thursday, May 16, 13

    View full-size slide

  5. App


    DATA FLOW
    incoming request
    sync persist data
    Thursday, May 16, 13

    View full-size slide

  6. App
    ❶ ❸

    DATA FLOW
    incoming request
    sync persist data
    send response
    Thursday, May 16, 13

    View full-size slide

  7. App




    DATA FLOW
    incoming request
    sync persist data
    send response
    async queue message
    Thursday, May 16, 13

    View full-size slide

  8. App




    DATA FLOW
    async queue message
    NSQ responsibilities
    Thursday, May 16, 13

    View full-size slide

  9. MESSAGING PATTERNS
    Thursday, May 16, 13

    View full-size slide

  10. PS
    m1
    m1
    m1
    Producer
    ConsumerA
    ConsumerB
    messages duplicated to multiple consumers
    de-couple independent stream operations
    PUBSUB / MULTICAST
    Thursday, May 16, 13

    View full-size slide

  11. Q
    m2
    m2
    m1
    Producer
    ConsumerA
    ConsumerA
    m1
    messages load balanced among a homogenous group of consumers
    horizontal scalability
    DISTRIBUTION
    Thursday, May 16, 13

    View full-size slide

  12. Q
    m2
    m2
    m1
    Producer
    ConsumerA
    ConsumerA
    m1
    fault tolerance
    in face of consumer failure, other consumers (try to) pick up the slack
    DISTRIBUTION
    Thursday, May 16, 13

    View full-size slide

  13. Q
    m1
    Producer
    ConsumerA
    ConsumerA
    m2
    if consumers cannot keep up with producers,
    the queue is able to hold onto messages so they
    can be processed later
    m3
    QUEUEING
    X
    X
    Thursday, May 16, 13

    View full-size slide

  14. TYPICAL (TERRIBLE) ARCHITECTURE
    Host A
    API
    queue
    queuereader
    •no delivery guarantee
    •SPOFs
    •inefficient
    •complicated setup
    •hard-coded config
    Thursday, May 16, 13

    View full-size slide

  15. TYPICAL (TERRIBLE) ARCHITECTURE
    Host A
    API
    queue
    queuereader
    Host B
    pubsub / multicast
    •no delivery guarantee
    •SPOFs
    •inefficient
    •complicated setup
    •hard-coded config
    Thursday, May 16, 13

    View full-size slide

  16. TYPICAL (TERRIBLE) ARCHITECTURE
    Host A
    API
    queue
    queuereader
    Host B
    pubsub / multicast
    Host C
    queue queuereader
    relay
    •no delivery guarantee
    •SPOFs
    •inefficient
    •complicated setup
    •hard-coded config
    Thursday, May 16, 13

    View full-size slide

  17. TYPICAL (TERRIBLE) ARCHITECTURE
    Host A
    API
    queue
    queuereader
    Host B
    pubsub / multicast
    Host C
    queue queuereader
    relay
    SPOF
    SPOF
    COMPLEX
    •no delivery guarantee
    •SPOFs
    •inefficient
    •complicated setup
    •hard-coded config
    Thursday, May 16, 13

    View full-size slide

  18. DESIGNING A SOLUTION
    Thursday, May 16, 13

    View full-size slide

  19. GOALS
    •provide a straightforward upgrade path
    •greatly simplify configuration requirements
    •promote topologies that enable high-availability and eliminate SPOFs
    •address the need for stronger message delivery guarantees
    •bound the memory footprint of a single process
    •improve efficiency
    •data format and programming language agnostic
    Thursday, May 16, 13

    View full-size slide

  20. I WANT IT ALL
    Thursday, May 16, 13

    View full-size slide

  21. Thursday, May 16, 13

    View full-size slide

  22. Thursday, May 16, 13

    View full-size slide

  23. TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “clicks”
    Topics
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  24. TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  25. TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  26. TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  27. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  28. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  29. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  30. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    A
    A
    A
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  31. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    A
    A
    A
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  32. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    A
    A
    A
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  33. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    A
    A
    A
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  34. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    A
    A
    A
    B
    B
    B
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  35. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    A
    A
    A
    B
    B
    B
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  36. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    A
    A
    A
    B
    B
    B
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  37. separate hosts
    TOPICS AND CHANNELS
    • a topic is a distinct stream of messages (a
    single nsqd instance can have multiple
    topics)
    • a channel is an independent queue for a
    topic (a topic can have multiple channels)
    • consumers discover producers by
    querying nsqlookupd (a discovery service
    for topics)
    • topics and channels are created at
    runtime (just start publishing/subscribing)
    nsqd
    “metrics”
    Channels
    “clicks”
    Topics
    “spam_analysis”
    “archive”
    Consumers
    A
    A
    A
    B
    B
    B
    combine multicast, distribution, and queueing
    Thursday, May 16, 13

    View full-size slide

  38. QUEUES
    •topics and channels are independent queues
    •queues have arbitrary high water marks (after
    which messages transparently read/write to
    disk, bounding memory footprint)
    •supports channel-independent degradation and
    recovery
    buffer this
    channel
    high water mark
    persisted
    messages
    Thursday, May 16, 13

    View full-size slide

  39. DISCOVERY
    remove the need for publishers and consumers to know about each other
    nsqlookupd
    nsqd
    producer
    nsqlookupd
    Thursday, May 16, 13

    View full-size slide

  40. DISCOVERY
    remove the need for publishers and consumers to know about each other
    nsqlookupd
    nsqd
    ❶ publish msg (specifying topic)
    producer
    nsqlookupd
    Thursday, May 16, 13

    View full-size slide

  41. DISCOVERY
    remove the need for publishers and consumers to know about each other
    nsqlookupd
    nsqd
    ❶ publish msg (specifying topic)
    producer
    ➋ IDENTIFY
    persistent TCP connections
    nsqlookupd
    Thursday, May 16, 13

    View full-size slide

  42. DISCOVERY
    remove the need for publishers and consumers to know about each other
    nsqlookupd
    nsqd
    ❶ publish msg (specifying topic)
    producer
    ➋ IDENTIFY
    persistent TCP connections
    nsqlookupd
    ➌ REGISTER (topic/channel)
    Thursday, May 16, 13

    View full-size slide

  43. DISCOVERY (CLIENT)
    remove the need for publishers and consumers to know about each other
    nsqlookupd nsqlookupd
    consumer
    Thursday, May 16, 13

    View full-size slide

  44. DISCOVERY (CLIENT)
    remove the need for publishers and consumers to know about each other
    nsqlookupd nsqlookupd
    consumer
    ➊ regularly poll for topic producers
    HTTP requests
    Thursday, May 16, 13

    View full-size slide

  45. DISCOVERY (CLIENT)
    remove the need for publishers and consumers to know about each other
    nsqlookupd nsqlookupd
    consumer
    ➊ regularly poll for topic producers
    ➋ connect to all producers
    HTTP requests
    Thursday, May 16, 13

    View full-size slide

  46. ELIMINATE ALL THE SPOF
    •easily enable distributed and
    decentralized topologies
    •no brokers
    •consumers connect to all producers
    •messages are pushed to consumers
    •nsqlookupd instances are independent
    and require no coordination (run a
    few for HA)
    Thursday, May 16, 13

    View full-size slide

  47. ELIMINATE ALL THE SPOF
    nsqd nsqd
    nsqd
    •easily enable distributed and
    decentralized topologies
    •no brokers
    •consumers connect to all producers
    •messages are pushed to consumers
    •nsqlookupd instances are independent
    and require no coordination (run a
    few for HA)
    Thursday, May 16, 13

    View full-size slide

  48. ELIMINATE ALL THE SPOF
    nsqd nsqd
    nsqd
    consumer
    •easily enable distributed and
    decentralized topologies
    •no brokers
    •consumers connect to all producers
    •messages are pushed to consumers
    •nsqlookupd instances are independent
    and require no coordination (run a
    few for HA)
    Thursday, May 16, 13

    View full-size slide

  49. ELIMINATE ALL THE SPOF
    nsqd nsqd
    nsqd
    consumer
    •easily enable distributed and
    decentralized topologies
    •no brokers
    •consumers connect to all producers
    •messages are pushed to consumers
    •nsqlookupd instances are independent
    and require no coordination (run a
    few for HA)
    Thursday, May 16, 13

    View full-size slide

  50. ELIMINATE ALL THE SPOF
    nsqd nsqd
    nsqd
    consumer consumer
    •easily enable distributed and
    decentralized topologies
    •no brokers
    •consumers connect to all producers
    •messages are pushed to consumers
    •nsqlookupd instances are independent
    and require no coordination (run a
    few for HA)
    Thursday, May 16, 13

    View full-size slide

  51. ELIMINATE ALL THE SPOF
    nsqd nsqd
    nsqd
    consumer consumer
    •easily enable distributed and
    decentralized topologies
    •no brokers
    •consumers connect to all producers
    •messages are pushed to consumers
    •nsqlookupd instances are independent
    and require no coordination (run a
    few for HA)
    Thursday, May 16, 13

    View full-size slide

  52. NSQ
    NSQD
    API
    consumer
    NSQ
    NSQD
    API
    NSQ
    NSQD
    API
    consumer
    nsqlookupd
    nsqlookupd
    Thursday, May 16, 13

    View full-size slide

  53. NSQ
    NSQD
    API
    consumer
    NSQ
    NSQD
    API
    NSQ
    NSQD
    API
    consumer
    nsqlookupd
    nsqlookupd
    PUBLISH
    Thursday, May 16, 13

    View full-size slide

  54. NSQ
    NSQD
    API
    consumer
    NSQ
    NSQD
    API
    NSQ
    NSQD
    API
    consumer
    nsqlookupd
    nsqlookupd
    PUBLISH
    REGISTER
    Thursday, May 16, 13

    View full-size slide

  55. NSQ
    NSQD
    API
    consumer
    NSQ
    NSQD
    API
    NSQ
    NSQD
    API
    consumer
    nsqlookupd
    nsqlookupd
    PUBLISH
    REGISTER
    DISCOVER
    Thursday, May 16, 13

    View full-size slide

  56. NSQ
    NSQD
    API
    consumer
    NSQ
    NSQD
    API
    NSQ
    NSQD
    API
    consumer
    nsqlookupd
    nsqlookupd
    PUBLISH
    REGISTER
    DISCOVER
    SUBSCRIBE
    Thursday, May 16, 13

    View full-size slide

  57. MESSAGE GUARANTEES
    •messages are delivered at least once
    •handling is guaranteed by the protocol:
    •nsqd sends a message and stores it temporarily
    •client replies FIN (finish) or REQ (re-queue)
    •if client does not reply message is automatically re-queued
    •any single nsqd instance failure can result in message loss (can be mitigated)
    Thursday, May 16, 13

    View full-size slide

  58. CLIENT BEHAVIOR
    •messages are pushed to clients (no polling!)
    •clients manage flow via “RDY state”
    •clients can perform 3 actions on a message:
    •finish
    •re-queue (optionally defer by a duration of time)
    •touch
    •back off, i.e. slow down the rate of processing
    Thursday, May 16, 13

    View full-size slide

  59. ONE MORE THING
    •#ephemeral channels - runtime introspection
    •no backup beyond channel high water mark
    •automatically go away when last client disconnects
    •server side channel pausing
    •administratively stop the flow of messages from a channel to its clients
    •no message loss (queue backs up)
    •really $#%^ing awesome for operations
    Thursday, May 16, 13

    View full-size slide

  60. OTHER SOLUTIONS
    •ZeroMQ - it’s a library, not a platform
    •RabbitMQ, ActiveMQ - promotes brokered topology (and AMQP’s original authors
    abandoned it to build ZeroMQ)
    •kafka - heavyweight, complex, designed for different use case
    •beanstalk, kestrel - just a better queue
    we knew you were going to ask about this...
    Thursday, May 16, 13

    View full-size slide

  61. IN PRODUCTION
    Thursday, May 16, 13

    View full-size slide

  62. TOOLING
    • nsqadmin provides a web interface to
    administrate and introspect an NSQ cluster at
    runtime (and empty, pause, or delete topics/
    channels)
    • nsq_to_http - utility that helps transport an
    aggregate stream over HTTP
    • nsq_to_file - utility that safely persists an
    aggregated stream to disk
    • nsq_stat - iostat like utility for a topic/channel
    • nsq_tail - tail like utility for a topic/channel
    Thursday, May 16, 13

    View full-size slide

  63. EXAMPLE CLIENTS
    •Go Client - https://gist.github.com/4039222
    •Synchronous Python Client - https://gist.github.com/3925081
    •Async Python Client - https://gist.github.com/3925092
    Thursday, May 16, 13

    View full-size slide

  64. DEMO
    Thursday, May 16, 13

    View full-size slide

  65. !anks
    @imsnakes & @jehiah
    https://github.com/bitly/nsq
    shoutouts to @danielhfrank, @ploxiln, and @mccutchen
    Thursday, May 16, 13

    View full-size slide