Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stream Processing: Philosophy, Concepts, and Technologies

Stream Processing: Philosophy, Concepts, and Technologies

Given at PhillyETE 2013: Stream processing has emerged in recent years as a very fast-growing paradigm in data science infrastructure. This rise can be partly attributed to some factors external to system design, such as business demands for near-realtime data or inability of hardware to manage an ever-growing data set. However, this paradigm also possesses many inherent strengths, and there is good reason for it to be embraced, not simply tolerated. In this talk I’ll discuss some high level advantages of processing data in streams, such as fault tolerance, horizontal scalability, and composability. I’ll then introduce NSQ, Bitly’s open source queueing system, and discuss how it provides us with these advantages and how it approaches the tradeoffs inherent in designing distributed systems. I’ll also discuss some of the burdens that NSQ places on developers, such as idempotent operations, and why they are necessary. Finally, I’ll discuss some new technologies that aim to abstract away the mechanism of communcation between streaming programs, and talk about the powerful opportunities and risks that they offer.

F23de27bf1eaa7e07152ee18b23f2261?s=128

Dan Frank

April 03, 2013
Tweet

Transcript

  1. STREAM PROCESSING PHILOSOPHY, CONCEPTS, AND TECHNOLOGIES Dan Frank df@bit.ly @danielhfrank

  2. What did I just sign up for?

  3. • Stream processing as a tool for decomposition and modularity

    What did I just sign up for?
  4. • Stream processing as a tool for decomposition and modularity

    • Stream processing composition building blocks What did I just sign up for?
  5. • Stream processing as a tool for decomposition and modularity

    • Stream processing composition building blocks • Stream processing in your distributed web application What did I just sign up for?
  6. • Stream processing as a tool for decomposition and modularity

    • Stream processing composition building blocks • Stream processing in your distributed web application • NSQ, Bitly’s distributed messaging framework What did I just sign up for?
  7. • Stream processing as a tool for decomposition and modularity

    • Stream processing composition building blocks • Stream processing in your distributed web application • NSQ, Bitly’s distributed messaging framework • The future now: stream processing within your programs, and technologies to do it What did I just sign up for?
  8. STREAM PROCESSING? Let’s say: “Near-realtime processing of sequential messages /

    events”
  9. A QUICK NOTE ON • Hadoop is a dominant framework

    for doing batch tasks: tasks that operate on a fully populated dataset and just need to be done “later”. Offline • Stream processing is basically the opposite of this: operating as new data comes in, computation happens online. No concept of “complete” dataset • BUT, using the two as complementary data analysis components is very effective
  10. Career Topology

  11. Why Stream Processing?

  12. Why Stream Processing? REALTIME ANALYTICS!

  13. Why Stream Processing? REALTIME ANALYTICS! There are better reasons!

  14. CASE STUDY: PROCESSING LINES IN A FILE

  15. NAÏVE “ARCHITECTURE” for line in lines: new_line = do_something(line) newer_line

    = do_something_else(new_line) # ... outputs.append(newest_line)
  16. NAÏVE “ARCHITECTURE” for line in lines: new_line = do_something(line) newer_line

    = do_something_else(new_line) # ... outputs.append(newest_line) Composition of our functions is static, built into our program
  17. NAÏVE “ARCHITECTURE” for line in lines: new_line = do_something(line) newer_line

    = do_something_else(new_line) # ... outputs.append(newest_line) Composition of our functions is static, built into our program Error handling? Uhh
  18. Unix Solution: Pipes < lines do_something | do_something_else | ...

  19. Unix Solution: Pipes < lines do_something | do_something_else | ...

    Composition happens outside the application code
  20. Unix Solution: Pipes < lines do_something | do_something_else | ...

    Composition happens outside the application code Errors are printed to stderr, execution continues. It’ll do...
  21. ASIDE ON MODULARITY

  22. ASIDE ON MODULARITY • Modularity in code • Logically simpler

    functions, more easily grokked + tested • Smaller functions more easily reused throughout program, DRY
  23. ASIDE ON MODULARITY • Modularity in code • Logically simpler

    functions, more easily grokked + tested • Smaller functions more easily reused throughout program, DRY • Modularity in architecture • Fine grained scaling of individual components • Isolate failures • All of the above
  24. BIG LEAGUES: TRENDRR STACK VERSION def process_tweet(tweet): get_sentiment() get_location() ...

    vs SentimentProcessor LocationProcessor
  25. “QUEUEREADER” applications consume messages generated as outlined above

  26. “QUEUEREADER” applications consume messages generated as outlined above • May

    modify messages and send further downstream
  27. “QUEUEREADER” applications consume messages generated as outlined above • May

    modify messages and send further downstream • May update some sort of database
  28. “QUEUEREADER” applications consume messages generated as outlined above • May

    modify messages and send further downstream • May update some sort of database • Probably a good idea to do some archival as well
  29. ARCHIVAL GOODIES

  30. ARCHIVAL GOODIES •Backfill new systems

  31. ARCHIVAL GOODIES •Backfill new systems •Repair busted systems

  32. ARCHIVAL GOODIES •Backfill new systems •Repair busted systems •Ripe for

    batch processing
  33. ARCHIVAL GOODIES •Backfill new systems •Repair busted systems •Ripe for

    batch processing •Include timestamps in your messages!
  34. COMPOSITION BUILDING BLOCKS

  35. Pubsub / Multicast Model PS msg msg msg Producer ConsumerA

    ConsumerB Messages duplicated to multiple consumers Decouple independent stream operations
  36. Q m2 m2 m1 Producer ConsumerA ConsumerA m1 Distribution Model

    Messages distributed among consumers Horizontally scale workers to achieve desired throughput
  37. Q m2 m2 m1 Producer Consumer Consumer m1 Distribution Model

    Fault Tolerance: In face of consumer failure, other consumers (try to) pick up the slack
  38. Q m1 Producer Consumer Consumer m2 Buffered Model Buffering: If

    consumers cannot keep up with producers, the queue is able to hold onto messages so they can be processed later m3
  39. MAKE IT WEBSCALE!!! what does this have to do with

    my webapp?
  40. MAKE IT WEBSCALE!!! what does this have to do with

    my webapp? Web requests are serialized as event messages
  41. MAKE IT WEBSCALE!!! what does this have to do with

    my webapp? Web requests are serialized as event messages Messages make up a stream that can be processed elsewhere in your distributed application
  42. App ❶ ASYNC DATA FLOW incoming request

  43. App ❶ ❷ ASYNC DATA FLOW incoming request sync persist

    data
  44. App ❶ ❸ ❷ ASYNC DATA FLOW incoming request sync

    persist data send response
  45. App ❶ ❹ ❸ ❷ ASYNC DATA FLOW incoming request

    sync persist data send response async queue message
  46. App ❶ ❹ ❸ ❷ ASYNC DATA FLOW incoming request

    sync persist data send response async queue message Downstream processing decoupled from request / response
  47. IT’S NICE BUT • Stringing together queues and pubsubs implementing

    these models a pain • Single conduit for messages a SPOF • Single queue leads to rigid dependencies between services
  48. TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader

  49. TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B

    pubsub
  50. TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B

    pubsub Host C simplequeue queuereader ps_to_http
  51. TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B

    pubsub Host C simplequeue queuereader ps_to_http SPOF SPOF COMPLEX
  52. TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B

    pubsub Host C simplequeue queuereader ps_to_http SPOF SPOF COMPLEX ANARCHY
  53. I WANT IT ALL

  54. None
  55. NSQ Core Features

  56. NSQ Core Features Queue daemon facilitates multicast, distribution, and buffering

  57. NSQ Core Features Queue daemon facilitates multicast, distribution, and buffering

    Fully distributed and decentralized
  58. NSQ Core Features Queue daemon facilitates multicast, distribution, and buffering

    Lookup service simplifies configuration and allows topology to change dynamically Fully distributed and decentralized
  59. MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW

    • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “clicks” Topics
  60. MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW

    • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics
  61. MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW

    • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis”
  62. MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW

    • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive”
  63. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers
  64. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers
  65. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers
  66. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A
  67. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A
  68. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A
  69. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A
  70. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B
  71. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B
  72. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B
  73. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B
  74. DISCOVERY remove the need for publishers and consumers to know

    about each other nsqlookupd nsqd producer nsqlookupd
  75. DISCOVERY remove the need for publishers and consumers to know

    about each other nsqlookupd nsqd ❶ publish msg (specifying topic) producer nsqlookupd
  76. DISCOVERY remove the need for publishers and consumers to know

    about each other nsqlookupd nsqd ❶ publish msg (specifying topic) producer ➋ IDENTIFY persistent TCP connections nsqlookupd
  77. DISCOVERY remove the need for publishers and consumers to know

    about each other nsqlookupd nsqd ❶ publish msg (specifying topic) producer ➋ IDENTIFY persistent TCP connections nsqlookupd ➌ REGISTER (topic/channel)
  78. DISCOVERY (CLIENT) remove the need for publishers and consumers to

    know about each other nsqlookupd nsqlookupd consumer
  79. DISCOVERY (CLIENT) remove the need for publishers and consumers to

    know about each other nsqlookupd nsqlookupd consumer ➊ regularly poll for topic producers HTTP requests
  80. DISCOVERY (CLIENT) remove the need for publishers and consumers to

    know about each other nsqlookupd nsqlookupd consumer ➊ regularly poll for topic producers ➋ connect to all producers HTTP requests
  81. ELIMINATE ALL THE SPOF •easily enable distributed and decentralized topologies

    •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  82. ELIMINATE ALL THE SPOF nsqd nsqd nsqd •easily enable distributed

    and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  83. ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer •easily enable

    distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  84. ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer •easily enable

    distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  85. ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer consumer •easily

    enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  86. ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer consumer •easily

    enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  87. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd
  88. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH
  89. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH REGISTER
  90. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH REGISTER DISCOVER
  91. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH REGISTER DISCOVER SUBSCRIBE
  92. A WORD ON ERRORS •If a reader does not reply

    to confirm completion of a message within a timeout, the message is requeued. •Abandoned after configurable number of requeues •Allows for recovery in face of transient problems without getting hung up on bad messages
  93. OTHER NSQ NICETIES •Admin interface: server-side channel pausing, admin action

    notifications •Configurable high-water mark on memory usage •Ephemeral channels for stream sampling
  94. github.com/bitly/nsq

  95. DISTRIBUTED MESSAGING CAVEATS

  96. DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!*

  97. DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!* •NSQ protocol guarantees

    delivery at least once - idempotence is a must! (_ids help)
  98. DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!* •NSQ protocol guarantees

    delivery at least once - idempotence is a must! (_ids help) •Try not to be shocked by effortless recovery from node failure
  99. DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!* •NSQ protocol guarantees

    delivery at least once - idempotence is a must! (_ids help) •Try not to be shocked by effortless recovery from node failure *See http://bit.ly/life_beyond_transactions
  100. STREAM PROCESSING: WHY NOW?

  101. STREAM PROCESSING: WHY NOW? •Cheap node distribution: EC2 etc

  102. STREAM PROCESSING: WHY NOW? •Cheap node distribution: EC2 etc •Moore’s

    law, Amdahl’s law, battered deceased equines...
  103. STREAM PROCESSING: WHY NOW? •Cheap node distribution: EC2 etc •Moore’s

    law, Amdahl’s law, battered deceased equines... •Taking advantage of CPU parallelism the way forward for program efficiency - good thing we just went over a paradigm for distributing tasks among parallel workers!
  104. INTRA-PROGRAM STREAM PROCESSING IN THE WILD

  105. EXAMPLE 1: GOLANG

  106. •Channels allow synchronized passage of messages between two goroutines •Goroutine

    independence (through synchronization) allows stream-like architecture: •“Don’t communicate by sharing memory, share memory by communicating” •Golang scheduler can parallelize between cores (GOMAXPROCS) •Channels act like queues. Multicast not really an option •Queuereader applications are a particularly good fit for goroutine concurrency
  107. Q m... m1 ConsumerA ConsumerA CPU 1 m2 m1 m3

    CPU 2 Goroutine 1 Goroutine 2 Goroutine 3 m1 m2 m3 •Within each consumer, messages distributed among goroutines •Goroutines, when possible, parallelized across CPUs •OK to have more goroutines than CPUs - golang scheduler will give them CPU time when another goroutine is idle (e.g. waiting on network) Golang Channel
  108. EXAMPLE 2 WHAT’S THE DEAL WITH ZEROMQ?

  109. ZMQ FEATURES •Networking library that provides building blocks discussed earlier

    •Unlike golang channels, does support many more complex patterns •Transport layer abstracted out: same application can connect multiple threads or multiple machines •Can start by distributing among processes, and scale up to several boxes. Application code doesn’t need to know about it! •All the rage among the webscale set, but unclear what the hell is going on in the community
  110. zmq.bind(“inproc://example_socket”) zmq.bind(“tcp://1.2.3.4:5678”) Change transport by changing one string

  111. ALMOST DONE I PROMISE

  112. WHAT HAVE WE SEEN HERE? •Stream processing paradigm is a

    great tool for writing composed, modular applications •Fault tolerance and horizontal scalability come in the box •Your web application is probably better suited to this design than you think •NSQ is the tool we use to write distributed stream processing applications and it kicks ass at it •These same paradigms can aid in writing performant applications making use of multicore computer architecture, so you should plan on seeing a lot more of this stuff in the near future, whether you like it or not
  113. THANKS! Dan Frank df@bit.ly @danielhfrank