Slide 1

Slide 1 text

STREAM PROCESSING PHILOSOPHY, CONCEPTS, AND TECHNOLOGIES Dan Frank [email protected] @danielhfrank

Slide 2

Slide 2 text

What did I just sign up for?

Slide 3

Slide 3 text

• Stream processing as a tool for decomposition and modularity What did I just sign up for?

Slide 4

Slide 4 text

• Stream processing as a tool for decomposition and modularity • Stream processing composition building blocks What did I just sign up for?

Slide 5

Slide 5 text

• Stream processing as a tool for decomposition and modularity • Stream processing composition building blocks • Stream processing in your distributed web application What did I just sign up for?

Slide 6

Slide 6 text

• Stream processing as a tool for decomposition and modularity • Stream processing composition building blocks • Stream processing in your distributed web application • NSQ, Bitly’s distributed messaging framework What did I just sign up for?

Slide 7

Slide 7 text

• Stream processing as a tool for decomposition and modularity • Stream processing composition building blocks • Stream processing in your distributed web application • NSQ, Bitly’s distributed messaging framework • The future now: stream processing within your programs, and technologies to do it What did I just sign up for?

Slide 8

Slide 8 text

STREAM PROCESSING? Let’s say: “Near-realtime processing of sequential messages / events”

Slide 9

Slide 9 text

A QUICK NOTE ON • Hadoop is a dominant framework for doing batch tasks: tasks that operate on a fully populated dataset and just need to be done “later”. Offline • Stream processing is basically the opposite of this: operating as new data comes in, computation happens online. No concept of “complete” dataset • BUT, using the two as complementary data analysis components is very effective

Slide 10

Slide 10 text

Career Topology

Slide 11

Slide 11 text

Why Stream Processing?

Slide 12

Slide 12 text

Why Stream Processing? REALTIME ANALYTICS!

Slide 13

Slide 13 text

Why Stream Processing? REALTIME ANALYTICS! There are better reasons!

Slide 14

Slide 14 text

CASE STUDY: PROCESSING LINES IN A FILE

Slide 15

Slide 15 text

NAÏVE “ARCHITECTURE” for line in lines: new_line = do_something(line) newer_line = do_something_else(new_line) # ... outputs.append(newest_line)

Slide 16

Slide 16 text

NAÏVE “ARCHITECTURE” for line in lines: new_line = do_something(line) newer_line = do_something_else(new_line) # ... outputs.append(newest_line) Composition of our functions is static, built into our program

Slide 17

Slide 17 text

NAÏVE “ARCHITECTURE” for line in lines: new_line = do_something(line) newer_line = do_something_else(new_line) # ... outputs.append(newest_line) Composition of our functions is static, built into our program Error handling? Uhh

Slide 18

Slide 18 text

Unix Solution: Pipes < lines do_something | do_something_else | ...

Slide 19

Slide 19 text

Unix Solution: Pipes < lines do_something | do_something_else | ... Composition happens outside the application code

Slide 20

Slide 20 text

Unix Solution: Pipes < lines do_something | do_something_else | ... Composition happens outside the application code Errors are printed to stderr, execution continues. It’ll do...

Slide 21

Slide 21 text

ASIDE ON MODULARITY

Slide 22

Slide 22 text

ASIDE ON MODULARITY • Modularity in code • Logically simpler functions, more easily grokked + tested • Smaller functions more easily reused throughout program, DRY

Slide 23

Slide 23 text

ASIDE ON MODULARITY • Modularity in code • Logically simpler functions, more easily grokked + tested • Smaller functions more easily reused throughout program, DRY • Modularity in architecture • Fine grained scaling of individual components • Isolate failures • All of the above

Slide 24

Slide 24 text

BIG LEAGUES: TRENDRR STACK VERSION def process_tweet(tweet): get_sentiment() get_location() ... vs SentimentProcessor LocationProcessor

Slide 25

Slide 25 text

“QUEUEREADER” applications consume messages generated as outlined above

Slide 26

Slide 26 text

“QUEUEREADER” applications consume messages generated as outlined above • May modify messages and send further downstream

Slide 27

Slide 27 text

“QUEUEREADER” applications consume messages generated as outlined above • May modify messages and send further downstream • May update some sort of database

Slide 28

Slide 28 text

“QUEUEREADER” applications consume messages generated as outlined above • May modify messages and send further downstream • May update some sort of database • Probably a good idea to do some archival as well

Slide 29

Slide 29 text

ARCHIVAL GOODIES

Slide 30

Slide 30 text

ARCHIVAL GOODIES •Backfill new systems

Slide 31

Slide 31 text

ARCHIVAL GOODIES •Backfill new systems •Repair busted systems

Slide 32

Slide 32 text

ARCHIVAL GOODIES •Backfill new systems •Repair busted systems •Ripe for batch processing

Slide 33

Slide 33 text

ARCHIVAL GOODIES •Backfill new systems •Repair busted systems •Ripe for batch processing •Include timestamps in your messages!

Slide 34

Slide 34 text

COMPOSITION BUILDING BLOCKS

Slide 35

Slide 35 text

Pubsub / Multicast Model PS msg msg msg Producer ConsumerA ConsumerB Messages duplicated to multiple consumers Decouple independent stream operations

Slide 36

Slide 36 text

Q m2 m2 m1 Producer ConsumerA ConsumerA m1 Distribution Model Messages distributed among consumers Horizontally scale workers to achieve desired throughput

Slide 37

Slide 37 text

Q m2 m2 m1 Producer Consumer Consumer m1 Distribution Model Fault Tolerance: In face of consumer failure, other consumers (try to) pick up the slack

Slide 38

Slide 38 text

Q m1 Producer Consumer Consumer m2 Buffered Model Buffering: If consumers cannot keep up with producers, the queue is able to hold onto messages so they can be processed later m3

Slide 39

Slide 39 text

MAKE IT WEBSCALE!!! what does this have to do with my webapp?

Slide 40

Slide 40 text

MAKE IT WEBSCALE!!! what does this have to do with my webapp? Web requests are serialized as event messages

Slide 41

Slide 41 text

MAKE IT WEBSCALE!!! what does this have to do with my webapp? Web requests are serialized as event messages Messages make up a stream that can be processed elsewhere in your distributed application

Slide 42

Slide 42 text

App ❶ ASYNC DATA FLOW incoming request

Slide 43

Slide 43 text

App ❶ ❷ ASYNC DATA FLOW incoming request sync persist data

Slide 44

Slide 44 text

App ❶ ❸ ❷ ASYNC DATA FLOW incoming request sync persist data send response

Slide 45

Slide 45 text

App ❶ ❹ ❸ ❷ ASYNC DATA FLOW incoming request sync persist data send response async queue message

Slide 46

Slide 46 text

App ❶ ❹ ❸ ❷ ASYNC DATA FLOW incoming request sync persist data send response async queue message Downstream processing decoupled from request / response

Slide 47

Slide 47 text

IT’S NICE BUT • Stringing together queues and pubsubs implementing these models a pain • Single conduit for messages a SPOF • Single queue leads to rigid dependencies between services

Slide 48

Slide 48 text

TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader

Slide 49

Slide 49 text

TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B pubsub

Slide 50

Slide 50 text

TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B pubsub Host C simplequeue queuereader ps_to_http

Slide 51

Slide 51 text

TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B pubsub Host C simplequeue queuereader ps_to_http SPOF SPOF COMPLEX

Slide 52

Slide 52 text

TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B pubsub Host C simplequeue queuereader ps_to_http SPOF SPOF COMPLEX ANARCHY

Slide 53

Slide 53 text

I WANT IT ALL

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

NSQ Core Features

Slide 56

Slide 56 text

NSQ Core Features Queue daemon facilitates multicast, distribution, and buffering

Slide 57

Slide 57 text

NSQ Core Features Queue daemon facilitates multicast, distribution, and buffering Fully distributed and decentralized

Slide 58

Slide 58 text

NSQ Core Features Queue daemon facilitates multicast, distribution, and buffering Lookup service simplifies configuration and allows topology to change dynamically Fully distributed and decentralized

Slide 59

Slide 59 text

MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “clicks” Topics

Slide 60

Slide 60 text

MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics

Slide 61

Slide 61 text

MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis”

Slide 62

Slide 62 text

MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive”

Slide 63

Slide 63 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers

Slide 64

Slide 64 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers

Slide 65

Slide 65 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers

Slide 66

Slide 66 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A

Slide 67

Slide 67 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A

Slide 68

Slide 68 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A

Slide 69

Slide 69 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A

Slide 70

Slide 70 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B

Slide 71

Slide 71 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B

Slide 72

Slide 72 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B

Slide 73

Slide 73 text

separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B

Slide 74

Slide 74 text

DISCOVERY remove the need for publishers and consumers to know about each other nsqlookupd nsqd producer nsqlookupd

Slide 75

Slide 75 text

DISCOVERY remove the need for publishers and consumers to know about each other nsqlookupd nsqd ❶ publish msg (specifying topic) producer nsqlookupd

Slide 76

Slide 76 text

DISCOVERY remove the need for publishers and consumers to know about each other nsqlookupd nsqd ❶ publish msg (specifying topic) producer ➋ IDENTIFY persistent TCP connections nsqlookupd

Slide 77

Slide 77 text

DISCOVERY remove the need for publishers and consumers to know about each other nsqlookupd nsqd ❶ publish msg (specifying topic) producer ➋ IDENTIFY persistent TCP connections nsqlookupd ➌ REGISTER (topic/channel)

Slide 78

Slide 78 text

DISCOVERY (CLIENT) remove the need for publishers and consumers to know about each other nsqlookupd nsqlookupd consumer

Slide 79

Slide 79 text

DISCOVERY (CLIENT) remove the need for publishers and consumers to know about each other nsqlookupd nsqlookupd consumer ➊ regularly poll for topic producers HTTP requests

Slide 80

Slide 80 text

DISCOVERY (CLIENT) remove the need for publishers and consumers to know about each other nsqlookupd nsqlookupd consumer ➊ regularly poll for topic producers ➋ connect to all producers HTTP requests

Slide 81

Slide 81 text

ELIMINATE ALL THE SPOF •easily enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)

Slide 82

Slide 82 text

ELIMINATE ALL THE SPOF nsqd nsqd nsqd •easily enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)

Slide 83

Slide 83 text

ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer •easily enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)

Slide 84

Slide 84 text

ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer •easily enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)

Slide 85

Slide 85 text

ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer consumer •easily enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)

Slide 86

Slide 86 text

ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer consumer •easily enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)

Slide 87

Slide 87 text

EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API NSQ NSQD API consumer nsqlookupd nsqlookupd

Slide 88

Slide 88 text

EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH

Slide 89

Slide 89 text

EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH REGISTER

Slide 90

Slide 90 text

EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH REGISTER DISCOVER

Slide 91

Slide 91 text

EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH REGISTER DISCOVER SUBSCRIBE

Slide 92

Slide 92 text

A WORD ON ERRORS •If a reader does not reply to confirm completion of a message within a timeout, the message is requeued. •Abandoned after configurable number of requeues •Allows for recovery in face of transient problems without getting hung up on bad messages

Slide 93

Slide 93 text

OTHER NSQ NICETIES •Admin interface: server-side channel pausing, admin action notifications •Configurable high-water mark on memory usage •Ephemeral channels for stream sampling

Slide 94

Slide 94 text

github.com/bitly/nsq

Slide 95

Slide 95 text

DISTRIBUTED MESSAGING CAVEATS

Slide 96

Slide 96 text

DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!*

Slide 97

Slide 97 text

DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!* •NSQ protocol guarantees delivery at least once - idempotence is a must! (_ids help)

Slide 98

Slide 98 text

DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!* •NSQ protocol guarantees delivery at least once - idempotence is a must! (_ids help) •Try not to be shocked by effortless recovery from node failure

Slide 99

Slide 99 text

DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!* •NSQ protocol guarantees delivery at least once - idempotence is a must! (_ids help) •Try not to be shocked by effortless recovery from node failure *See http://bit.ly/life_beyond_transactions

Slide 100

Slide 100 text

STREAM PROCESSING: WHY NOW?

Slide 101

Slide 101 text

STREAM PROCESSING: WHY NOW? •Cheap node distribution: EC2 etc

Slide 102

Slide 102 text

STREAM PROCESSING: WHY NOW? •Cheap node distribution: EC2 etc •Moore’s law, Amdahl’s law, battered deceased equines...

Slide 103

Slide 103 text

STREAM PROCESSING: WHY NOW? •Cheap node distribution: EC2 etc •Moore’s law, Amdahl’s law, battered deceased equines... •Taking advantage of CPU parallelism the way forward for program efficiency - good thing we just went over a paradigm for distributing tasks among parallel workers!

Slide 104

Slide 104 text

INTRA-PROGRAM STREAM PROCESSING IN THE WILD

Slide 105

Slide 105 text

EXAMPLE 1: GOLANG

Slide 106

Slide 106 text

•Channels allow synchronized passage of messages between two goroutines •Goroutine independence (through synchronization) allows stream-like architecture: •“Don’t communicate by sharing memory, share memory by communicating” •Golang scheduler can parallelize between cores (GOMAXPROCS) •Channels act like queues. Multicast not really an option •Queuereader applications are a particularly good fit for goroutine concurrency

Slide 107

Slide 107 text

Q m... m1 ConsumerA ConsumerA CPU 1 m2 m1 m3 CPU 2 Goroutine 1 Goroutine 2 Goroutine 3 m1 m2 m3 •Within each consumer, messages distributed among goroutines •Goroutines, when possible, parallelized across CPUs •OK to have more goroutines than CPUs - golang scheduler will give them CPU time when another goroutine is idle (e.g. waiting on network) Golang Channel

Slide 108

Slide 108 text

EXAMPLE 2 WHAT’S THE DEAL WITH ZEROMQ?

Slide 109

Slide 109 text

ZMQ FEATURES •Networking library that provides building blocks discussed earlier •Unlike golang channels, does support many more complex patterns •Transport layer abstracted out: same application can connect multiple threads or multiple machines •Can start by distributing among processes, and scale up to several boxes. Application code doesn’t need to know about it! •All the rage among the webscale set, but unclear what the hell is going on in the community

Slide 110

Slide 110 text

zmq.bind(“inproc://example_socket”) zmq.bind(“tcp://1.2.3.4:5678”) Change transport by changing one string

Slide 111

Slide 111 text

ALMOST DONE I PROMISE

Slide 112

Slide 112 text

WHAT HAVE WE SEEN HERE? •Stream processing paradigm is a great tool for writing composed, modular applications •Fault tolerance and horizontal scalability come in the box •Your web application is probably better suited to this design than you think •NSQ is the tool we use to write distributed stream processing applications and it kicks ass at it •These same paradigms can aid in writing performant applications making use of multicore computer architecture, so you should plan on seeing a lot more of this stuff in the near future, whether you like it or not

Slide 113

Slide 113 text

THANKS! Dan Frank [email protected] @danielhfrank