Real-time stream processing with Python

Real-time stream processing with Python By Kristin Nguyen, HP Labs
PyCon Singapore 18 June 2015

Main topics • 8 rules for stream processing • Open-source
software stack • Apache Storm • ELK Stack: Logstash, ElasticSearch, Kibana • Python packages: Pyleus, ElasticSearch, TextBlob (NLTK + pattern)

Typical use cases • Trading: 122000 messages per second in
2005, doubling every year  ~125 million messages per second in 2015 • Finance & telco: fraud detection • Online advertising: personalized ads • Computer networks: monitor security attacks e.g. DOS • Manufacturing: process control and automation • Sensor networks: monitoring applications e.g. congestion-based tolling on highways, real-time traffic routing

8 rules for stream processing 1. Keep the data moving
2. Query on streams: windowed operations 3. Handle stream imperfections: delayed, missing, out-of- order messages 4. Generate predictable outcomes

8 rules for stream processing 5. Integrate stored and streaming
data 6. Guarantee data safety and availability 7. Partition and scale automatically 8. Process and response instantaneously Paper: The 8 requirements of real-time stream processing by Stonebraker, Cetintemel, Zdonik, ACM SIGMOD 2005

Current stream processing solutions • Main memory DBMSs • Rule
engines • Stream processing engines e.g. Aurora, STREAM

Characteristics of current solutions Rule Main-memory DBMSs Rule Engines Stream
Processing Engines Keep the data moving No Yes Yes SQL-style processing on streams N N Y Handle stream imperfections Difficult Possible Possible Generate predictable outcomes D P P Integrate stored and streaming data N N Y Guarantee data safety and availability P P P Partition and scale applications P P P Process and respond instantaneously P P P

Stream processing frameworks • Apache Storm by Twitter • Spark
Streaming by Berkeley • Apache Kafka by LinkedIn • Apache Samza by LinkedIn: relies on Kafka and Hadoop YARN

Technologies • Apache Storm: stream processing engine • Pyleus: API
to implement Storm topologies with Python • ELK stack: store, query and visualize

8 rules for stream processing applied to Apache Storm 1.
Keep the data moving: yes, no storage involved, push not poll 2. Query on streams: not really 3. Handle stream imperfections: yes, guarantee reliability with tuple acknowledgment 4. Generate predictable outcomes: yes

8 rules for stream processing applied to Apache Storm 5.
Integrate stored and streaming data: not really 6. Guarantee data safety and availability: yes 7. Partition and scale automatically: yes 8. Process and response instantaneously: yes

Abstraction in Storm • Tuple: immutable list of (K,V) pairs
• Stream: unbounded sequence of tuples • Spout: source of the stream • Bolt: processing unit – process tuples & produce new output stream • Topology: a network of spouts and bolts • Task: a running instance of a spout or a bolt • Stream grouping: how tuples are sent between tasks e.g. shuffle, fields

Storm topology Image by https://storm.apache.org/documentation/Tutorial.html

Storm parallelism Image by http://jansipke.nl/storm-in-pictures/

Storm & ZooKeeper Image by http://jansipke.nl/storm-in-pictures/

Develop a Storm project with Pyleus • How to setup
Storm from scratch? • Storm + Zookeeper + Supervisord • How to develop and test a Storm topology locally? • Spout: initialize(), next_tuple() • Bolt: initialize(), process_tuple() • Topology: define connections among spouts and bolts in a .yaml config file • Compile: pyleus build • Run in local mode: pyleus local • How to deploy and monitor a Storm topology? • Run pyleus submit (pyleus list, pyleus kill, storm activate, storm deactivate, storm rebalance) • Monitor with Storm UI

YAML configuration file

Development guidelines • Each component performs a light computation •
Error handling is critical • Test, tune & scale individual components

Performance tuning • Parallelism settings: #machines, #workers, #executors, #tasks •
Reliability settings: #acker tasks • Stream grouping type • Topology design & logic

Demo • Real-time Twitter sentiment analysis • 500 million per
day 5800 thousand per sec • Twitter streaming API: 1% - 58 per sec • Preprocess: remove unwanted content • Compute sentiment score • Write the result to Elasticsearch

Real-time stream processing with Python

Real-time stream processing with Python

Kristin Nguyen

Other Decks in Programming

Featured

Transcript