Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time stream processing with Python

Real-time stream processing with Python

Discussion on real-time stream processing and how to build a Python application to analyse data streams.

Avatar for Kristin Nguyen

Kristin Nguyen

June 18, 2015

Other Decks in Programming

Transcript

  1. Main topics • 8 rules for stream processing • Open-source

    software stack • Apache Storm • ELK Stack: Logstash, ElasticSearch, Kibana • Python packages: Pyleus, ElasticSearch, TextBlob (NLTK + pattern)
  2. Typical use cases • Trading: 122000 messages per second in

    2005, doubling every year  ~125 million messages per second in 2015 • Finance & telco: fraud detection • Online advertising: personalized ads • Computer networks: monitor security attacks e.g. DOS • Manufacturing: process control and automation • Sensor networks: monitoring applications e.g. congestion-based tolling on highways, real-time traffic routing
  3. 8 rules for stream processing 1. Keep the data moving

    2. Query on streams: windowed operations 3. Handle stream imperfections: delayed, missing, out-of- order messages 4. Generate predictable outcomes
  4. 8 rules for stream processing 5. Integrate stored and streaming

    data 6. Guarantee data safety and availability 7. Partition and scale automatically 8. Process and response instantaneously Paper: The 8 requirements of real-time stream processing by Stonebraker, Cetintemel, Zdonik, ACM SIGMOD 2005
  5. Current stream processing solutions • Main memory DBMSs • Rule

    engines • Stream processing engines e.g. Aurora, STREAM
  6. Characteristics of current solutions Rule Main-memory DBMSs Rule Engines Stream

    Processing Engines Keep the data moving No Yes Yes SQL-style processing on streams N N Y Handle stream imperfections Difficult Possible Possible Generate predictable outcomes D P P Integrate stored and streaming data N N Y Guarantee data safety and availability P P P Partition and scale applications P P P Process and respond instantaneously P P P
  7. Stream processing frameworks • Apache Storm by Twitter • Spark

    Streaming by Berkeley • Apache Kafka by LinkedIn • Apache Samza by LinkedIn: relies on Kafka and Hadoop YARN
  8. Technologies • Apache Storm: stream processing engine • Pyleus: API

    to implement Storm topologies with Python • ELK stack: store, query and visualize
  9. 8 rules for stream processing applied to Apache Storm 1.

    Keep the data moving: yes, no storage involved, push not poll 2. Query on streams: not really 3. Handle stream imperfections: yes, guarantee reliability with tuple acknowledgment 4. Generate predictable outcomes: yes
  10. 8 rules for stream processing applied to Apache Storm 5.

    Integrate stored and streaming data: not really 6. Guarantee data safety and availability: yes 7. Partition and scale automatically: yes 8. Process and response instantaneously: yes
  11. Abstraction in Storm • Tuple: immutable list of (K,V) pairs

    • Stream: unbounded sequence of tuples • Spout: source of the stream • Bolt: processing unit – process tuples & produce new output stream • Topology: a network of spouts and bolts • Task: a running instance of a spout or a bolt • Stream grouping: how tuples are sent between tasks e.g. shuffle, fields
  12. Develop a Storm project with Pyleus • How to setup

    Storm from scratch? • Storm + Zookeeper + Supervisord • How to develop and test a Storm topology locally? • Spout: initialize(), next_tuple() • Bolt: initialize(), process_tuple() • Topology: define connections among spouts and bolts in a .yaml config file • Compile: pyleus build • Run in local mode: pyleus local • How to deploy and monitor a Storm topology? • Run pyleus submit (pyleus list, pyleus kill, storm activate, storm deactivate, storm rebalance) • Monitor with Storm UI
  13. Development guidelines • Each component performs a light computation •

    Error handling is critical • Test, tune & scale individual components
  14. Performance tuning • Parallelism settings: #machines, #workers, #executors, #tasks •

    Reliability settings: #acker tasks • Stream grouping type • Topology design & logic
  15. Demo • Real-time Twitter sentiment analysis • 500 million per

    day 5800 thousand per sec • Twitter streaming API: 1% - 58 per sec • Preprocess: remove unwanted content • Compute sentiment score • Write the result to Elasticsearch