Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling streaming data with Streamz

Handling streaming data with Streamz

WQD7007 Guest Lecture

Faiz Zaki

April 27, 2021
Tweet

More Decks by Faiz Zaki

Other Decks in Technology

Transcript

  1. Guest Lecture Name WQD 7007- Big Data Management Handling streaming

    data with Streamz Faiz Zaki Network Analytics Lab Universiti Malaya 27 April 2021
  2. Guest Lecture Name WQD 7007- Big Data Management Agenda •

    Recap • Introduction • Use case and demo Acknowledgement: Some portion of the contents in this slide deck is adapted from Matthew Rocklin’s talk, Streaming Processing with Dask, at PyData 2017. Thank you ☺
  3. Guest Lecture Name WQD 7007- Big Data Management Introduction Discussion

    (20 April 2021) Share your real-world use cases on database, data warehouse and data lake. How does your company or yourself utilize them to manage data? What problems do your company/yourself face along the way?
  4. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    Streaming data is one of the common forms of big data (volume, velocity) • Streaming data is: • Unbounded/infinite: we might receive data continuously, forever • Timely: we care about responding quickly (near real-time) • Used in: • Web server logs • Financial time series (trading) • Network data • IoT sensors
  5. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    When dealing with streaming data, it is a common scenario where we only need a small subset of the data for analysis etc. • It is also common not to store the data i.e. just analyse and output on the fly.
  6. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    There are various available solutions • Big Data solutions: • Apache Spark Streaming • Apache Storm • Apache Flink • Azure Stream Analytics • Complex solutions • Message queueing (ZeroMQ, RabbitMQ) • Custom codes and protocols
  7. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    Streamz : a Python library for dealing with streaming data • Pythonic • Simple in simple cases • Flexible enough for complex cases • Integrates well with Python libraries (Jupyter, Pandas etc.) • Other Python libraries for streaming data? Not much. • Scikit-multiflow : Python machine learning library for streaming data (incremental learning)
  8. Guest Lecture Name WQD 7007- Big Data Management Use Case

    Problems • Network traffic analysis is critical for network administrators to manage their network efficiently. • Network traffic is continuous in nature. • Network traffic payload sizes are large. But the headers are much smaller. • Can we utilize the traffic stream and produce near real-time analysis?
  9. Guest Lecture Name WQD 7007- Big Data Management Use Case

    Outcomes • Saved (a lot of) money (again). • Possible enhancements: • Include ML • More specific classification • User profile • Anomaly/Attack
  10. Guest Lecture Name WQD 7007- Big Data Management Take-home messages

    • Streaming data is valuable and in abundance. • While storing is expensive (for some), computing on the fly is preferable. • Near real-time analysis assist in rapid decision making. • Python presents an accessible approach to customize streaming processing with manageable complexity.
  11. Guest Lecture Name WQD 7007- Big Data Management Thank you

    Slide deck is available at speakerdeck.com/mfaizmzaki