Slide 1

Slide 1 text

Guest Lecture Name WQD 7007- Big Data Management Handling streaming data with Streamz Faiz Zaki Network Analytics Lab Universiti Malaya 27 April 2021

Slide 2

Slide 2 text

Guest Lecture Name WQD 7007- Big Data Management Agenda • Recap • Introduction • Use case and demo Acknowledgement: Some portion of the contents in this slide deck is adapted from Matthew Rocklin’s talk, Streaming Processing with Dask, at PyData 2017. Thank you ☺

Slide 3

Slide 3 text

Guest Lecture Name WQD 7007- Big Data Management Introduction Discussion (20 April 2021) Share your real-world use cases on database, data warehouse and data lake. How does your company or yourself utilize them to manage data? What problems do your company/yourself face along the way?

Slide 4

Slide 4 text

Guest Lecture Name WQD 7007- Big Data Management Introduction • Streaming data is one of the common forms of big data (volume, velocity) • Streaming data is: • Unbounded/infinite: we might receive data continuously, forever • Timely: we care about responding quickly (near real-time) • Used in: • Web server logs • Financial time series (trading) • Network data • IoT sensors

Slide 5

Slide 5 text

Guest Lecture Name WQD 7007- Big Data Management Introduction • When dealing with streaming data, it is a common scenario where we only need a small subset of the data for analysis etc. • It is also common not to store the data i.e. just analyse and output on the fly.

Slide 6

Slide 6 text

Guest Lecture Name WQD 7007- Big Data Management Introduction • There are various available solutions • Big Data solutions: • Apache Spark Streaming • Apache Storm • Apache Flink • Azure Stream Analytics • Complex solutions • Message queueing (ZeroMQ, RabbitMQ) • Custom codes and protocols

Slide 7

Slide 7 text

Guest Lecture Name WQD 7007- Big Data Management Introduction • Streamz : a Python library for dealing with streaming data • Pythonic • Simple in simple cases • Flexible enough for complex cases • Integrates well with Python libraries (Jupyter, Pandas etc.) • Other Python libraries for streaming data? Not much. • Scikit-multiflow : Python machine learning library for streaming data (incremental learning)

Slide 8

Slide 8 text

Guest Lecture Name WQD 7007- Big Data Management Use Case Problems • Network traffic analysis is critical for network administrators to manage their network efficiently. • Network traffic is continuous in nature. • Network traffic payload sizes are large. But the headers are much smaller. • Can we utilize the traffic stream and produce near real-time analysis?

Slide 9

Slide 9 text

Guest Lecture Name WQD 7007- Big Data Management Use Case Solution

Slide 10

Slide 10 text

Guest Lecture Name WQD 7007- Big Data Management Use Case Let’s code ☺

Slide 11

Slide 11 text

Guest Lecture Name WQD 7007- Big Data Management Use Case Outcomes • Saved (a lot of) money (again). • Possible enhancements: • Include ML • More specific classification • User profile • Anomaly/Attack

Slide 12

Slide 12 text

Guest Lecture Name WQD 7007- Big Data Management Take-home messages • Streaming data is valuable and in abundance. • While storing is expensive (for some), computing on the fly is preferable. • Near real-time analysis assist in rapid decision making. • Python presents an accessible approach to customize streaming processing with manageable complexity.

Slide 13

Slide 13 text

Guest Lecture Name WQD 7007- Big Data Management Thank you Slide deck is available at speakerdeck.com/mfaizmzaki