Slide 1

Slide 1 text

Guest Lecture Name WQD 7007- Big Data Management Managing your big data: examples and use cases Faiz Zaki Network Analytics Lab Universiti Malaya 20 April 2021

Slide 2

Slide 2 text

Guest Lecture Name WQD 7007- Big Data Management Agenda • Introduction • Tools • Use cases

Slide 3

Slide 3 text

Guest Lecture Name WQD 7007- Big Data Management Introduction • Big Data is.. big. • 3Vs : Volume, velocity, variety • The term Big Data started growing in popularity in the last decade (> 2012). • Doug Cutting created Hadoop at around the same period. • Do we really have or need big data? Why?

Slide 4

Slide 4 text

Guest Lecture Name WQD 7007- Big Data Management Introduction • According to IDC, as much as 90% of digital data is unstructured. • Structured data fits nicely in a table: names, addresses etc. • Unstructured data exists in its raw form: text, social media posts, emails, videos etc.

Slide 5

Slide 5 text

Guest Lecture Name WQD 7007- Big Data Management Introduction • All of these data leads to another issue. • How do you store and use them? • Databases, data warehouses (ETL), data lakes (ELT) VS

Slide 6

Slide 6 text

Guest Lecture Name WQD 7007- Big Data Management Tools • Hierarchical Data Format (HDF5) • File directory structure i.e. group and dataset • A single file • Vaex • Lazy, out-of-core processing • Dask • Parallel multi-core processing • Built on top of Pandas

Slide 7

Slide 7 text

Guest Lecture Name WQD 7007- Big Data Management Tools • Hadoop • MapReduce • HDFS • Name Node • Data Node • YARN • Hortonworks, Cloudera, Azure HDInsight, Amazon EMR simplify deployment of Hadoop clusters.

Slide 8

Slide 8 text

Guest Lecture Name WQD 7007- Big Data Management Tools Cloud Data Warehouses • Google Big Query • Snowflake Data Mining (ETL) • Xplenty • RapidMiner

Slide 9

Slide 9 text

Guest Lecture Name WQD 7007- Big Data Management Use Case Problems • Pusat Teknologi Maklumat (PTM), UM has been facing constant cyber attacks. MyCERT often identified UM as a source of botnet attacks. As PTM implemented dynamic IP address allocation to all authenticated users, it became difficult to trace the origin of any attacks. • PTM’s Palo Alto firewalls are underutilized. • Traffic logs amounted to approximately 50GB daily without any payload. • No centralized system to monitor the network (SIEM).

Slide 10

Slide 10 text

Guest Lecture Name WQD 7007- Big Data Management Use Case Solution Photo: Google

Slide 11

Slide 11 text

Guest Lecture Name WQD 7007- Big Data Management Use Case Solution • Palo Alto firewalls, ADs and other sources send out logs to SIEM which is equipped with ELK stack, acting as the data lake. • Elasticsearch and Logstash index all the logs for rapid retrieval. • Conducts user mapping to IP address at a fixed interval for accurate identification. • Kibana visualizes the log. • Search Guard manages the authentication to ELK. • Skedler sends out automated reporting to PTM.

Slide 12

Slide 12 text

Guest Lecture Name WQD 7007- Big Data Management Use Case

Slide 13

Slide 13 text

Guest Lecture Name WQD 7007- Big Data Management Use Case

Slide 14

Slide 14 text

Guest Lecture Name WQD 7007- Big Data Management Use Case Outcomes • Saved (a lot of) money. • Able to map IP addresses to specific users. • Identified source of attacks. • Built network profiles.

Slide 15

Slide 15 text

Guest Lecture Name WQD 7007- Big Data Management Take-home messages • You might have big data if you constantly deal with structured and unstructured data. • More often than not, standalone tools such as Python and its libraries are capable of handling big data. • Various point and click tools available. • Big data technologies have heavily abstracted the complexity of handling big data in such a way that we are beginning to view big data as just normal data.