Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Managing your Big Data: Examples and Use Cases

Managing your Big Data: Examples and Use Cases

WQD 7007 Guest Lecture.

Faiz Zaki

April 20, 2021
Tweet

More Decks by Faiz Zaki

Other Decks in Technology

Transcript

  1. Guest Lecture Name WQD 7007- Big Data Management Managing your

    big data: examples and use cases Faiz Zaki Network Analytics Lab Universiti Malaya 20 April 2021
  2. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    Big Data is.. big. • 3Vs : Volume, velocity, variety • The term Big Data started growing in popularity in the last decade (> 2012). • Doug Cutting created Hadoop at around the same period. • Do we really have or need big data? Why?
  3. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    According to IDC, as much as 90% of digital data is unstructured. • Structured data fits nicely in a table: names, addresses etc. • Unstructured data exists in its raw form: text, social media posts, emails, videos etc.
  4. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    All of these data leads to another issue. • How do you store and use them? • Databases, data warehouses (ETL), data lakes (ELT) VS
  5. Guest Lecture Name WQD 7007- Big Data Management Tools •

    Hierarchical Data Format (HDF5) • File directory structure i.e. group and dataset • A single file • Vaex • Lazy, out-of-core processing • Dask • Parallel multi-core processing • Built on top of Pandas
  6. Guest Lecture Name WQD 7007- Big Data Management Tools •

    Hadoop • MapReduce • HDFS • Name Node • Data Node • YARN • Hortonworks, Cloudera, Azure HDInsight, Amazon EMR simplify deployment of Hadoop clusters.
  7. Guest Lecture Name WQD 7007- Big Data Management Tools Cloud

    Data Warehouses • Google Big Query • Snowflake Data Mining (ETL) • Xplenty • RapidMiner
  8. Guest Lecture Name WQD 7007- Big Data Management Use Case

    Problems • Pusat Teknologi Maklumat (PTM), UM has been facing constant cyber attacks. MyCERT often identified UM as a source of botnet attacks. As PTM implemented dynamic IP address allocation to all authenticated users, it became difficult to trace the origin of any attacks. • PTM’s Palo Alto firewalls are underutilized. • Traffic logs amounted to approximately 50GB daily without any payload. • No centralized system to monitor the network (SIEM).
  9. Guest Lecture Name WQD 7007- Big Data Management Use Case

    Solution • Palo Alto firewalls, ADs and other sources send out logs to SIEM which is equipped with ELK stack, acting as the data lake. • Elasticsearch and Logstash index all the logs for rapid retrieval. • Conducts user mapping to IP address at a fixed interval for accurate identification. • Kibana visualizes the log. • Search Guard manages the authentication to ELK. • Skedler sends out automated reporting to PTM.
  10. Guest Lecture Name WQD 7007- Big Data Management Use Case

    Outcomes • Saved (a lot of) money. • Able to map IP addresses to specific users. • Identified source of attacks. • Built network profiles.
  11. Guest Lecture Name WQD 7007- Big Data Management Take-home messages

    • You might have big data if you constantly deal with structured and unstructured data. • More often than not, standalone tools such as Python and its libraries are capable of handling big data. • Various point and click tools available. • Big data technologies have heavily abstracted the complexity of handling big data in such a way that we are beginning to view big data as just normal data.