Big Data is.. big. • 3Vs : Volume, velocity, variety • The term Big Data started growing in popularity in the last decade (> 2012). • Doug Cutting created Hadoop at around the same period. • Do we really have or need big data? Why?
According to IDC, as much as 90% of digital data is unstructured. • Structured data fits nicely in a table: names, addresses etc. • Unstructured data exists in its raw form: text, social media posts, emails, videos etc.
Hierarchical Data Format (HDF5) • File directory structure i.e. group and dataset • A single file • Vaex • Lazy, out-of-core processing • Dask • Parallel multi-core processing • Built on top of Pandas
Problems • Pusat Teknologi Maklumat (PTM), UM has been facing constant cyber attacks. MyCERT often identified UM as a source of botnet attacks. As PTM implemented dynamic IP address allocation to all authenticated users, it became difficult to trace the origin of any attacks. • PTM’s Palo Alto firewalls are underutilized. • Traffic logs amounted to approximately 50GB daily without any payload. • No centralized system to monitor the network (SIEM).
Solution • Palo Alto firewalls, ADs and other sources send out logs to SIEM which is equipped with ELK stack, acting as the data lake. • Elasticsearch and Logstash index all the logs for rapid retrieval. • Conducts user mapping to IP address at a fixed interval for accurate identification. • Kibana visualizes the log. • Search Guard manages the authentication to ELK. • Skedler sends out automated reporting to PTM.
• You might have big data if you constantly deal with structured and unstructured data. • More often than not, standalone tools such as Python and its libraries are capable of handling big data. • Various point and click tools available. • Big data technologies have heavily abstracted the complexity of handling big data in such a way that we are beginning to view big data as just normal data.