Machine Learning-based Malicious Adversaries Detection in an Enterprise Environment by Using Open Source Tools Muhammad Najmi Ahmad Zabidi International Islamic University Malaysia MOSC 2012 Berjaya Times Square, Kuala Lumpur 9th July 2012 Muhammad Najmi Ahmad Zabidi MOSC 2012 1/34
About • I am a research grad student at Universiti Teknologi Malaysia, Skudai, Johor Bahru, Malaysia • My current employer is International Islamic University Malaysia, Kuala Lumpur • Research area - malware detection, narrowing on Windows executables • For past few years (since 2003), I am a Subversion(SVN) committer for KDE localization project to Malay language (but now rarely commit.. need a new intern to replace :) ) Muhammad Najmi Ahmad Zabidi MOSC 2012 2/34
Computing world as we knew it • Interconnected machine • Previously less connected, now ‘‘socialized’’ machines • Brought real problems to the cyberworld Muhammad Najmi Ahmad Zabidi MOSC 2012 3/34
Spam • Annoying • Productivity wasted in unneccesary file deletion • Difficult to find important email - extreme case Muhammad Najmi Ahmad Zabidi MOSC 2012 6/34
Scam • Preying on naive victims • Sounds to good to be true, but still some people believed • Organized crime/syndicate... with mules cooperating Muhammad Najmi Ahmad Zabidi MOSC 2012 7/34
Phishing • Almost similar with scam, but different tactic • More sophisticated, but does not need mule/physical meetup Muhammad Najmi Ahmad Zabidi MOSC 2012 8/34
Phishing • Almost similar with scam, but different tactic • More sophisticated, but does not need mule/physical meetup • Main purpose to gain important details - online banking login name, password hence access to the victim’s account Muhammad Najmi Ahmad Zabidi MOSC 2012 8/34
Phishing • Almost similar with scam, but different tactic • More sophisticated, but does not need mule/physical meetup • Main purpose to gain important details - online banking login name, password hence access to the victim’s account • More secure to the criminal Muhammad Najmi Ahmad Zabidi MOSC 2012 8/34
Malware • Safely to say,covers trojan,virus,dialers,rabbits,worms,rootkit(bundled nowadays) • Already infecting computers since 1980s, threat is more obvious when the Internet is coming in Muhammad Najmi Ahmad Zabidi MOSC 2012 9/34
Malware • Safely to say,covers trojan,virus,dialers,rabbits,worms,rootkit(bundled nowadays) • Already infecting computers since 1980s, threat is more obvious when the Internet is coming in • Attacking any operating system, Linux, Windows, Mac... even Android phones Muhammad Najmi Ahmad Zabidi MOSC 2012 9/34
Problems with adversaries detection • Some manually crafted, some automated • React relatively fast, difficult to trace • Too many (for example, spam) hence too time consuming for manual work Muhammad Najmi Ahmad Zabidi MOSC 2012 10/34
In house analysis • Given enough expertise, in house analysis could be useful • Maintaining reputation, having own group of analysts to handle incidents • Try minimize costs, use open source tools whenever possible Muhammad Najmi Ahmad Zabidi MOSC 2012 11/34
Categories Machine Learning • Associated with the Artificial Intelligence • Mimicking human (brain) learning • Learns through experience • Deals with known and unknown patterns • Overlapping (or somehow originated) with Data Mining, Pattern Recognition Muhammad Najmi Ahmad Zabidi MOSC 2012 12/34
Categories Table 1: Differences between clustering and classification Classification Deals with known data Supervised learning Muhammad Najmi Ahmad Zabidi MOSC 2012 13/34
Categories Table 1: Differences between clustering and classification Classification Deals with known data Supervised learning Popular algorithms includes: • Random Forest • Neural Networks • k-Nearest Neighbor • Decision Trees Muhammad Najmi Ahmad Zabidi MOSC 2012 13/34
Categories Table 1: Differences between clustering and classification Classification Deals with known data Supervised learning Popular algorithms includes: • Random Forest • Neural Networks • k-Nearest Neighbor • Decision Trees Predictive [Tan et al., 2005] Muhammad Najmi Ahmad Zabidi MOSC 2012 13/34
Categories Table 1: Differences between clustering and classification Classification Clustering Deals with known data Supervised learning Popular algorithms includes: • Random Forest • Neural Networks • k-Nearest Neighbor • Decision Trees Predictive [Tan et al., 2005] Muhammad Najmi Ahmad Zabidi MOSC 2012 13/34
Categories Table 1: Differences between clustering and classification Classification Clustering Deals with known data Deals with unknown data Supervised learning Popular algorithms includes: • Random Forest • Neural Networks • k-Nearest Neighbor • Decision Trees Predictive [Tan et al., 2005] Muhammad Najmi Ahmad Zabidi MOSC 2012 13/34
Categories Table 1: Differences between clustering and classification Classification Clustering Deals with known data Deals with unknown data Supervised learning Unsupervised learning Popular algorithms includes: • Random Forest • Neural Networks • k-Nearest Neighbor • Decision Trees Predictive [Tan et al., 2005] Muhammad Najmi Ahmad Zabidi MOSC 2012 13/34
Categories Table 1: Differences between clustering and classification Classification Clustering Deals with known data Deals with unknown data Supervised learning Unsupervised learning Popular algorithms includes: • Random Forest • Neural Networks • k-Nearest Neighbor • Decision Trees Popular algorithms includes: • K-means • Fuzzy C • Gaussian Predictive [Tan et al., 2005] Muhammad Najmi Ahmad Zabidi MOSC 2012 13/34
Categories Table 1: Differences between clustering and classification Classification Clustering Deals with known data Deals with unknown data Supervised learning Unsupervised learning Popular algorithms includes: • Random Forest • Neural Networks • k-Nearest Neighbor • Decision Trees Popular algorithms includes: • K-means • Fuzzy C • Gaussian Predictive [Tan et al., 2005] Descriptive [Tan et al., 2005] Muhammad Najmi Ahmad Zabidi MOSC 2012 13/34
Categories What to look? • We look for patterns • In some case, have the spam,phishing mails corpus ready • We call these patterns as ‘‘features’’ Muhammad Najmi Ahmad Zabidi MOSC 2012 14/34
Categories Spam/scam • The language that being used • Perhaps words like ‘‘You have won GBP100,000,000’’ notification through emails • Spam bombarded emails, some might be true businesses, but irresistable to handle. • Scam, asking people to bank in money for untruthful reasons Muhammad Najmi Ahmad Zabidi MOSC 2012 15/34
Categories Phishing mails • Look for URL • Current effort for example by PhishTank is done by using public submission and (I believe) manual verification Muhammad Najmi Ahmad Zabidi MOSC 2012 16/34
Categories Malware • Researchers tend to look on the Application Programming Interface (API) calls, some on the opcodes • Analysis done either by using static or dynamic analysis Muhammad Najmi Ahmad Zabidi MOSC 2012 17/34
The datasets • Spam email research is already quite sometimes compared to the other (phishing) • Sample dataset: • http://csmining.org/index.php/spam-email-datasets-.html • http://archive.ics.uci.edu/ml/datasets/Spambase • Scam email somehow very much associated with spam, since it is unwanted email. Might as well being categorized as ‘‘sub-spam’’ • Phishing emails samples: • Sample dataset: • http://phishtank.com Muhammad Najmi Ahmad Zabidi MOSC 2012 19/34
Feature Selection/Extraction • When analyzing, we’re interested with features • What kind of feature? • Important keywords, strong features • Non important features will be phased out.. unneccesary • Some features might be redundant Muhammad Najmi Ahmad Zabidi MOSC 2012 20/34
• There are algorithms which meant for this: • Information Gain • Support Vector Machine (SVM) • other... some maybe hybrid algoritms(combining several algorithms altogether) - also known as ensemble Muhammad Najmi Ahmad Zabidi MOSC 2012 21/34
Weka R language Octave Python Scipy List of tools • Weka • R language • Octave (as replacement for Matlab) • Python Sci-py with Matplotlib Muhammad Najmi Ahmad Zabidi MOSC 2012 22/34
Weka R language Octave Python Scipy Weka • Obtained data are in numbers and visualizations • Need to do some reading on how to interpret them • Test with different algorithms to get the best results Muhammad Najmi Ahmad Zabidi MOSC 2012 24/34
Weka R language Octave Python Scipy R language • No merely a tool, but a language by itself • Usually being used by data analysts Muhammad Najmi Ahmad Zabidi MOSC 2012 25/34
Weka R language Octave Python Scipy Octave • Octave is an open source alternative for Matlab (MATrix LABoratory) • Works almost similar like Matlab does Muhammad Najmi Ahmad Zabidi MOSC 2012 27/34
Weka R language Octave Python Scipy Python Scipy #!/usr/bin/env python """ Example: simple line plot. Show how to make and save a simple line plot with labels, title and grid """ import numpy import pylab t = numpy.arange(0.0, 1.0+0.01, 0.01) s = numpy.cos(2*2*numpy.pi*t) pylab.plot(t, s) pylab.xlabel(’time (s)’) pylab.ylabel(’voltage (mV)’) pylab.title(’About as simple as it gets,folks’) pylab.grid(True) pylab.savefig(’simple_plot’) pylab.show() Muhammad Najmi Ahmad Zabidi MOSC 2012 29/34
Flowchart Conclusion The flow Feature Selection Feature Categorization Clustering Classification Visualization Weka, Octave, R scipy, octave, R Weka, Octave, R scipy, octave, R Muhammad Najmi Ahmad Zabidi MOSC 2012 31/34
Flowchart Conclusion Conclusion • Malicious/unwanted threats from spam, scam, phishing and malware is not easy Muhammad Najmi Ahmad Zabidi MOSC 2012 32/34
Flowchart Conclusion Conclusion • Malicious/unwanted threats from spam, scam, phishing and malware is not easy • Perhaps one sample could be done by hands, but having thousands per day is tedious Muhammad Najmi Ahmad Zabidi MOSC 2012 32/34
Flowchart Conclusion Conclusion • Malicious/unwanted threats from spam, scam, phishing and malware is not easy • Perhaps one sample could be done by hands, but having thousands per day is tedious • Machine learning assist in automation Muhammad Najmi Ahmad Zabidi MOSC 2012 32/34
Flowchart Conclusion Conclusion • Malicious/unwanted threats from spam, scam, phishing and malware is not easy • Perhaps one sample could be done by hands, but having thousands per day is tedious • Machine learning assist in automation • Open source provides alternative (free as in minimal cost) for the analysis Muhammad Najmi Ahmad Zabidi MOSC 2012 32/34
Flowchart Conclusion Conclusion • Malicious/unwanted threats from spam, scam, phishing and malware is not easy • Perhaps one sample could be done by hands, but having thousands per day is tedious • Machine learning assist in automation • Open source provides alternative (free as in minimal cost) for the analysis • In house analysis provides security in an organization/enterprise reputation Muhammad Najmi Ahmad Zabidi MOSC 2012 32/34
Flowchart Conclusion Get in touch! najmi.zabidi @ gmail.com http://mypacketstream.blogspot.com This slides was created with L A TEX Beamer Muhammad Najmi Ahmad Zabidi MOSC 2012 33/34
Flowchart Conclusion Bibliography Rieck, K., Trinius, P., Willems, C., and Holz, T. (2009). Automatic analysis of malware behavior using machine learning. TU, Professoren der Fak. IV. Tan, P.-N., Steinbach, M., and Kumar, V. (2005). Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. Muhammad Najmi Ahmad Zabidi MOSC 2012 34/34