Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Clustering and Classification for Cyber Crime

Avatar for Kyrus Kyrus
October 01, 2012

Clustering and Classification for Cyber Crime

A high-level discussion on what clustering and classification are, similarities and differences between them, and applications to cyber crime investigations.

Avatar for Kyrus

Kyrus

October 01, 2012
Tweet

More Decks by Kyrus

Other Decks in Technology

Transcript

  1. Thank You • ICCyber Organizing Committee • Marcos Vinícius and

    the other officers of the Brazilian Federal Police • Sando Suffert of Apura • You for attending! 2
  2. 3 Outline • Introduction • Similarity • Fuzzy Hashing •

    Features • Distance Measures • Feature Selection • Clustering • Classification • Questions
  3. 4 Introduction • U.S. Air Force Office of Special Investigations

    • U.S. Department of Justice • Now work for Kyrus • Applied Computer Forensics Research • Tools I've written – Memory forensics – md5deep/hashdeep – fuzzy hashing (ssdeep) – Foremost
  4. Introduction • Analyzing an infinite number of programs/documents – Only

    five minutes per sample • Which of them are similar to each other? • Which of them fit into existing categories? – Variant of Zeus – Written by $person? – Relevant to this investigation 5
  5. Assumptions • There are too many programs to manually review

    – You’re not going to look at them all • Computers are good at computing – Humans are not • Humans are good at categorizing – Computers are not 6
  6. Clustering • Group similar items together • For example: –

    Variants of Zeus are all similar to each other – Documents written by the same person – Documents related to the same topic 9
  7. Similarity • Depends on: – The kind of things be

    compared – How they’re being compared 11
  8. Example • Both live in Washington DC • Both like

    a good hamburger • Both like dogs • Conclusion: Similar • President Obama is 7 cm taller • Jesse does not have gray hair • Work in different career fields • Conclusion: Not similar 13
  9. Current Tools • Cryptographic Hashing – Exact match – e.g.

    SHA-256 • Fuzzy Hashing – Similar blocks of ones and zeros • Manual analysis 14
  10. Fuzzy Hashing 15 A 7 b F d r t

    8 5 d N 4 o P A 7 b F X r t 8 5 d N 4 o P
  11. Fuzzy Hashing • Compare signatures: – A 7 b F

    d r t 8 5 d N 4 o P – A 7 b F X r t 8 5 d N 4 o P • Compute the edit distance between the signatures – Edit distance is number of changes necessary to turn one string into the other • Small edit distance means more similar • In this example, the edit distance is one 16
  12. Fuzzy Hashing 17 A 7 b F d r t

    8 5 d N 4 o P 5 k A j b F 9 2 b 5 @ N q o P Y k
  13. General Approach • Extract simple features from each input •

    Compare the features mathematically • The result of the comparison is a similarity score • With fuzzy hashing, the features are the hashes of the blocks of raw bytes 18
  14. Similar Programs • Similarity depends on: – The kind of

    things be compared – How they’re being compared 19
  15. Similar Programs • Do the same thing • Have the

    same look and feel • Connect to the same servers • Written by the same person • Used in the same intrusion • Run on the same platform 20
  16. Features • Signed code? • Which APIs are called •

    How often APIs are called • Order in which APIs are called • Entropy • DLLs used • Percentage of code coverage • Magic strings • N-grams of instructions • Control-flow graph • IP addresses accessed • … 22 Image courtesy of Flickr user doctor_keats and used under Create Commons license.
  17. Distance Measures • Clustering – Group of inputs which are

    close to each other • Closeness depends on distance measure • What it sounds like – How far apart are the input programs? – As measured by our features • Alternatively, how similar are they? 23
  18. Distance Measures 29 Code coverage Entropy We can measure the

    Euclidean distance between these points
  19. Cosine Similarity 30 Code coverage Entropy θ Or we could

    measure the angle between the vectors The smaller the angle, the more similar the programs
  20. Feature Selection • The Curse of Dimensionality – So many

    dimensions (features) that comparisons become too time consuming or too complex • No problem • Select the “important” features – (Insert mathy stuff here) • Example: – Presence of crypto constants – Depends on context 31
  21. Comparisons • Can find programs similar to any query •

    Similar to a kind of fuzzy hashing – “Signature” is the set of selected features 32
  22. Computing Clusters • Extract features from all inputs • Compute

    distance metric for all pairs of inputs For all inputs a and b: if distance(a,b) < threshold add_cluster(a,b) • Exclusive vs. Non-Exclusive clustering – Assume A~B and B~C – Exclusive: {A,B,C} – Non-Exclusive: {A,B} {B,C} 33
  23. Not Just for Programs • eDiscovery is all over this

    • Commercially available now • Uses phrases of text as features – File format independent • Distance metrics are statistical or linguistic or … 34
  24. Classification • Also known as: – Predictive Coding – Assisted

    Machine Learning • Put inputs into a category – Zeus variant or Not Zeus variant – Written by Microsoft, $person, Other – Relevant to this case or not 36
  25. Classification • Artificial intelligence is just math • There are

    many algorithms: – Naïve Bayesian classifier – K-Nearest Neighbor – Locality Sensitive Hashing – Decision Trees – Neural Networks – Hidden Markov Models 37
  26. Classification • User must create a set of training data

    • Must identify some documents for each possible outcome • Train the algorithm on this data • "Knowledge" is stored by the algorithm • Which can then be applied to new inputs 38 Image courtesy Flickr user dhillan and used under a Creative Commons license, http://www.flickr.com/photos/dhillan/3848315549/
  27. Naïve Bayes Classification • Let's go through an example –

    How your spam detector works • Determine which is greater – Probability message is spam? Or ham? – P(spam) or P(ham) – P(spam given features) or P(ham given features) 39
  28. Training Set • Get a set of emails • Human

    labels which are spam, which are ham 40 From: [email protected] To: [email protected] Subject: V1agra!! T0p quailikty V1aagra delieverd direct to you! http://sales.v1agara.biz/ From: [email protected] To: [email protected] Subject: Wear a jacket It will be cold while you are in Brazil. Please wear a jacket so that I do not worry about you. Love, Mom SPAM HAM
  29. Naïve Bayesian Classifier • Based on Bayes Theorem • Probabilities

    are based on what‘s in the training set = ∗ (|) () = ∗ (|) () • In other words, count things in the training set and do math on them 41
  30. Naïve Bayesian Classifier = ∗ (|) () ℎ = ℎ

    ∗ (|ℎ) () Which probability is greater? 42
  31. Decision Tree • Build a flowchart of questions on the

    features • Each question should divide the data equally • Blackjack example: 43 Is your total < 11? Have pair? Dealer have < 11? Split hands Hit Stay
  32. Decision Tree • Quick to classify, but slow to construct

    • What questions are best at which point in the tree? • [Insert mathy stuff here] • You could make a career out of efficient decision tree generation – And people do 44
  33. Classification Packages • All of these are Free and Open

    Source: – Weka – Apache Mahout – Malheur – LibSVM • Which is the best? 45
  34. Classification Systems • Academia – “Solved problem” • For you?

    – Some assembly required – Your Agency puts it together 46
  35. Measuring Classifier Performance • There are several measures • Look

    at false positives and false negatives – Run the classifier on the training set • When building, reserve some known values for a test set – Not used in training the classifier • There is a problem of over-fitting – Classifier "knows" the training data too well 47
  36. Classification • Training set must be balanced • Roughly equal

    numbers of inputs for each category 48 Image courtesy Flickr user andyg and used under a Creative Commons license, http://www.flickr.com/photos/andyg/2642257588/
  37. Application to Child Pornography • Difficult to extract relevant features

    • Lots of features in images – File format, size – GPS location – Camera serial number • But difficult to extract features related to the content 49
  38. Conclusion • Analyzing an infinite number of programs – Only

    five minutes per sample – Computer time is cheap • Which of them are similar to each other? – Build clusters of programs and documents • Which of them fit into existing categories? – Variant of Zeus – Written by $person? – Relevant to this investigation – Build classifiers for these categories 50
  39. 51 Outline • Introduction • Similarity • Features • Distance

    Measures • Feature Selection • Clustering • Classification • Questions