Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGS594 Lecture 20

JGS594 Lecture 20

Software Engineering for Machine Learning
Spam Detection
(202204)

B546a9b97d993392e4b22b74b99b91fe?s=128

Javier Gonzalez
PRO

April 14, 2022
Tweet

More Decks by Javier Gonzalez

Other Decks in Programming

Transcript

  1. jgs SER 594 Software Engineering for Machine Learning Lecture 20:

    Supervised Learning (spam detection) Dr. Javier Gonzalez-Sanchez javiergs@asu.edu javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
  2. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 2

    jgs Machine Learning
  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

    jgs Definition § Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.
  4. jgs Text Mining Supervised Learning

  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

    jgs Text Mining Extracting information from text documents (natural language). § generate a search index; § text categorization into domains; § text clustering to organize a set of documents; § sentiment analysis to identify subjective information; § concept or entity extraction (people, places, organizations); § document summarization (identify important points); and § learning relations between named entities. § Spam detection in email messages,
  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6

    jgs Previously … § Text Mining as unsupervised technique that looks for patterns in a corpus of text – identify words (topics) that appear un a statistically meaningful way § analyze a large archive of text documents (blog post, an email, a tweet, a book chapter, etc.) and understand what the archive contains § The most well-known algorithm is Latent Dirichlet Allocation (LDA) 2003, Blei, Ng, & Jordan (available on Canvas)
  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7

    jgs Question Can we make classification of documents a supervised effort?
  8. jgs Tools Document Classification

  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

    jgs Library § Mallet (MAchine Learning for Language Toolkit), a Java-based package for statistical natural-language processing, document classification, clustering, topic modeling, information extraction, and other machine-learning applications to text. § Andrew McCallum, University of Massachusetts Amherst, 2002)
  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

    jgs Mallet § Download http://mallet.cs.umass.edu/download.php
  11. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11

    jgs Mallet API mallet.jar mallet-deps.jar
  12. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12

    jgs Data | Email Spam Dataset § 2000, Androutsopoulos et al. (reorganized - 2010, Ng) § Pre-processed emails § Consists of 350 spam and 350 non-spam (for training) § Consists of 130 spam and 130 non-spam (for testing) § http://openclassroom.stanford.edu/ MainFolder/DocumentPage.php? course=MachineLearning&doc=exercises/ex6/ex6.html § Download ex6DataEmails.zip (the version that in NOT preproseced)
  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

    jgs Import Data for Training File iterator cc.mallet.pipe.iterator.FileIterator FileIterator iterator = new FileIterator( § A list of File[] directories with text files § A file filter that specifies which files to select within a directory § A pattern that is applied to a filename to produce a class label )
  14. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14

    jgs We did This
  15. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 15

    jgs New Code
  16. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16

    jgs We did This
  17. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 17

    jgs New Code
  18. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18

    jgs Features + Label This is now a classification problem!
  19. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19

    jgs Machine Learning
  20. jgs Classification Naive Bayes

  21. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 21

    jgs Data and Pipeline / / yes, we are still using Mallet. / / Nothing else.
  22. jgs Evaluation

  23. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 23

    jgs Testing
  24. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 24

    jgs What is this?
  25. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 25

    jgs Evaluation (1000)
  26. jgs Summary Text Mining

  27. jgs

  28. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 28

    jgs But, How does a Classifier work?
  29. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 29

    jgs Questions
  30. jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,

    Ph.D. javiergs@asu.edu Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.