Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGS594 Lecture 20

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

JGS594 Lecture 20

Software Engineering for Machine Learning
Spam Detection
(202204)

Avatar for Javier Gonzalez-Sanchez

Javier Gonzalez-Sanchez PRO

April 14, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Programming

Transcript

  1. jgs SER 594 Software Engineering for Machine Learning Lecture 20:

    Supervised Learning (spam detection) Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
  2. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

    jgs Definition § Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.
  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

    jgs Text Mining Extracting information from text documents (natural language). § generate a search index; § text categorization into domains; § text clustering to organize a set of documents; § sentiment analysis to identify subjective information; § concept or entity extraction (people, places, organizations); § document summarization (identify important points); and § learning relations between named entities. § Spam detection in email messages,
  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6

    jgs Previously … § Text Mining as unsupervised technique that looks for patterns in a corpus of text – identify words (topics) that appear un a statistically meaningful way § analyze a large archive of text documents (blog post, an email, a tweet, a book chapter, etc.) and understand what the archive contains § The most well-known algorithm is Latent Dirichlet Allocation (LDA) 2003, Blei, Ng, & Jordan (available on Canvas)
  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7

    jgs Question Can we make classification of documents a supervised effort?
  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

    jgs Library § Mallet (MAchine Learning for Language Toolkit), a Java-based package for statistical natural-language processing, document classification, clustering, topic modeling, information extraction, and other machine-learning applications to text. § Andrew McCallum, University of Massachusetts Amherst, 2002)
  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

    jgs Mallet § Download http://mallet.cs.umass.edu/download.php
  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11

    jgs Mallet API mallet.jar mallet-deps.jar
  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12

    jgs Data | Email Spam Dataset § 2000, Androutsopoulos et al. (reorganized - 2010, Ng) § Pre-processed emails § Consists of 350 spam and 350 non-spam (for training) § Consists of 130 spam and 130 non-spam (for testing) § http://openclassroom.stanford.edu/ MainFolder/DocumentPage.php? course=MachineLearning&doc=exercises/ex6/ex6.html § Download ex6DataEmails.zip (the version that in NOT preproseced)
  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

    jgs Import Data for Training File iterator cc.mallet.pipe.iterator.FileIterator FileIterator iterator = new FileIterator( § A list of File[] directories with text files § A file filter that specifies which files to select within a directory § A pattern that is applied to a filename to produce a class label )
  11. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18

    jgs Features + Label This is now a classification problem!
  12. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 21

    jgs Data and Pipeline / / yes, we are still using Mallet. / / Nothing else.
  13. jgs

  14. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 28

    jgs But, How does a Classifier work?
  15. jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,

    Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.