Information Retrieval and Text Mining 2021 - Text Classification

Text Classiﬁca on [DAT640] Informa on Retrieval and Text Mining
Krisz an Balog University of Stavanger August 24, 2021 CC BY 4.0

Text classiﬁca on • Classification is the problem of assigning
objects to one of several predefined categories ◦ One of the fundamental problems in machine learning, where it is performed the basis of a training dataset (instances whose category membership is known) • In text classification (or text categorization) the objects are text documents • Binary classification (two classes, 0/1 or -/+) ◦ E.g., deciding whether an email is spam or not • Multiclass classification (n classes) ◦ E.g., Categorizing news stories into topics (finance, weather, politics, sports, etc.) 2 / 18

General approach training data (documents with known category labels) test
data (documents without category labels) model learn model apply model 3 / 18

Formally • Given a training sample (X, y), where X
is a set of documents with corresponding labels y, from a set Y of possible labels, the task is to learn a function f(·) that can predict the class y = f(x) for an unseen document x. 4 / 18

Families of approaches • Feature-based approaches (“traditional” machine learning) •
Neural approaches (“deep learning”) 5 / 18

Question What could be used as features in text classification?
6 / 18

Features for text classiﬁca on • Use words as features
(bag-of-words) ◦ Words will be referred to as terms • Values can be, e.g., binary (term presence/absence) or integers (term counts) • Documents are represented by their term vector • Document-term matrix is huge, but most of the values are zeros; stored as a sparse matrix t1 t2 t3 . . . tm d1 1 0 2 0 d2 0 1 0 2 d3 0 0 1 0 . . . dn 0 1 0 0 Document-term matrix 7 / 18

Question What are possible shortcomings of using raw term frequencies?
8 / 18

English language • Most frequent words ◦ the (7%) ◦
of (3.5%) ◦ and (2.8%) • Top 135 most frequent words account for half of the words used 9 / 18

Zip’s law • Given some corpus of natural language utterances,
the frequency of any word is inversely proportional to its rank in the frequency table ◦ Word number n has a frequency proportional to 1/n 10 / 18

Term weigh ng • Intuition #1: terms that appear often
in a document should get high weights ◦ E.g., The more often a document contains the term “dog,” the more likely that the document is “about” dogs • Intuition #2: terms that appear in many documents should get low weights ◦ E.g., stopwords, like “a,” “the,” “this,” etc. • How do we capture this mathematically? ◦ Term frequency ◦ Inverse document frequency 11 / 18

Term frequency (TF) • We write ct,d for the raw
count of a term in a document • Term frequency tft,d reflects the importance of a term (t) in a document (d) • Variants ◦ Binary: tft,d ∈ {0, 1} ◦ Raw count: tft,d = ct,d ◦ L1-normalized: tft,d = ct,d |d| • where |d| is the length of the document, i.e., the sum of all term counts in d: |d| = t∈d ct,d ◦ L2-normalized: tft,d = ct,d ||d|| • where ||d|| = t∈d (ct,d )2 ◦ Log-normalized: tft,d = 1 + log ct,d ◦ ... • By default, when we refer to TF we will mean the L1-normalized version 12 / 18

Inverse document frequency (IDF) • Inverse document frequency idft reflects
the importance of a term (t) in a collection of documents ◦ The more documents that a term occurs in, the less discriminating the term is between documents, consequently, the less “useful” idft = log N nt ◦ where N is the total number of documents in the collection and nt is the number of documents that contain t ◦ Log is used to “dampen” the effect of IDF 13 / 18

IDF illustra on1 Figure: Illustration of the IDF function as
the document frequency varies. Figure is taken from (Zhai&Massung, 2016)[Fig. 6.10] 1Note that the textbook uses a slightly different IDF formula with +1 in the numerator. 14 / 18

Term weigh ng (TF-IDF) • Combine TF and IDF weights
by multiplying them: tfidft,d = tft,d · idft ◦ Term frequency weight measures importance in document ◦ Inverse document frequency measures importance in collection 15 / 18

Question Is it possible to use other, non-term-frequency-based features? 16
/ 18

Addi onal features for text classiﬁca on • Descriptive statistics
(avg. sentence length, length of various document fields, like title, abstract, body,...) • Document source • Document quality indicators (e.g., readability level) • Presence of images/attachments/JavaScript/... • Publication date • Language • ... 17 / 18

Reading • Text Data Management and Analysis (Zhai&Massung) ◦ Chapter
15: Sections 15.1–15.4 18 / 18

Information Retrieval and Text Mining 2021 - Te...

Information Retrieval and Text Mining 2021 - Text Classification

Krisztian Balog

More Decks by Krisztian Balog

Other Decks in Education

Featured

Transcript

Text Classiﬁca on [DAT640] Informa on Retrieval and Text Mining

Text classiﬁca on • Classification is the problem of assigning

General approach training data (documents with known category labels) test

Formally • Given a training sample (X, y), where X

Families of approaches • Feature-based approaches (“traditional” machine learning) •

Question What could be used as features in text classification?

Features for text classiﬁca on • Use words as features

Question What are possible shortcomings of using raw term frequencies?

English language • Most frequent words ◦ the (7%) ◦

Zip’s law • Given some corpus of natural language utterances,

Term weigh ng • Intuition #1: terms that appear often

Term frequency (TF) • We write ct,d for the raw

Inverse document frequency (IDF) • Inverse document frequency idft reflects

IDF illustra on1 Figure: Illustration of the IDF function as

Term weigh ng (TF-IDF) • Combine TF and IDF weights

Question Is it possible to use other, non-term-frequency-based features? 16

Addi onal features for text classiﬁca on • Descriptive statistics

Reading • Text Data Management and Analysis (Zhai&Massung) ◦ Chapter