Slide 12
Slide 12 text
Term frequency (TF)
• We write ct,d for the raw count of a term in a document
• Term frequency tft,d reflects the importance of a term (t) in a document (d)
• Variants
◦ Binary: tft,d
∈ {0, 1}
◦ Raw count: tft,d
= ct,d
◦ L1-normalized: tft,d
= ct,d
|d|
• where |d| is the length of the document, i.e., the sum of all term counts in d:
|d| =
t∈d
ct,d
◦ L2-normalized: tft,d
= ct,d
||d||
• where ||d|| =
t∈d
(ct,d
)2
◦ Log-normalized: tft,d
= 1 + log ct,d
◦ ...
• By default, when we refer to TF we will mean the L1-normalized version
12 / 18