Slide 1

Slide 1 text

Text Similarity [DAT640] Informa on Retrieval and Text Mining Krisz an Balog University of Stavanger August 31, 2021 CC BY 4.0

Slide 2

Slide 2 text

Text similarity • Core ingredient in many text mining and information retrieval problems • Need to express the similarity between two pieces of text (referred to as documents, for simplicity) • The choice of the similarity measure is closely tied with how documents are represented 2 / 10

Slide 3

Slide 3 text

Jaccard similarity • Jaccard similarity: only the presence/absence of terms in documents is considered, with no regard to magnitude • Defined as the ratio of shared terms and total terms in two documents: simJaccard(X, Y ) = |X ∩ Y | |X ∪ Y | , ◦ where X and Y represent the terms that appear in documents d1 and d2 , respectively 3 / 10

Slide 4

Slide 4 text

Jaccard similarity • Jaccard similarity for term vector-based representations: simJaccard(x, y) = i 1(xi) × 1(yi) i 1(xi + yi) , ◦ here 1(x) is an indicator function (1 if x > 0 and 0 otherwise). Example term 1 term 2 term 3 term 4 term 5 doc x 1 0 1 0 3 doc y 0 2 4 0 1 Table: Document-term vectors with term frequencies. x = 1, 0, 1, 0, 3 y = 0, 2, 4, 0, 1 simJaccard (x, y) = 0 + 0 + 1 + 0 + 1 1 + 1 + 1 + 0 + 1 = 2 4 4 / 10

Slide 5

Slide 5 text

Cosine similarity • Cosine similarity: the cosine of the angle between the two document vectors plotted in their high-dimensional space; the larger the angle, the more dissimilar the documents are: simcos(x, y) = x · y ||x|| · ||y|| = n i=1 xiyi n i=1 x2 i n i=1 y2 i , ◦ where x and y are the term vectors corresponding to documents d1 and d2 , respectively ◦ Term weights (xi and yi ) may be raw term counts or TF-IDF-weighted frequencies 5 / 10

Slide 6

Slide 6 text

Cosine similarity - Geometric interpreta on term 1 term 2 doc x 1 2 doc y 2 4 simcos(x, y) = 1 6 / 10

Slide 7

Slide 7 text

Cosine similarity - Geometric interpreta on term 1 term 2 doc x 1 0 doc y 0 2 simcos(x, y) = 0 7 / 10

Slide 8

Slide 8 text

Cosine similarity - Geometric interpreta on term 1 term 2 doc x 4 2 doc y 1 3 simcos(x, y) = 0.7 8 / 10

Slide 9

Slide 9 text

Cosine similarity Example term 1 term 2 term 3 term 4 term 5 doc x 1 0 1 0 3 doc y 0 2 4 0 1 Table: Document-term vectors with term frequencies. x = 1, 0, 1, 0, 3 y = 0, 2, 4, 0, 1 simcos(x, y) = x · y ||x|| · ||y|| = n i=1 xi yi n i=1 x2 i n i=1 y2 i = 1 × 0 + 0 × 2 + 1 × 4 + 0 × 0 + 3 × 1 √ 12 + 02 + 12 + 02 + 32 √ 02 + 22 + 42 + 02 + 12 = 7 √ 11 √ 21 9 / 10

Slide 10

Slide 10 text

Reading • Text Data Management and Analysis (Zhai&Massung) ◦ Chapter 14: Section 14.2 10 / 10