Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Retrieval and Text Mining 2021 - Text Similarity

Information Retrieval and Text Mining 2021 - Text Similarity

University of Stavanger, DAT640, 2021 fall

Krisztian Balog

August 31, 2021
Tweet

More Decks by Krisztian Balog

Other Decks in Education

Transcript

  1. Text Similarity [DAT640] Informa on Retrieval and Text Mining Krisz

    an Balog University of Stavanger August 31, 2021 CC BY 4.0
  2. Text similarity • Core ingredient in many text mining and

    information retrieval problems • Need to express the similarity between two pieces of text (referred to as documents, for simplicity) • The choice of the similarity measure is closely tied with how documents are represented 2 / 10
  3. Jaccard similarity • Jaccard similarity: only the presence/absence of terms

    in documents is considered, with no regard to magnitude • Defined as the ratio of shared terms and total terms in two documents: simJaccard(X, Y ) = |X ∩ Y | |X ∪ Y | , ◦ where X and Y represent the terms that appear in documents d1 and d2 , respectively 3 / 10
  4. Jaccard similarity • Jaccard similarity for term vector-based representations: simJaccard(x,

    y) = i 1(xi) × 1(yi) i 1(xi + yi) , ◦ here 1(x) is an indicator function (1 if x > 0 and 0 otherwise). Example term 1 term 2 term 3 term 4 term 5 doc x 1 0 1 0 3 doc y 0 2 4 0 1 Table: Document-term vectors with term frequencies. x = 1, 0, 1, 0, 3 y = 0, 2, 4, 0, 1 simJaccard (x, y) = 0 + 0 + 1 + 0 + 1 1 + 1 + 1 + 0 + 1 = 2 4 4 / 10
  5. Cosine similarity • Cosine similarity: the cosine of the angle

    between the two document vectors plotted in their high-dimensional space; the larger the angle, the more dissimilar the documents are: simcos(x, y) = x · y ||x|| · ||y|| = n i=1 xiyi n i=1 x2 i n i=1 y2 i , ◦ where x and y are the term vectors corresponding to documents d1 and d2 , respectively ◦ Term weights (xi and yi ) may be raw term counts or TF-IDF-weighted frequencies 5 / 10
  6. Cosine similarity - Geometric interpreta on term 1 term 2

    doc x 1 2 doc y 2 4 simcos(x, y) = 1 6 / 10
  7. Cosine similarity - Geometric interpreta on term 1 term 2

    doc x 1 0 doc y 0 2 simcos(x, y) = 0 7 / 10
  8. Cosine similarity - Geometric interpreta on term 1 term 2

    doc x 4 2 doc y 1 3 simcos(x, y) = 0.7 8 / 10
  9. Cosine similarity Example term 1 term 2 term 3 term

    4 term 5 doc x 1 0 1 0 3 doc y 0 2 4 0 1 Table: Document-term vectors with term frequencies. x = 1, 0, 1, 0, 3 y = 0, 2, 4, 0, 1 simcos(x, y) = x · y ||x|| · ||y|| = n i=1 xi yi n i=1 x2 i n i=1 y2 i = 1 × 0 + 0 × 2 + 1 × 4 + 0 × 0 + 3 × 1 √ 12 + 02 + 12 + 02 + 32 √ 02 + 22 + 42 + 02 + 12 = 7 √ 11 √ 21 9 / 10