Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Retrieval and Text Mining 2021 - Te...

Information Retrieval and Text Mining 2021 - Text Similarity

University of Stavanger, DAT640, 2021 fall

Avatar for Krisztian Balog

Krisztian Balog

August 31, 2021
Tweet

More Decks by Krisztian Balog

Other Decks in Education

Transcript

  1. Text Similarity [DAT640] Informa on Retrieval and Text Mining Krisz

    an Balog University of Stavanger August 31, 2021 CC BY 4.0
  2. Text similarity • Core ingredient in many text mining and

    information retrieval problems • Need to express the similarity between two pieces of text (referred to as documents, for simplicity) • The choice of the similarity measure is closely tied with how documents are represented 2 / 10
  3. Jaccard similarity • Jaccard similarity: only the presence/absence of terms

    in documents is considered, with no regard to magnitude • Defined as the ratio of shared terms and total terms in two documents: simJaccard(X, Y ) = |X ∩ Y | |X ∪ Y | , ◦ where X and Y represent the terms that appear in documents d1 and d2 , respectively 3 / 10
  4. Jaccard similarity • Jaccard similarity for term vector-based representations: simJaccard(x,

    y) = i 1(xi) × 1(yi) i 1(xi + yi) , ◦ here 1(x) is an indicator function (1 if x > 0 and 0 otherwise). Example term 1 term 2 term 3 term 4 term 5 doc x 1 0 1 0 3 doc y 0 2 4 0 1 Table: Document-term vectors with term frequencies. x = 1, 0, 1, 0, 3 y = 0, 2, 4, 0, 1 simJaccard (x, y) = 0 + 0 + 1 + 0 + 1 1 + 1 + 1 + 0 + 1 = 2 4 4 / 10
  5. Cosine similarity • Cosine similarity: the cosine of the angle

    between the two document vectors plotted in their high-dimensional space; the larger the angle, the more dissimilar the documents are: simcos(x, y) = x · y ||x|| · ||y|| = n i=1 xiyi n i=1 x2 i n i=1 y2 i , ◦ where x and y are the term vectors corresponding to documents d1 and d2 , respectively ◦ Term weights (xi and yi ) may be raw term counts or TF-IDF-weighted frequencies 5 / 10
  6. Cosine similarity - Geometric interpreta on term 1 term 2

    doc x 1 2 doc y 2 4 simcos(x, y) = 1 6 / 10
  7. Cosine similarity - Geometric interpreta on term 1 term 2

    doc x 1 0 doc y 0 2 simcos(x, y) = 0 7 / 10
  8. Cosine similarity - Geometric interpreta on term 1 term 2

    doc x 4 2 doc y 1 3 simcos(x, y) = 0.7 8 / 10
  9. Cosine similarity Example term 1 term 2 term 3 term

    4 term 5 doc x 1 0 1 0 3 doc y 0 2 4 0 1 Table: Document-term vectors with term frequencies. x = 1, 0, 1, 0, 3 y = 0, 2, 4, 0, 1 simcos(x, y) = x · y ||x|| · ||y|| = n i=1 xi yi n i=1 x2 i n i=1 y2 i = 1 × 0 + 0 × 2 + 1 × 4 + 0 × 0 + 3 × 1 √ 12 + 02 + 12 + 02 + 32 √ 02 + 22 + 42 + 02 + 12 = 7 √ 11 √ 21 9 / 10