Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unsupervised Context Retrieval for Long-tail En...

Unsupervised Context Retrieval for Long-tail Entities

Date: October 5, 2019
Venue: Santa Clara, CA, USA. The 2019 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '19)
Corresponding article: https://arxiv.org/abs/1908.01798
Video: https://www.youtube.com/watch?v=aCi0y_WY_Gk

Please cite, link to or credit this presentation when using it or part of it in your work.

#InformationRetrieval #IR #EntityOrientedSearch #EntityLinking #LongTailEntities #InformationExtraction #SmallData #LimitedData #DataLabeling

Darío Garigliotti

October 05, 2019
Tweet

More Decks by Darío Garigliotti

Other Decks in Research

Transcript

  1. Unsupervised Context Retrieval for Long-tail Entities Darío Garigliotti, Dyaa Albakour,

    Miguel Martinez, and Krisztian Balog IAI, University of Stavanger, Norway + Signal AI, UK The 2019 ACM SIGIR International Conference on the Theory of Information Retrieval Santa Clara, CA
  2. Motivation • Monitoring entities in media streams often relies on

    rich entity representations, like structured information available in a knowledge base. • Long-tail entities are hard to monitor, due to their limited, if not entirely missing, representation in the reference knowledge base. Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019
  3. Problem Statement Unsupervised Context Retrieval for Long-tail Entities - ICTIR

    2019 • Given a (long-tail) entity e • And a set of contexts (here, sentences) • Context retrieval problem: to rank each context c in the set according to how likely e is actually mentioned in c
  4. Problem Statement Unsupervised Context Retrieval for Long-tail Entities - ICTIR

    2019 • Example: context retrieval for the entity Isai (an investment fund) "Capital firm Isai just raised a new $175 million fund." "S.J. Surya's Isai begins with a curious disclaimer." Isai (Investment fund) Isai (Movie)
  5. Problem Statement Unsupervised Context Retrieval for Long-tail Entities - ICTIR

    2019 Isai (Investment fund) Isai (Movie) "Capital firm Isai just raised a new $175 million fund." "S.J. Surya's Isai begins with a curious disclaimer."
  6. Problem Statement Unsupervised Context Retrieval for Long-tail Entities - ICTIR

    2019 Context Retrieval e = Isai "Capital firm Isai just raised a new $175 million fund.", "S.J. Surya's Isai begins with a curious disclaimer.", C = { … } "Capital firm Isai just raised a new $175 million fund." "S.J. Surya's Isai begins with a curious disclaimer." Isai (Investment fund) Isai (Movie)
  7. Approach Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019

    Context Retrieval e = Isai "Capital firm Isai just raised a new $175 million fund.", "S.J. Surya's Isai begins with a curious disclaimer.", C = { … }
  8. Approach Support Entity Ranking: importance of a support entity e~

    for e Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019 Context Retrieval SER ~ e 1 = Dorm Room Fund ~ e 2 = Fundica ~ e 3 = Venture Partners … e = Isai "Capital firm Isai just raised a new $175 million fund.", "S.J. Surya's Isai begins with a curious disclaimer.", C = { … }
  9. Approach Support Context Ranking: importance of a support context c~

    for a support entity e~ Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019 Context Retrieval SCR SER ~ e 1 = Dorm Room Fund ~ e 2 = Fundica ~ e 3 = Venture Partners … … ~ C 2 = { "Fundica held the finals of their Roadshow.", … } ~ C 3 = { c 3,1 , c 3,2 , … } ~ ~ ~ C 1 = { c 1,1 , c 1,2 , … } ~ ~ e = Isai "Capital firm Isai just raised a new $175 million fund.", "S.J. Surya's Isai begins with a curious disclaimer.", C = { … }
  10. Approach Context-to-Context Ranking: importance of a support context c~ for

    c, given that an alias of e is mentioned in c Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019 Context Retrieval SCR SER ~ e 1 = Dorm Room Fund ~ e 2 = Fundica ~ e 3 = Venture Partners … … ~ C 2 = { "Fundica held the finals of their Roadshow.", … } ~ C 3 = { c 3,1 , c 3,2 , … } ~ ~ ~ C 1 = { c 1,1 , c 1,2 , … } ~ ~ e = Isai "Capital firm Isai just raised a new $175 million fund.", "S.J. Surya's Isai begins with a curious disclaimer.", C = { … } CCR
  11. Approach Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019

    Context Retrieval SCR SER ~ e 1 = Dorm Room Fund ~ e 2 = Fundica ~ e 3 = Venture Partners … … ~ C 2 = { "Fundica held the finals of their Roadshow.", … } ~ C 3 = { c 3,1 , c 3,2 , … } ~ ~ ~ C 1 = { c 1,1 , c 1,2 , … } ~ ~ e = Isai "Capital firm Isai just raised a new $175 million fund.", "S.J. Surya's Isai begins with a curious disclaimer.", C = { … } CCR
  12. Approach • Our framework enables to estimate P(c|e), the probability

    that the alias mentioned in a context c refers to the long-tail entity e Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019 P(c|e) = X ˜ e X ˜ c P(c|e,˜ c)P(˜ c|˜ e) P(˜ e|e) CCR SCR SER
  13. Component Estimators • Component: Support Entity Ranking - Basic: BM25

    (k1=1.2, b=0.8), using the description of e as a query over the opening_text field of each Wikipedia article in an index - Pop: the basic score is multiplied by the popularity of the support entity - Types: from basic, entities not having common types with e are removed Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019
  14. Component Estimators • Component: Context-to-Context Ranking - Retrieval score (BM25)

    of c with c~ as a query over a context index - Semantic (cosine) similarity between the term- averaged word2vec vectors for c and c~ Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019
  15. Test Collection • 165 long-tail entities, 73 out of which

    did not have a corresponding Wikipedia article by 2018-10-01 • For each entity e, 5k- contexts (from a proprietary collection of news articles) with an alias of e are ranked with each combination of estimators • Top 20 contexts per ranking are pooled, leading to 4,536 contexts annotated with binary relevance Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019
  16. Experimental Results • RQ: What is the best way to

    estimate each component in our approach? - Basic SER setting outperforms its pop and types variants, with high significance, in terms of MAP and MRR, for each CCR setting - The semantic definition leads to a more robust CCR • RQ: How does our approach perform for context retrieval? - Our method outperforms the baseline [1] with high significance • RQ: How does it perform for entities with and without a corresponding representation in Wikipedia? - It outperforms the baseline [1] in both subsets, and in particular is robust for the long-tail entities [1] Roi Blanco and Hugo Zaragoza. 2010. Finding Support Sentences for Entities. In Proc. of SIGIR. 339–346. Unsupervised Context Retrieval for Long-tail Entities - ICTIR 2019