Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unsupervised Context Retrieval for Long-tail Entities

Unsupervised Context Retrieval for Long-tail Entities

Date: October 5, 2019
Venue: Santa Clara, CA, USA. The 2019 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '19)
Corresponding article: https://arxiv.org/abs/1908.01798
Video: https://www.youtube.com/watch?v=aCi0y_WY_Gk

Please cite the paper, and link to or credit this presentation when using it or part of it in your work.

#InformationRetrieval #IR #EntityOrientedSearch #EntityLinking #LongTailEntities #InformationExtraction #SmallData #LimitedData #DataLabeling

Darío Garigliotti

October 04, 2019
Tweet

More Decks by Darío Garigliotti

Other Decks in Research

Transcript

  1. UNSUPERVISED CONTEXT RETRIEVAL FOR LONG-TAIL ENTITIES Darío Garigliotti1,2, Dyaa Albakour2,

    Miguel Martinez2, and Krisztian Balog1 1 IAI, University of Stavanger, Norway 2 Signal AI, UK (Work done while Darío Garigliotti was a visiting researcher at Signal AI.) Get the paper: References: [1] Roi Blanco and Hugo Zaragoza. 2010. Finding Support Sentences for Entities. In Proc. of SIGIR. 339–346. • 165 long-tail entities, 73 out of which were not in Wikipedia by 2018-10-01. • For each entity e, 5k- contexts (from a proprietary collection of news articles) with an alias of e are ranked with each combination of estimators. • Top 20 contexts per ranking are pooled, leading to 4,536 contexts annotated with binary relevance. TEST COLLECTION • Monitoring entities in media streams often relies on rich entity representations, like information available in a knowledge base (KB), or in a knowledge repository, e.g., Wikipedia. • Long-tail entities are hard to monitor, due to their limited, if not entirely missing, representation in knowledge catalogs. • Given a long-tail entity e, and a set of contexts (here, sentences), the context retrieval problem consists in ranking each context c according to how likely e is actually mentioned in c. MOTIVATION AND PROBLEM STATEMENT GENERATIVE MODELING FRAMEWORK - Support Entity Ranking: importance of a support entity e~ for e. - Support Context Ranking: importance of a support context c~ for a support entity e~. - Context-to-Context Ranking: importance of a support context c~ for c, given that an alias of e is mentioned in c. P(c|e) = X ˜ e X ˜ c P(c|e,˜ c)P(˜ c|˜ e) P(˜ e|e) CCR SCR SER • Our framework enables to estimate P(c|e), the probability that the alias mentioned in a context c refers to the long-tail entity e: Context Retrieval SCR SER ~ e 1 = Dorm Room Fund ~ e 2 = Fundica ~ e 3 = Venture Partners … … ~ C 2 = { "Fundica held the finals of their Roadshow.", … } ~ C 3 = { c 3,1 , c 3,2 , … } ~ ~ ~ C 1 = { c 1,1 , c 1,2 , … } ~ ~ e = Isai "Capital firm Isai just raised a new $175 million fund.", "S.J. Surya's Isai begins with a curious disclaimer.", C = { … } CCR • RQ1: How does our approach perform for context retrieval? - Our method outperforms the baseline with high significance, in terms of MAP and MRR. • RQ2: How does it perform for entities with and without representation in Wikipedia? - It outperforms the baseline in both subsets, and in particular is robust for the long-tail entities. EXPERIMENTAL RESULTS Entities All In Wikipedia Not in Wikipedia Method MAP MRR MAP MRR MAP MRR Spotlight, ✓ = 0.6 0.0406 0.0826 0.0726 0.1475 0.0004 0.0008 Spotlight, ✓ = 0.9 0.0393 0.0811 0.0701 0.1447 0.0004 0.0008 Baseline [1] 0.2248 0.3357 0.1732 0.2787 0.2899 0.4077 Our method 0.5195‡ 0.6133‡ 0.4900‡ 0.6032‡ 0.5565‡ 0.6261‡ Table. Context retrieval performance. For a given metric, we report statistical significance against the baseline, tested using a two-tailed paired t-test at p<0.001. RQ1 RQ2 - Retrieval score (BM25) of c with c~ as a query over a context index. - Semantic (cosine) similarity between the term-averaged word2vec vectors for c and c~. • Context-to-Context Ranking: • Support Entity Ranking: - Basic: BM25 (k1=1.2, b=0.8), using the description of e as a query over the opening_text field of each Wikipedia article in an index. - Pop: the basic score is multiplied by the popularity of the support entity. - Types: from basic, entities not having common types with e are removed. COMPONENT ESTIMATORS • (We refer to the paper for experiments on the best way to estimate each component.)