Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Retrieval and Text Mining 2020 - Table Retrieval

Information Retrieval and Text Mining 2020 - Table Retrieval

University of Stavanger, DAT640, 2020 fall
Invited lecture by Shuo Zhang

Krisztian Balog

November 01, 2020
Tweet

More Decks by Krisztian Balog

Other Decks in Education

Transcript

  1. © 2020 Bloomberg Finance L.P. All rights reserved. AD HOC

    TABLE RETRIEVAL USING SEMANTIC SIMILARITY Oct, 2020 Shuo Zhang, Research Scientist, AI Group @Bloomberg @imsure318
  2. • Web Tables: The WebTables systems extract 14.1 billion HTML

    tables and finds 154M are high-quality [1] • Web Tables: Lehmberg et al. (2016) extract 233M content tables from Common Crawl 2015 [2] • Wikipedia Tables: The current snapshot of Wikipedia contains more than 3.23M tables from 520k articles • Spreadsheets: The number of worldwide spreadsheet users is estimated to exceed 400M, and about 50 to 80% of business use spreadsheets • … STATISTICS ON TABLES [1] Cafarella et al. Webtables: Exploring the power of tables on the web, VLDB Endow, 2008 [2] Lehmberg et al. A large public corpus of web tables containing time and context metadata, WWW Companion, 2016
  3. THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE Formula 1 constructors’

    statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … …
  4. Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base

    Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Table cap)on THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE
  5. Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base

    Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Core column (subject column) THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE We assume that these en))es are recognized and disambiguated, i.e., linked to a knowledge base
  6. Heading column labels (table schema) Formula 1 constructors’ statistics 2016

    Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE
  7. TASK • Ad hoc table retrieval: • Given a keyword

    query as input, return a ranked list of tables from a table corpus Singapore Search Year GDP Nominal (Billion) GDP Nominal Per Capita GDP Real (Billion) Singapore - Wikipedia, Economy Statistics (Recent Years) GNI Nominal (Billion) GNI Nominal Per Capita 2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292 https://en.wikipedia.org/wiki/Singapore Show more (5 rows total) Singapore - Wikipedia, Language used most frequently at home Language Color in Figure Percent English Blue 36.9% Show more (6 rows total) https://en.wikipedia.org/wiki/Singapore 2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216 2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902 Mandarin Yellow 34.9% Malay Red 10.7%
  8. APPROACHES • Unsupervised methods • Build a document-based representa)on for

    each table, then employ conven)onal document retrieval methods • Supervised methods • Describe query-table pairs using a set of features, then employ supervised machine learning ("learning-to-rank")
  9. UNSUPERVISED METHODS • Single-field document representa)on • All table content,

    no structure • Mul)-field document representa)on • Separate document fields for embedding document’s )tle, sec)on )tle, table cap)on, table body, and table headings
  10. SUPERVISED METHODS • Three groups of features • Query features

    • #query terms, query IDF scores • Table features • Table proper)es: #rows, #cols, #empty cells, etc. • Embedding document: link structure, number of tables, etc. • Query-table features • Query terms found in different table elements, LM score, etc. • Our novel seman)c matching features
  11. Can we go beyond lexical matching and improve keyword table

    search performance by incorporating semantic matching?
  12. SEMANTIC MATCHING • Main objec)ve: go beyond term-based matching •

    Three components: 1. Content extrac)on 2. Seman)c representa)ons 3. Similarity measures
  13. SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of

    a query/table is represented as a set of terms, which can be words or en))es Query … q1 qn … Table t1 tm
  14. SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of

    a query/table is represented as a set of terms, which can be words or en))es Query … q1 qn … Table t1 tm Entity-based: - Top-k ranked entities from a knowledge base - Entities in the core table column - Top-k ranked entities using the embedding document/section title as a query
  15. SEMANTIC MATCHING 2. SEMANTIC REPRESENTATIONS • Each of the raw

    terms is mapped to a seman)c vector representa)on Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn …
  16. SEMANTIC REPRESENTATIONS • Bag-of-concepts (sparse discrete vectors) • Bag-of-en%%es •

    Each vector element corresponds to an en)ty • is 1 if there exists a link between en))es i and j in the KB • Bag-of-categories • Each vector element corresponds to a Wikipedia category • is 1 if en)ty i is assigned to Wikipedia category j • Embeddings (dense con)nuous vectors) • Word embeddings • Word2Vec (300 dimensions, trained on Google news) • Graph embeddings • RDF2vec (200 dimensions, trained on DBpedia) ~ ti ~ ti[j] ~ ti[j] j
  17. SEMANTIC MATCHING 3. SIMILARITY MEASURES Query … … Table q1

    qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching
  18. SEMANTIC MATCHING EARLY FUSION MATCHING STRATEGY Query … … Table

    q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Early: Take the centroid of semantic vectors and compute their cosine similarity ~ t1 ~ tm … ~ q1 ~ qn …
  19. SEMANTIC MATCHING LATE FUSION MATCHING STRATEGY Query … … Table

    q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Late: Compute all pairwise similarities between the query and table semantic vectors, then aggregate those pairwise similarity scores (sum, avg, or max) ~ t1 ~ tm … ~ q1 ~ qn … AGGR … …
  20. EXPERIMENTAL SETUP • Table corpus • WikiTables corpus1: 1.6M tables

    extracted from Wikipedia • Knowledge base • DBpedia (2015-10): 4.6M en))es with an English abstract • Queries • Sampled from two sources2,3 • Rank-based evalua)on • NDCG@5, 10, 15, 20 1 Bhagavatula et al. TabEL: En%ty Linking in Web Tables. In: ISWC ’15. 2 Cafarella et al. Data Integra%on for the Rela%onal Web. Proc. of VLDB Endow. (2009) 3 Vene)s et al. Recovering Seman%cs of Tables on the Web. Proc. of VLDB Endow. (2011) QS-1 QS-2 video games asian countries currency us ci)es laptops cpu kings of africa food calories economy gdp guitars manufacturer
  21. RELEVANCE ASSESSMENTS • Collected via crowdsourcing • Pooling to depth

    20, 3120 query-table pairs in total • Assessors are presented with the following scenario • "Imagine that your task is to create a new table on the query topic" • A table is … • Non-relevant (0): if it is unclear what it is about or it about a different topic • Relevant (1): if some cells or values could be used from it • Highly relevant (2): if large blocks or several values could be used from it
  22. RESEARCH QUESTIONS • RQ1: Can seman)c matching improve retrieval performance?

    • RQ2: Which of the seman)c representa)ons is the most effec)ve? • RQ3: Which of the similarity measures performs best?
  23. RESULTS: RQ1 NDCG@10 NDCG@20 Single-field document ranking 0.4344 0.5254 Mul%-field

    document ranking 0.4860 0.5473 WebTable1 0.2992 0.3726 WikiTable2 0.4766 0.5206 LTR baseline 0.5456 0.6031 STR (LTR + seman%c matching) 0.6293 0.6825 1 Cafarella et al. WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. (2008) 2 Bhagavatula et al. Methods for Exploring and Mining Tables on Wikipedia. In: IDEA ’13. • Can seman)c matching improve retrieval performance? • Yes. STR achieves substan)al and significant improvements over LTR.
  24. RESULTS: RQ3 • Which of the similarity measures performs best?

    • Late-sum and Late-avg (but it also depends on the representa)on)
  25. • Introduce and address the problem of ad hoc table

    retrieval • Perform semantic matching between queries and tables • Evaluate the methods using a purpose-built test collection based on Wikipedia tables SUMMARY
  26. • Deng et al. Table2Vec: Neural Word and Entity Embeddings

    for Table Population and Retrieval, SIGIR 2019 • https://arxiv.org/pdf/1906.00041.pdf • Trabelsi et al. Improved Table Retrieval Using Multiple Context Embeddings for Attributes, ICBD 2019 • http://www.cse.lehigh.edu/~brian/pubs/2019/BigData/Improved_Table_Retrieval.pdf • Bagheri et al. A Latent Model for Ad Hoc Table Retrieval, ECIR 2020 • https://link.springer.com/chapter/10.1007%2F978-3-030-45442-5_11 • Chen et al. Table Search Using a Deep Contextualized Language Model, SIGIR 2020 • https://arxiv.org/pdf/2005.09207.pdf • Shraga et al. Web Table Retrieval Using Multimodal Deep Learning, SIRIR 2020 • https://dl.acm.org/doi/abs/10.1145/3397271.3401120 READING