Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Retrieval and Text Mining 2020 - Ta...

Information Retrieval and Text Mining 2020 - Table Retrieval

University of Stavanger, DAT640, 2020 fall
Invited lecture by Shuo Zhang

Avatar for Krisztian Balog

Krisztian Balog

November 01, 2020
Tweet

More Decks by Krisztian Balog

Other Decks in Education

Transcript

  1. © 2020 Bloomberg Finance L.P. All rights reserved. AD HOC

    TABLE RETRIEVAL USING SEMANTIC SIMILARITY Oct, 2020 Shuo Zhang, Research Scientist, AI Group @Bloomberg @imsure318
  2. • Web Tables: The WebTables systems extract 14.1 billion HTML

    tables and finds 154M are high-quality [1] • Web Tables: Lehmberg et al. (2016) extract 233M content tables from Common Crawl 2015 [2] • Wikipedia Tables: The current snapshot of Wikipedia contains more than 3.23M tables from 520k articles • Spreadsheets: The number of worldwide spreadsheet users is estimated to exceed 400M, and about 50 to 80% of business use spreadsheets • … STATISTICS ON TABLES [1] Cafarella et al. Webtables: Exploring the power of tables on the web, VLDB Endow, 2008 [2] Lehmberg et al. A large public corpus of web tables containing time and context metadata, WWW Companion, 2016
  3. THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE Formula 1 constructors’

    statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … …
  4. Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base

    Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Table cap)on THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE
  5. Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base

    Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Core column (subject column) THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE We assume that these en))es are recognized and disambiguated, i.e., linked to a knowledge base
  6. Heading column labels (table schema) Formula 1 constructors’ statistics 2016

    Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE
  7. TASK • Ad hoc table retrieval: • Given a keyword

    query as input, return a ranked list of tables from a table corpus Singapore Search Year GDP Nominal (Billion) GDP Nominal Per Capita GDP Real (Billion) Singapore - Wikipedia, Economy Statistics (Recent Years) GNI Nominal (Billion) GNI Nominal Per Capita 2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292 https://en.wikipedia.org/wiki/Singapore Show more (5 rows total) Singapore - Wikipedia, Language used most frequently at home Language Color in Figure Percent English Blue 36.9% Show more (6 rows total) https://en.wikipedia.org/wiki/Singapore 2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216 2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902 Mandarin Yellow 34.9% Malay Red 10.7%
  8. APPROACHES • Unsupervised methods • Build a document-based representa)on for

    each table, then employ conven)onal document retrieval methods • Supervised methods • Describe query-table pairs using a set of features, then employ supervised machine learning ("learning-to-rank")
  9. UNSUPERVISED METHODS • Single-field document representa)on • All table content,

    no structure • Mul)-field document representa)on • Separate document fields for embedding document’s )tle, sec)on )tle, table cap)on, table body, and table headings
  10. SUPERVISED METHODS • Three groups of features • Query features

    • #query terms, query IDF scores • Table features • Table proper)es: #rows, #cols, #empty cells, etc. • Embedding document: link structure, number of tables, etc. • Query-table features • Query terms found in different table elements, LM score, etc. • Our novel seman)c matching features
  11. Can we go beyond lexical matching and improve keyword table

    search performance by incorporating semantic matching?
  12. SEMANTIC MATCHING • Main objec)ve: go beyond term-based matching •

    Three components: 1. Content extrac)on 2. Seman)c representa)ons 3. Similarity measures
  13. SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of

    a query/table is represented as a set of terms, which can be words or en))es Query … q1 qn … Table t1 tm
  14. SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of

    a query/table is represented as a set of terms, which can be words or en))es Query … q1 qn … Table t1 tm Entity-based: - Top-k ranked entities from a knowledge base - Entities in the core table column - Top-k ranked entities using the embedding document/section title as a query
  15. SEMANTIC MATCHING 2. SEMANTIC REPRESENTATIONS • Each of the raw

    terms is mapped to a seman)c vector representa)on Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn …
  16. SEMANTIC REPRESENTATIONS • Bag-of-concepts (sparse discrete vectors) • Bag-of-en%%es •

    Each vector element corresponds to an en)ty • is 1 if there exists a link between en))es i and j in the KB • Bag-of-categories • Each vector element corresponds to a Wikipedia category • is 1 if en)ty i is assigned to Wikipedia category j • Embeddings (dense con)nuous vectors) • Word embeddings • Word2Vec (300 dimensions, trained on Google news) • Graph embeddings • RDF2vec (200 dimensions, trained on DBpedia) ~ ti ~ ti[j] ~ ti[j] j
  17. SEMANTIC MATCHING 3. SIMILARITY MEASURES Query … … Table q1

    qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching
  18. SEMANTIC MATCHING EARLY FUSION MATCHING STRATEGY Query … … Table

    q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Early: Take the centroid of semantic vectors and compute their cosine similarity ~ t1 ~ tm … ~ q1 ~ qn …
  19. SEMANTIC MATCHING LATE FUSION MATCHING STRATEGY Query … … Table

    q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Late: Compute all pairwise similarities between the query and table semantic vectors, then aggregate those pairwise similarity scores (sum, avg, or max) ~ t1 ~ tm … ~ q1 ~ qn … AGGR … …
  20. EXPERIMENTAL SETUP • Table corpus • WikiTables corpus1: 1.6M tables

    extracted from Wikipedia • Knowledge base • DBpedia (2015-10): 4.6M en))es with an English abstract • Queries • Sampled from two sources2,3 • Rank-based evalua)on • NDCG@5, 10, 15, 20 1 Bhagavatula et al. TabEL: En%ty Linking in Web Tables. In: ISWC ’15. 2 Cafarella et al. Data Integra%on for the Rela%onal Web. Proc. of VLDB Endow. (2009) 3 Vene)s et al. Recovering Seman%cs of Tables on the Web. Proc. of VLDB Endow. (2011) QS-1 QS-2 video games asian countries currency us ci)es laptops cpu kings of africa food calories economy gdp guitars manufacturer
  21. RELEVANCE ASSESSMENTS • Collected via crowdsourcing • Pooling to depth

    20, 3120 query-table pairs in total • Assessors are presented with the following scenario • "Imagine that your task is to create a new table on the query topic" • A table is … • Non-relevant (0): if it is unclear what it is about or it about a different topic • Relevant (1): if some cells or values could be used from it • Highly relevant (2): if large blocks or several values could be used from it
  22. RESEARCH QUESTIONS • RQ1: Can seman)c matching improve retrieval performance?

    • RQ2: Which of the seman)c representa)ons is the most effec)ve? • RQ3: Which of the similarity measures performs best?
  23. RESULTS: RQ1 NDCG@10 NDCG@20 Single-field document ranking 0.4344 0.5254 Mul%-field

    document ranking 0.4860 0.5473 WebTable1 0.2992 0.3726 WikiTable2 0.4766 0.5206 LTR baseline 0.5456 0.6031 STR (LTR + seman%c matching) 0.6293 0.6825 1 Cafarella et al. WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. (2008) 2 Bhagavatula et al. Methods for Exploring and Mining Tables on Wikipedia. In: IDEA ’13. • Can seman)c matching improve retrieval performance? • Yes. STR achieves substan)al and significant improvements over LTR.
  24. RESULTS: RQ3 • Which of the similarity measures performs best?

    • Late-sum and Late-avg (but it also depends on the representa)on)
  25. • Introduce and address the problem of ad hoc table

    retrieval • Perform semantic matching between queries and tables • Evaluate the methods using a purpose-built test collection based on Wikipedia tables SUMMARY
  26. • Deng et al. Table2Vec: Neural Word and Entity Embeddings

    for Table Population and Retrieval, SIGIR 2019 • https://arxiv.org/pdf/1906.00041.pdf • Trabelsi et al. Improved Table Retrieval Using Multiple Context Embeddings for Attributes, ICBD 2019 • http://www.cse.lehigh.edu/~brian/pubs/2019/BigData/Improved_Table_Retrieval.pdf • Bagheri et al. A Latent Model for Ad Hoc Table Retrieval, ECIR 2020 • https://link.springer.com/chapter/10.1007%2F978-3-030-45442-5_11 • Chen et al. Table Search Using a Deep Contextualized Language Model, SIGIR 2020 • https://arxiv.org/pdf/2005.09207.pdf • Shraga et al. Web Table Retrieval Using Multimodal Deep Learning, SIRIR 2020 • https://dl.acm.org/doi/abs/10.1145/3397271.3401120 READING