Slide 1

Slide 1 text

© 2020 Bloomberg Finance L.P. All rights reserved. AD HOC TABLE RETRIEVAL USING SEMANTIC SIMILARITY Oct, 2020 Shuo Zhang, Research Scientist, AI Group @Bloomberg @imsure318

Slide 2

Slide 2 text

TABLES ARE EVERYWHERE

Slide 3

Slide 3 text

• Web Tables: The WebTables systems extract 14.1 billion HTML tables and finds 154M are high-quality [1] • Web Tables: Lehmberg et al. (2016) extract 233M content tables from Common Crawl 2015 [2] • Wikipedia Tables: The current snapshot of Wikipedia contains more than 3.23M tables from 520k articles • Spreadsheets: The number of worldwide spreadsheet users is estimated to exceed 400M, and about 50 to 80% of business use spreadsheets • … STATISTICS ON TABLES [1] Cafarella et al. Webtables: Exploring the power of tables on the web, VLDB Endow, 2008 [2] Lehmberg et al. A large public corpus of web tables containing time and context metadata, WWW Companion, 2016

Slide 4

Slide 4 text

TYPES OF TABLES Rela%onal tables En%ty tables Other

Slide 5

Slide 5 text

THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … …

Slide 6

Slide 6 text

Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Table cap)on THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE

Slide 7

Slide 7 text

Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Core column (subject column) THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE We assume that these en))es are recognized and disambiguated, i.e., linked to a knowledge base

Slide 8

Slide 8 text

Heading column labels (table schema) Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE

Slide 9

Slide 9 text

AD HOC TABLE SEARCH Return a table in response to a keyword query

Slide 10

Slide 10 text

TASK • Ad hoc table retrieval: • Given a keyword query as input, return a ranked list of tables from a table corpus Singapore Search Year GDP Nominal (Billion) GDP Nominal Per Capita GDP Real (Billion) Singapore - Wikipedia, Economy Statistics (Recent Years) GNI Nominal (Billion) GNI Nominal Per Capita 2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292 https://en.wikipedia.org/wiki/Singapore Show more (5 rows total) Singapore - Wikipedia, Language used most frequently at home Language Color in Figure Percent English Blue 36.9% Show more (6 rows total) https://en.wikipedia.org/wiki/Singapore 2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216 2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902 Mandarin Yellow 34.9% Malay Red 10.7%

Slide 11

Slide 11 text

APPROACHES • Unsupervised methods • Build a document-based representa)on for each table, then employ conven)onal document retrieval methods • Supervised methods • Describe query-table pairs using a set of features, then employ supervised machine learning ("learning-to-rank")

Slide 12

Slide 12 text

UNSUPERVISED METHODS • Single-field document representa)on • All table content, no structure • Mul)-field document representa)on • Separate document fields for embedding document’s )tle, sec)on )tle, table cap)on, table body, and table headings

Slide 13

Slide 13 text

SUPERVISED METHODS • Three groups of features • Query features • #query terms, query IDF scores • Table features • Table proper)es: #rows, #cols, #empty cells, etc. • Embedding document: link structure, number of tables, etc. • Query-table features • Query terms found in different table elements, LM score, etc. • Our novel seman)c matching features

Slide 14

Slide 14 text

FEATURES

Slide 15

Slide 15 text

Can we go beyond lexical matching and improve keyword table search performance by incorporating semantic matching?

Slide 16

Slide 16 text

SEMANTIC MATCHING • Main objec)ve: go beyond term-based matching • Three components: 1. Content extrac)on 2. Seman)c representa)ons 3. Similarity measures

Slide 17

Slide 17 text

SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of a query/table is represented as a set of terms, which can be words or en))es Query … q1 qn … Table t1 tm

Slide 18

Slide 18 text

SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of a query/table is represented as a set of terms, which can be words or en))es Query … q1 qn … Table t1 tm Entity-based: - Top-k ranked entities from a knowledge base - Entities in the core table column - Top-k ranked entities using the embedding document/section title as a query

Slide 19

Slide 19 text

SEMANTIC MATCHING 2. SEMANTIC REPRESENTATIONS • Each of the raw terms is mapped to a seman)c vector representa)on Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn …

Slide 20

Slide 20 text

SEMANTIC REPRESENTATIONS • Bag-of-concepts (sparse discrete vectors) • Bag-of-en%%es • Each vector element corresponds to an en)ty • is 1 if there exists a link between en))es i and j in the KB • Bag-of-categories • Each vector element corresponds to a Wikipedia category • is 1 if en)ty i is assigned to Wikipedia category j • Embeddings (dense con)nuous vectors) • Word embeddings • Word2Vec (300 dimensions, trained on Google news) • Graph embeddings • RDF2vec (200 dimensions, trained on DBpedia) ~ ti ~ ti[j] ~ ti[j] j

Slide 21

Slide 21 text

SEMANTIC MATCHING 3. SIMILARITY MEASURES Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching

Slide 22

Slide 22 text

SEMANTIC MATCHING EARLY FUSION MATCHING STRATEGY Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Early: Take the centroid of semantic vectors and compute their cosine similarity ~ t1 ~ tm … ~ q1 ~ qn …

Slide 23

Slide 23 text

SEMANTIC MATCHING LATE FUSION MATCHING STRATEGY Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Late: Compute all pairwise similarities between the query and table semantic vectors, then aggregate those pairwise similarity scores (sum, avg, or max) ~ t1 ~ tm … ~ q1 ~ qn … AGGR … …

Slide 24

Slide 24 text

EXPERIMENTAL EVALUATION

Slide 25

Slide 25 text

EXPERIMENTAL SETUP • Table corpus • WikiTables corpus1: 1.6M tables extracted from Wikipedia • Knowledge base • DBpedia (2015-10): 4.6M en))es with an English abstract • Queries • Sampled from two sources2,3 • Rank-based evalua)on • NDCG@5, 10, 15, 20 1 Bhagavatula et al. TabEL: En%ty Linking in Web Tables. In: ISWC ’15. 2 Cafarella et al. Data Integra%on for the Rela%onal Web. Proc. of VLDB Endow. (2009) 3 Vene)s et al. Recovering Seman%cs of Tables on the Web. Proc. of VLDB Endow. (2011) QS-1 QS-2 video games asian countries currency us ci)es laptops cpu kings of africa food calories economy gdp guitars manufacturer

Slide 26

Slide 26 text

RELEVANCE ASSESSMENTS • Collected via crowdsourcing • Pooling to depth 20, 3120 query-table pairs in total • Assessors are presented with the following scenario • "Imagine that your task is to create a new table on the query topic" • A table is … • Non-relevant (0): if it is unclear what it is about or it about a different topic • Relevant (1): if some cells or values could be used from it • Highly relevant (2): if large blocks or several values could be used from it

Slide 27

Slide 27 text

RESEARCH QUESTIONS • RQ1: Can seman)c matching improve retrieval performance? • RQ2: Which of the seman)c representa)ons is the most effec)ve? • RQ3: Which of the similarity measures performs best?

Slide 28

Slide 28 text

RESULTS: RQ1 NDCG@10 NDCG@20 Single-field document ranking 0.4344 0.5254 Mul%-field document ranking 0.4860 0.5473 WebTable1 0.2992 0.3726 WikiTable2 0.4766 0.5206 LTR baseline 0.5456 0.6031 STR (LTR + seman%c matching) 0.6293 0.6825 1 Cafarella et al. WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. (2008) 2 Bhagavatula et al. Methods for Exploring and Mining Tables on Wikipedia. In: IDEA ’13. • Can seman)c matching improve retrieval performance? • Yes. STR achieves substan)al and significant improvements over LTR.

Slide 29

Slide 29 text

RESULTS: RQ2 • Which of the seman)c representa)ons is the most effec)ve? • Bag-of-en))es.

Slide 30

Slide 30 text

RESULTS: RQ3 • Which of the similarity measures performs best? • Late-sum and Late-avg (but it also depends on the representa)on)

Slide 31

Slide 31 text

FEATURE ANALYSIS

Slide 32

Slide 32 text

• Introduce and address the problem of ad hoc table retrieval • Perform semantic matching between queries and tables • Evaluate the methods using a purpose-built test collection based on Wikipedia tables SUMMARY

Slide 33

Slide 33 text

• Deng et al. Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval, SIGIR 2019 • https://arxiv.org/pdf/1906.00041.pdf • Trabelsi et al. Improved Table Retrieval Using Multiple Context Embeddings for Attributes, ICBD 2019 • http://www.cse.lehigh.edu/~brian/pubs/2019/BigData/Improved_Table_Retrieval.pdf • Bagheri et al. A Latent Model for Ad Hoc Table Retrieval, ECIR 2020 • https://link.springer.com/chapter/10.1007%2F978-3-030-45442-5_11 • Chen et al. Table Search Using a Deep Contextualized Language Model, SIGIR 2020 • https://arxiv.org/pdf/2005.09207.pdf • Shraga et al. Web Table Retrieval Using Multimodal Deep Learning, SIRIR 2020 • https://dl.acm.org/doi/abs/10.1145/3397271.3401120 READING