Information Retrieval and Text Mining - Table Search, Generation and Completion

Shuo Zhang October 22, 2019 TABLE SEARCH, GENERATION AND COMPLETION

TABLES ARE EVERYWHERE

• Web Tables: The WebTables systems extract 14.1 billion HTML
tables and ﬁnds 154M are high-quality [1] • Web Tables: Lehmberg et al. (2016) extract 233M content tables from Common Crawl 2015 [2] • Wikipedia Tables: The current snapshot of Wikipedia contains more than 3.23M tables from 520k articles • Spreadsheets: The number of worldwide spreadsheet users is estimated to exceed 400M, and about 50 to 80% of business use spreadsheets • … STATISTICS ON TABLES [1] Cafarella et al. Webtables: Exploring the power of tables on the web, VLDB Endow, 2008 [2] Lehmberg et al. A large public corpus of web tables containing time and context metadata, WWW Companion, 2016

TYPES OF TABLES Relational tables Entity tables Other

THE ANATOMY OF A RELATIONAL   (ENTITY-FOCUSED) TABLE Formula 1
constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … …

Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base
Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Table caption THE ANATOMY OF A RELATIONAL   (ENTITY-FOCUSED) TABLE

Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base
Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Core column (subject column) THE ANATOMY OF A RELATIONAL   (ENTITY-FOCUSED) TABLE We assume that these entities are recognized and disambiguated, i.e., linked to a knowledge base

Heading column labels (table schema) Formula 1 constructors’ statistics 2016
Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … THE ANATOMY OF A RELATIONAL   (ENTITY-FOCUSED) TABLE

WHAT INTELLIGENT ASSISTANCE FUNCTIONALITIES CAN WE PROVIDE FOR PEOPLE WORKING
WITH TABLES?

OVERVIEW Table Search Table Generation Table Completion

#1: TABLE SEARCH Return a table in response to a
keyword query

#2 TABLE GENERATION Automatically generating an entire table in response
to a natural language query

#3 TABLE COMPLETION Oscar Best Actor …… ACTOR IN A
LEADING ROLE Casey Aﬄeck Manchester by the Sea …… oscar.go.com/winners Year 2013 Actor Film 2014 2015 Matthew McConaughey Eddie Redmayne Leonard DiCaprio Dallas Buyers Club The theory of Everything The Revenant 2016 Casey Aﬄeck A 1.2017 2.2018 Add entity B Add column 1.Role(s) 2.Director(s) C Generating a ranked list of suggestions for the next row, column, and cell

• Data sources • Table corpus: 1.6M tables extracted from
Wikipedia • Knowledge base: DBpedia 2015-10 (4.6M entities) • Evaluation measures • Standard IR measures (MAP, MRR, NDCG) EXPERIMENTAL SETTING

• Two retrieval tasks, with tables as results • Ad
hoc table retrieval • Query-by-table TABLE SEARCH

TASK • Ad hoc table retrieval: • Given a keyword
query as input, return a ranked list of tables from a table corpus Singapore Search Year GDP Nominal (Billion) GDP Nominal Per Capita GDP Real (Billion) Singapore - Wikipedia, Economy Statistics (Recent Years) GNI Nominal (Billion) GNI Nominal Per Capita 2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292 https://en.wikipedia.org/wiki/Singapore Show more (5 rows total) Singapore - Wikipedia, Language used most frequently at home Language Color in Figure Percent English Blue 36.9% Show more (6 rows total) https://en.wikipedia.org/wiki/Singapore 2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216 2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902 Mandarin Yellow 34.9% Malay Red 10.7%

APPROACHES • Unsupervised methods • Build a document-based representation for
each table, then employ conventional document retrieval methods • Supervised methods • Describe query-table pairs using a set of features, then employ supervised machine learning ("learning-to-rank")

UNSUPERVISED METHODS • Single-field document representation • All table content,
no structure • Multi-field document representation • Separate document fields for embedding document’s title, section title, table caption, table body, and table headings

SUPERVISED METHODS • Three groups of features • Query features
• #query terms, query IDF scores • Table features • Table properties: #rows, #cols, #empty cells, etc. • Embedding document: link structure, number of tables, etc. • Query-table features • Query terms found in different table elements, LM score, etc. • Our novel semantic matching features

Can we go beyond lexical matching and improve keyword table
search performance by incorporating semantic matching?

SEMANTIC MATCHING • Main objective: go beyond term-based matching •
Three components: 1. Content extraction 2. Semantic representations 3. Similarity measures

SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of
a query/table is represented as a set of terms, which can be words or entities Query … q1 qn … Table t1 tm

SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of
a query/table is represented as a set of terms, which can be words or entities Query … q1 qn … Table t1 tm Entity-based: - Top-k ranked entities from a knowledge base - Entities in the core table column - Top-k ranked entities using the embedding document/section title as a query

SEMANTIC MATCHING 2. SEMANTIC REPRESENTATIONS • Each of the raw
terms is mapped to a semantic vector representation Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn …

SEMANTIC REPRESENTATIONS • Bag-of-concepts (sparse discrete vectors) • Bag-of-entities •
Each vector element corresponds to an entity • is 1 if there exists a link between entities i and j in the KB • Bag-of-categories • Each vector element corresponds to a Wikipedia category • is 1 if entity i is assigned to Wikipedia category j • Embeddings (dense continuous vectors) • Word embeddings • Word2Vec (300 dimensions, trained on Google news) • Graph embeddings • RDF2vec (200 dimensions, trained on DBpedia) ~ ti ~ ti[j] ~ ti[j] j

SEMANTIC MATCHING 3. SIMILARITY MEASURES Query … … Table q1
qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching

SEMANTIC MATCHING EARLY FUSION MATCHING STRATEGY Query … … Table
q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Early: Take the centroid of semantic vectors and compute their cosine similarity ~ t1 ~ tm … ~ q1 ~ qn …

SEMANTIC MATCHING LATE FUSION MATCHING STRATEGY Query … … Table
q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Late: Compute all pairwise similarities between the query and table semantic vectors, then aggregate those pairwise similarity scores (sum, avg, or max) ~ t1 ~ tm … ~ q1 ~ qn … AGGR … …

EXPERIMENTAL EVALUATION

EXPERIMENTAL SETUP • Table corpus • WikiTables corpus1: 1.6M tables
extracted from Wikipedia • Knowledge base • DBpedia (2015-10): 4.6M entities with an English abstract • Queries • Sampled from two sources2,3 • Rank-based evaluation • NDCG@5, 10, 15, 20 1 Bhagavatula et al. TabEL: Entity Linking in Web Tables. In: ISWC ’15. 2 Cafarella et al. Data Integration for the Relational Web. Proc. of VLDB Endow. (2009) 3 Venetis et al. Recovering Semantics of Tables on the Web. Proc. of VLDB Endow. (2011) QS-1 QS-2 video games asian countries currency us cities laptops cpu kings of africa food calories economy gdp guitars manufacturer

RELEVANCE ASSESSMENTS • Collected via crowdsourcing • Pooling to depth
20, 3120 query-table pairs in total • Assessors are presented with the following scenario • "Imagine that your task is to create a new table on the query topic" • A table is … • Non-relevant (0): if it is unclear what it is about or it about a different topic • Relevant (1): if some cells or values could be used from it • Highly relevant (2): if large blocks or several values could be used from it

RESEARCH QUESTIONS • RQ1: Can semantic matching improve retrieval performance?
• RQ2: Which of the semantic representations is the most effective? • RQ3: Which of the similarity measures performs best?

RESULTS: RQ1 NDCG@10 NDCG@20 Single-field document ranking 0.4344 0.5254 Multi-field
document ranking 0.4860 0.5473 WebTable1 0.2992 0.3726 WikiTable2 0.4766 0.5206 LTR baseline 0.5456 0.6031 STR (LTR + semantic matching) 0.6293 0.6825 1 Cafarella et al. WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. (2008) 2 Bhagavatula et al. Methods for Exploring and Mining Tables on Wikipedia. In: IDEA ’13. • Can semantic matching improve retrieval performance? • Yes. STR achieves substantial and significants improvements over LTR.

RESULTS: RQ2 • Which of the semantic representations is the
most effective? • Bag-of-entities.

RESULTS: RQ3 • Which of the similarity measures performs best?
• Late-sum and Late-avg (but it also depends on the representation)

FEATURE ANALYSIS

OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc
table retrieval 2. Query-by-table

TASK Rider Marc MARQUEZ Andrea DOVIZIOSO Maverick VINALES Valentino ROSSI
Pos. 1 2 3 4 MotoGP World Standing 2017 Input table Bike Honda Ducati Yamaha Yamaha Rider Giacomo Agostini Angel Nieto Valentino Rossi Mike Hailwood Rank 1 2 3 3 … … Grand Prix motorcycle racing World champions Country Italy Spain Italy UK … Points 282 261 226 197 Rider Marc MARQUEZ Valentino ROSSI Jorge LORENZO Maverick VINALES Pos. 1 2 3 4 MotoGP 2016 Championship Final Standing Bike Honda Yamaha Yamaha Suzuki Nation SPA ITA SPA SPA … … … … Period 1966-1975 1969-1984 1997-2009 1961-1967 … Total 15 13 9 9 … … Points 298 249 233 202 … • Query-by-table: • Given an input table, return a ranked list of relevant tables • Boils down to table matching: • Computing the similarity between a pair of tables

PREVIOUS APPROACHES • Extracting a keyword query (from various table
elements) and scoring tables against that query • Splitting tables into various elements and performing element-wise matching • Ad hoc similarity measures, tailor-made for each table element • Lacking a principled way of combining element-level similarities • Matching elements of different types have not been explored

Can we develop an effective and theoretically sound table matching
framework for measuring and combining table element level similarity, without resorting to hand-crafted features?

APPROACH 1. Represent table elements in multiple semantic spaces 2.
Measure element-level similarity in each of the semantic spaces y • x1 = x2 : element-wise matching • x1 != x2 : cross-element matching 3. Combine the element-level similarities in a discriminative learning framework i( ˜ T, T) = sim( ˜ Ty x1 , Ty x2 )

TABLE ELEMENTS Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine
Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Table topic Tt   (caption + page title) Table entities TE (core column) Table headings Th Table data TD

1. REPRESENTING TABLE ELEMENTS IN SEMANTIC SPACES • Word embeddings
• Graph embeddings • Entity embeddings Tx = [t1, ..., tN ] 2 4 ty 1 [ ... ] ... ty N [ ... ] 3 5 Term space Semantic space y • Words: TF-IDF weight • Entities: presence/absence (0/1)

2. MEASURING ELEMENT-LEVEL SIMILARITY [ ... ] Cy x2 ˜
Cy x1 [ ... ] ⎡ ⎣ ty 1 [ ... ] ... ty M [ ... ] ⎤ ⎦ simearly ⎡ ⎣ ˜ ty 1 [ ... ] ... ˜ ty N [ ... ] ⎤ ⎦ AGGR ⎡ ⎣ cos(˜ ty 1 ,ty 1 ) ... cos(˜ ty N ,ty M ) ⎤ ⎦ ⎡ ⎣ ˜ ty 1 [ ... ] ... ˜ ty N [ ... ] ⎤ ⎦ ⎡ ⎣ ty 1 [ ... ] ... ty M [ ... ] ⎤ ⎦ simlate • Weighted centroid of term-level semantic vectors • Cosine similarity of centroid vectors simearly( ˜ Tx1 , Tx2 ) = cos( ~ Cy x1 , ~ Cy x2 ) • Compute the cosine similarities between all pairs of semantic vectors, then aggregate simlate( ˜ Tx1 , Tx2 ) = aggr({cos(~ t1,~ t2) : ~ t1 2 ˜ Ty ~ x1 ,~ t2 2 Ty ~ x2 })

3. ELEMENT-LEVEL SIMILARITIES AS FEATURES Element-wise Cross-element

EXPERIMENTAL SETUP • WikiTables corpus, DBpedia as knowledge base •
50 tables sampled • Diverse topics (sports, music, ﬁlms, food, celebrities, geography, politics, etc.) • Relevance assessments • (2) highly relevant: it is about the same topic as the input table, but contains additional novel content that is not present in the input table • (1) relevant: on-topic, but it contains limited novel content • (0) non-relevant • Fleiss Kappa = 0.6703 (substantial agreement)

SETTING A BASELINE (1)

SETTING A BASELINE (2)

SETTING A BASELINE (2) Table 2

SETTING A BASELINE (2) Table 2 Table 3

RESULTS element- wise cross- element table features NDCG@5 NDCG@10 HCF-1
+ 0.5382 0.5542 HCF-2 + + 0.5895 0.6050 CRAB-1 + 0.5578 0.5672 CRAB-2 + + 0.6172 0.6267 CRAB-3 + + 0.5140 0.5282 CRAB-4 + + + 0.5804 0.6027

RESEARCH QUESTIONS • RQ1: Which of the semantic representations (word-
based, graph-based, or entity-based) is the most effective for modeling table elements? • RQ2: Which of the two element-level matching strategies performs better, element-wise or cross- element? • RQ3: How much do different table elements contribute to retrieval performance?

RESULTS: RQ1 • Which of the semantic representations is the
most effective? • Entity-based. Also, they are complimentary

RESULTS: RQ2 • Which of the two element-level matching strategies
performs better, element-wise or cross-element? • Element-wise element- wise cross- element table features NDCG@5 NDCG@10 HCF-1 + 0.5382 0.5542 HCF-2 + + 0.5895 0.6050 CRAB-1 + 0.5578 0.5672 CRAB-2 + + 0.6172 0.6267 CRAB-3 + + 0.5140 0.5282 CRAB-4 + + + 0.5804 0.6027

RESULTS: RQ2 • There are several cases where cross-element matching
yields higher scores than element-wise matching

ANALYSIS: INPUT TABLE SIZE Horizontal split of input table x
% Input table l1 e1 … lj … e2 e3 … … en … lm Vertical split of input table x % Input table l1 e1 l2 l3 … … ei … … en … lm Horizontal split of input table x % Input table l1 e1 … lj … e2 e3 … … en … lm Vertical split of input table x % Input table l1 e1 l2 l3 … … ei … … en … lm

table retrieval 2. Query-by-table

TASK • On-the-ﬂy table generation: • Answer a free text
query with a relational table, where • the core column lists all relevant entities; • columns correspond to attributes of those entities; • cells contain the values of the corresponding entity attributes. Video albums of Taylor Swift Search Title Released data Label CMT Crossroads: Taylor Swift and … Formats Journey to Fearless Speak Now World Tour-Live The 1989 World Tour Live Jun 16, 2009 Oct 11, 2011 Nov 21, 2011 Dec 20, 2015 Big Machine Shout! Factory Big Machine Big Machine DVD Blu-ray, DVD CD/Blu-ray, … Streaming E V S

APPROACH Core column entity ranking and schema determination could potentially
mutually reinforce each other. Query (q) E Core column en+ty ranking Schema determina+on S Value lookup V E S

ALGORITHM Query (q) E Core column en+ty ranking Schema determina+on
Value lookup E S S V

KNOWLEDGE BASE ENTRY ed ep Property: value Entity name Entity
type Description Property: value … ea

CORE COLUMN ENTITY RANKING scoret(e, q) = X i wi
i(e, q, St 1)

SCHEMA DETERMINATION scoret(s, q) = X i wi i(s, q,
Et 1)

VALUE LOOKUP • A catalog of possible entity attribute-value pairs
• Entity, schema label, value, provenance quadruples he, s, v, pi e s v T #123 values from KB values from TC s e v T #123

VALUE LOOKUP • Finding a cell’s value is a lookup
in that catalog score(v, e, s, q) = max hs0,v,pi2eV match(s,s0) conf (p, q) eV values from KB values from TC soft string matching - "birthday" vs. "date of birth" - "country" vs. "nationality" matching conﬁdence - KB takes priority over TC - based on the corresponding table’s relevance to the query

EXPERIMENTAL SETUP • Table corpus • WikiTables corpus: 1.6M tables
extracted from Wikipedia • Knowledge base • DBpedia (2015-10): 4.6M entities with an English abstract • Two query sets • Rank-based metrics • NDCG for core column entity ranking and schema determination • MAP/MRR for value lookup

QUERY SET 1 (QS-1) • List queries from the DBpedia-Entity
v2 collection1 (119) • "all cars that are produced in Germany" • "permanent members of the UN Security Council" • "Airlines that currently use Boeing 747 planes" • Core column entity ranking • Highly relevant entities from the collection • Schema determination • Crowdsourcing, 3-point relevance scale, 7k query-label pairs • Value lookup • Crowdsourcing, 25 queries sample, 14k cell values 1 Hasibi et al. DBpedia-Entity v2: A Test Collection for Entity Search. In: SIGIR ’17.

QUERY SET 2 (QS-2) • Entity-relationship queries from the RELink
Query Collection1 (600) • Queries are answered by entity tuples (pairs or triplets) • That is, each query is answered by a table with 2 or 3 columns (including the core entity column) • Queries and relevance judgments are obtained automatically from Wikipedia lists that contain relational tables • Human annotators were asked to formulate the corresponding information need as a natural language query • "Find peaks above 6000m in the mountains of Peru" • "Which countries and cities have accredited armenian embassadors?" • "Which anti-aircraft guns were used in ships during war periods and what country produced them?" 1 Saleiro et al. RELink: A Research Framework and Test Collection for Entity-Relationship Retrieval. In: SIGIR ’17.

CORE COLUMN ENTITY RANKING (QUERY-BASED) QS-1 QS-2 NDCG@5 NDCG@10 NDCG@5
NDCG@10 LM 0.2419 0.2591 0.0708 0.0823 DRRM_TKS (ed) 0.2015 0.2028 0.0501 0.0540 DRRM_TKS (ep) 0.1780 0.1808 0.1089 0.1083 Combined 0.2821 0.2834 0.0852 0.0920

CORE COLUMN ENTITY RANKING (SCHEMA-ASSISTED) QS-1 QS-2 • R #0:
without schema information (query only) • R #1-#3: with automatic schema determination (top 10) • Oracle: with ground truth schema

EXAMPLE "Towns in the Republic of Ireland in 2006 Census
Records"

table retrieval 2. Query-by-table 1. On-the-ﬂy table generation

• Three completion tasks • Row population • Column population
• Value ﬁnding TABLE COMPLETION

TASK • Row population: • Row population is the task
of generating a ranked list of entities to be added to the core column of a given seed table A Formula 1 constructors’ statistics 2016 1.McLaren 2.Mercedes 3.Red Bull Add entity Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK

APPROACH Candidate Selection Entity Ranking Ranked list of suggestion (top-K
entities) 1 2

APPROACH 1 Candidate Selection Entity Ranking Ranked list of suggestion
(top-K entities) 1 2 • From knowledge base • Entities that are the same types(s) or belong to the same categories • From table corpus • Entities from related tables (contain any seed entities, similar table caption, headings)

APPROACH Entity Ranking Ranked list of suggestion (top-K entities) 1
2 • Based on the similarity between the candidate entity and various table elements 1 Candidate Selection P(e|E, L, c) / P(e|E)P(L|e)P(c|e). Entity similarity Column label similarity Caption similarity Candidate entity

• Idea: Taking existing tables and simulate the user in
an intermediate step during table completion • Select a set of (1000) tables randomly • Contain at least 6 rows and at least 4 columns • For any intermediate step (i row completed) • First i (1<=i<=5) rows are taken as the seed table • Entities in the remaining rows are the ground truth EXPERIMENTAL DESIGN Row population E Egt L seed table l1 e1 l2 l3 … … ei ei+1 … en … lm

RESULTS FOR CANDIDATE SELECTION

RESULTS FOR ENTITY RANKING

TAKE-AWAY POINTS • Both tables and KBs are useful for
row population • Candidate selection • Category > Type • Entity > Caption > Headings • All complement each other • Entity ranking • Entity > Headings > Caption • All complement each other • Highly relevant to candidate selection

table retrieval 2. Query-by-table 1. On-the-ﬂy table generation 1. Row population 2. Column population 3. Value ﬁnding

TASK • Column population: • Column population is the task
of generating a ranked list of column labels to be added to the column headings of a given seed table B Formula 1 constructors’ statistics 2016 Add column 1.Seasons 2.Races Entered Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK

APPROACH Label Ranking Ranked list of suggestion (top-K labels) 2
1 Candidate Selection

APPROACH 1 Candidate Selection Label Ranking Ranked list of suggestion
(top-K labels) 1 2 • From table corpus • Column labels from related tables (contain any seed entities, similar table caption, headings)

APPROACH Label Ranking Ranked list of suggestion (top-K labels) 1
2 • Based on the similarity between the candidate column labels and the table elements 1 Candidate Selection P(l|E, c, L) = X T P(l|T)P(T|E, c, L)

• Idea: Taking existing tables and simulate the user in
an intermediate step during table completion • Select a set of (1000) tables randomly • Contain at least 6 rows and at least 4 columns • For any intermediate step (i row completed) • First j (1<=j<=3) rows are taken as the seed table • Labels in the remaining rows are the ground truth EXPERIMENTAL DESIGN Column population E L seed table l1 e1 … lj lj+1 … … … … en … lm Lgt

RESULTS FOR CANDIDATE SELECTION

RESULTS FOR LABEL RANKING

TASK • Cell value finding: • Given an input relational
table, find the value of a specific cell (identified by the entity in the core column and the column heading label) or (optionally) determine if the cell should be left empty Oscar Best Actor Year 2013 Actor Film Role(s) 2014 2015 Matthew McConaughey Eddie Redmayne Leonard DiCaprio Dallas Buyers Club The theory of Everything The Revenant Ron Woodroof Stephen Hawking Hugh Class 2016 Casey Affleck Manchester by the Sea Lee Chandler 2017 Gary Oldman 1.Darkest Hour https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor (2 additional sources) 2.Tinker Tailor Soldier Spy https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor (1 additional source) 3.Nil by Mouth http://dbpedia.org/page/Gary_Oldman A 1.Lee Chandler https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor https://en.wikipedia.org/wiki/Casey_Affleck 2.Ray Sybert https://en.wikipedia.org/wiki/Casey_Affleck B

Novel aspects: 1. Enabling a cell to have multiple, possible
conﬂicting values 2. Supplementing the predicted values with supporting evidence 3. Combining evidence from multiple sources 4. Handling the case where the cell should be left empty

APPROACH Candidate Value ﬁnding Value Ranking Ranked list of suggestion
(top-K values) 1 2

APPROACH 1 Candidate Value ﬁnding Value Ranking Ranked list of
suggestion (top-K values) 1 2 • From knowledge base • Heading-to-predicate matching • E.g. “location” vs. <dbp:location>, <dbp:city>, <dbp:country> • From table corpus • Heading-to-heading matching • Identity other table column that have the same meaning, e.g., nation vs. country Cleveland Indians Last Won 1948 Last Played 1997 World Series record by team or franchise, 1903–2013 Cleveland Indians Most recent win 1948 Win % .400 Cleveland Indians Season 1948 Per game 2,620,627 List of World Series champions Clarksdale Planters Year-by-year results ✔ ✖ Legion Original air date 14 October 1993 <dbr:Legion_(Red_Dwarf)> <dbp:airdate> "1993-10-14" Value Normalization DBpedia: e h e p v

Table corpus Knowledge base Table matching Value extraction Candidate ﬁnding
(Sect. 4) Heading-to-heading matching Heading-to-predicate matching TC value ranking KB value ranking Value ranking (Sect. 5) {(v; e, h0, T0)} {(v; e, p)} KB+TC value ranking (e, h, T) score(v; e, h, T) Input: Output: ranked list of values ordered by

APPROACH Candidate Value ﬁnding Value Ranking Ranked list of suggestion
(top-K values) 1 1 2 • Combine evidence in a feature-based approach • Feature I: Degree of support for the given value across the diﬀerent evidence sources • Feature II: Empty value prediction • Feature III: Semantic relatedness between the input table and candidate tables (where the value originates from)

• Idea: Conceal cell values from existing tables • Randomly
select an existing table • Pick a table column • Remove n cells randomly from this column • Evaluate using crowdsourcing • Given the input table, the value, and a source documents, does this appear as the correct value for the missing cell? EXPERIMENTAL DESIGN Sampled Column Picked values

EXPERIMENTAL RESULTS Method Empty values excluded Empty values included Baseline
0,585 0,518 Feature I 0,664 0,576 Feature II 0,684 0,590 Feature I+II+III 0,757 0,671 Value ﬁnding performance in terms of NDCG@5

WHAT INTELLIGENT ASSISTANCE FUNCTIONALITIES CAN WE PROVIDE FOR PEOPLE WORKING
WITH TABLES?

• Tasks: A collection of new tasks, including query- by-table,
table generation, row and column population, and cell value finding with evidences • Methods: Novel methods for table search, table generation, and table completion • Resources: A series of data resources publicly available for reproducibility, including code, run files, and high-quality human annotations via annotators or crowd sourcing CONTRIBUTIONS OF THIS THESIS Chapter Description Link 3 Test collection, feature file, and run files related to keyword table search  https://github.com/iai-group/ www2018-table 5 Code and test collections related to table completion https://github.com/iai-group/ sigir2017-table 6 Test collections, feature files, and run files related to table generation  https://github.com/iai-group/ sigir2018-table 7 Test collections, feature files, and run files related to table cell completion  https://github.com/iai-group/ cikm2019-table Appendix A Code of SmartTable demo https://github.com/iai-group/ SmartTable http://smarttable.cc/

• … FUTURE

THANKS!

Information Retrieval and Text Mining - Table S...

Information Retrieval and Text Mining - Table Search, Generation and Completion

More Decks by Krisztian Balog

Other Decks in Education

Featured

Transcript