Information Retrieval and Text Mining - Table Search, Generation and Completion

Slide 1

Slide 1 text

Shuo Zhang October 22, 2019 TABLE SEARCH, GENERATION AND COMPLETION

Slide 2

Slide 2 text

TABLES ARE EVERYWHERE

Slide 3

Slide 3 text

• Web Tables: The WebTables systems extract 14.1 billion HTML tables and ﬁnds 154M are high-quality [1] • Web Tables: Lehmberg et al. (2016) extract 233M content tables from Common Crawl 2015 [2] • Wikipedia Tables: The current snapshot of Wikipedia contains more than 3.23M tables from 520k articles • Spreadsheets: The number of worldwide spreadsheet users is estimated to exceed 400M, and about 50 to 80% of business use spreadsheets • … STATISTICS ON TABLES [1] Cafarella et al. Webtables: Exploring the power of tables on the web, VLDB Endow, 2008 [2] Lehmberg et al. A large public corpus of web tables containing time and context metadata, WWW Companion, 2016

Slide 4

Slide 4 text

TYPES OF TABLES Relational tables Entity tables Other

Slide 5

Slide 5 text

THE ANATOMY OF A RELATIONAL   (ENTITY-FOCUSED) TABLE Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … …

Slide 6

Slide 6 text

Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Table caption THE ANATOMY OF A RELATIONAL   (ENTITY-FOCUSED) TABLE

Slide 7

Slide 7 text

Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Core column (subject column) THE ANATOMY OF A RELATIONAL   (ENTITY-FOCUSED) TABLE We assume that these entities are recognized and disambiguated, i.e., linked to a knowledge base

Slide 8

Slide 8 text

Heading column labels (table schema) Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … THE ANATOMY OF A RELATIONAL   (ENTITY-FOCUSED) TABLE

Slide 9

Slide 9 text

WHAT INTELLIGENT ASSISTANCE FUNCTIONALITIES CAN WE PROVIDE FOR PEOPLE WORKING WITH TABLES?

Slide 10

Slide 10 text

OVERVIEW Table Search Table Generation Table Completion

Slide 11

Slide 11 text

#1: TABLE SEARCH Return a table in response to a keyword query

Slide 12

Slide 12 text

#2 TABLE GENERATION Automatically generating an entire table in response to a natural language query

Slide 13

Slide 13 text

#3 TABLE COMPLETION Oscar Best Actor …… ACTOR IN A LEADING ROLE Casey Aﬄeck Manchester by the Sea …… oscar.go.com/winners Year 2013 Actor Film 2014 2015 Matthew McConaughey Eddie Redmayne Leonard DiCaprio Dallas Buyers Club The theory of Everything The Revenant 2016 Casey Aﬄeck A 1.2017 2.2018 Add entity B Add column 1.Role(s) 2.Director(s) C Generating a ranked list of suggestions for the next row, column, and cell

Slide 14

Slide 14 text

• Data sources • Table corpus: 1.6M tables extracted from Wikipedia • Knowledge base: DBpedia 2015-10 (4.6M entities) • Evaluation measures • Standard IR measures (MAP, MRR, NDCG) EXPERIMENTAL SETTING

Slide 15

Slide 15 text

• Two retrieval tasks, with tables as results • Ad hoc table retrieval • Query-by-table TABLE SEARCH

Slide 16

Slide 16 text

TASK • Ad hoc table retrieval: • Given a keyword query as input, return a ranked list of tables from a table corpus Singapore Search Year GDP Nominal (Billion) GDP Nominal Per Capita GDP Real (Billion) Singapore - Wikipedia, Economy Statistics (Recent Years) GNI Nominal (Billion) GNI Nominal Per Capita 2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292 https://en.wikipedia.org/wiki/Singapore Show more (5 rows total) Singapore - Wikipedia, Language used most frequently at home Language Color in Figure Percent English Blue 36.9% Show more (6 rows total) https://en.wikipedia.org/wiki/Singapore 2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216 2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902 Mandarin Yellow 34.9% Malay Red 10.7%

Slide 17

Slide 17 text

APPROACHES • Unsupervised methods • Build a document-based representation for each table, then employ conventional document retrieval methods • Supervised methods • Describe query-table pairs using a set of features, then employ supervised machine learning ("learning-to-rank")

Slide 18

Slide 18 text

UNSUPERVISED METHODS • Single-field document representation • All table content, no structure • Multi-field document representation • Separate document fields for embedding document’s title, section title, table caption, table body, and table headings

Slide 19

Slide 19 text

SUPERVISED METHODS • Three groups of features • Query features • #query terms, query IDF scores • Table features • Table properties: #rows, #cols, #empty cells, etc. • Embedding document: link structure, number of tables, etc. • Query-table features • Query terms found in different table elements, LM score, etc. • Our novel semantic matching features

Slide 20

Slide 20 text

Can we go beyond lexical matching and improve keyword table search performance by incorporating semantic matching?

Slide 21

Slide 21 text

SEMANTIC MATCHING • Main objective: go beyond term-based matching • Three components: 1. Content extraction 2. Semantic representations 3. Similarity measures

Slide 22

Slide 22 text

SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of a query/table is represented as a set of terms, which can be words or entities Query … q1 qn … Table t1 tm

Slide 23

Slide 23 text

SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of a query/table is represented as a set of terms, which can be words or entities Query … q1 qn … Table t1 tm Entity-based: - Top-k ranked entities from a knowledge base - Entities in the core table column - Top-k ranked entities using the embedding document/section title as a query

Slide 24

Slide 24 text

SEMANTIC MATCHING 2. SEMANTIC REPRESENTATIONS • Each of the raw terms is mapped to a semantic vector representation Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn …

Slide 25

Slide 25 text

SEMANTIC REPRESENTATIONS • Bag-of-concepts (sparse discrete vectors) • Bag-of-entities • Each vector element corresponds to an entity • is 1 if there exists a link between entities i and j in the KB • Bag-of-categories • Each vector element corresponds to a Wikipedia category • is 1 if entity i is assigned to Wikipedia category j • Embeddings (dense continuous vectors) • Word embeddings • Word2Vec (300 dimensions, trained on Google news) • Graph embeddings • RDF2vec (200 dimensions, trained on DBpedia) ~ ti ~ ti[j] ~ ti[j] j

Slide 26

Slide 26 text

SEMANTIC MATCHING 3. SIMILARITY MEASURES Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching

Slide 27

Slide 27 text

SEMANTIC MATCHING EARLY FUSION MATCHING STRATEGY Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Early: Take the centroid of semantic vectors and compute their cosine similarity ~ t1 ~ tm … ~ q1 ~ qn …

Slide 28

Slide 28 text

SEMANTIC MATCHING LATE FUSION MATCHING STRATEGY Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Late: Compute all pairwise similarities between the query and table semantic vectors, then aggregate those pairwise similarity scores (sum, avg, or max) ~ t1 ~ tm … ~ q1 ~ qn … AGGR … …

Slide 29

Slide 29 text

EXPERIMENTAL EVALUATION

Slide 30

Slide 30 text

EXPERIMENTAL SETUP • Table corpus • WikiTables corpus1: 1.6M tables extracted from Wikipedia • Knowledge base • DBpedia (2015-10): 4.6M entities with an English abstract • Queries • Sampled from two sources2,3 • Rank-based evaluation • NDCG@5, 10, 15, 20 1 Bhagavatula et al. TabEL: Entity Linking in Web Tables. In: ISWC ’15. 2 Cafarella et al. Data Integration for the Relational Web. Proc. of VLDB Endow. (2009) 3 Venetis et al. Recovering Semantics of Tables on the Web. Proc. of VLDB Endow. (2011) QS-1 QS-2 video games asian countries currency us cities laptops cpu kings of africa food calories economy gdp guitars manufacturer

Slide 31

Slide 31 text

RELEVANCE ASSESSMENTS • Collected via crowdsourcing • Pooling to depth 20, 3120 query-table pairs in total • Assessors are presented with the following scenario • "Imagine that your task is to create a new table on the query topic" • A table is … • Non-relevant (0): if it is unclear what it is about or it about a different topic • Relevant (1): if some cells or values could be used from it • Highly relevant (2): if large blocks or several values could be used from it

Slide 32

Slide 32 text

RESEARCH QUESTIONS • RQ1: Can semantic matching improve retrieval performance? • RQ2: Which of the semantic representations is the most effective? • RQ3: Which of the similarity measures performs best?

Slide 33

Slide 33 text

RESULTS: RQ1 NDCG@10 NDCG@20 Single-field document ranking 0.4344 0.5254 Multi-field document ranking 0.4860 0.5473 WebTable1 0.2992 0.3726 WikiTable2 0.4766 0.5206 LTR baseline 0.5456 0.6031 STR (LTR + semantic matching) 0.6293 0.6825 1 Cafarella et al. WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. (2008) 2 Bhagavatula et al. Methods for Exploring and Mining Tables on Wikipedia. In: IDEA ’13. • Can semantic matching improve retrieval performance? • Yes. STR achieves substantial and significants improvements over LTR.

Slide 34

Slide 34 text

RESULTS: RQ2 • Which of the semantic representations is the most effective? • Bag-of-entities.

Slide 35

Slide 35 text

RESULTS: RQ3 • Which of the similarity measures performs best? • Late-sum and Late-avg (but it also depends on the representation)

Slide 36

Slide 36 text

FEATURE ANALYSIS

Slide 37

Slide 37 text

OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc table retrieval 2. Query-by-table

Slide 38

Slide 38 text

TASK Rider Marc MARQUEZ Andrea DOVIZIOSO Maverick VINALES Valentino ROSSI Pos. 1 2 3 4 MotoGP World Standing 2017 Input table Bike Honda Ducati Yamaha Yamaha Rider Giacomo Agostini Angel Nieto Valentino Rossi Mike Hailwood Rank 1 2 3 3 … … Grand Prix motorcycle racing World champions Country Italy Spain Italy UK … Points 282 261 226 197 Rider Marc MARQUEZ Valentino ROSSI Jorge LORENZO Maverick VINALES Pos. 1 2 3 4 MotoGP 2016 Championship Final Standing Bike Honda Yamaha Yamaha Suzuki Nation SPA ITA SPA SPA … … … … Period 1966-1975 1969-1984 1997-2009 1961-1967 … Total 15 13 9 9 … … Points 298 249 233 202 … • Query-by-table: • Given an input table, return a ranked list of relevant tables • Boils down to table matching: • Computing the similarity between a pair of tables

Slide 39

Slide 39 text

PREVIOUS APPROACHES • Extracting a keyword query (from various table elements) and scoring tables against that query • Splitting tables into various elements and performing element-wise matching • Ad hoc similarity measures, tailor-made for each table element • Lacking a principled way of combining element-level similarities • Matching elements of different types have not been explored

Slide 40

Slide 40 text

Can we develop an effective and theoretically sound table matching framework for measuring and combining table element level similarity, without resorting to hand-crafted features?

Slide 41

Slide 41 text

APPROACH 1. Represent table elements in multiple semantic spaces 2. Measure element-level similarity in each of the semantic spaces y • x1 = x2 : element-wise matching • x1 != x2 : cross-element matching 3. Combine the element-level similarities in a discriminative learning framework i( ˜ T, T) = sim( ˜ Ty x1 , Ty x2 )

Slide 42

Slide 42 text

TABLE ELEMENTS Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Table topic Tt   (caption + page title) Table entities TE (core column) Table headings Th Table data TD

Slide 43

Slide 43 text

1. REPRESENTING TABLE ELEMENTS IN SEMANTIC SPACES • Word embeddings • Graph embeddings • Entity embeddings Tx = [t1, ..., tN ] 2 4 ty 1 [ ... ] ... ty N [ ... ] 3 5 Term space Semantic space y • Words: TF-IDF weight • Entities: presence/absence (0/1)

Slide 44

Slide 44 text

2. MEASURING ELEMENT-LEVEL SIMILARITY [ ... ] Cy x2 ˜ Cy x1 [ ... ] ⎡ ⎣ ty 1 [ ... ] ... ty M [ ... ] ⎤ ⎦ simearly ⎡ ⎣ ˜ ty 1 [ ... ] ... ˜ ty N [ ... ] ⎤ ⎦ AGGR ⎡ ⎣ cos(˜ ty 1 ,ty 1 ) ... cos(˜ ty N ,ty M ) ⎤ ⎦ ⎡ ⎣ ˜ ty 1 [ ... ] ... ˜ ty N [ ... ] ⎤ ⎦ ⎡ ⎣ ty 1 [ ... ] ... ty M [ ... ] ⎤ ⎦ simlate • Weighted centroid of term-level semantic vectors • Cosine similarity of centroid vectors simearly( ˜ Tx1 , Tx2 ) = cos( ~ Cy x1 , ~ Cy x2 ) • Compute the cosine similarities between all pairs of semantic vectors, then aggregate simlate( ˜ Tx1 , Tx2 ) = aggr({cos(~ t1,~ t2) : ~ t1 2 ˜ Ty ~ x1 ,~ t2 2 Ty ~ x2 })

Slide 45

Slide 45 text

3. ELEMENT-LEVEL SIMILARITIES AS FEATURES Element-wise Cross-element

Slide 46

Slide 46 text

EXPERIMENTAL EVALUATION

Slide 47

Slide 47 text

EXPERIMENTAL SETUP • WikiTables corpus, DBpedia as knowledge base • 50 tables sampled • Diverse topics (sports, music, ﬁlms, food, celebrities, geography, politics, etc.) • Relevance assessments • (2) highly relevant: it is about the same topic as the input table, but contains additional novel content that is not present in the input table • (1) relevant: on-topic, but it contains limited novel content • (0) non-relevant • Fleiss Kappa = 0.6703 (substantial agreement)

Slide 48

Slide 48 text

SETTING A BASELINE (1)

Slide 49

Slide 49 text

SETTING A BASELINE (2)

Slide 50

Slide 50 text

SETTING A BASELINE (2) Table 2

Slide 51

Slide 51 text

SETTING A BASELINE (2) Table 2 Table 3

Slide 52

Slide 52 text

RESULTS element- wise cross- element table features NDCG@5 NDCG@10 HCF-1 + 0.5382 0.5542 HCF-2 + + 0.5895 0.6050 CRAB-1 + 0.5578 0.5672 CRAB-2 + + 0.6172 0.6267 CRAB-3 + + 0.5140 0.5282 CRAB-4 + + + 0.5804 0.6027

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

RESEARCH QUESTIONS • RQ1: Which of the semantic representations (word- based, graph-based, or entity-based) is the most effective for modeling table elements? • RQ2: Which of the two element-level matching strategies performs better, element-wise or cross- element? • RQ3: How much do different table elements contribute to retrieval performance?

Slide 56

Slide 56 text

RESULTS: RQ1 • Which of the semantic representations is the most effective? • Entity-based. Also, they are complimentary

Slide 57

Slide 57 text

RESULTS: RQ2 • Which of the two element-level matching strategies performs better, element-wise or cross-element? • Element-wise element- wise cross- element table features NDCG@5 NDCG@10 HCF-1 + 0.5382 0.5542 HCF-2 + + 0.5895 0.6050 CRAB-1 + 0.5578 0.5672 CRAB-2 + + 0.6172 0.6267 CRAB-3 + + 0.5140 0.5282 CRAB-4 + + + 0.5804 0.6027

Slide 58

Slide 58 text

RESULTS: RQ2 • There are several cases where cross-element matching yields higher scores than element-wise matching

Slide 59

Slide 59 text

ANALYSIS: INPUT TABLE SIZE Horizontal split of input table x % Input table l1 e1 … lj … e2 e3 … … en … lm Vertical split of input table x % Input table l1 e1 l2 l3 … … ei … … en … lm Horizontal split of input table x % Input table l1 e1 … lj … e2 e3 … … en … lm Vertical split of input table x % Input table l1 e1 l2 l3 … … ei … … en … lm

Slide 60

Slide 60 text

OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc table retrieval 2. Query-by-table

Slide 61

Slide 61 text

TASK • On-the-ﬂy table generation: • Answer a free text query with a relational table, where • the core column lists all relevant entities; • columns correspond to attributes of those entities; • cells contain the values of the corresponding entity attributes. Video albums of Taylor Swift Search Title Released data Label CMT Crossroads: Taylor Swift and … Formats Journey to Fearless Speak Now World Tour-Live The 1989 World Tour Live Jun 16, 2009 Oct 11, 2011 Nov 21, 2011 Dec 20, 2015 Big Machine Shout! Factory Big Machine Big Machine DVD Blu-ray, DVD CD/Blu-ray, … Streaming E V S

Slide 62

Slide 62 text

APPROACH Core column entity ranking and schema determination could potentially mutually reinforce each other. Query (q) E Core column en+ty ranking Schema determina+on S Value lookup V E S

Slide 63

Slide 63 text

ALGORITHM Query (q) E Core column en+ty ranking Schema determina+on Value lookup E S S V

Slide 64

Slide 64 text

KNOWLEDGE BASE ENTRY ed ep Property: value Entity name Entity type Description Property: value … ea

Slide 65

Slide 65 text

CORE COLUMN ENTITY RANKING scoret(e, q) = X i wi i(e, q, St 1)

Slide 66

Slide 66 text

SCHEMA DETERMINATION scoret(s, q) = X i wi i(s, q, Et 1)

Slide 67

Slide 67 text

VALUE LOOKUP • A catalog of possible entity attribute-value pairs • Entity, schema label, value, provenance quadruples he, s, v, pi e s v T #123 values from KB values from TC s e v T #123

Slide 68

Slide 68 text

VALUE LOOKUP • Finding a cell’s value is a lookup in that catalog score(v, e, s, q) = max hs0,v,pi2eV match(s,s0) conf (p, q) eV values from KB values from TC soft string matching - "birthday" vs. "date of birth" - "country" vs. "nationality" matching conﬁdence - KB takes priority over TC - based on the corresponding table’s relevance to the query

Slide 69

Slide 69 text

EXPERIMENTAL EVALUATION

Slide 70

Slide 70 text

EXPERIMENTAL SETUP • Table corpus • WikiTables corpus: 1.6M tables extracted from Wikipedia • Knowledge base • DBpedia (2015-10): 4.6M entities with an English abstract • Two query sets • Rank-based metrics • NDCG for core column entity ranking and schema determination • MAP/MRR for value lookup

Slide 71

Slide 71 text

QUERY SET 1 (QS-1) • List queries from the DBpedia-Entity v2 collection1 (119) • "all cars that are produced in Germany" • "permanent members of the UN Security Council" • "Airlines that currently use Boeing 747 planes" • Core column entity ranking • Highly relevant entities from the collection • Schema determination • Crowdsourcing, 3-point relevance scale, 7k query-label pairs • Value lookup • Crowdsourcing, 25 queries sample, 14k cell values 1 Hasibi et al. DBpedia-Entity v2: A Test Collection for Entity Search. In: SIGIR ’17.

Slide 72

Slide 72 text

QUERY SET 2 (QS-2) • Entity-relationship queries from the RELink Query Collection1 (600) • Queries are answered by entity tuples (pairs or triplets) • That is, each query is answered by a table with 2 or 3 columns (including the core entity column) • Queries and relevance judgments are obtained automatically from Wikipedia lists that contain relational tables • Human annotators were asked to formulate the corresponding information need as a natural language query • "Find peaks above 6000m in the mountains of Peru" • "Which countries and cities have accredited armenian embassadors?" • "Which anti-aircraft guns were used in ships during war periods and what country produced them?" 1 Saleiro et al. RELink: A Research Framework and Test Collection for Entity-Relationship Retrieval. In: SIGIR ’17.

Slide 73

Slide 73 text

CORE COLUMN ENTITY RANKING (QUERY-BASED) QS-1 QS-2 NDCG@5 NDCG@10 NDCG@5 NDCG@10 LM 0.2419 0.2591 0.0708 0.0823 DRRM_TKS (ed) 0.2015 0.2028 0.0501 0.0540 DRRM_TKS (ep) 0.1780 0.1808 0.1089 0.1083 Combined 0.2821 0.2834 0.0852 0.0920

Slide 74

Slide 74 text

CORE COLUMN ENTITY RANKING (SCHEMA-ASSISTED) QS-1 QS-2 • R #0: without schema information (query only) • R #1-#3: with automatic schema determination (top 10) • Oracle: with ground truth schema

Slide 75

Slide 75 text

EXAMPLE "Towns in the Republic of Ireland in 2006 Census Records"

Slide 76

Slide 76 text

OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc table retrieval 2. Query-by-table 1. On-the-ﬂy table generation

Slide 77

Slide 77 text

• Three completion tasks • Row population • Column population • Value ﬁnding TABLE COMPLETION

Slide 78

Slide 78 text

TASK • Row population: • Row population is the task of generating a ranked list of entities to be added to the core column of a given seed table A Formula 1 constructors’ statistics 2016 1.McLaren 2.Mercedes 3.Red Bull Add entity Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK

Slide 79

Slide 79 text

APPROACH Candidate Selection Entity Ranking Ranked list of suggestion (top-K entities) 1 2

Slide 80

Slide 80 text

APPROACH 1 Candidate Selection Entity Ranking Ranked list of suggestion (top-K entities) 1 2 • From knowledge base • Entities that are the same types(s) or belong to the same categories • From table corpus • Entities from related tables (contain any seed entities, similar table caption, headings)

Slide 81

Slide 81 text

APPROACH Entity Ranking Ranked list of suggestion (top-K entities) 1 2 • Based on the similarity between the candidate entity and various table elements 1 Candidate Selection P(e|E, L, c) / P(e|E)P(L|e)P(c|e). Entity similarity Column label similarity Caption similarity Candidate entity

Slide 82

Slide 82 text

Slide 83

Slide 83 text

• Idea: Taking existing tables and simulate the user in an intermediate step during table completion • Select a set of (1000) tables randomly • Contain at least 6 rows and at least 4 columns • For any intermediate step (i row completed) • First i (1<=i<=5) rows are taken as the seed table • Entities in the remaining rows are the ground truth EXPERIMENTAL DESIGN Row population E Egt L seed table l1 e1 l2 l3 … … ei ei+1 … en … lm

Slide 84

Slide 84 text

RESULTS FOR CANDIDATE SELECTION

Slide 85

Slide 85 text

RESULTS FOR ENTITY RANKING

Slide 86

Slide 86 text

TAKE-AWAY POINTS • Both tables and KBs are useful for row population • Candidate selection • Category > Type • Entity > Caption > Headings • All complement each other • Entity ranking • Entity > Headings > Caption • All complement each other • Highly relevant to candidate selection

Slide 87

Slide 87 text

OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc table retrieval 2. Query-by-table 1. On-the-ﬂy table generation 1. Row population 2. Column population 3. Value ﬁnding

Slide 88

Slide 88 text

TASK • Column population: • Column population is the task of generating a ranked list of column labels to be added to the column headings of a given seed table B Formula 1 constructors’ statistics 2016 Add column 1.Seasons 2.Races Entered Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK

Slide 89

Slide 89 text

APPROACH Label Ranking Ranked list of suggestion (top-K labels) 2 1 Candidate Selection

Slide 90

Slide 90 text

APPROACH 1 Candidate Selection Label Ranking Ranked list of suggestion (top-K labels) 1 2 • From table corpus • Column labels from related tables (contain any seed entities, similar table caption, headings)

Slide 91

Slide 91 text

APPROACH Label Ranking Ranked list of suggestion (top-K labels) 1 2 • Based on the similarity between the candidate column labels and the table elements 1 Candidate Selection P(l|E, c, L) = X T P(l|T)P(T|E, c, L)

Slide 92

Slide 92 text

• Idea: Taking existing tables and simulate the user in an intermediate step during table completion • Select a set of (1000) tables randomly • Contain at least 6 rows and at least 4 columns • For any intermediate step (i row completed) • First j (1<=j<=3) rows are taken as the seed table • Labels in the remaining rows are the ground truth EXPERIMENTAL DESIGN Column population E L seed table l1 e1 … lj lj+1 … … … … en … lm Lgt

Slide 93

Slide 93 text

RESULTS FOR CANDIDATE SELECTION

Slide 94

Slide 94 text

RESULTS FOR LABEL RANKING

Slide 95

Slide 95 text

OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc table retrieval 2. Query-by-table 1. On-the-ﬂy table generation 1. Row population 2. Column population 3. Value ﬁnding

Slide 96

Slide 96 text

TASK • Cell value finding: • Given an input relational table, find the value of a specific cell (identified by the entity in the core column and the column heading label) or (optionally) determine if the cell should be left empty Oscar Best Actor Year 2013 Actor Film Role(s) 2014 2015 Matthew McConaughey Eddie Redmayne Leonard DiCaprio Dallas Buyers Club The theory of Everything The Revenant Ron Woodroof Stephen Hawking Hugh Class 2016 Casey Affleck Manchester by the Sea Lee Chandler 2017 Gary Oldman 1.Darkest Hour https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor (2 additional sources) 2.Tinker Tailor Soldier Spy https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor (1 additional source) 3.Nil by Mouth http://dbpedia.org/page/Gary_Oldman A 1.Lee Chandler https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor https://en.wikipedia.org/wiki/Casey_Affleck 2.Ray Sybert https://en.wikipedia.org/wiki/Casey_Affleck B

Slide 97

Slide 97 text

Novel aspects: 1. Enabling a cell to have multiple, possible conﬂicting values 2. Supplementing the predicted values with supporting evidence 3. Combining evidence from multiple sources 4. Handling the case where the cell should be left empty

Slide 98

Slide 98 text

APPROACH Candidate Value ﬁnding Value Ranking Ranked list of suggestion (top-K values) 1 2

Slide 99

Slide 99 text

APPROACH 1 Candidate Value ﬁnding Value Ranking Ranked list of suggestion (top-K values) 1 2 • From knowledge base • Heading-to-predicate matching • E.g. “location” vs. , , • From table corpus • Heading-to-heading matching • Identity other table column that have the same meaning, e.g., nation vs. country Cleveland Indians Last Won 1948 Last Played 1997 World Series record by team or franchise, 1903–2013 Cleveland Indians Most recent win 1948 Win % .400 Cleveland Indians Season 1948 Per game 2,620,627 List of World Series champions Clarksdale Planters Year-by-year results ✔ ✖ Legion Original air date 14 October 1993 "1993-10-14" Value Normalization DBpedia: e h e p v

Slide 100

Slide 100 text

Table corpus Knowledge base Table matching Value extraction Candidate ﬁnding (Sect. 4) Heading-to-heading matching Heading-to-predicate matching TC value ranking KB value ranking Value ranking (Sect. 5) {(v; e, h0, T0)} {(v; e, p)} KB+TC value ranking (e, h, T) score(v; e, h, T) Input: Output: ranked list of values ordered by

Slide 101

Slide 101 text

APPROACH Candidate Value ﬁnding Value Ranking Ranked list of suggestion (top-K values) 1 1 2 • Combine evidence in a feature-based approach • Feature I: Degree of support for the given value across the diﬀerent evidence sources • Feature II: Empty value prediction • Feature III: Semantic relatedness between the input table and candidate tables (where the value originates from)

Slide 102

Slide 102 text

• Idea: Conceal cell values from existing tables • Randomly select an existing table • Pick a table column • Remove n cells randomly from this column • Evaluate using crowdsourcing • Given the input table, the value, and a source documents, does this appear as the correct value for the missing cell? EXPERIMENTAL DESIGN Sampled Column Picked values

Slide 103

Slide 103 text

EXPERIMENTAL RESULTS Method Empty values excluded Empty values included Baseline 0,585 0,518 Feature I 0,664 0,576 Feature II 0,684 0,590 Feature I+II+III 0,757 0,671 Value ﬁnding performance in terms of NDCG@5

Slide 104

Slide 104 text

OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc table retrieval 2. Query-by-table 1. On-the-ﬂy table generation 1. Row population 2. Column population 3. Value ﬁnding

Slide 105

Slide 105 text

WHAT INTELLIGENT ASSISTANCE FUNCTIONALITIES CAN WE PROVIDE FOR PEOPLE WORKING WITH TABLES?

Slide 106

Slide 106 text

• Tasks: A collection of new tasks, including query- by-table, table generation, row and column population, and cell value finding with evidences • Methods: Novel methods for table search, table generation, and table completion • Resources: A series of data resources publicly available for reproducibility, including code, run files, and high-quality human annotations via annotators or crowd sourcing CONTRIBUTIONS OF THIS THESIS Chapter Description Link 3 Test collection, feature file, and run files related to keyword table search  https://github.com/iai-group/ www2018-table 5 Code and test collections related to table completion https://github.com/iai-group/ sigir2017-table 6 Test collections, feature files, and run files related to table generation  https://github.com/iai-group/ sigir2018-table 7 Test collections, feature files, and run files related to table cell completion  https://github.com/iai-group/ cikm2019-table Appendix A Code of SmartTable demo https://github.com/iai-group/ SmartTable http://smarttable.cc/