Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Retrieval and Text Mining - Table Search, Generation and Completion

Information Retrieval and Text Mining - Table Search, Generation and Completion

University of Stavanger, DAT640, 2019 fall

Krisztian Balog

October 22, 2019
Tweet

More Decks by Krisztian Balog

Other Decks in Education

Transcript

  1. • Web Tables: The WebTables systems extract 14.1 billion HTML

    tables and finds 154M are high-quality [1] • Web Tables: Lehmberg et al. (2016) extract 233M content tables from Common Crawl 2015 [2] • Wikipedia Tables: The current snapshot of Wikipedia contains more than 3.23M tables from 520k articles • Spreadsheets: The number of worldwide spreadsheet users is estimated to exceed 400M, and about 50 to 80% of business use spreadsheets • … STATISTICS ON TABLES [1] Cafarella et al. Webtables: Exploring the power of tables on the web, VLDB Endow, 2008 [2] Lehmberg et al. A large public corpus of web tables containing time and context metadata, WWW Companion, 2016
  2. THE ANATOMY OF A RELATIONAL 
 (ENTITY-FOCUSED) TABLE Formula 1

    constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … …
  3. Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base

    Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Table caption THE ANATOMY OF A RELATIONAL 
 (ENTITY-FOCUSED) TABLE
  4. Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base

    Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Core column (subject column) THE ANATOMY OF A RELATIONAL 
 (ENTITY-FOCUSED) TABLE We assume that these entities are recognized and disambiguated, i.e., linked to a knowledge base
  5. Heading column labels (table schema) Formula 1 constructors’ statistics 2016

    Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … THE ANATOMY OF A RELATIONAL 
 (ENTITY-FOCUSED) TABLE
  6. #3 TABLE COMPLETION Oscar Best Actor …… ACTOR IN A

    LEADING ROLE Casey Affleck Manchester by the Sea …… oscar.go.com/winners Year 2013 Actor Film 2014 2015 Matthew McConaughey Eddie Redmayne Leonard DiCaprio Dallas Buyers Club The theory of Everything The Revenant 2016 Casey Affleck A 1.2017 2.2018 Add entity B Add column 1.Role(s) 2.Director(s) C Generating a ranked list of suggestions for the next row, column, and cell
  7. • Data sources • Table corpus: 1.6M tables extracted from

    Wikipedia • Knowledge base: DBpedia 2015-10 (4.6M entities) • Evaluation measures • Standard IR measures (MAP, MRR, NDCG) EXPERIMENTAL SETTING
  8. • Two retrieval tasks, with tables as results • Ad

    hoc table retrieval • Query-by-table TABLE SEARCH
  9. TASK • Ad hoc table retrieval: • Given a keyword

    query as input, return a ranked list of tables from a table corpus Singapore Search Year GDP Nominal (Billion) GDP Nominal Per Capita GDP Real (Billion) Singapore - Wikipedia, Economy Statistics (Recent Years) GNI Nominal (Billion) GNI Nominal Per Capita 2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292 https://en.wikipedia.org/wiki/Singapore Show more (5 rows total) Singapore - Wikipedia, Language used most frequently at home Language Color in Figure Percent English Blue 36.9% Show more (6 rows total) https://en.wikipedia.org/wiki/Singapore 2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216 2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902 Mandarin Yellow 34.9% Malay Red 10.7%
  10. APPROACHES • Unsupervised methods • Build a document-based representation for

    each table, then employ conventional document retrieval methods • Supervised methods • Describe query-table pairs using a set of features, then employ supervised machine learning ("learning-to-rank")
  11. UNSUPERVISED METHODS • Single-field document representation • All table content,

    no structure • Multi-field document representation • Separate document fields for embedding document’s title, section title, table caption, table body, and table headings
  12. SUPERVISED METHODS • Three groups of features • Query features

    • #query terms, query IDF scores • Table features • Table properties: #rows, #cols, #empty cells, etc. • Embedding document: link structure, number of tables, etc. • Query-table features • Query terms found in different table elements, LM score, etc. • Our novel semantic matching features
  13. Can we go beyond lexical matching and improve keyword table

    search performance by incorporating semantic matching?
  14. SEMANTIC MATCHING • Main objective: go beyond term-based matching •

    Three components: 1. Content extraction 2. Semantic representations 3. Similarity measures
  15. SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of

    a query/table is represented as a set of terms, which can be words or entities Query … q1 qn … Table t1 tm
  16. SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of

    a query/table is represented as a set of terms, which can be words or entities Query … q1 qn … Table t1 tm Entity-based: - Top-k ranked entities from a knowledge base - Entities in the core table column - Top-k ranked entities using the embedding document/section title as a query
  17. SEMANTIC MATCHING 2. SEMANTIC REPRESENTATIONS • Each of the raw

    terms is mapped to a semantic vector representation Query … … Table q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn …
  18. SEMANTIC REPRESENTATIONS • Bag-of-concepts (sparse discrete vectors) • Bag-of-entities •

    Each vector element corresponds to an entity • is 1 if there exists a link between entities i and j in the KB • Bag-of-categories • Each vector element corresponds to a Wikipedia category • is 1 if entity i is assigned to Wikipedia category j • Embeddings (dense continuous vectors) • Word embeddings • Word2Vec (300 dimensions, trained on Google news) • Graph embeddings • RDF2vec (200 dimensions, trained on DBpedia) ~ ti ~ ti[j] ~ ti[j] j
  19. SEMANTIC MATCHING 3. SIMILARITY MEASURES Query … … Table q1

    qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching
  20. SEMANTIC MATCHING EARLY FUSION MATCHING STRATEGY Query … … Table

    q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Early: Take the centroid of semantic vectors and compute their cosine similarity ~ t1 ~ tm … ~ q1 ~ qn …
  21. SEMANTIC MATCHING LATE FUSION MATCHING STRATEGY Query … … Table

    q1 qn t1 tm ~ t1 ~ tm … ~ q1 ~ qn … semantic matching Late: Compute all pairwise similarities between the query and table semantic vectors, then aggregate those pairwise similarity scores (sum, avg, or max) ~ t1 ~ tm … ~ q1 ~ qn … AGGR … …
  22. EXPERIMENTAL SETUP • Table corpus • WikiTables corpus1: 1.6M tables

    extracted from Wikipedia • Knowledge base • DBpedia (2015-10): 4.6M entities with an English abstract • Queries • Sampled from two sources2,3 • Rank-based evaluation • NDCG@5, 10, 15, 20 1 Bhagavatula et al. TabEL: Entity Linking in Web Tables. In: ISWC ’15. 2 Cafarella et al. Data Integration for the Relational Web. Proc. of VLDB Endow. (2009) 3 Venetis et al. Recovering Semantics of Tables on the Web. Proc. of VLDB Endow. (2011) QS-1 QS-2 video games asian countries currency us cities laptops cpu kings of africa food calories economy gdp guitars manufacturer
  23. RELEVANCE ASSESSMENTS • Collected via crowdsourcing • Pooling to depth

    20, 3120 query-table pairs in total • Assessors are presented with the following scenario • "Imagine that your task is to create a new table on the query topic" • A table is … • Non-relevant (0): if it is unclear what it is about or it about a different topic • Relevant (1): if some cells or values could be used from it • Highly relevant (2): if large blocks or several values could be used from it
  24. RESEARCH QUESTIONS • RQ1: Can semantic matching improve retrieval performance?

    • RQ2: Which of the semantic representations is the most effective? • RQ3: Which of the similarity measures performs best?
  25. RESULTS: RQ1 NDCG@10 NDCG@20 Single-field document ranking 0.4344 0.5254 Multi-field

    document ranking 0.4860 0.5473 WebTable1 0.2992 0.3726 WikiTable2 0.4766 0.5206 LTR baseline 0.5456 0.6031 STR (LTR + semantic matching) 0.6293 0.6825 1 Cafarella et al. WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. (2008) 2 Bhagavatula et al. Methods for Exploring and Mining Tables on Wikipedia. In: IDEA ’13. • Can semantic matching improve retrieval performance? • Yes. STR achieves substantial and significants improvements over LTR.
  26. RESULTS: RQ3 • Which of the similarity measures performs best?

    • Late-sum and Late-avg (but it also depends on the representation)
  27. TASK Rider Marc MARQUEZ Andrea DOVIZIOSO Maverick VINALES Valentino ROSSI

    Pos. 1 2 3 4 MotoGP World Standing 2017 Input table Bike Honda Ducati Yamaha Yamaha Rider Giacomo Agostini Angel Nieto Valentino Rossi Mike Hailwood Rank 1 2 3 3 … … Grand Prix motorcycle racing World champions Country Italy Spain Italy UK … Points 282 261 226 197 Rider Marc MARQUEZ Valentino ROSSI Jorge LORENZO Maverick VINALES Pos. 1 2 3 4 MotoGP 2016 Championship Final Standing Bike Honda Yamaha Yamaha Suzuki Nation SPA ITA SPA SPA … … … … Period 1966-1975 1969-1984 1997-2009 1961-1967 … Total 15 13 9 9 … … Points 298 249 233 202 … • Query-by-table: • Given an input table, return a ranked list of relevant tables • Boils down to table matching: • Computing the similarity between a pair of tables
  28. PREVIOUS APPROACHES • Extracting a keyword query (from various table

    elements) and scoring tables against that query • Splitting tables into various elements and performing element-wise matching • Ad hoc similarity measures, tailor-made for each table element • Lacking a principled way of combining element-level similarities • Matching elements of different types have not been explored
  29. Can we develop an effective and theoretically sound table matching

    framework for measuring and combining table element level similarity, without resorting to hand-crafted features?
  30. APPROACH 1. Represent table elements in multiple semantic spaces 2.

    Measure element-level similarity in each of the semantic spaces y • x1 = x2 : element-wise matching • x1 != x2 : cross-element matching 3. Combine the element-level similarities in a discriminative learning framework i( ˜ T, T) = sim( ˜ Ty x1 , Ty x2 )
  31. TABLE ELEMENTS Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine

    Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Formula 1 constructors’ statistics 2016 Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK … … Table topic Tt 
 (caption + page title) Table entities TE (core column) Table headings Th Table data TD
  32. 1. REPRESENTING TABLE ELEMENTS IN SEMANTIC SPACES • Word embeddings

    • Graph embeddings • Entity embeddings Tx = [t1, ..., tN ] 2 4 ty 1 [ ... ] ... ty N [ ... ] 3 5 Term space Semantic space y • Words: TF-IDF weight • Entities: presence/absence (0/1)
  33. 2. MEASURING ELEMENT-LEVEL SIMILARITY [ ... ] Cy x2 ˜

    Cy x1 [ ... ] ⎡ ⎣ ty 1 [ ... ] ... ty M [ ... ] ⎤ ⎦ simearly ⎡ ⎣ ˜ ty 1 [ ... ] ... ˜ ty N [ ... ] ⎤ ⎦ AGGR ⎡ ⎣ cos(˜ ty 1 ,ty 1 ) ... cos(˜ ty N ,ty M ) ⎤ ⎦ ⎡ ⎣ ˜ ty 1 [ ... ] ... ˜ ty N [ ... ] ⎤ ⎦ ⎡ ⎣ ty 1 [ ... ] ... ty M [ ... ] ⎤ ⎦ simlate • Weighted centroid of term-level semantic vectors • Cosine similarity of centroid vectors simearly( ˜ Tx1 , Tx2 ) = cos( ~ Cy x1 , ~ Cy x2 ) • Compute the cosine similarities between all pairs of semantic vectors, then aggregate simlate( ˜ Tx1 , Tx2 ) = aggr({cos(~ t1,~ t2) : ~ t1 2 ˜ Ty ~ x1 ,~ t2 2 Ty ~ x2 })
  34. EXPERIMENTAL SETUP • WikiTables corpus, DBpedia as knowledge base •

    50 tables sampled • Diverse topics (sports, music, films, food, celebrities, geography, politics, etc.) • Relevance assessments • (2) highly relevant: it is about the same topic as the input table, but contains additional novel content that is not present in the input table • (1) relevant: on-topic, but it contains limited novel content • (0) non-relevant • Fleiss Kappa = 0.6703 (substantial agreement)
  35. RESULTS element- wise cross- element table features NDCG@5 NDCG@10 HCF-1

    + 0.5382 0.5542 HCF-2 + + 0.5895 0.6050 CRAB-1 + 0.5578 0.5672 CRAB-2 + + 0.6172 0.6267 CRAB-3 + + 0.5140 0.5282 CRAB-4 + + + 0.5804 0.6027
  36. RESULTS element- wise cross- element table features NDCG@5 NDCG@10 HCF-1

    + 0.5382 0.5542 HCF-2 + + 0.5895 0.6050 CRAB-1 + 0.5578 0.5672 CRAB-2 + + 0.6172 0.6267 CRAB-3 + + 0.5140 0.5282 CRAB-4 + + + 0.5804 0.6027
  37. RESULTS element- wise cross- element table features NDCG@5 NDCG@10 HCF-1

    + 0.5382 0.5542 HCF-2 + + 0.5895 0.6050 CRAB-1 + 0.5578 0.5672 CRAB-2 + + 0.6172 0.6267 CRAB-3 + + 0.5140 0.5282 CRAB-4 + + + 0.5804 0.6027
  38. RESEARCH QUESTIONS • RQ1: Which of the semantic representations (word-

    based, graph-based, or entity-based) is the most effective for modeling table elements? • RQ2: Which of the two element-level matching strategies performs better, element-wise or cross- element? • RQ3: How much do different table elements contribute to retrieval performance?
  39. RESULTS: RQ1 • Which of the semantic representations is the

    most effective? • Entity-based. Also, they are complimentary
  40. RESULTS: RQ2 • Which of the two element-level matching strategies

    performs better, element-wise or cross-element? • Element-wise element- wise cross- element table features NDCG@5 NDCG@10 HCF-1 + 0.5382 0.5542 HCF-2 + + 0.5895 0.6050 CRAB-1 + 0.5578 0.5672 CRAB-2 + + 0.6172 0.6267 CRAB-3 + + 0.5140 0.5282 CRAB-4 + + + 0.5804 0.6027
  41. RESULTS: RQ2 • There are several cases where cross-element matching

    yields higher scores than element-wise matching
  42. ANALYSIS: INPUT TABLE SIZE Horizontal split of input table x

    % Input table l1 e1 … lj … e2 e3 … … en … lm Vertical split of input table x % Input table l1 e1 l2 l3 … … ei … … en … lm Horizontal split of input table x % Input table l1 e1 … lj … e2 e3 … … en … lm Vertical split of input table x % Input table l1 e1 l2 l3 … … ei … … en … lm
  43. TASK • On-the-fly table generation: • Answer a free text

    query with a relational table, where • the core column lists all relevant entities; • columns correspond to attributes of those entities; • cells contain the values of the corresponding entity attributes. Video albums of Taylor Swift Search Title Released data Label CMT Crossroads: Taylor Swift and … Formats Journey to Fearless Speak Now World Tour-Live The 1989 World Tour Live Jun 16, 2009 Oct 11, 2011 Nov 21, 2011 Dec 20, 2015 Big Machine Shout! Factory Big Machine Big Machine DVD Blu-ray, DVD CD/Blu-ray, … Streaming E V S
  44. APPROACH Core column entity ranking and schema determination could potentially

    mutually reinforce each other. Query (q) E Core column en+ty ranking Schema determina+on S Value lookup V E S
  45. KNOWLEDGE BASE ENTRY ed ep Property: value Entity name Entity

    type Description Property: value … ea
  46. VALUE LOOKUP • A catalog of possible entity attribute-value pairs

    • Entity, schema label, value, provenance quadruples he, s, v, pi e s v T #123 values from KB values from TC s e v T #123
  47. VALUE LOOKUP • Finding a cell’s value is a lookup

    in that catalog score(v, e, s, q) = max hs0,v,pi2eV match(s,s0) conf (p, q) eV values from KB values from TC soft string matching - "birthday" vs. "date of birth" - "country" vs. "nationality" matching confidence - KB takes priority over TC - based on the corresponding table’s relevance to the query
  48. EXPERIMENTAL SETUP • Table corpus • WikiTables corpus: 1.6M tables

    extracted from Wikipedia • Knowledge base • DBpedia (2015-10): 4.6M entities with an English abstract • Two query sets • Rank-based metrics • NDCG for core column entity ranking and schema determination • MAP/MRR for value lookup
  49. QUERY SET 1 (QS-1) • List queries from the DBpedia-Entity

    v2 collection1 (119) • "all cars that are produced in Germany" • "permanent members of the UN Security Council" • "Airlines that currently use Boeing 747 planes" • Core column entity ranking • Highly relevant entities from the collection • Schema determination • Crowdsourcing, 3-point relevance scale, 7k query-label pairs • Value lookup • Crowdsourcing, 25 queries sample, 14k cell values 1 Hasibi et al. DBpedia-Entity v2: A Test Collection for Entity Search. In: SIGIR ’17.
  50. QUERY SET 2 (QS-2) • Entity-relationship queries from the RELink

    Query Collection1 (600) • Queries are answered by entity tuples (pairs or triplets) • That is, each query is answered by a table with 2 or 3 columns (including the core entity column) • Queries and relevance judgments are obtained automatically from Wikipedia lists that contain relational tables • Human annotators were asked to formulate the corresponding information need as a natural language query • "Find peaks above 6000m in the mountains of Peru" • "Which countries and cities have accredited armenian embassadors?" • "Which anti-aircraft guns were used in ships during war periods and what country produced them?" 1 Saleiro et al. RELink: A Research Framework and Test Collection for Entity-Relationship Retrieval. In: SIGIR ’17.
  51. CORE COLUMN ENTITY RANKING (QUERY-BASED) QS-1 QS-2 NDCG@5 NDCG@10 NDCG@5

    NDCG@10 LM 0.2419 0.2591 0.0708 0.0823 DRRM_TKS (ed) 0.2015 0.2028 0.0501 0.0540 DRRM_TKS (ep) 0.1780 0.1808 0.1089 0.1083 Combined 0.2821 0.2834 0.0852 0.0920
  52. CORE COLUMN ENTITY RANKING (SCHEMA-ASSISTED) QS-1 QS-2 • R #0:

    without schema information (query only) • R #1-#3: with automatic schema determination (top 10) • Oracle: with ground truth schema
  53. OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc

    table retrieval 2. Query-by-table 1. On-the-fly table generation
  54. TASK • Row population: • Row population is the task

    of generating a ranked list of entities to be added to the core column of a given seed table A Formula 1 constructors’ statistics 2016 1.McLaren 2.Mercedes 3.Red Bull Add entity Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK
  55. APPROACH 1 Candidate Selection Entity Ranking Ranked list of suggestion

    (top-K entities) 1 2 • From knowledge base • Entities that are the same types(s) or belong to the same categories • From table corpus • Entities from related tables (contain any seed entities, similar table caption, headings)
  56. APPROACH Entity Ranking Ranked list of suggestion (top-K entities) 1

    2 • Based on the similarity between the candidate entity and various table elements 1 Candidate Selection P(e|E, L, c) / P(e|E)P(L|e)P(c|e). Entity similarity Column label similarity Caption similarity Candidate entity
  57. APPROACH Entity Ranking Ranked list of suggestion (top-K entities) 1

    2 • Based on the similarity between the candidate entity and various table elements 1 Candidate Selection P(e|E) = EPKB(e|E) + (1 E)PT C(e|E) P(L|e) = X l2L ⇣ L Y t2l PLM (t|✓e) + (1 L) |L| PEM (l|e) ⌘ P(c|e) = Y t2c cPKB(t|✓e) + (1 c)PT C(t|e) Entity similarity: Column Labels Likelihood: Caption Likelihood:
  58. • Idea: Taking existing tables and simulate the user in

    an intermediate step during table completion • Select a set of (1000) tables randomly • Contain at least 6 rows and at least 4 columns • For any intermediate step (i row completed) • First i (1<=i<=5) rows are taken as the seed table • Entities in the remaining rows are the ground truth EXPERIMENTAL DESIGN Row population E Egt L seed table l1 e1 l2 l3 … … ei ei+1 … en … lm
  59. TAKE-AWAY POINTS • Both tables and KBs are useful for

    row population • Candidate selection • Category > Type • Entity > Caption > Headings • All complement each other • Entity ranking • Entity > Headings > Caption • All complement each other • Highly relevant to candidate selection
  60. OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc

    table retrieval 2. Query-by-table 1. On-the-fly table generation 1. Row population 2. Column population 3. Value finding
  61. TASK • Column population: • Column population is the task

    of generating a ranked list of column labels to be added to the column headings of a given seed table B Formula 1 constructors’ statistics 2016 Add column 1.Seasons 2.Races Entered Constructor Ferrari Engine Country Base Force India Haas Ferrari Mercedes Ferrari Italy India US Italy UK US & UK Manor Mercedes UK UK
  62. APPROACH 1 Candidate Selection Label Ranking Ranked list of suggestion

    (top-K labels) 1 2 • From table corpus • Column labels from related tables (contain any seed entities, similar table caption, headings)
  63. APPROACH Label Ranking Ranked list of suggestion (top-K labels) 1

    2 • Based on the similarity between the candidate column labels and the table elements 1 Candidate Selection P(l|E, c, L) = X T P(l|T)P(T|E, c, L)
  64. • Idea: Taking existing tables and simulate the user in

    an intermediate step during table completion • Select a set of (1000) tables randomly • Contain at least 6 rows and at least 4 columns • For any intermediate step (i row completed) • First j (1<=j<=3) rows are taken as the seed table • Labels in the remaining rows are the ground truth EXPERIMENTAL DESIGN Column population E L seed table l1 e1 … lj lj+1 … … … … en … lm Lgt
  65. OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc

    table retrieval 2. Query-by-table 1. On-the-fly table generation 1. Row population 2. Column population 3. Value finding
  66. TASK • Cell value finding: • Given an input relational

    table, find the value of a specific cell (identified by the entity in the core column and the column heading label) or (optionally) determine if the cell should be left empty Oscar Best Actor Year 2013 Actor Film Role(s) 2014 2015 Matthew McConaughey Eddie Redmayne Leonard DiCaprio Dallas Buyers Club The theory of Everything The Revenant Ron Woodroof Stephen Hawking Hugh Class 2016 Casey Affleck Manchester by the Sea Lee Chandler 2017 Gary Oldman 1.Darkest Hour https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor (2 additional sources) 2.Tinker Tailor Soldier Spy https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor (1 additional source) 3.Nil by Mouth http://dbpedia.org/page/Gary_Oldman A 1.Lee Chandler https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor https://en.wikipedia.org/wiki/Casey_Affleck 2.Ray Sybert https://en.wikipedia.org/wiki/Casey_Affleck B
  67. Novel aspects: 1. Enabling a cell to have multiple, possible

    conflicting values 2. Supplementing the predicted values with supporting evidence 3. Combining evidence from multiple sources 4. Handling the case where the cell should be left empty
  68. APPROACH 1 Candidate Value finding Value Ranking Ranked list of

    suggestion (top-K values) 1 2 • From knowledge base • Heading-to-predicate matching • E.g. “location” vs. <dbp:location>, <dbp:city>, <dbp:country> • From table corpus • Heading-to-heading matching • Identity other table column that have the same meaning, e.g., nation vs. country Cleveland Indians Last Won 1948 Last Played 1997 World Series record by team or franchise, 1903–2013 Cleveland Indians Most recent win 1948 Win % .400 Cleveland Indians Season 1948 Per game 2,620,627 List of World Series champions Clarksdale Planters Year-by-year results ✔ ✖ Legion Original air date 14 October 1993 <dbr:Legion_(Red_Dwarf)> <dbp:airdate> "1993-10-14" Value Normalization DBpedia: e h e p v
  69. Table corpus Knowledge base Table matching Value extraction Candidate finding

    (Sect. 4) Heading-to-heading matching Heading-to-predicate matching TC value ranking KB value ranking Value ranking (Sect. 5) {(v; e, h0, T0)} {(v; e, p)} KB+TC value ranking (e, h, T) score(v; e, h, T) Input: Output: ranked list of values ordered by
  70. APPROACH Candidate Value finding Value Ranking Ranked list of suggestion

    (top-K values) 1 1 2 • Combine evidence in a feature-based approach • Feature I: Degree of support for the given value across the different evidence sources • Feature II: Empty value prediction • Feature III: Semantic relatedness between the input table and candidate tables (where the value originates from)
  71. • Idea: Conceal cell values from existing tables • Randomly

    select an existing table • Pick a table column • Remove n cells randomly from this column • Evaluate using crowdsourcing • Given the input table, the value, and a source documents, does this appear as the correct value for the missing cell? EXPERIMENTAL DESIGN Sampled Column Picked values
  72. EXPERIMENTAL RESULTS Method Empty values excluded Empty values included Baseline

    0,585 0,518 Feature I 0,664 0,576 Feature II 0,684 0,590 Feature I+II+III 0,757 0,671 Value finding performance in terms of NDCG@5
  73. OVERVIEW Table Search Table Generation Table Completion 1. Ad hoc

    table retrieval 2. Query-by-table 1. On-the-fly table generation 1. Row population 2. Column population 3. Value finding
  74. • Tasks: A collection of new tasks, including query- by-table,

    table generation, row and column population, and cell value finding with evidences • Methods: Novel methods for table search, table generation, and table completion • Resources: A series of data resources publicly available for reproducibility, including code, run files, and high-quality human annotations via annotators or crowd sourcing CONTRIBUTIONS OF THIS THESIS Chapter Description Link 3 Test collection, feature file, and run files related to keyword table search
 https://github.com/iai-group/ www2018-table 5 Code and test collections related to table completion https://github.com/iai-group/ sigir2017-table 6 Test collections, feature files, and run files related to table generation
 https://github.com/iai-group/ sigir2018-table 7 Test collections, feature files, and run files related to table cell completion
 https://github.com/iai-group/ cikm2019-table Appendix A Code of SmartTable demo https://github.com/iai-group/ SmartTable http://smarttable.cc/