Computational Semantics and Evaluation Benchmark for Interrogative Sentences via Combinatory Categorial Grammar

4, Dec, 2023 PACLIC 37 Computational Semantics and Evaluation Benchmark
  for Interrogative Sentences   via Combinatory Categorial Grammar 1 Hayate Funakura Kyoto University Koji Mineshima Keio University This work is partially supported by JST, CREST grant number JPMJCR 2 1 1 4 .

Agenda • Background • Proposals • Benchmark • Software •
Evaluation and annotation • Future prospects 2

Background • Formal semantics has been developed since the 1
9 7 0 s. • Various analyses have been proposed for individual phenomena. • Argument wh-questions (Hirsch 2 0 1 9 , Xiang 2 0 2 0 , etc.) • Adjective wh-questions (Nelken+ 1 9 9 8 , Tellings 2 0 1 9 , etc.) • Polar/alternative questions (Biezma 2 0 1 1 , Roelofsen+ 2 0 1 5 , etc.) • Embedded questions (Lahiri 2 0 0 2 , Ciardelli+ 2 0 1 3 , etc.) 3

• Formal semantics has been developed since the 1 9
7 0 s. • Various analyses have been proposed for individual phenomena. • Argument wh-questions (Hirsch 2 0 1 9 , Xiang 2 0 2 0 , etc.) • Adjective wh-questions (Nelken+ 1 9 9 8 , Tellings 2 0 1 9 , etc.) • Polar/alternative questions (Biezma 2 0 1 1 , Roelofsen+ 2 0 1 5 , etc.) • Embedded questions (Lahiri 2 0 0 2 , Ciardelli+ 2 0 1 3 , etc.) 4 Data and a system are needed   for the uni fi ed evaluation of various analyses Background

Our proposals 1 . Evaluation benckmark for question semantics 2
. Software for implementing and evaluating NL semantics 3 . Evaluation and Annotation 5

7 Evaluation benchmark — Background

• Creating benchmarks for evaluating theoretical linguistics   began with
the FraCaS test suite (Cooper+ 1 9 9 6 ) • A collection of inference problems in the format for recognizing textual entailment (RTE) • Covers a wide range of linguistic phenomena   e.g. GQ, plurals, anaphora, and etc. 8 Evaluation benchmark — Background

• Subsequent benchmarks have been proposed,   but questions are
less addressed • FraCaS covers only polar-questions • Wanatabe+ ( 2 0 1 9 ) provide a benchmark for question semantics, but the dataset has limitations in variation.   It does not include wh-words other than who,   and there are no instances where the object is a wh-word. 9 Evaluation benchmark — Background

10 • We have created a benchmark QSEM for evaluating
the syntax- semantics interface for various types of questions. • QSEM tests understanding of the following: • Quanti fi cational expressions • Multiple wh-questions • Scope ambiguity • Near-real text wh-questions Evaluation benchmark — Proposal

11 • Quanti fi cational expressions: Pairs of polar questions
and responses were extracted from sections 1 . 1 and 1 . 2 of FraCaS as samples for quanti fi cation. • Multiple wh-questions: The QA pairs on multiple wh-questions are taken from Dayal ( 2 0 1 6 ). • Scope ambiguity: Samples for scope ambiguity were taken from Chierchia ( 1 9 9 3 ) and Krifka ( 2 0 0 3 ). • Near-real text wh-questions: We randomly sampled questions from the SQuAD dataset as more real-text-like data and created sentential answers based on the non-sentential answers in the original dataset. Evaluation benchmark — Proposal

• QSEM contains P(remises), Q(uestion), and a label (yes/no/unk) 12
P 1 Bill made every dish. P 2 Bill is a boy. Q Which boy made every dish? Label yes ID: 0 5 3 (created based on examples from Krifka 2 0 0 3 ) yes The premises directly answer the question no The premises negate the presupposition of the question unknown None of the above Evaluation benchmark — Proposal

• Scope ambiguity: Ambiguity arises in wh-questions   when a
certain quanti fi cational expression is the subject. 13 P Bill likes Smith and Sue likes Jones. Q Who does everyone like? Label yes ID: 0 4 8 (created based on examples from Krifka 2 0 0 3 ) P Everyone likes Smith. Q Who does everyone like? Label yes ID: 0 4 9 (created based on examples from Krifka 2 0 0 3 ) Evaluation benchmark — Proposal

• Scope ambiguity: Ambiguity arises in wh-questions   when a
certain quanti fi cational expression is the subject. 14 P Bill likes Smith and Sue likes Jones. Q Who does everyone like? Label yes ID: 0 4 8 (created based on examples from Krifka 2 0 0 3 ) P Everyone likes Smith. Q Who does everyone like? Label yes ID: 0 4 9 (created based on examples from Krifka 2 0 0 3 ) ∀ > 𝗐𝗁 𝗐 𝗁 > ∀ Evaluation benchmark — Proposal

Table: Comparison of existing benchmarks and ours 15 #FODINBSL 5ZQFTPGRVFTUJPOT
4J[F FraCaS polar 3 4 6 Watanabe+ 2 0 1 9 who, polar, alternative 4 9 Ours who, which, what when, where, polar 1 3 8 Evaluation benchmark — Proposal

17 Software — Background

• Several implementations of semantic parsing have been proposed. •
Each of these systems depends on a speci fi c semantic representation. • There is still no system that supports the diverse analyses of questions. 18 4ZTUFN 4FNBOUJDSFQSFTFOUBUJPO NatLog (MacCartney+ 2 0 0 7 ) Natural logic (MacCartney+ 2 0 0 7 ) Boxer (Bos 2 0 0 8 ) DRT LangPro (Abzianidze 2 0 1 5 ) Natural logic (Muskens 2 0 1 0 ) ccg 2 lambda (Mineshima+ 2 0 1 5 ) Higher-order logic Software — Background

• ccg 2 lambda (Mineshima+ 2 0 1 5 )
  an implementation of syntax-semantics-prover interface 19 CCG Parser   e.g. depccg Semantic   composition Prover   e.g. Coq yes/no/unk Benchmark P 1 , P 2 , … H Semantic   template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background

  an implementation of syntax-semantics-prover interface 20 Semantic   composition yes/no/unk Benchmark Semantic   template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background Lexical meaning representations Semantic rules P 1 , P 2 , … H CCG Parser   e.g. depccg Prover   e.g. Coq

  an implementation of syntax-semantics-prover interface 21 Semantic   composition yes/no/unk Benchmark Semantic   template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background Higher-order logic P 1 , P 2 , … H CCG Parser   e.g. depccg Prover   e.g. Coq

  an implementation of syntax-semantics-prover interface 22 Semantic   composition yes/no/unk Benchmark Semantic   template CCG tree MR for P 1 MR for P 2 MR for Q yes: The Ps entail the H no: The Ps contradict the H unk: None of the above Software — Background P 1 , P 2 , … H CCG Parser   e.g. depccg Prover   e.g. Coq

  an implementation of syntax-semantics-prover interface 23 Semantic   composition yes/no/unk Benchmark Semantic   template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background High customizability FraCaS (RTE, Mineshima+ 2 0 1 5 ), SICK (Yanaka+ 2 0 1 8 ), etc. P 1 , P 2 , … H CCG Parser   e.g. depccg Prover   e.g. Coq

• We have proposed an extended version of ccg 2
lambda,   and call this system ccg 2 hol. • The new features: • Utilization of semantic tags (Abzianidze+ 2 0 1 7 ) as a resource for judging lexical meaning representations • Support for various analyses of question semantics • Our system allows us to design more fl exible lexical meaning representation, and gives a uni fi ed evaluation platform for question semantics. 24 Software — Proposal

• We have proposed an extended version of ccg 2
lambda,   and call this system ccg 2 hol. • The new features: • Utilization of semantic tags (Abzianidze+ 2 0 1 7 ) as a resource for judging lexical meaning representations • Support for various analyses of question semantics • Our system allows for: • more fl exible design for lexical meaning • uni fi ed evaluation for question semantics 25 Software — Proposal

• Note: • ccg 2 hol is not a question-speci
fi c system • It supports a wide range of questions in addition to other constructions. 26 Software — Proposal

• System output for Who does John like? (see our
paper for details of analysis) 27 Software — Proposal CCG category HOL (we de fi ned) Standard logical expr. DRS (not displayed here)

• System output for Who does John like? (see our
paper for details of analysis) 28 ∃x . ∃e . [Like(e) ∧ Subj(e) = John ∧ Obj(e) = x] Software — Proposal

• Our system can realize other analyses (e.g. Karttunen sem.,
Inquisitive sem.) 29 Software — Proposal λp . ∃x . [p(wa ) ∧ p = λw . […]]

Software — Proposal • Our theoretical assumptions • Various analyses
can be performed depending on the de fi nition of the operators and . 𝖰 ? 30 • Who does John like?   • Does John smokes?   𝖰 (λx . ∃e . [Like(e) ∧ Subj = John ∧ Obj = x]) ?(∃e . [Smoke(e) ∧ Subj(e) = John])

• By the following procedure, we conducted the evaluation of
our analysis and the annotation of QSEM in parallel: 32 Evaluation and annotation QSEM ccg 2 hol Manual correction   of semtags 6 8 / 1 3 8 ✅ CCG Tree ✅ MR ✅ Inference 7 0 / 1 3 8 Not annotated   primarily due to parsing errors * For error analysis, please refer to the paper.

• Currently, our system derives each representation separately. 33 Future
prospects Semantic   composition CCG   Parsing LF HOL DRS

• In the future, we aim to implement a feature
that will convert from HOL to other representations. 34 Future prospects Semantic   composition LF HOL DRS Sentence CCG   Parsing

• By replacing syntactic parsing and semantic composition with a
seq 2 seq model, a universal semantic parser can be realized   without having a model for each representation. 35 Future prospects Seq 2 seq model LF HOL DRS Sentence

Summary 36 • We have proposed an extended version of
ccg4lambda, and call this system ccg#hol. • The new features: • Utilization of semantic tags (Abzianidze+ 4DEF) as a resource for judging lexical meaning representations • Support for various analyses of question semantics • Our system allows for: • more ﬂexible design for lexical meaning • uniﬁed evaluation for question semantics X Software — Proposal Table: Comparison of existing benchmarks and ours X #FODINBSL 5ZQFTPGRVFTUJPOT 4J[F FraCaS polar )*+ Watanabe+ )*+, who, polar, alternative *, Ours who, which, what when, where, polar %&' Evaluation benchmark — Proposal • By the following procedure, we conducted the evaluation of our analysis and the annotation of QSEM in parallel: X Evaluation and annotation QSEM ccg#hol Manual correction of semtags !" / %&" ✅ CCG Tree ✅ MR ✅ Inference '( / %&" Not annotated primarily due to parsing errors * For error analysis, please refer to the paper. • By replacing syntactic parsing and semantic composition with a seq5seq model, a universal semantic parser can be realized without having a model for each representation. X Future prospects Seq$seq model LF HOL DRS Sentence

• Each problem in FraCaS contains   P(remises), Q(uestion), H(ypothesis),
and a label (yes/no/unk). 37 P 1 An Italian became the world's greatest tenor. Q Was there an Italian who became the world's greatest tenor? H There was an Italian who became the world's greatest tenor. Label yes (P 1 entails H, and P 1 provides a positive answer to Q) ID: 0 0 1 yes Ps entail H / Ps provides a positive answer to Q no Ps contradict H / Ps provides a negative answer to Q unknown None of the above Evaluation benchmark — Background

• The dataset created by Watanabe+ ( 2 0 1
9 ) contains   A(nswer), Q(uestion), and a label (yes/no/unk) • It does not include wh-words other than who, and there are no instances where the object is a wh-word. 38 A John ran. Q Who ran? Label yes (A provides a positive answer to Q) ID: 0 0 1 Evaluation benchmark — Background

Computational Semantics and Evaluation Benchmar...

Computational Semantics and Evaluation Benchmark for Interrogative Sentences via Combinatory Categorial Grammar

More Decks by hfunakura

Other Decks in Research

Featured

Transcript

Computational Semantics and Evaluation Benchmark for Interrogative Sentences via Combinatory Categorial Grammar