FriendsQA: Open-Domain Question Answering on TV Show Transcripts

FriendsQA: Open-Domain Question Answering on TV Show Transcripts Zhengzhe Yang
Advisor: Dr. Jinho D. Choi Emory University, Department of Computer Science

Contents Layout Introduction Background The Corpus Approach Experiments Conclusion

Introduction • What is Question Answering? • A task to
challenge machines ability to understand a document • Later apply the learned knowledge to answer to queries • By completing a blank: Cloze-style • Selecting from a pool of answer candidates: Multiple choice • Select an answer span from the document: Span-based

Introduction • Motivation • Remarkable results have been reported on
numerous dataset, but… • No multiparty dialogue! • Wiki articles and News articles • (non-) fictional stories • Children’s books • Multiparty dialogue is the most natural mean of communication,

Introduction • FriendsQA: an open- domain Question Answering dataset •
Given a context, the task is to select the answer span like the example on the right

Background: Cloze-style Datasets • CNN/Daily Mail • Predict PERSON entities
on summarization for an article • Children’s Book Test • Expand to predict all entities using children’s books • BookTest • 60 time larger than CBT • Who-did-what • Description sentence and evidence passage from English Gigaword Corpus

Background: MC Datasets • MCTest: comprising short fictional stories •
RACE: compiled from English assessments for 12-18 years old students • TQA: compiled from middle school science lessons and textbooks • SciQ: passages from science exams collected via crowdsourcing • DREAM: multiparty dialogue passages from English-as-a-foreign-language

Background: Span-based Datasets • bAbI: infer event descriptions • WikiQA
and SQuAD: wikipedia • NewsQA: CNN articles • MS MARCO: web documents (Bing) • TriviaQA: from trivia enthusiasts • CoQA: conversational flow between questioner and answerer

Background: QA Systems • R-Net • ReasoNet • Attention Over
Attention Reader • Reinforced Mnemonic Reader • Transformer • MEMEN • FusionNet • Stochastic Answer Network • QANet • ELMo • BERT

Background: Character Mining • The first 4 seasons are annotated
for character identification tasks • Annotations are again extended to plural mentions • The first 4 seasons are also annotated with fine-grained emotion detection • All 10 seasons are processed for a cloze-style RC task

Background: FriendsQA vs. Other Dialogue QA • FriendsQA vs. CoQA
• CoQA aims to answer questions in one- to-one conversation between a questioner and answerer • The evidence passage is still wiki articles • FriendsQA vs. Cloze-style RC task • Cloze-style reasoning is less complex comparing to span-based QA • The predictions are limited to PERSON entities • FriendsQA vs. DREAM • Multiple choice questions are not ideal for practical QA applications

The Corpus: FriendsQA Dataset • 1,222 scenes (83 are pruned
because of having fewer than 5 utterances) • All utterances are concatenated together to form an evidence passage • The task is to find a contiguous answer span from the evidence passage

The Corpus: Challenges with entity resolution • Utterances are spoken
by several people and context switching happens more frequently • The ubiquitous and interchangeable use of pronouns

The Corpus: Challenges with metaphors • Homophones confusion • Humor
that could be understood by human readers • Require outside knowledge. In this case, knowledge regarding human body

The Corpus: Challenges with sarcasm • The use of sarcasm
is dominant in Friends to create humorous effects • The meaning is exactly opposite if comprehended directly

The Corpus: Crowdsourcing • All annotation tasks are conducted on
Amazon Mechanical Turk. • Left panel: the dialogue • Right panel: text inputs for question generation • Prior to actual tasks: a quiz to ensure annotators’ understanding of this task and web interface

The Corpus: Phase 1 –> Question-Answer Generation • Clear annotation
guidelines • 4 questions out of six: {what, when, where, who, why, how} • Answerable question • Multiple answers • However, selected answers must be relevant to the question • speaker name and • Utterance ID can also be selected

The Corpus: Quality Assurance • Task can only be submitted
after passing all rules • Are there at least 4 types of questions annotated? • Does each question have at least one answer span associated with it? • Does any question have too much string overlaps with the original text in the dialogue?

The Corpus: Phase 2 –> Verification and Paraphrasing • Questions
generated in Phase 1 are published again without answers • Annotators are asked to revise the questions if unanswerable or ambiguous • Annotators are asked to answer the questions • Annotators are asked to paraphrase the questions • Additional checking for quality assurance: • Check if the paraphrased question is the exact copy

The Corpus: Four Rounds of Annotation • Four rounds of
annotations are conducted before official annotation tasks • F1 score metric is adopted to evaluate Inter-annotator Agreement (ITA)

The Corpus: R1 • Observed ambiguous questions that led to
bad answers • Update the guidelines to make the questions as explicit as possible

The Corpus: R2 • 6.27% improvement observed on ITA •
Add more examples of questions and answer spans to the guidelines

The Corpus: R3 • Another 2.48% improvement on ITA •
no update is made to the guidelines.

The Corpus: R4 • Marginal ITA improvement of 0.67% observed
• Implies that our annotation guidelines are stabilized.

The Corpus: Question / Answer Pruning • If question is
revised dramatically, prune the first question (21.8% are revised) • If answers do not agree, prune the question and the answer (13.5% are pruned)

The Corpus: Inter- annotator Agreement After pruning: • 10,610 questions
• 21,262 answer spans • ITA: 81.82% / 53.55%

The Corpus: Question Types vs. Answer Categories • 250 questions
are randomly sampled • Diversity of FriendsQA

Approach • Three SOTA systems selected to represent common approaches
• R-Net: Recurrent Neural Network with attention mechanisms • QANet: Convolutional Neural Network with self-attention • BERT: deep feed-forward neural networks with Transformers

Approach: R- Net • Recurrent Neural Network Based • Self-matching
Mechanism

Approach: QANet • Convolutional Neural Network based • Dramatic speed-up:
data augmentation

Approach: BERT • pushed all current state- of-the-art scores to
another level • Transformers (Attention Only) based

Experiments: Model Development • All dialogues from are randomly shuffled
and redistributed as the training (80%), development (10%), and test (10%) • Each training instance consists of a dialogue, questions, and a single answer to each question • Utterance IDs are replaced with the actual utterance Set Dialogues Questions Answers Training 977 8,535 17,074 Development 122 1,010 2,057 Test 123 1,065 2,131

Experiments: Model Development • Recall that each question could have
multiple answers • Three strategy to generate training instances with single answer • Select the shortest answer and discard the rest • Select the longest answer and discard the rest • If a question Q1 have multiple answers A1 and A2, generate two training instances (Q1, A1) and (Q1, A2) and train independently

Experiments: Evaluation Metrics • Span-based Match • Exact Match •
Utterance Match

Experiments: Span-based Match • Each answer is treated as bag-of-words
• Compute macro-average F1 score • P: Precision • R: Recall

Experiments: Exact Match • Check if the prediction and gold
answer are the same • Score is either 1 or 0

Experiments: Utterance Match • Given the nature of multiparty dialogue
QA, utterance match is introduced • Models are considered to be powerful if always looking for answers in the correct utterance • UM mainly checks if the prediction resides within the same utterance as the gold answer span

Experiments: Results • All experiments are run three times •
Average score with standard deviation • BERT and QANet perform better with multiple- answer strategy • R-Net performs better with others

Experiments: Results with replacement • Take advantage of Character Mining
project • Kept an entity mapping and replace all PERSON entities in both dialogue and questions • Plural mentions handled naively (we ent0 ent1 ent2)

Experiments: Results based on Q-Type • where and when questions
are mostly factoid, which show the highest performance with UM • why and how require cross-utterance reasoning, leading to worse performance • who and what questions give a good mixture of proper and common nouns and show moderate performance Type Dist. UM SM EM What 19.70% 77.42 69.39 55.04 Where 18.28% 84.35 78.86 65.93 Who 17.17% 74.12 64.34 55.29 Why 15.76% 60.47 50.03 27.14 How 14.65% 65.52 52.04 32.64 When 14.44% 80.65 65.81 51.98

Experiments: Results of Start of utterance • Predict the start
of the utterance • Only need 1 output layer: simply report accuracy • Demonstrate the power of NN SoU Acc. 1 57.23 2 57.62 3 55.25 Avg. 56.70

Experiments: Results with top-k answers 45 50 55 60 65
70 75 80 85 90 95 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Top-K Answers Utterance Match Span Match Exact Match

Error Analysis • 100 randomly sampled completely mismatched questions •
Through the analysis, 6 types of errors become evident

Error Analysis • Entity Resolution • Paraphrase and Partial Match
• Cross-Utterance Reasoning • Question Bias • Noise in Annotation • Miscellaneous Entity Resolution 28% Paraphrase and Partial Match 20% Cross- Utterance Reasoning 18% Question Bias 17% Miscellaneous 13% Noise in Annotation 4%

Entity Resolution (28%) Q: What is Chandler’s opinion regarding marriage?
A: Joey thinks… (wrong entity!)

Paraphrase and Partial Match (20%) • Paraphrasing, abstraction, nicknames, etc.
referred to somewhere else in the conversation. • Partially correct, especially for why and how questions, which could be acceptable in practice. • Motivates us to evaluate using Utterance Match.

Cross-Utterance Reasoning (18%) • This type reveals an universal challenge
in understanding human-to-human conversation. • Reason across multiple utterances back and forth, especially if a story or an event unfolds gradually, scatters in different places, and is told by different speakers

Question Bias (17%) • This type occurs when the answer
predictions overly rely on the question types. Q: Why is Chandler against marriage? A: …because Joey built this chair on his own • Because is not necessarily the correct answer!

Noise in Annotation (4%) • FriendsQA, although gives high inter-annotator
agreement, still includes noise caused by wrong spans, ambiguous or unanswerable questions, or typos.

Miscellaneous (13%) • Errors in this category have no apparent
cause to understand why the model predicts these answers • They often seem irrelevant to the questions so that they need more investigation.

Conclusion: Contributions • FriendsQA: an open-domain question answering dataset •
An extensive and comprehensive analysis: validity, difficulty and diversity • Three state-of-the-art models are run and compared: shown its potential • Error analysis offers insightful retrospective and make suggestions to future deeper study

Conclusion: Future Work • Q-type and error analysis can serve
as guidelines to further enhance the QA model performance. • Why and how questions should be studied more attentively • Speaker information could be encoded into the utterance • Top-k answer: another challenging but tangible task • Answer existence prediction and an utterance-based model to select utterance candidates

Q & A Thank you!

FriendsQA: Open-Domain Question Answering on TV...

FriendsQA: Open-Domain Question Answering on TV Show Transcripts

More Decks by Emory NLP

Other Decks in Technology

Featured

Transcript