Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FriendsQA: Open-Domain Question Answering on TV...

FriendsQA: Open-Domain Question Answering on TV Show Transcripts

Emory NLP

July 09, 2021
Tweet

More Decks by Emory NLP

Other Decks in Technology

Transcript

  1. FriendsQA: Open-Domain Question Answering on TV Show Transcripts Zhengzhe Yang

    Advisor: Dr. Jinho D. Choi Emory University, Department of Computer Science
  2. Introduction • What is Question Answering? • A task to

    challenge machines ability to understand a document • Later apply the learned knowledge to answer to queries • By completing a blank: Cloze-style • Selecting from a pool of answer candidates: Multiple choice • Select an answer span from the document: Span-based
  3. Introduction • Motivation • Remarkable results have been reported on

    numerous dataset, but… • No multiparty dialogue! • Wiki articles and News articles • (non-) fictional stories • Children’s books • Multiparty dialogue is the most natural mean of communication,
  4. Introduction • FriendsQA: an open- domain Question Answering dataset •

    Given a context, the task is to select the answer span like the example on the right
  5. Background: Cloze-style Datasets • CNN/Daily Mail • Predict PERSON entities

    on summarization for an article • Children’s Book Test • Expand to predict all entities using children’s books • BookTest • 60 time larger than CBT • Who-did-what • Description sentence and evidence passage from English Gigaword Corpus
  6. Background: MC Datasets • MCTest: comprising short fictional stories •

    RACE: compiled from English assessments for 12-18 years old students • TQA: compiled from middle school science lessons and textbooks • SciQ: passages from science exams collected via crowdsourcing • DREAM: multiparty dialogue passages from English-as-a-foreign-language
  7. Background: Span-based Datasets • bAbI: infer event descriptions • WikiQA

    and SQuAD: wikipedia • NewsQA: CNN articles • MS MARCO: web documents (Bing) • TriviaQA: from trivia enthusiasts • CoQA: conversational flow between questioner and answerer
  8. Background: QA Systems • R-Net • ReasoNet • Attention Over

    Attention Reader • Reinforced Mnemonic Reader • Transformer • MEMEN • FusionNet • Stochastic Answer Network • QANet • ELMo • BERT
  9. Background: Character Mining • The first 4 seasons are annotated

    for character identification tasks • Annotations are again extended to plural mentions • The first 4 seasons are also annotated with fine-grained emotion detection • All 10 seasons are processed for a cloze-style RC task
  10. Background: FriendsQA vs. Other Dialogue QA • FriendsQA vs. CoQA

    • CoQA aims to answer questions in one- to-one conversation between a questioner and answerer • The evidence passage is still wiki articles • FriendsQA vs. Cloze-style RC task • Cloze-style reasoning is less complex comparing to span-based QA • The predictions are limited to PERSON entities • FriendsQA vs. DREAM • Multiple choice questions are not ideal for practical QA applications
  11. The Corpus: FriendsQA Dataset • 1,222 scenes (83 are pruned

    because of having fewer than 5 utterances) • All utterances are concatenated together to form an evidence passage • The task is to find a contiguous answer span from the evidence passage
  12. The Corpus: Challenges with entity resolution • Utterances are spoken

    by several people and context switching happens more frequently • The ubiquitous and interchangeable use of pronouns
  13. The Corpus: Challenges with metaphors • Homophones confusion • Humor

    that could be understood by human readers • Require outside knowledge. In this case, knowledge regarding human body
  14. The Corpus: Challenges with sarcasm • The use of sarcasm

    is dominant in Friends to create humorous effects • The meaning is exactly opposite if comprehended directly
  15. The Corpus: Crowdsourcing • All annotation tasks are conducted on

    Amazon Mechanical Turk. • Left panel: the dialogue • Right panel: text inputs for question generation • Prior to actual tasks: a quiz to ensure annotators’ understanding of this task and web interface
  16. The Corpus: Phase 1 –> Question-Answer Generation • Clear annotation

    guidelines • 4 questions out of six: {what, when, where, who, why, how} • Answerable question • Multiple answers • However, selected answers must be relevant to the question • speaker name and • Utterance ID can also be selected
  17. The Corpus: Quality Assurance • Task can only be submitted

    after passing all rules • Are there at least 4 types of questions annotated? • Does each question have at least one answer span associated with it? • Does any question have too much string overlaps with the original text in the dialogue?
  18. The Corpus: Phase 2 –> Verification and Paraphrasing • Questions

    generated in Phase 1 are published again without answers • Annotators are asked to revise the questions if unanswerable or ambiguous • Annotators are asked to answer the questions • Annotators are asked to paraphrase the questions • Additional checking for quality assurance: • Check if the paraphrased question is the exact copy
  19. The Corpus: Four Rounds of Annotation • Four rounds of

    annotations are conducted before official annotation tasks • F1 score metric is adopted to evaluate Inter-annotator Agreement (ITA)
  20. The Corpus: R1 • Observed ambiguous questions that led to

    bad answers • Update the guidelines to make the questions as explicit as possible
  21. The Corpus: R2 • 6.27% improvement observed on ITA •

    Add more examples of questions and answer spans to the guidelines
  22. The Corpus: R3 • Another 2.48% improvement on ITA •

    no update is made to the guidelines.
  23. The Corpus: R4 • Marginal ITA improvement of 0.67% observed

    • Implies that our annotation guidelines are stabilized.
  24. The Corpus: Question / Answer Pruning • If question is

    revised dramatically, prune the first question (21.8% are revised) • If answers do not agree, prune the question and the answer (13.5% are pruned)
  25. The Corpus: Inter- annotator Agreement After pruning: • 10,610 questions

    • 21,262 answer spans • ITA: 81.82% / 53.55%
  26. The Corpus: Question Types vs. Answer Categories • 250 questions

    are randomly sampled • Diversity of FriendsQA
  27. Approach • Three SOTA systems selected to represent common approaches

    • R-Net: Recurrent Neural Network with attention mechanisms • QANet: Convolutional Neural Network with self-attention • BERT: deep feed-forward neural networks with Transformers
  28. Approach: BERT • pushed all current state- of-the-art scores to

    another level • Transformers (Attention Only) based
  29. Experiments: Model Development • All dialogues from are randomly shuffled

    and redistributed as the training (80%), development (10%), and test (10%) • Each training instance consists of a dialogue, questions, and a single answer to each question • Utterance IDs are replaced with the actual utterance Set Dialogues Questions Answers Training 977 8,535 17,074 Development 122 1,010 2,057 Test 123 1,065 2,131
  30. Experiments: Model Development • Recall that each question could have

    multiple answers • Three strategy to generate training instances with single answer • Select the shortest answer and discard the rest • Select the longest answer and discard the rest • If a question Q1 have multiple answers A1 and A2, generate two training instances (Q1, A1) and (Q1, A2) and train independently
  31. Experiments: Span-based Match • Each answer is treated as bag-of-words

    • Compute macro-average F1 score • P: Precision • R: Recall
  32. Experiments: Exact Match • Check if the prediction and gold

    answer are the same • Score is either 1 or 0
  33. Experiments: Utterance Match • Given the nature of multiparty dialogue

    QA, utterance match is introduced • Models are considered to be powerful if always looking for answers in the correct utterance • UM mainly checks if the prediction resides within the same utterance as the gold answer span
  34. Experiments: Results • All experiments are run three times •

    Average score with standard deviation • BERT and QANet perform better with multiple- answer strategy • R-Net performs better with others
  35. Experiments: Results with replacement • Take advantage of Character Mining

    project • Kept an entity mapping and replace all PERSON entities in both dialogue and questions • Plural mentions handled naively (we ent0 ent1 ent2)
  36. Experiments: Results based on Q-Type • where and when questions

    are mostly factoid, which show the highest performance with UM • why and how require cross-utterance reasoning, leading to worse performance • who and what questions give a good mixture of proper and common nouns and show moderate performance Type Dist. UM SM EM What 19.70% 77.42 69.39 55.04 Where 18.28% 84.35 78.86 65.93 Who 17.17% 74.12 64.34 55.29 Why 15.76% 60.47 50.03 27.14 How 14.65% 65.52 52.04 32.64 When 14.44% 80.65 65.81 51.98
  37. Experiments: Results of Start of utterance • Predict the start

    of the utterance • Only need 1 output layer: simply report accuracy • Demonstrate the power of NN SoU Acc. 1 57.23 2 57.62 3 55.25 Avg. 56.70
  38. Experiments: Results with top-k answers 45 50 55 60 65

    70 75 80 85 90 95 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Top-K Answers Utterance Match Span Match Exact Match
  39. Error Analysis • 100 randomly sampled completely mismatched questions •

    Through the analysis, 6 types of errors become evident
  40. Error Analysis • Entity Resolution • Paraphrase and Partial Match

    • Cross-Utterance Reasoning • Question Bias • Noise in Annotation • Miscellaneous Entity Resolution 28% Paraphrase and Partial Match 20% Cross- Utterance Reasoning 18% Question Bias 17% Miscellaneous 13% Noise in Annotation 4%
  41. Paraphrase and Partial Match (20%) • Paraphrasing, abstraction, nicknames, etc.

    referred to somewhere else in the conversation. • Partially correct, especially for why and how questions, which could be acceptable in practice. • Motivates us to evaluate using Utterance Match.
  42. Cross-Utterance Reasoning (18%) • This type reveals an universal challenge

    in understanding human-to-human conversation. • Reason across multiple utterances back and forth, especially if a story or an event unfolds gradually, scatters in different places, and is told by different speakers
  43. Question Bias (17%) • This type occurs when the answer

    predictions overly rely on the question types. Q: Why is Chandler against marriage? A: …because Joey built this chair on his own • Because is not necessarily the correct answer!
  44. Noise in Annotation (4%) • FriendsQA, although gives high inter-annotator

    agreement, still includes noise caused by wrong spans, ambiguous or unanswerable questions, or typos.
  45. Miscellaneous (13%) • Errors in this category have no apparent

    cause to understand why the model predicts these answers • They often seem irrelevant to the questions so that they need more investigation.
  46. Conclusion: Contributions • FriendsQA: an open-domain question answering dataset •

    An extensive and comprehensive analysis: validity, difficulty and diversity • Three state-of-the-art models are run and compared: shown its potential • Error analysis offers insightful retrospective and make suggestions to future deeper study
  47. Conclusion: Future Work • Q-type and error analysis can serve

    as guidelines to further enhance the QA model performance. • Why and how questions should be studied more attentively • Speaker information could be encoded into the utterance • Top-k answer: another challenging but tangible task • Answer existence prediction and an utterance-based model to select utterance candidates