Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Agree To Disagree: Improving disagreement detec...

Sushant Hiray
October 23, 2017

Agree To Disagree: Improving disagreement detection using dual GRUs

Presentation of our work on disagreement detection at ESSEM 2017. In this work, we show that by using a Siamese inspired architecture to encode the discussions, we no longer need to rely on hand-crafted features to exploit the meta thread structure. The research paper can be found at https://arxiv.org/abs/1708.05582

Sushant Hiray

October 23, 2017
Tweet

More Decks by Sushant Hiray

Other Decks in Research

Transcript

  1. A G R E E T O D I S

    A G R E E I M P R O V I N G D I S A G R E E M E N T D E T E C T I O N W I T H D U A L G R U S S U S H A N T H I R AY, V E N K AT E S H D U P PA D A F I R S T N A M E . L A S T N A M E @ S E E R N E T. I O S E E R N E T T E C H N O L O G I E S E S S E M 2 0 1 7
  2. A G R E E M E N T A

    N D D I S A G R E E M E N T • Global: Overall stance towards the topic • Local: Between sentences, posts in forums Duty Calls 
 https://xkcd.com/386/
  3. W H Y ? • Detect presence of disputes •

    Understand ideological stance of participants • Detect sub-groups
  4. R E L AT E D W O R K

    : S P E E C H • Galley et al 2014, Hillard et al 2003, Hahn et al 2006 • Datasets: ICSI, AMI Meeting Corpus • Features*: sentiments, n-grams, prosodic features
 
 
 * The models use hand-transcribed or ASR based annotations as well
  5. R E L AT E D W O R K

    : T E X T • Yin et al 2012, Abott et al 2011, Misra and Walker 2013, Mukherjee and Lui 2012, Rosenthal and McKeown 2015 • 2/3/5-way (dis)agreement detection • Datasets: ABCD, AWTP, IAC, US Message Board, AAWD, Political Forums • Features: thread structure, lexical, style, sentiment, polarity, durational features etc
  6. R E L AT E D D O M A

    I N S • Stance Detection: Task of identifying whether the author of the text is in favor, against or neutral towards a target • Argument Mining: Umbrella category which primarily focuses on tasks like: • automatic extraction of arguments from text • argument proposition classification • argumentative parsing
  7. D ATA D E F I N I T I

    O N Quote Response Pairs: Consider only those tuples where quote(Q) functions as a "dialogic parent" to the response(R).
  8. D ATA S E T S Internet argument corpus Agreement

    in Wikipedia Talk Pages Agreement By Create Debators
  9. D ATA S E T S D ATA S E

    T T H R E A D C O U N T P O S T C O U N T A G R E E D I S A G R E E N O N E A B C D 9 9 8 1 1 8 5 4 7 9 3 8 1 9 5 6 0 9 9 1 8 6 2 9 3 I A C 1 2 2 0 5 9 4 0 4 2 8 1 2 3 6 4 2 7 6 A W T P 5 0 8 2 2 3 8 1 4 8 6 3 6
  10. D ATA S E T S D ATA S E

    T T H R E A D C O U N T P O S T C O U N T A G R E E D I S A G R E E N O N E A B C D 9 9 8 1 1 8 5 4 7 9 3 8 1 9 5 6 0 9 9 1 8 6 2 9 3 I A C 1 2 2 0 5 9 4 0 4 2 8 1 2 3 6 4 2 7 6 A W T P 5 0 8 2 2 3 8 1 4 8 6 3 6 3 0 X L A R G E R D ATA
  11. D ATA A N N O TAT I O N

    : A B C D The side information is provided by the participant on the website createdebate.com • AGREEMENT: Quote and Response support the same side and the authors are different • DISAGREEMENT: Quote and Response support different sides and the authors are different • NONE: Quote is Root, however the Quote and Response have the same author
  12. D ATA A N N O TAT I O N

    : I A C • Data from 4forums • Roughly 6000 Q-R pairs annotated for (dis)agreement on a scale of [-5, 5] • [-5, -1) disagreement, [-1, 1] none, (1, 5] agreement • In case of multiple annotators, average annotation scores are computed after filtering out the none annotations.
  13. D ATA A N N O TAT I O N

    : A W T P • Data from Wikipedia and Livejournal • Cohen’s κ = 0.73 for 3-way classification: agreement, disagreement, none • Cohen’s κ = 0.66 for 5-way classification: agreement- response, agreement-paraphrase, disagreement- response and disagreement-paraphrase
  14. D ATA E X P L O R AT I

    O N L A B E L P O S T N O N E QUOTE: Is Scientology a real religeon? Or is it a fake money making gimmick? 
 
 RESPONSE: All religions are fake, there’s an argument to be made the vast majority are money making gimmicks. Scien- tology is no more outlandish than any of the more widespread religions. A G R E E QUOTE: I am against suicide because you are basically not only harming yourself, but everyone else around you. Let’s not mention it is a cowards way out. I also have a religious but I CAN explain that reason. RESPONSE: So true man people only harm the people they love by dying. I am not religious and religiously and non- religiously suicide is wrong. D I S A G R E E QUOTE: The majority of the information learned in school is irrelevant to real world skills. Besides, in a voluntary setting, most children would go to school via parents demands where school choice would be much more abundant. RESPONSE: Children learn math which is relevant, children learn history which is relvant, children learn the releveant languge to their country, children learn foreign languages which imporoves economic opportunites. My one friend grew up in Baghdad, Iraq, and they don’t play when it comes to education. He started learning English in the 3rd grade I think through graduation which helped his economic ooportunities, and he is an artchitect so the math helped. Please excuse my typos. I have a learning disability.
  15. F E AT U R E E X T R

    A C T I O N
  16. F E AT U R E S • Word Vectors:

    GloVe embeddings of 300 dimensions • Lexicons [1] • AFINN : Valence ratings between -5 to 5 • BingLui: Opinion Lexicons • NRC Affect Intensity: Real valued affect intensity • NRC Word Emotion Lexicon: 8 sense level associations and 2 sentiment level associations • NRC Hashtag Lexicon: Word emotion associations computed on hashtags • LIWC: Various categorization of words according to thoughts, feelings, motivations
 
 [1] Open source implementation: https://github.com/SEERNET/EmoInt
  17. S Y S T E M D E S C

    R I P T I O N
  18. S Y S T E M D E S C

    R I P T I O N • Siamese inspired architecture • Using lexical features as well as word embeddings • 3-way classification
 Model Architecture
  19. S Y S T E M PA R A M

    E T E R S • Sequence length: 64 • Word embedding dimension: 300 • Dropout of 0.5 • Relu layer • Batch Normalization
  20. E X P E R I M E N T

    S : F E AT U R E S U S E D P R E C I S I O N R E C A L L W E I G H T E D F 1 S C O R E L E X I C O N S 0 . 7 8 8 0 0 . 7 9 8 0 . 7 8 9 G R U 0 . 7 9 2 0 . 7 9 8 0 . 7 9 4 G R U + L E X I C O N S 0 . 8 1 2 0 . 8 1 5 0 . 8 0 4 Analyzing impact of varying features on the model performance on ABCD dataset
  21. E X P E R I M E N T

    S : F E AT U R E S U S E D P R E C I S I O N R E C A L L W E I G H T E D F 1 S C O R E L E X I C O N S 0 . 7 8 8 0 0 . 7 9 8 0 . 7 8 9 G R U 0 . 7 9 2 0 . 7 9 8 0 . 7 9 4 G R U + L E X I C O N S 0 . 8 1 2 0 . 8 1 5 0 . 8 0 4 Analyzing impact of varying features on the model performance on ABCD dataset S O TA : 0 . 7 7 6
  22. E X P E R I M E N T

    S : S E Q U E N C E L E N G T H S E Q U E N C E L E N G T H S P R E C I S I O N R E C A L L W E I G H T E D F 1 S C O R E 3 2 0 . 8 1 0 0 . 8 1 3 0 . 8 0 6 6 4 0 . 8 1 2 0 . 8 1 5 0 . 8 0 4 1 2 8 0 . 8 0 5 0 . 8 0 8 0 . 7 9 6 Investigating impact of varying maximum sequence length on model performance. Results are on ABCD dataset
  23. E X P E R I M E N T

    S : S E Q U E N C E L E N G T H Distribution of sequence length in Q-R pairs. The graph shows the number of posts v/s sequence length.
  24. E X P E R I M E N T

    S : T R A N S F E R L E A R N I N G M O D E L P R E C I S I O N R E C A L L W E I G H T E D F 1 S C O R E S O TA - - 0 . 5 7 8 D I R E C T 0 . 4 7 3 0 . 3 6 4 0 . 2 8 5 T U N I N G 0 . 5 3 0 0 . 5 7 2 0 . 4 5 0 T R A N S F E R 0 . 5 0 8 0 . 5 5 9 0 . 4 6 5 R E T R A I N L A S T 2 L AY E R S 0 . 5 3 1 0 . 5 7 2 0 . 4 6 0 R E T R A I N L A S T 3 L AY E R S 0 . 5 2 3 0 . 5 7 0 0 . 4 2 8 Tuning Model trained on ABCD for IAC
  25. E X P E R I M E N T

    S : T R A N S F E R L E A R N I N G M O D E L P R E C I S I O N R E C A L L W E I G H T E D F 1 S C O R E S O TA - - 0 . 4 4 4 D I R E C T 0 . 4 3 4 0 . 4 4 5 0 . 3 8 9 T U N I N G 0 . 5 1 5 0 . 4 7 0 0 . 4 6 4 T R A N S F E R 0 . 5 2 5 0 . 4 3 7 0 . 4 4 7 R E T R A I N L A S T 2 L AY E R S 0 . 5 3 4 0 . 4 5 7 0 . 4 4 7 R E T R A I N L A S T 3 L AY E R S 0 . 4 7 7 0 . 5 4 6 0 . 4 8 6 Tuning Model trained on ABCD for AWTP
  26. E X P E R I M E N T

    S : T R A N S F E R L E A R N I N G M O D E L P R E C I S I O N R E C A L L W E I G H T E D F 1 S C O R E S O TA - - 0 . 4 4 4 D I R E C T 0 . 4 3 4 0 . 4 4 5 0 . 3 8 9 T U N I N G 0 . 5 1 5 0 . 4 7 0 0 . 4 6 4 T R A N S F E R 0 . 5 2 5 0 . 4 3 7 0 . 4 4 7 R E T R A I N L A S T 2 L AY E R S 0 . 5 3 4 0 . 4 5 7 0 . 4 4 7 R E T R A I N L A S T 3 L AY E R S 0 . 4 7 7 0 . 5 4 6 0 . 4 8 6 Tuning Model trained on ABCD for AWTP C O M P E T I T I V E P E R F O R M A N C E
  27. I S T H E M O D E L

    D O M A I N I N D E P E N D E N T ?
  28. D O M A I N I N D E

    P E N D E N T ? • Topics in Train, Development, Test are disjoint. • Ensures model is topic independent and learns nuances of (dis)agreement rather than specifics of the domain.
  29. D O M A I N I N D E

    P E N D E N T ? T O P I C N A M E Q U O T E R E S P O N S E T R A I N Are guns fun Berno Guns are weapons that can kill people. If an adult can find a gun to be fun, then a child can easily find a gun to be fun. We all know what happens when children get their hands on guns. dtrimble Thats like saying we all know what happens when kids get their hands on the car keys! The debate is not about if they are safe its about weather they are fun. yes guns are fun VA L I D AT I O N 18 years old to date someone over 18 Safiya Thats illegal for a 16 year old to date a 19 year old the law is you have to be 18 to date someone that's 18 years old or older Cuaroc Not in Britain. T E S T Are lessons learnt from the past relevant to today's world of unprecedented change sommonsonata Hasn't germany became a freed country due to the spread of democracy which you have just stated? gyliu15 The example that Germany has become a freed country because of the spread of democracy is not valid. Nevertheless, the lessons learnt from the past are still relevant to today's world of unprecedented change in the political arena.
  30. D E T E C T I N G (

    D I S ) A G R E E M E N T I S H A R D
  31. H A R D E X A M P L

    E S L A B E L Q U O T E R E S P O N S E D E S C R I P T I O N A G R E E JessHall01 my parents do not deserve it, they treat me like shit and that’s unecessary. NO parents should even romotely lay their hands on a child. EVER. kamranw I agree with you there. NO parent should hit their child. I completely feel your pain if that is the case. I would be curious to know how they treat you like shit though. Not sure how that is relevent to minors being sexual active either. Although the participants are on the same side of the debate, they are agreeing on some sentences, which makes it difficult to give a single label to the whole post D I S A G R E E joecavalry I tell my kids to hurry up and eat their breakfast so that we can get to school on time. My little one is so slow that I tell her she eats breakslow ;) Morgie7171 AWWW thats so cute............... haha- hahahahahahahahhaaha hhha Due to self labeling, sometimes off topic chit- chat on debate forums is erroneously tagged
  32. H A R D E X A M P L

    E S L A B E L Q U O T E R E S P O N S E D E S C R I P T I O N D I S A G R E E Of course guns kill people, but that doesn't mean that they should be banned. Knives kill people, cars kill people, Big Macs kill people, but should they be banned as well? Do yourself a favor and ignore anyone who tells you to be yourself. Bad idea in your case. Guns don't kill people, but people uses guns to kill people. Sarcastic responses are difficult for the model to figure out.
  33. C O N C L U S I O N

    • Model doesn’t need to rely on handcrafted features for exploiting meta-thread structures • Model pre-trained on large corpora (ABCD) performs competitively on much smaller hand annotated datasets • Model is available as an API at: developers.deepaffects.com
  34. • Applying recent developments in computational sarcasm to improve the

    model’s ability to detect sarcastic responses • Use attention based models to identify the specific contexts the model is using to predict. This gives more visibility in debugging the model output • Instead of solely relying on self-labelled datasets, use them as weak labeled datasets for semi-supervised learning and combine them with hand-labelled datasets for correcting mislabels.