View Distillation with Unlabeled Data for Extracting Adverse Drug Effects from User-Generated Data

Slide 1

Slide 1 text

Payam Karisani, Jinho D. Choi, Li Xiong Emory University SMM4H@NAACL 2021

Slide 2

Slide 2 text

2  ADRs are the unintended effects of drugs for prevention, diagnosis, or treatment  ADR Classification task is detecting ADR reports in user-generated data [Yates and Goharian, 2013]  Positive: “this prozac is hitting me”  Negative: “prozac is the best drug”  It has been shown that social media users report the negative impact of drugs 11 months earlier than regular patients [Duh et al., 2016] Adverse Drug Effect Classification Task

Slide 3

Slide 3 text

3  Highly imbalanced class distributions  On average less than 9% positive documents  Highly sparse language model  There are virtually unlimited inventive ways of using drug names in language  User-generated data is noisy  It is informal, lacks capitalization, lacks punctuation, typos are common  On average the F1 measure of the state-of-the-art models is about 0.65 Challenges in ADR Classification Task

Slide 4

Slide 4 text

4  ADR classification task has a long history  Leaman et al., published the first work in 2010. They crawled drug discussion forums and used a sliding window to detect positive reports.  Currently state-of-the-art models rely on pretrained transformers  In SMM4H 2019 shared task, Chen et al., used pretrained BERT and were ranked 1st. (F1=0.645)  In SMM4H 2020 shared task, Wang et al., used pretrained RoBERTa and were ranked 1st. (F1=0.640)  In SMM4H 2021 shared task, Ramesh et al., used pretrained RoBERTa and were ranked 1st. (F1=0.610) Related Studies

Slide 5

Slide 5 text

5  Our model consists of four steps: 1) Extracting two views from documents 2) Training one classifier on the data in each view and generating pseudo- labels 3) Using the pseudo-labels in each view to initialize a classifier in the other view 4) Further training the classifier in each view using labeled data View Distillation with Unlabeled Data

Slide 6

Slide 6 text

6 1) We use the document and keyword representations as two views  The views are not conditionally independent, but represent different aspects [Balcan et al., 2004] 2) Each classifier labels the representation in one view, but the labels can be assigned to documents Extracting Two Views and Generating Pseudo-Labels [CLS] this prozac is hitting me soooo hard [SEP] BERT Document View Document Classifier [CLS] this prozac is hitting me soooo hard [SEP] BERT Keyword View Keyword Classifier [CLS] this prozac is hitting me [SEP] + ?  The labels in one view can be transferred to another view  Using two classifiers we label a large set of unlabeled documents

Slide 7

Slide 7 text

7 3) We use the pseudo-labels in each view to pretrain a classifier in the other view Cross View Pretraining with Pseudo- Labels Drug Classifier (𝐶𝑔 ) Document Classifier (𝐶𝑑 ) Train Train Label Label Labeled Documents Unlabeled Documents Documents + - - - + - +++ - + - - - + - +++ - +++ - + - + - + - ++ - - + - - - - - +++ - - ++ - Drug Classifier (෢ 𝐶𝑔 ) Document Classifier (෢ 𝐶𝑑 ) Distill Distill Resulting Classifiers

Slide 8

Slide 8 text

8 4) Further training the classifier in each view using the initial labeled documents Finetuning with Labeled Data  To label unseen documents the two classifiers are aggregated

Slide 9

Slide 9 text

9  We used the SMM4H ADR dataset to evaluate our model.  This dataset is the largest benchmark on this topic.  It consists of 30,174 tweets of which 25,616 tweets are in the training set (with 9.2% positive rate).  Evaluation is done via a CodaLab webpage  We used BERT pretrained on 800K unlabeled drug related tweets as the feature extractor, and used the same tweet set to generate pseudo-labels  We compare with two sets of baseline models: 1. Models that we implemented with our own pretrained BERT model 2. Models available on CodaLab webpage Experiments

Slide 10

Slide 10 text

10  Main Result:  Comparison with single view algorithms:  Comparison with other variations of pretraining/finetuning: Results and Analysis

Slide 11

Slide 11 text

Thank You! 11