Upgrade to Pro — share decks privately, control downloads, hide ads and more …

View Distillation with Unlabeled Data for Extra...

View Distillation with Unlabeled Data for Extracting Adverse Drug Effects from User-Generated Data

Emory NLP

July 08, 2021
Tweet

More Decks by Emory NLP

Other Decks in Technology

Transcript

  1. 2  ADRs are the unintended effects of drugs for

    prevention, diagnosis, or treatment  ADR Classification task is detecting ADR reports in user-generated data [Yates and Goharian, 2013]  Positive: “this prozac is hitting me”  Negative: “prozac is the best drug”  It has been shown that social media users report the negative impact of drugs 11 months earlier than regular patients [Duh et al., 2016] Adverse Drug Effect Classification Task
  2. 3  Highly imbalanced class distributions  On average less

    than 9% positive documents  Highly sparse language model  There are virtually unlimited inventive ways of using drug names in language  User-generated data is noisy  It is informal, lacks capitalization, lacks punctuation, typos are common  On average the F1 measure of the state-of-the-art models is about 0.65 Challenges in ADR Classification Task
  3. 4  ADR classification task has a long history 

    Leaman et al., published the first work in 2010. They crawled drug discussion forums and used a sliding window to detect positive reports.  Currently state-of-the-art models rely on pretrained transformers  In SMM4H 2019 shared task, Chen et al., used pretrained BERT and were ranked 1st. (F1=0.645)  In SMM4H 2020 shared task, Wang et al., used pretrained RoBERTa and were ranked 1st. (F1=0.640)  In SMM4H 2021 shared task, Ramesh et al., used pretrained RoBERTa and were ranked 1st. (F1=0.610) Related Studies
  4. 5  Our model consists of four steps: 1) Extracting

    two views from documents 2) Training one classifier on the data in each view and generating pseudo- labels 3) Using the pseudo-labels in each view to initialize a classifier in the other view 4) Further training the classifier in each view using labeled data View Distillation with Unlabeled Data
  5. 6 1) We use the document and keyword representations as

    two views  The views are not conditionally independent, but represent different aspects [Balcan et al., 2004] 2) Each classifier labels the representation in one view, but the labels can be assigned to documents Extracting Two Views and Generating Pseudo-Labels [CLS] this prozac is hitting me soooo hard [SEP] BERT Document View Document Classifier [CLS] this prozac is hitting me soooo hard [SEP] BERT Keyword View Keyword Classifier [CLS] this prozac is hitting me [SEP] + ?  The labels in one view can be transferred to another view  Using two classifiers we label a large set of unlabeled documents
  6. 7 3) We use the pseudo-labels in each view to

    pretrain a classifier in the other view Cross View Pretraining with Pseudo- Labels Drug Classifier (𝐶𝑔 ) Document Classifier (𝐶𝑑 ) Train Train Label Label Labeled Documents Unlabeled Documents Documents + - - - + - +++ - + - - - + - +++ - +++ - + - + - + - ++ - - + - - - - - +++ - - ++ - Drug Classifier (෢ 𝐶𝑔 ) Document Classifier (෢ 𝐶𝑑 ) Distill Distill Resulting Classifiers
  7. 8 4) Further training the classifier in each view using

    the initial labeled documents Finetuning with Labeled Data  To label unseen documents the two classifiers are aggregated
  8. 9  We used the SMM4H ADR dataset to evaluate

    our model.  This dataset is the largest benchmark on this topic.  It consists of 30,174 tweets of which 25,616 tweets are in the training set (with 9.2% positive rate).  Evaluation is done via a CodaLab webpage  We used BERT pretrained on 800K unlabeled drug related tweets as the feature extractor, and used the same tweet set to generate pseudo-labels  We compare with two sets of baseline models: 1. Models that we implemented with our own pretrained BERT model 2. Models available on CodaLab webpage Experiments
  9. 10  Main Result:  Comparison with single view algorithms:

     Comparison with other variations of pretraining/finetuning: Results and Analysis