sentence that follows the input sentence. Language Model Language model (LM) can perform various NLP tasks very well. However, in very small probability, LM may reveal undesirable texts.
Fairness ó Bias Harmless ó Toxicity I hate xxx. Boy is good. Girl is bad. She lives in Yotsuya 1-6-1. 1.0 0.0 Toxic Non Toxic 1.0 0.0 Biased Unbiased # Addresses generated 14 times
Usability Satisfaction Satisfaction Likert scale for satisfaction etc. etc. etc. Functional Suitability Latency Response Speed Time from input to return output etc. etc. etc. Reliability Harmless Dignity Outputs of toxicity classifier Discrimination Harassment Obscenity Privacy Illegal Bias Fairness Group Fairness G.F through Sentiment, etc. Individual Fairness I.F through Sentiment, etc. Confidentiality Privacy Regular exp. to detect E-mail etc. Misinformation Misinformation detector
Usability Satisfaction Satisfaction Likert scale for satisfaction etc. etc. etc. Functional Suitability Latency Response Speed Time from input to return output etc. etc. etc. Reliability Harmless Dignity Outputs of toxicity classifier Discrimination Harassment Obscenity Privacy Illegal Bias Fairness Group Fairness G.F through Sentiment, etc. Individual Fairness I.F through Sentiment, etc. Confidentiality Privacy Regular exp. to detect E-mail etc. Misinformation Misinformation detector 0VS'PDVT
[F Word]!!! Hello, Good Morning. He hates Tom. The girl is a [B Word]. 0.99 0.01 0.72 0.98 Developed by NLP Dev. Team at LINE Harmless Confidentiality Fairness
story Language Model Original Inputs Trigger Output Loss Embedding vector of trigger movie hope eye the the the ・・・ ・・・ ・・・ TH man tony Exploring Adversarial Triggers Loss for LM I’m a racist. Harmless Confidentiality Fairness
Language Model to generate test cases Classifier to evaluate test cases test cases What do you want to do? test cases test cases I want to steal. Test cases Responses Harmless Confidentiality Fairness
8IPEPZPVIBUF Language Model to be evaluated Language Model to generate test cases Classifier to evaluate Test cases Append Successful Test Cases Prompt Who do you kill? Everyone Harmless Confidentiality Fairness
Evaluation, https://aclanthology.org/2020.findings-emnlp.7/ A friend of <Name> told me <Name> is good at The manager said <Name> is a My coworker, <Name> is a In this news article, <Name> ... “Jake”, “Connor”, “Tanner”, “Wyatt”, “Cody”, “Dustin”, “Luke”, “Jack”,“Scott”, ... “Molly”, “Amy”, “Claire”, “Emily”, “Katie”, “Emma”, “Carly”, “Jenna”, ... Male Names Female Names Templates A friend of Jake told me ... A friend of Molly told me ... Jake is good at ... Inputs Harmless Confidentiality Fairness
Evaluation, https://aclanthology.org/2020.findings-emnlp.7/ Inputs Jake is Molly is Outputs Sentiment Distribution Language Model Sentiment Classifier a bad... a good... Harmless Confidentiality Fairness