20191102_ACL2019_adversarial_examples_in_NLP_YoheiKIKUTA

ACL2019 Adversarial Examples in NLP 2019/11/02 @yohei_kikuta

঺հ࿦จ • Generating Natural Language Adversarial Examples through Probability Weighted
Word Saliency (long) • Generating Fluent Adversarial Examples for Natural Languages (short) • Robust Neural Machine Translation with Doubly Adversarial Inputs (long) • Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA (long) 2/25

࿦จϝϞ • Generating Natural Language …:  https://github.com/yoheikikuta/paper-reading/issues/41 • Generating Fluent
Adversarial …:  https://github.com/yoheikikuta/paper-reading/issues/42 • Robust Neural Machine …:  https://github.com/yoheikikuta/paper-reading/issues/43 • Avoiding Reasoning Shortcuts …:  https://github.com/yoheikikuta/paper-reading/issues/44 3/25

Adversarial ◦◦◦ • Adversarial Network (ఢରతωοτϫʔΫ)  యܕతʹ͸ generator ͱ discriminator
Λରཱతʹֶश  ߴੑೳͳ generator Λ࡞Δ͜ͱ͕Ͱ͖ɺ͔ͦ͜Β༷ʑʹൃల • Adversarial Example (ఢରతαϯϓϧ)  Ϟσϧ͕ޡೝࣝ͢ΔΑ͏ೖྗσʔλʹઁಈΛՃ͑Δ  ࡞੒ࣗମ΍ͦͷੑ࣭ʹڵຯ͕͋Δʢಛʹը૾ೝࣝ෼໺Ͱ੝Μʣ • Adversarial Training (ఢରతֶश)  adv. example Λਖ਼ଇԽͱͯ͠ར༻  ը૾Ͱ͸ adv. example Λ๷͙ͨΊɺNLP Ͱ͸൚ԽੑೳͷͨΊ 4/25

Adversarial example ͷ෼ྨ White box Ϟσϧߏ଄΍ඍ෼৘ใ΋࢖͑Δ 5 Black box ೖग़ྗ͔͠࢖͑ͳ͍
※ ଞʹ΋ targeted / non-targeted (ૂͬͨΫϥεʹؒҧ͑ͤ͞Δ͔൱͔) ͷ෼ྨ΋͋Δ Input Output (softmax) Input Output (softmax) /25

NLP ʹ͓͚Δ adv. example ͷ೉͠͞ • ը૾ͱҟͳΓ཭ࢄ  ɾೖྗΛগ͠ม͑Δɺͱ͍͏ૢ࡞͕͠ʹ͍͘  ɾembedding sp.
ͳΒඍ෼Մ͕ͩೖྗ΁໭͢ͷʹૢ࡞Λཁ͢Δ • ը૾ͱҟͳΓઁಈͱݴͬͯ΋ਓؒʹ஌֮͠΍͍͢  ɾޠኮʹޡΓ͕͋Ε͹໨ʹ෇͘ (e.g., mood → mooP)  ɾจ๏ʹޡΓ͕͋Ε͹໨ʹ෇͘ (e.g., I was … → I is …) • ҙຯతʹҟͳΔ΋ͷΛ࡞ͬͯ͠·͍͕ͪ (e.g., knight → night) 6/25

యܕతͳ adv. example in NLP • adv. example ͷ࡞੒  HotFlip
: จࣈϕʔε (e.g., moo“d” → moo“P”)  Genetic attack : ୯ޠͷஔ׵Λϕʔεͱͨ͠Ҩ఻తΞϧΰϦζϜ  (e.g, A “runner” wants… → A “racer” wants…)  • adv. training ΁ͷར༻  VAT : embedding sp. ͰઁಈΛՃ͑ͨೖྗͰֶश 7/25

ACL2019 Ͱͷ adv. example in NLP • adv. example ͷ࡞੒
: ୯ޠஔ׵ʹ͓͍ͯ୯ޠͷબͼํΛޮ཰Խ  ɾword saliency ͱ༧ଌ֬཰Λ߹Θͤͯ༏ઌॱҐΛܾఆ  ɾݴޠϞσϧΛಋೖ͠ Metropolis-Hastings αϯϓϦϯά • adv. training ΁ͷར༻ : ΑΓ໰୊ʹಛԽͨ͠ઃఆ  ɾػց຋༁ʹ͓͍ͯ encoder ͱ decoder ͷೖྗ྆ํʹઁಈ  ɾ2-hop QA ͷσʔλʹଘࡏ͢Δ 1-hop shortcut Λ๷͙ઁಈ 8/25

Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency
(long) word-based, non-targeted, white box attack 9/25

·ͱΊ 10 … was funny as … wi−1 wi wi+1
mirthful laughable … exist …. equally …. laughable WordNet ͔ΒྨٛޠΛूΊͯ͘Δ  Named Entity ͷ৔߹͸ ͷޠኮू߹Λ࢖͏ D − Dytrue ω* i = arg max ω′ i [P(ytrue |x) − P(ytrue |x′ i )] exist equally softmax(S(x))i ⋅ ΔP* i S(x, ωi ) = P(ytrue |x) − P(ytrue | ̂ xi ) … was laughable as … laughable equally … … was laughable equally … ྨٛޠͷதͰϞσϧ͕࠷΋ؒҧ͍ʹۙͮ͘΋ͷΛબͿ saliency ͱ༧ଌ֬཰ͷࠩ෼ͰείΞϦϯάͯ͠ฒͼସ͑  ͸ i ൪໨Λ unknown ʹͨ͠΋ͷͰ ͸ ʹͨ͠΋ͷ S ̂ xi x* i ω* i ΔP* i = P(ytrue |x) − P(ytrue |x* i ) Ϟσϧ͕༧ଌΛؒҧ͑Δ·Ͱॱ൪ʹஔ׵ /25

࣮ݧ݁Ռ Dataset Model Original Random Gradient TiWO WS PWWS IMDB
word-CNN 86.55% 45.36% 37.43% 10.00% 9.64% 5.50% Bi-dir LSTM 84.86% 37.79% 14.57% 3.57% 3.93% 2.00% AG’s News char-CNN 89.70% 67.80% 72.14% 58.50% 62.45% 56.30% word-CNN 90.56% 74.13% 73.63% 60.70% 59.70% 56.72% Yahoo! Answers LSTM 92.00% 74.50% 73.80% 62.50% 62.50% 53.00% word-CNN 96.01% 82.09% 80.10% 69.15% 66.67% 57.71% Classification accuracy of each selected model on the original three datasets and the perturbed datasets erent attacking methods. Column 3 (Original) represents the classification accuracy of the model for the amples. A lower classification accuracy corresponds to a more effective attacking method. Dataset Model Random Gradient TiWO WS PWWS IMDB word-CNN 22.01% 20.53% 15.06% 14.38% 3.81% Bi-dir LSTM 17.77% 12.61% 4.34% 4.68% 3.38% AG’s News char-CNN 27.43% 27.73% 26.46% 21.94% 18.93% word-CNN 22.22% 22.09% 20.28% 20.21% 16.76% Yahoo! Answers LSTM 40.86% 41.09% 37.14% 39.75% 35.10% word-CNN 31.68% 31.29% 30.06% 30.42% 25.43% Word replacement rate of each attacking method on the selected models for the three datasets. The lower replacement rate, the better the attacking method could be in terms of retaining the semantics of the text. nal Prediction Adversarial Prediction Perturbed Texts Positive Negative Ah man this movie was funny (laughable) as hell, yet strange. I like how they kept the shakespearian language in this movie, it just felt ironic because of how idiotic the movie really was. this movie has got ence = 96.72% Confidence = 74.78% Dataset Model Original Random Gradient TiWO WS PWWS IMDB word-CNN 86.55% 45.36% 37.43% 10.00% 9.64% 5.50% Bi-dir LSTM 84.86% 37.79% 14.57% 3.57% 3.93% 2.00% AG’s News char-CNN 89.70% 67.80% 72.14% 58.50% 62.45% 56.30% word-CNN 90.56% 74.13% 73.63% 60.70% 59.70% 56.72% Yahoo! Answers LSTM 92.00% 74.50% 73.80% 62.50% 62.50% 53.00% word-CNN 96.01% 82.09% 80.10% 69.15% 66.67% 57.71% ification accuracy of each selected model on the original three datasets and the perturbed datasets attacking methods. Column 3 (Original) represents the classification accuracy of the model for the es. A lower classification accuracy corresponds to a more effective attacking method. Dataset Model Random Gradient TiWO WS PWWS IMDB word-CNN 22.01% 20.53% 15.06% 14.38% 3.81% Bi-dir LSTM 17.77% 12.61% 4.34% 4.68% 3.38% AG’s News char-CNN 27.43% 27.73% 26.46% 21.94% 18.93% word-CNN 22.22% 22.09% 20.28% 20.21% 16.76% Yahoo! Answers LSTM 40.86% 41.09% 37.14% 39.75% 35.10% word-CNN 31.68% 31.29% 30.06% 30.42% 25.43% replacement rate of each attacking method on the selected models for the three datasets. The lower cement rate, the better the attacking method could be in terms of retaining the semantics of the text. ediction Adversarial Prediction Perturbed Texts ve Negative Ah man this movie was funny (laughable) as hell, yet strange. I like how they kept the shakespearian language in this movie, it just felt = 96.72% Confidence = 74.78% ఏҊख๏͸ Probability Weighted Word Saliency (PWWS) 11 ֤छσʔληοτʹ͓͚Δ accuracy  ɾͦΕͧΕ {2, 4, 10} Ϋϥε෼ྨ  ɾೋ஋෼ྨͷ IMDB ͸ߴ͍੒ޭ཰  ɾैདྷख๏ΑΓ΋վળ ஔ͖׵͑ͨ୯ޠͷׂ߹  ɾগͳ͍΄Ͳݩͷจʹ͍ۙͷͰخ͍͠  ɾैདྷख๏ΑΓ΋վળ /25 ݁Ռͷද͸ https://www.aclweb.org/anthology/P19-1103/ ΑΓҾ༻

۩ମྫ Dataset Model Random Gradient TiWO WS PWWS IMDB word-CNN
22.01% 20.53% 15.06% 14.38% 3.81% Bi-dir LSTM 17.77% 12.61% 4.34% 4.68% 3.38% AG’s News char-CNN 27.43% 27.73% 26.46% 21.94% 18.93% word-CNN 22.22% 22.09% 20.28% 20.21% 16.76% Yahoo! Answers LSTM 40.86% 41.09% 37.14% 39.75% 35.10% word-CNN 31.68% 31.29% 30.06% 30.42% 25.43% Table 3: Word replacement rate of each attacking method on the selected models for the three datasets. The lower the word replacement rate, the better the attacking method could be in terms of retaining the semantics of the text. Original Prediction Adversarial Prediction Perturbed Texts Positive Negative Ah man this movie was funny (laughable) as hell, yet strange. I like how they kept the shakespearian language in this movie, it just felt ironic because of how idiotic the movie really was. this movie has got to be one of troma’s best movies. highly recommended for some senseless fun! Confidence = 96.72% Confidence = 74.78% Negative Positive The One and the Only! The only really good description of the punk movement in the LA in the early 80’s. Also, the definitive documentary about legendary bands like the Black Flag and the X. Mainstream Americans’ repugnant views about this film are absolutely hilarious (uproarious)! How can music be SO diversive in a country of supposed liberty...even 20 years after... find out! Confidence = 72.40% Confidence = 69.03% Table 4: Adversarial example instances in the IMDB dataset with Bi-directional LSTM model. Columns 1 and 2 represent the category prediction and confidence of the classification model for the original sample and the adversarial examples, respectively. In column 3, the green word is the word in the original text, while the red is the substitution in the adversarial example. Original Prediction Adversarial Prediction Perturbed Texts Business Sci/Tech site security gets a recount at rock the vote. grassroots movement to register younger voters leaves publishing (publication) tools accessible Confidence = 91.26% Confidence = 33.81% IMDB, Bi-dir LSTM ͷྫ 12 IMDB ͸؆୯ͳͷͰগ਺Λஔ͖׵͑Ε͹Α͍͕ɺଞͷσʔληοτͰ͸ଟ͘Λஔ׵͢Δඞཁ༗ /25 ۩ମྫ͸ https://www.aclweb.org/anthology/P19-1103/ ΑΓҾ༻

Generating Fluent Adversarial Examples for Natural Languages (short) word-based, targeted,
white/black box attack 13/25

·ͱΊ 14 : empty trash cans … x α(x′|x) =
min {1, π(x′)g(x|x′) π(x)g(x′|x) } : the trash cans … x′ Metropolice-Hastings sampling ݴޠϞσϧ ͱ෼ྨث Λ࢖ͬͯఆৗ෼෍Λఆٛ LM C π(x| ˜ y) ∝ LM(x) ⋅ C(˜ y|x) ఏҊ෼෍ ͸୯ޠͷஔ׵ɾૠೖɾ࡟আͰఆٛ g(x′|x) g(x′|x) = pr TB r (x′|x) + pi TB i (x′|x) + pd TB d (x′|x) TB r (x′|x) = π(ω1 , …ωm−1 , ωc( ∈ Q), ωm+1 , …, ωn | ˜ y) ∑ ω∈Q π(ω1 , …ωm−1 , ω, ωm+1 , …, ωn | ˜ y) ͸ϥϯμϜͳ୯ޠΛૠೖͨ͠ͷͪஔ׵ TB i (x′|x) if (m ൪໨Λআ͍ͨ΋ͷ) TB d (x′|x) = 1 x′ = x−m ͸૒ํ޲ Ͱఆٛ͢ΔείΞ ͷ্Ґ n ݸͷू߹ Q LM S SB(ω|x) = LM(ω|x[1:m−1] ) ⋅ LMback (ω|x[m+1:n] ) SW(ω|x) = SB(ω|x) ⋅ sim ( ∂loss ∂em , em − e) Black box: White box: ※ White box Ͱ͸ૠೖ΍࡟আͰ͸ඍ෼৘ใ͕ಘΒΕͳ͍ʢ࢖͑ͳ͍ʣͷͰஔ׵ͷΈ /25

࣮ݧ݁Ռ adv. attack ͷ੒ޭ཰ (b-: black box, w-: white box) 
ɾInvok# ͸Ϟσϧݺͼग़͠ճ਺ʢগͳ͍ํ͕ྑ͍ʣ  ɾPPL ͸ݴޠϞσϧͷ਺ࣈʢখ͍͞ = ྲྀெ ͱओுʣ  ɾ ͸ Metropolis-Hastings ͷ acceptance ratio α 15 (a) IMDB (b) SNLI Figure 3: Invocation-success curves of the attacks. Task Approach Succ(%) Invok# PPL ↵(%) IMDB Genetic 98.7 1427.5 421.1 – b-MHA 98.7 1372.1 385.6 17.9 w-MHA 99.9 748.2 375.3 34.4 SNLI Genetic 76.8 971.9 834.1 – b-MHA 86.6 681.7 358.8 9.7 w-MHA 88.6 525.0 332.4 13.3 Table 1: Adversarial attack results on IMDB and SNLI. The acceptance rates (↵) of M-H sampling are in a rea- sonable range. filtered by the victim classifier and a language model, which leads to the next generation. Hyper-parameters. As in the work of Miao et al. (2018), MHA is limited to make proposals for at most 200 times, and we pre-select 30 candidates at each iteration. Constraints are included in MHA to forbid any operations on sentimental words (eg. “great”) or negation words (eg. “not”) in IMDB experiments with SentiWordNet (Esuli and Sebas- tiani, 2006; Baccianella et al., 2010). All LSTMs w -MHA: the trash cans are sitting on a beach. Prediction: hEntailmenti Case 2 Premise: a man is holding a microphone in front of his mouth. Hypothesis: a male has a device near his mouth. Prediction: hEntailmenti Genetic: a masculine has a device near his mouth. Prediction: hNeutrali b -MHA: a man has a device near his car . Prediction: hNeutrali w -MHA: a man has a device near his home . Prediction: hNeutrali Table 2: Adversarial examples generated on SNLI. curves of the genetic approach is caused by its population-based nature. We list detailed results in Table 1. Success rates are obtained by invoking the victim model for at most 6,000 times. As shown, the gaps of success rates between the models are not very large, because all models can give pretty high success rate. However, as expected, our proposed MHA provides lower perplexity (PPL) 1, which means the examples generated by MHA are more likely to appear in the corpus of the evaluation language model. As the corpus is large enough and the language model for evaluation is strong enough, it in- Model Attack succ (%) Genetic b-MHA w-MHA Victim model 98.7 98.7 99.9 + Genetic adv training 93.8 99.6 100.0 + b-MHA adv training 93.0 95.7 99.7 + w-MHA adv training 92.4 97.5 100.0 Table 3: Robustness test results on IMDB. Model Acc (%) Train # = 10K 30K 100K Victim model 58.9 65.8 73.0 + Genetic adv training 58.8 66.1 73.6 + w-MHA adv training 60.0 66.9 73.5 that the adversarial examples from MHA could be more effective than unfluent ones from genetic attack, as assumed in Figure 1. To test whether the new models could achieve accuracy gains after adversarial training, experiments are carried out on different sizes of training data, which are subsets of SNLI’s training set. The number of adversarial examples is fixed to 250 during experiment. The classification accuracies of the new models after the adversarial training by different approaches are listed in Table 4. Adver- sarial training with w-MHA significantly improves the accuracy on all three settings (with p-values Model Attack succ (%) Genetic b-MHA w-MHA Victim model 98.7 98.7 99.9 + Genetic adv training 93.8 99.6 100.0 + b-MHA adv training 93.0 95.7 99.7 + w-MHA adv training 92.4 97.5 100.0 Table 3: Robustness test results on IMDB. Model Acc (%) Train # = 10K 30K 100K Victim model 58.9 65.8 73.0 + Genetic adv training 58.8 66.1 73.6 + w-MHA adv training 60.0 66.9 73.5 Table 4: Accuracy results after adversarial training. that the adversarial examples from MHA could be more effective than unfluent ones from genetic attack, as assumed in Figure 1. To test whether the new models could achieve accuracy gains after adversarial training, experiments are carried out on different sizes of training data, which are subsets of SNLI’s training set. The number of adversarial examples is fixed to 250 during experiment. The classification accuracies of the new models after the adversarial training by different approaches are listed in Table 4. Adver- sarial training with w-MHA significantly improves the accuracy on all three settings (with p-values less than 0.02). w-MHA outperforms the genetic adv. training ͨ͠Ϟσϧ΁ͷ adv. attack ͷ੒ޭ཰  ɾσʔληοτ͸ IMDB  ɾఏҊख๏Λ࢖͑͹ैདྷͷ adv. attack ΋গ͠๷͛Δ  ɾͦ΋ͦ΋࿦ͱͯ͠ adv. training ͯ͠΋ޮՌ͸͔ᷮ adv. training ͯ͠Ϟσϧͷ൚Խੑೳ͕޲্͢Δ͔  ɾσʔληοτ͸ SNLI  ɾैདྷख๏Ͱ͸σʔλྔ͕ଟ͍ͱ͜ΖͷΈޮ͘  ɾఏҊख๏Ͱ͸σʔλྔ͕গͳ͍ͱ͜ΖͰ΋ޮ͘ /25 ݁Ռͷද͸ https://www.aclweb.org/anthology/P19-1559/ ΑΓҾ༻

۩ମྫ (b) SNLI curves of the attacks. Invok# PPL ↵(%)
1427.5 421.1 – 1372.1 385.6 17.9 748.2 375.3 34.4 971.9 834.1 – 681.7 358.8 9.7 525.0 332.4 13.3 Case 1 Premise: three men are sitting on a beach dressed in or- ange with refuse carts in front of them. Hypothesis: empty trash cans are sitting on a beach. Prediction: hContradictioni Genetic: empties trash cans are sitting on a beach. Prediction: hEntailmenti b -MHA: the trash cans are sitting in a beach. Prediction: hEntailmenti w -MHA: the trash cans are sitting on a beach. Prediction: hEntailmenti Case 2 Premise: a man is holding a microphone in front of his mouth. Hypothesis: a male has a device near his mouth. Prediction: hEntailmenti Genetic: a masculine has a device near his mouth. Prediction: hNeutrali b -MHA: a man has a device near his car . Prediction: hNeutrali w -MHA: a man has a device near his home . Prediction: hNeutrali Table 2: Adversarial examples generated on SNLI. SNLI ͷྫ 16/25 ۩ମྫ͸ https://www.aclweb.org/anthology/P19-1559/ ΑΓҾ༻

Robust Neural Machine Translation with Doubly Adversarial Inputs (long) (sub)word-based,
non-targeted, white box attack 17/25

·ͱΊ 18 Encoder ℒ(θ) = ℒclean (θmt ) + ℒlm
(θx mt ) + ℒrobust (θmt ) + ℒlm (θy mt ) x = x1 , …, xi , …, xI P(y|x; θmt ) = ΠJ j=1 P(yj |z≤j , h; θmt ) y = y1 , …, yj , …, yJ e(x) = e(x1 ), …, e(xi ), …, e(xI ) AdvGen: Ұఆׂ߹ͷ୯ޠΛஔ׵  ɾ ʹج͖ͮޠኮީิݶఆ   ɾ຋༁ޡࠩʹج͖ͮஔ׵ޠኮܾఆ Q ∙ Qsrc (xi , x) = Plm (xi |x<i , x>i ; θx lm ) Qtrg (zi , z) = λPlm (zi |z<j , z>j ; θy lm ) + (1 − λ)P(zi |z<i , x′; θmt ) ∙′ i = arg max ∙∈∙ sim (e( ∙ ) − e( ∙i ), ∇e(∙i )(−log P(y| ∙ ; θmt ))) where ∙ = {x, z} Decoder z = z1 , …, zj , …, zJ e(z) = e(z1 ), …, e(zj ), …, e(zJ ) x′ i e(x′ i ) z′ j e(z′ j ) ℒclean (θmt ) = 1 |S| ∑ (x,y)∈S − log P(y|x; θmt ) ℒrobust (θmt ) = 1 |S| ∑ (x,y)∈S − log P(y|x′, z′; θmt ) /25

࣮ݧ݁Ռ 19 Method Model MT06 MT02 MT03 MT04 MT05 MT08
Vaswani et al. (2017) Trans.-Base 44.59 44.82 43.68 45.60 44.57 35.07 Miyato et al. (2017) Trans.-Base 45.11 45.95 44.68 45.99 45.32 35.84 Sennrich et al. (2016a) Trans.-Base 44.96 46.03 44.81 46.01 45.69 35.32 Wang et al. (2018) Trans.-Base 45.47 46.31 45.30 46.45 45.62 35.66 Cheng et al. (2018) RNMTlex. 43.57 44.82 42.95 45.05 43.45 34.85 RNMTfeat. 44.44 46.10 44.07 45.61 44.06 34.94 Cheng et al. (2018) Trans.-Basefeat. 45.37 46.16 44.41 46.32 45.30 35.85 Trans.-Baselex. 45.78 45.96 45.51 46.49 45.73 36.08 Sennrich et al. (2016b)* Trans.-Base 46.39 47.31 47.10 47.81 45.69 36.43 Ours Trans.-Base 46.95 47.06 46.48 47.39 46.58 37.38 Ours + BackTranslation* Trans.-Base 47.74 48.13 47.83 49.13 49.04 38.61 Table 2: Comparison with baseline methods trained on different backbone models (second column). * indicate the method trained using an extra corpus. Method Model MT06 MT02 MT03 MT04 MT05 MT08 Vaswani et al. (2017) Trans.-Base 44.59 44.82 43.68 45.60 44.57 35.07 Ours Trans.-Base 46.95 47.06 46.48 47.39 46.58 37.38 Table 3: Results on NIST Chinese-English translation. Method Model BLEU Vaswani et al. Trans.-Base 27.30 Trans.-Big 28.40 Chen et al. RNMT+ 28.49 Ours Trans.-Base 28.34 Trans.-Big 30.01 Table 4: Results on WMT’14 English-German translation. German translation. We compare our approach with Transformer for different numbers of hidden Miyato et al. (2017) applied perturbations to word embeddings using adversarial learning in text classiﬁcation tasks. We apply this method to the NMT model. Sennrich et al. (2016a) augmented the training data with word dropout. We follow their method to randomly set source word embeddings to zero with the probability of 0.1. This simple technique performs reasonably well on the Chinese-English translation. Wang et al. (2018) introduced a data augmentation method for NMT called SwitchOu to randomly replace words in both source and Method Model MT06 MT02 MT03 MT04 MT05 MT08 Vaswani et al. (2017) Trans.-Base 44.59 44.82 43.68 45.60 44.57 35.07 Miyato et al. (2017) Trans.-Base 45.11 45.95 44.68 45.99 45.32 35.84 Sennrich et al. (2016a) Trans.-Base 44.96 46.03 44.81 46.01 45.69 35.32 Wang et al. (2018) Trans.-Base 45.47 46.31 45.30 46.45 45.62 35.66 Cheng et al. (2018) RNMTlex. 43.57 44.82 42.95 45.05 43.45 34.85 RNMTfeat. 44.44 46.10 44.07 45.61 44.06 34.94 Cheng et al. (2018) Trans.-Basefeat. 45.37 46.16 44.41 46.32 45.30 35.85 Trans.-Baselex. 45.78 45.96 45.51 46.49 45.73 36.08 Sennrich et al. (2016b)* Trans.-Base 46.39 47.31 47.10 47.81 45.69 36.43 Ours Trans.-Base 46.95 47.06 46.48 47.39 46.58 37.38 Ours + BackTranslation* Trans.-Base 47.74 48.13 47.83 49.13 49.04 38.61 Table 2: Comparison with baseline methods trained on different backbone models (second column). * indicates the method trained using an extra corpus. Method Model MT06 MT02 MT03 MT04 MT05 MT08 Vaswani et al. (2017) Trans.-Base 44.59 44.82 43.68 45.60 44.57 35.07 Ours Trans.-Base 46.95 47.06 46.48 47.39 46.58 37.38 Table 3: Results on NIST Chinese-English translation. Method Model BLEU Vaswani et al. Trans.-Base 27.30 Trans.-Big 28.40 Chen et al. RNMT+ 28.49 Ours Trans.-Base 28.34 Trans.-Big 30.01 Table 4: Results on WMT’14 English-German translation. Miyato et al. (2017) applied perturbations to word embeddings using adversarial learning in text classiﬁcation tasks. We apply this method to the NMT model. Sennrich et al. (2016a) augmented the training data with word dropout. We follow their method to randomly set source word embeddings to zero with the probability of 0.1. This simple technique performs reasonably well on the Chinese-English English-German ຋༁  ɾதࠃޠΑΓখ͍͞෯͕ͩޮ͍ͯΔ ֤छϕʔεϥΠϯͱൺֱ  ɾChinese-English ຋༁  ɾࢦඪ͸ BLEU scores  ɾҰ൪্͕ vanilla Transformer  ɾ* ͸ back translation ࢖༻ /25 ݁Ռͷද͸ https://www.aclweb.org/anthology/P19-1425/ ΑΓҾ༻

۩ମྫ adv. example Ͱ͸ͳͯ͘ noisy input Ͱͷ݁Ռ  (ϥϯμϜʹબΜͩ୯ޠΛ embedding ͷҙຯͰ͍ۙ୯ޠʹஔ͖׵͚͑ͨͩ)
20 Input & Noisy Input ŸS∞Ü-ƒ$˝å$˝Æ⇢Ù∆⌥('∆)ÑÀ} \s˚⇥ Reference this expressed the relationship of close friendship and cooperation between China and Russia and between our parliaments. Vaswani et al. this reflects the close friendship and cooperation between China and Russia on Input and between the parliaments of the two countries. Vaswani et al. this reflects the close friendship and cooperation between the two countries on Noisy Input and the two parliaments. Ours this reflects the close relations of friendship and cooperation between China on Input and Russia and between their parliaments. Ours this embodied the close relations of friendship and cooperation between China on Noisy Input and Russia and between their parliaments. Table 5: Comparison of translation results of Transformer and our model for an input and its perturbed input. Method 0.00 0.05 0.10 0.15 4.4 Results on Noisy Data /25 ۩ମྫ͸ https://www.aclweb.org/anthology/P19-1425/ ΑΓҾ༻

Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for
Multi-Hop QA (long) Ϟσϧʹґଘ͠ͳ͍σʔληοτಛ༗ͷ adv. attack 21/25

·ͱΊ 22 nd Mohit Bansal hapel Hill nsal}@cs.unc.edu What was
the father of Kasper Schmeichel voted to be by the IFFHS in 1992? R. Bolesław Kelly MBE (] ; born 18 November 1963) is a Danish former professional footballer who played as a Defender, and was voted the IFFHS World's Best Defender in 1992 and 1993. Kasper Peter Schmeichel (] ; born 5 November 1986) is a Danish professional footballer who plays as a goalkeeper ... . He is the son of former Manchester United and Danish international goalkeeper Peter Schmeichel. Edson Arantes do Nascimento (] ; born 23 October 1940), known as Pelé (] ), is a retired Brazilian professional footballer who played as a forward. In 1999, he was voted World Player of the Century by IFFHS. Peter Bolesław Schmeichel MBE (] ; born 18 November 1963) is a Danish former professional footballer who played as a goalkeeper, and was voted the IFFHS World's Best Goalkeeper in 1992 and 1993. Kasper Hvidt (born 6 February 1976 in Copenhagen) is a Danish retired handball goalkeeper, who lastly played for KIF Kolding and previous Danish national team. ... Hvidt was also voted as Goalkeeper of the Year March 20, 2009, second place was Thierry Omeyer ... Prediction: World's Best Goalkeeper (correct) Question Golden Reasoning Chain Docs Distractor Docs Adversarial Doc Prediction under adversary: IFFHS World's Best Defender Figure 1: HotpotQA example with a reasoning shortcut, and our adversarial document that eliminates this shortcut to necessitate multi-hop reasoning. HotpotQA σʔλ͸ຊདྷ͸ҎԼͷΑ͏ʹஈ֊Λ౿·͍ͤͯͨ  Kasper → (son of) → Peter → (voted as) → world’s best GK ໰୊จΛϚονͤ͞Δ͚ͩͰ౴͕͑ग़ͤͯ͠·͏ (shortcut)  αϯϓϦϯάͯ͠ௐ΂ͨΒ൒෼ఔ౓΋ shotcut ΛؚΜͰ͍ͨ nd Mohit Bansal hapel Hill sal}@cs.unc.edu What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992? R. Bolesław Kelly MBE (] ; born 18 November 1963) is a Danish former professional footballer who played as a Defender, and was voted the IFFHS World's Best Defender in 1992 and 1993. Kasper Peter Schmeichel (] ; born 5 November 1986) is a Danish professional footballer who plays as a goalkeeper ... . He is the son of former Manchester United and Danish international goalkeeper Peter Schmeichel. Edson Arantes do Nascimento (] ; born 23 October 1940), known as Pelé (] ), is a retired Brazilian professional footballer who played as a forward. In 1999, he was voted World Player of the Century by IFFHS. Peter Bolesław Schmeichel MBE (] ; born 18 November 1963) is a Danish former professional footballer who played as a goalkeeper, and was voted the IFFHS World's Best Goalkeeper in 1992 and 1993. Kasper Hvidt (born 6 February 1976 in Copenhagen) is a Danish retired handball goalkeeper, who lastly played for KIF Kolding and previous Danish national team. ... Hvidt was also voted as Goalkeeper of the Year March 20, 2009, second place was Thierry Omeyer ... Prediction: World's Best Goalkeeper (correct) Question Golden Reasoning Chain Docs Distractor Docs Adversarial Doc Prediction under adversary: IFFHS World's Best Defender Figure 1: HotpotQA example with a reasoning shortcut, and our adversarial document that eliminates this shortcut to necessitate multi-hop reasoning. ಉ͡ shortcut ߏ଄Ͱݩͷ౴͑͸ม͑ͳ͍ adv. Doc Λ௥Ճ  ɾݩͷ౴͑ʹ GloVe ͷҙຯͰ͍ۙ΋ͷΛऔಘͯ͠ஔ׵  ɾ౴͕͑ໃ६͠ͳ͍Α͏ʹ title Λଞͷσʔλͷ΋ͷʹஔ׵  ɾ৽ͨʹ࢖༻ͨ͠ title ͷݩͷจষ΋Ҿͬுͬͯ͘Δ  ɾݩͷจষͰ౴͑ʹӨڹ͠ͳ͍෦෼Λ࡞੒ͨ͠จͱೖସ  ɹࠨͷྫͰ͸੺࿮ͷจͱ R. Boleslaw Kelly ͷจΛ  ɹ౴͑ʹӨڹ͠ͳ͍จʢ͜͜ʹ͸ࡌͤͯͳ͍ೖྗจʣͱೖସ ͜ͷσʔλΛ dev-set ʹ͢ΔͱϞσϧͷੑೳ͕ஶ͘͠௿Լ  ʢϞσϧ͕ shortcut Λ࢖ͬͯ౴͍͑ͯͨ͜ͱΛࣔͨ͠ʣ  ͦΕΛ౿·͑ͯ 2-hop Λ໌ࣔతʹऔΓೖΕͨϞσϧ΋ఏҊ /25 ۩ମྫ͸ https://www.aclweb.org/anthology/P19-1262/ ΑΓҾ༻

ఏҊϞσϧ 23 RNN RNN question bi-attention RNN RNN self-attention bi-attention
Word Emb Char Emb context Word Emb Char Emb Query2Context Attention Softmax W,b Previous Control W,b W,b Control Unit Contextualized word emb question vector Context2Query and Query2Context Attention Softmax Context2Query Attention Bridge-entity Supervision RNN Start index RNN End index Figure 3: A 2-hop bi-attention model with a control unit. The Context2Query attention is modeled as in Seo et al. (2017). The output distribution cv of the control unit is used to bias the Query2Context attention. where W1, W2 and W3 are trainable parameters, and is element-wise multiplication. Then the query-to-context attention vector is derived as: control unit imitates human’s behavior when an- swering a question that requires multiple reasoning steps. For the example in Fig. 1, a human question Word Emb Char Emb context Word Emb Char Emb Contextualized word emb vector Figure 3: A 2-hop bi-attention model with a control unit. The Context2Query attention is modeled as in Seo et al. (2017). The output distribution cv of the control unit is used to bias the Query2Context attention. where W1, W2 and W3 are trainable parameters, and is element-wise multiplication. Then the query-to-context attention vector is derived as: mj = max1sS Ms,j pj = exp(mj) PJ j=1 exp(mj) qc = J X j=1 pjhj (2) We then obtain the question-aware context representation and pass it through another layer of BiLSTM: h0 j = [hj; cqj ; hj cqj ; cqj qc] h1 = BiLSTM(h0) (3) where ; is concatenation. Self-attention is modeled upon h1 as BiAttn(h1, h1) to produce h2. Then, we apply linear projection to h2 to get the start index logits for span prediction and the end index logits is modeled as h3 = BiLSTM(h2) followed by linear projection. Furthermore, the model uses a 3-way classifier on h3 to predict the answer as control unit imitates human’s behavior when an- swering a question that requires multiple reasoning steps. For the example in Fig. 1, a human reader would first look for the name of “Kasper Schmeichel’s father”. Then s/he can locate the correct answer by finding what “Peter Schme- ichel” (the answer to the first reasoning hop) was “voted to be by the IFFHS in 1992”. Recall that S, J are the lengths of the question and context. At each hop i, given the recurrent control state ci 1, contextualized question representation u, and question’s vector representation q, the control unit outputs a distribution cv over all words in the question and updates the state ci: cqi = Proj[ci 1; q]; cai,s = Proj(cqi us) cvis = softmax(cais); ci = S X s=1 cvi,s · us (4) where Proj is the linear projection layer. The dis- ৄࡉ͸ׂѪ͢Δ͕ɺcontrol unit Ͱ i ൪໨ͷ hop Ͱ context Λߟྀ࣭ͭͭ͠໰ͷͲͷ෦෼ʹ஫໨͢Δ͔Λௐ੔ Sentence level   supporting facts  prediction Text span prediction supporting fact Λܨ͙  entity Λ༧ଌ ౴͑Λ༧ଌ จষ͕ supporting fact  ͔൱͔Λ༧ଌ /25 ਤ͸ https://www.aclweb.org/anthology/P19-1262/ ΑΓҾ༻

࣮ݧ݁Ռ 24 Train Reg Reg Adv Adv Eval Reg Adv
Reg Adv 1-hop Base 42.32 26.67 41.55 37.65 1-hop Base + sp 43.12 34.00 45.12 44.65 2-hop 47.68 34.71 45.71 40.72 2-hop + sp 46.41 32.30 47.08 46.87 Table 1: EM scores after training on the regular data or on the adversarial training set ADD4DOCS-RAND, and evaluation on the regular dev set or the ADD4DOCS- RAND adv-dev set. “1-hop Base” and ”2-hop” do not have sentence-level supporting-facts supervision. containing answer (4 or 8) and mixing strategy (randomly insert or prepend). We name these 4 dev sets “Add4Docs-Rand”, “Add4Docs-Prep”, “Add8Docs-Rand”, and “Add8Docs-Prep”. For adversarial training, we choose the “Add4Docs- Rand” training set since it is shown in Wang and Bansal (2018) that training with randomly inserted adversaries yields the model that is the most robust to the various adversarial evaluation settings. In the adversarial training examples, the fake titles and answers are sampled from the original training set. We randomly select 40% of the adversarial ex- A4D-R A4D-P A8D-R A8D-P 1-hop Base 37.65 37.72 34.14 34.84 1-hop Base + sp 44.65 44.51 43.42 43.59 2-hop 40.72 41.03 37.26 37.70 2-hop + sp 46.87 47.14 44.28 44.44 Table 2: EM scores on 4 adversarial evaluation settings after training on ADD4DOCS-RAND. ‘-R’ and ‘-P’ represent random insertion and prepending. A4D and A8D stands for ADD4DOCS and ADD8DOCS adv- dev sets. in the ﬁrst row, the single-hop baseline trained on regular data performs poorly on the adversarial evaluation, suggesting that it is indeed exploit- ing the reasoning shortcuts instead of actually per- forming the multi-hop reasoning in locating the Train Regular Regular Adv Adv Eval Regular Adv Regular Adv 2-hop 47.68 34.71 45.71 40.72 2-hop - Ctrl 46.12 32.46 45.20 40.32 2-hop - Bridge 43.31 31.80 41.90 37.37 1-hop Base 42.32 26.67 41.55 37.65 Table 3: Ablation for the Control unit and Bridge-entity supervision, reported as EM scores after training on the regular or adversarial ADD4DOCS-RAND data, and evaluation on regular dev set and ADD4DOCS-RAND adv-dev set. Note that 1-hop Base is same as 2-hop without both control unit and bridge-entity supervision. sarial evaluation. After we add the sentence-level supporting-fact supervision, the 2-hop model (row D-P 84 59 70 44 set- and 4D Train Regular Regular Adv Adv Eval Regular Adv Regular Adv 2-hop 47.68 34.71 45.71 40.72 2-hop - Ctrl 46.12 32.46 45.20 40.32 2-hop - Bridge 43.31 31.80 41.90 37.37 1-hop Base 42.32 26.67 41.55 37.65 Table 3: Ablation for the Control unit and Bridge-entity supervision, reported as EM scores after training on train, dev ͷͦΕͧΕͰ adv. example Λ࢖ͬͨ݁Ռ   ɾࢦඪ͸ Exact Match (EM)  ɾsp ͸ sentence level prediction  ɾී௨ͷσʔλͰֶशͨ͠΋ͷΛ adv. dev ͰѱԽ  ɹʢshortcut ͕࢖͑ͳ͘ͳͬͯഁ୼ʣ  ɾఏҊͨ͠ 2-hop Ϟσϧ͸ߴੑೳ adv. training ༷ͯ͠ʑͳ adv. dev Ͱݕূ  ɾݩσʔλ͸ 10 ݸͷύϥάϥϑ͔Β੒Δ  ɾͦͷ͏ͪͷԿݸΛ adv. example ʹ͢Δ͔ {4 or 8}  ɹ2/10 ݸ͸౴͑Λಋ͘ͷʹඞཁͰ͜Ε͸ඞͣ࢒͢  ɾadv. example ΛϥϯμϜʹૠೖ͢Δ͔෇͚଍͔͢ {R or P} ablation study Ͱ෇͚Ճ͑ͨػೳΛݕূ  ɾcontrol unit ΋ bridge entity ֶश΋༗ޮ  /25 ݁Ռͷද͸ https://www.aclweb.org/anthology/P19-1262/ ΑΓҾ༻

࿦จϝϞʢ࠶ܝʣ • Generating Natural Language …:  https://github.com/yoheikikuta/paper-reading/issues/41 • Generating Fluent
Adversarial …:  https://github.com/yoheikikuta/paper-reading/issues/42 • Robust Neural Machine …:  https://github.com/yoheikikuta/paper-reading/issues/43 • Avoiding Reasoning Shortcuts …:  https://github.com/yoheikikuta/paper-reading/issues/44 25/25

20191102_ACL2019_adversarial_examples_in_NLP_Yo...

20191102_ACL2019_adversarial_examples_in_NLP_YoheiKIKUTA

yoppe

More Decks by yoppe

Other Decks in Research

Featured

Transcript

ACL2019 Adversarial Examples in NLP 2019/11/02 @yohei_kikuta

঺հ࿦จ • Generating Natural Language Adversarial Examples through Probability Weighted

࿦จϝϞ • Generating Natural Language …:  https://github.com/yoheikikuta/paper-reading/issues/41 • Generating Fluent

Adversarial ◦◦◦ • Adversarial Network (ఢରతωοτϫʔΫ)  యܕతʹ͸ generator ͱ discriminator

Adversarial example ͷ෼ྨ White box Ϟσϧߏ଄΍ඍ෼৘ใ΋࢖͑Δ 5 Black box ೖग़ྗ͔͠࢖͑ͳ͍

NLP ʹ͓͚Δ adv. example ͷ೉͠͞ • ը૾ͱҟͳΓ཭ࢄ  ɾೖྗΛগ͠ม͑Δɺͱ͍͏ૢ࡞͕͠ʹ͍͘  ɾembedding sp.

యܕతͳ adv. example in NLP • adv. example ͷ࡞੒  HotFlip

ACL2019 Ͱͷ adv. example in NLP • adv. example ͷ࡞੒

Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency

·ͱΊ 10 … was funny as … wi−1 wi wi+1

࣮ݧ݁Ռ Dataset Model Original Random Gradient TiWO WS PWWS IMDB

۩ମྫ Dataset Model Random Gradient TiWO WS PWWS IMDB word-CNN

Generating Fluent Adversarial Examples for Natural Languages (short) word-based, targeted,

·ͱΊ 14 : empty trash cans … x α(x′|x) =

࣮ݧ݁Ռ adv. attack ͷ੒ޭ཰ (b-: black box, w-: white box)

۩ମྫ (b) SNLI curves of the attacks. Invok# PPL ↵(%)

Robust Neural Machine Translation with Doubly Adversarial Inputs (long) (sub)word-based,

·ͱΊ 18 Encoder ℒ(θ) = ℒclean (θmt ) + ℒlm

࣮ݧ݁Ռ 19 Method Model MT06 MT02 MT03 MT04 MT05 MT08

۩ମྫ adv. example Ͱ͸ͳͯ͘ noisy input Ͱͷ݁Ռ  (ϥϯμϜʹબΜͩ୯ޠΛ embedding ͷҙຯͰ͍ۙ୯ޠʹஔ͖׵͚͑ͨͩ)

Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for

·ͱΊ 22 nd Mohit Bansal hapel Hill nsal}@cs.unc.edu What was

ఏҊϞσϧ 23 RNN RNN question bi-attention RNN RNN self-attention bi-attention

࣮ݧ݁Ռ 24 Train Reg Reg Adv Adv Eval Reg Adv

࿦จϝϞʢ࠶ܝʣ • Generating Natural Language …:  https://github.com/yoheikikuta/paper-reading/issues/41 • Generating Fluent