Slide 1

Slide 1 text

Shuntaro Yada UTokyo LISLab CSIRO Data61 LASC 22 July 2019 A Review of 
 ‘Submodular Optimization-based 
 Diverse Paraphrasing and 
 its Effectiveness in Data Augmentation’ 
 (Kumar+, ’19) 5BCMFTBOEpHVSFTJOUIJTTMJEFBSFCPSSPXFE GSPNUIFPSJHJOBMQBQFSBOEQSFTFOUBUJPO

Slide 2

Slide 2 text

• ‘Good’ data augmentation: – Fidelity = preserve the original meaning – Diversity = increase lexical/syntactical variation • Existing work of paraphrasing is basically fidelity-based and word-level • This work proposed a way to take diversity into account 2 (Kumar+, ’19) Overview ⇤2 † Manik Bhandari1 Partha Talukdar1 ence, Bangalore, India gy and Science, Pilani, India 55, mbbhandarimanik}@gmail.com SOURCE – how do i increase body height ? REFERENCE – what do i do to increase my height ? BEAM SEARCH – how do i increase my height ? – how do i increase my body height ? – how do i increase the height ? – how would i increase my body height ? DIPS (OURS) – how could i increase my height ? – what should i do to increase my height ? – what are the fastest ways to increase my height ? – is there any proven method to increase height ? Table 1: Sample paraphrases generated by beam search and DiPS (our method). DiPS offers lexically diverse paraphrases without compromising on fidelity.

Slide 3

Slide 3 text

Proposed method

Slide 4

Slide 4 text

• This work adopts a seq2seq-based paraphrasing model • The decoder uses a diversity-aware function to choose a good subset of candidates (instead of standard beam search algorithms) • To make it mathematically possible, the function is implemented as a submodular function 4 Proposed method Proposed method Subset Selection how do i increase my … how can i decrease the … how can i grow the … what ways exist to increase … how would I increase the … how do I decrease the … i am 17 , what … are there ways to increase … Vt how do i increase my … how can i grow the … what ways exist to increase … are there ways to increase … X argmaxX⊆Vt, |X|= k F(X) k

Slide 5

Slide 5 text

Proposed method 5 DiPS Induce Diversity while not compromising on Fidelity Where can I get that movie? can Where can I get that film? I How can I get that picture? : 3k Candidate Subsequences find film? Where can I that Where can I Where How can can I I that that picture picture get find get movie? Where can I Where can I that k- sequences Synonym (similar embeddings) Diversity Components Fidelity Components where , can , film , I , How , find that , that picture , .. I get , can I , Where can I Rewards unique n-grams Rewards Structural Coverage Source Sentence Where ENCODER DECODER

Slide 6

Slide 6 text

Proposed method 6 DiPS Induce Diversity while not compromising on Fidelity Where can I get that movie? can Where can I get that film? I How can I get that picture? : 3k Candidate Subsequences find film? Where can I that Where can I Where How can can I I that that picture picture get find get movie? Where can I Where can I that k- sequences Synonym (similar embeddings) Diversity Components Fidelity Components where , can , film , I , How , find that , that picture , .. I get , can I , Where can I Rewards unique n-grams Rewards Structural Coverage Source Sentence Where ENCODER DECODER Diversity Encoder Decoder Fidelity

Slide 7

Slide 7 text

Fidelity components 7 Fidelity Components Where can I get that film? How can I get that picture? : 3k Candidate Subsequences find film? Where can I that Where can I that that picture picture et get movie? Where can I Where can I that k- sequences Synonym (similar embeddings) nents Fidelity Components How , ure , can I que n-grams uctural Coverage Source Sentence ∑ x∈X (x, s) (x, s) = 1 |x| ∑ w i ∈x maxwj ∈s ψ(v wi , v wj ) Embedding based Similarity ∑ x∈X N ∑ n = 1 βn |xn-gram ∩ sn-gram| Lexical Similarity The hidden state of a time step t in the seq2seq’s decoder

Slide 8

Slide 8 text

Diversity components 8 Where can I get that movie? Where can I get that film? How can I get that picture? : 3k Candidate Subsequences find film? Where can I that Where can I Where How can can I I that that picture picture get find get movie? Where can I Where can I that k- sequences Synonym (similar embeddings) Diversity Components Fidelity Components where , can , film , I , How , find that , that picture , .. I get , can I , Where can I Rewards unique n-grams Rewards Structural Coverage Source Sentence N ∑ n = 1 βn ⋃ x∈X x n −gram N-gram uniqueness ∑ x i ∈V(t) ∑ x j ∈X ℛ(x i , x j ) ℛ(x i , x j ) = 1 − EditDistance(x i , x j ) |xi | + |xj | Structural Coverage an I get that movie? can Where can I get that film? I How can I get that picture? find film? Where can I that Where can I ere How can can I I that that picture picture get find get movie? Where can I Where can I that k- sequences Synonym (similar embeddings) I get , can I , Where can I Rewards unique n-grams Rewards Structural Coverage Where ENCODER DECODER 5IFTVCTFU9 UPCFDIPTFO TIPVMEDPWFSBTNBOZTUSVDUVSFT JOBMMDBOEJEBUFT7UBTQPTTJCMF .BOZVOJRVFOHSBNTXPVMENFBOUIBU TFOUFODFTJOUIFTVCTFU9BSFEJWFSTF Subset Selection how do i increase my … how can i decrease the … how can i grow the … what ways exist to increase … how would I increase the … how do I decrease the … i am 17 , what … are there ways to increase … Vt how do i increase my … how can i grow the … what ways exist to increase … are there ways to increase … X argmaxX⊆Vt, |X|= k F(X) k

Slide 9

Slide 9 text

A bit mathematics

Slide 10

Slide 10 text

• Functions that take a set as the argument and hold diminishing returns property • ‘You have more to gain from something new, if you have less to begin with’ • Useful for economics, networks, machine learning – See this slide for more introduction 10 A bit mathematics Submodular functions Sub-modularity # Items = 4 F = 2 # Items = 4 + 1 F = 2 + 1 # Items = 5 + 1 F = 3 + 0 Diminishing Returns F = # Unique Coloured items

Slide 11

Slide 11 text

• The authors implement fidelity and diversity as submodular functions of paraphrase candidate selection – i.e., functions introduced earlier • They are formed as the optimisation problem – There is an efficient algorithm (the left list) to solve maximisation problems of submodular functions if the functions also meet monotonicity 11 A bit mathematics Usage of submod. f. Monotonicity A B A F F ≤ Algorithm 1: Greedy selection for submodular opti- mization (Cardinality constraint) Input: Ground Set: V Budget: k Submodular Function: F 1 S ; 2 N V 3 while |S| < k do 4 x⇤ argmaxx2N F(S [ {x}) 5 S S [ {x⇤} 6 N N \ {x⇤} 7 end 8 return S approaches try to boost the generalization abili- ties of downstream classification models through word-level substitutions. However, they are in- herently restrictive in terms of the diversity they can offer. Our work offers a data-augmentation scheme via high quality paraphrases. Algorithm 2: DiPS Input: Input Sentence: Sin Max. decoding length: T Submodular objective: F No. of paraphrases required: k 1 Process Sin using the encoder of SEQ2SEQ 2 Start the decoder with input symbol sos 3 t 0; P ; 4 while t < T do 5 Generate top 3k most probable subsequences 6 P Select k based on argmax X✓V (t) F(X) using Algorithm 1 7 t = t + 1 8 end 9 return P The second criteria which the function needs to satisfy for Algorithm 1 to be applicable is of monotonicity. A set function F is said to be mono- tone non-decreasing if 8X ✓ Y, F(X)  F(Y ). Submodular functions are relevant in a large

Slide 12

Slide 12 text

Experimental results

Slide 13

Slide 13 text

• To check fidelity, this work measures BLEU (, METEOR, and TER+) of generated paraphrases against reference paraphrases • For diversity, the count of unique 1-4 grams in generated paraphrases is measured • The result looks better than baselines – though I doubt BLEU works well with diverse paraphrases 13 Experimental results Intrinsic evaluation Fidelity & Diversity (Quora Dataset) 27 30 33 36 SBS DBS VAE-SVG DPP SSR DiPS (Ours) BLEU (Fidelity) 30 39 48 57 66 Models SBS DBS VAE-SVG DPP SSR DiPS (Ours) 4-Distinct (Diversity) IUUQTXXXLBHHMFDPNDRVPSBRVFTUJPOQBJST

Slide 14

Slide 14 text

• Solve downstream tasks (text classification) using additional data augmented by the proposed method • Two simple models are used – LogReg = Logistic regression with simple features – LSTM = 1-layer LSTM • No information about how much data was augmented (!) 14 Experimental results Extrinsic evaluation Data Augmentation for Intent Classification Dataset : SNIPS Accuracy 93 94 95 96 97 98 Models No. Aug SBS DBS Syn. Rep Cont. Aug DiPS (Ours) LogReg LSTM Dataset : Yahoo-L31 Accuracy 62 63 64 65 66 67 Models No Aug. SBS DBS Syn.Rep Cont. Aug. DiPS (Ours) LogReg LSTM Data Augmentation for Intent Classification