Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs - Claude Coulombe

Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.
Claude COULOMBE PhD candidate - TÉLUQ / UQAM Consultant - Lingua Technologies Inc. Scientific advisor DataFranca Montréal Machine Learning Tuesday - May 21 2019 Sponsor

A well-kept secret of the Deep Learning There is a price to pay for Deep Learning that we do not talk enough about. A well-kept secret of Deep Learning Deep Learning requires huge amounts of data, access to large computing infrastructures with graphics processors, and a wealth of know-how to prepare data, build architectures, tune hyperparameters, and train deep neural networks.

The Big Data wall It is not uncommon to find oneself with far too little data to train a deep model. This "Big Data wall" represents a challenge for minority language communities on the Web, organizations, laboratories and companies that compete with the giants of GAFAM. The need for a large amount of data is not specific to Deep Learning, it is related to the complexity of the task to be solved.

The Big Data wall (2) Some rules of thumb about the amount of data for deep learning.

Data Augmentation The idea is to create new data from existing data. By habit, we say data augmentation, but it’s rather an ampliﬁcation, because we use existing data to create new ones, while preserving the meaning that must remain the same. The idea of "semantically invariant transformation" is at the heart of the data augmentation process in natural language. It also refers to synthetic data, generated data, artiﬁcial data or sometimes fake data.

Previous works In computer vision, it is common practice to create new images by geometric transformations that preserve similarity [Simard, Steinkraus & Platt, 2003], [Ha & Bunke, 1997]. This type of data augmentation was used to win the ImageNet challenge in 2012 [Krizhevsky, Sutskever & Hinton, 2012] which marks a turning point for Deep Learning. In speech recognition, data augmentation is achieved by manipulating the signal by slowing or accelerating it [Ko et al, 2015], by noise injection and alteration of the spectrogram [Jaitly & Hinton, 2013].

Data augmentation for NLP Until recently, the only widespread technique for amplifying text data was lexical substitution, which consists of replacing a word with its synonym using a thesaurus [Zhang & LeCun, 2015]. But it is no secret that hand-coded rules, including noise injection, are used. A concrete example is the NoiseMix library [Bittlingmayer, 2018]. Data augmentation for NLP is still uncommon

Goal The present work aims to study the feasibility of text data augmentation (TDA) techniques by data pre-processing that are practical, robust and simple to implement; Some existing TDA techniques have been tested for comparisons, such as noise injection or the use of regular expressions. Others have been modiﬁed or improved like lexical substitution. Finally, more innovative TDA techniques have been tested, such as the generation of paraphrases through back-translation and syntactic tree transformations, using practical, robust and easy-to-use online services.

Some general guideline rules Rule of Statistical Distribution Respect The augmented data must follow a statistical distribution similar to that of the original data. Golden Rule of Plausibility A human being should not be able to distinguish between the amplified data and the original data. [Géron, 2017b] Semantic Invariance Rule Data augmentation involves semantically invariant transformations.

Text noise injection involves the removal, replacement or random addition of an alphabetic character or a sequence drawn from a pre-established look-up table. The algorithm proceeds in 3 steps. 1) Random choice of a character to be replaced in the text. 2) Random draw of an operator (removal, replacement, addition, equivalence). 3) Random choice of character or replacement sequence [Roquette, 2018]. Textual noise injection Textual noise injection Light textual noise injection is a semantically invariant transformation. Strong textual noise injection is not a semantically invariant transformation.

The idea is to generate texts containing common misspellings in order to train our models, which will thus become more robust to this particular type of textual noise. The spelling error injection algorithm is based on a list of the most common misspellings in English. This list has been compiled by the publisher of Oxford Dictionaries[Oxford Dictionaries, 2018]. Spelling Errors Injection Spelling Errors Injection Spelling errors injection is a semantically invariant transformation.

Lexical substitution consists in proposing words that can replace a given word [Zhang & LeCun, 2015]. These words are typically synonyms coming from a thesaurus like Wordnet [Miller & al, 1990]. Lexical substitution using thesaurus Lexical Replacement Rules of Thumb Replacing a word by a real synonym is a semantically invariant transformation. Replacing a word by a hyperonym (more general word) is a semantically invariant transformation. Replacing a word by a hyponym (more specific word) is usually not a semantically invariant transformation. Replacing a word by an antonym is not a semantically invariant transformation.

Paraphrases generation Paraphrases generation considered unrealistic and costly Therefore, the best way to do data augmentation would have been using human rephrases of sentences, but this is unrealistic and expensive due the large volume of samples in our datasets. As a result, the most natural choice in data augmentation for us is to replace words or phrases with their synonyms. [Zhang & LeCun, 2015] Definition of the perfect paraphrase In addition to being meaning-preserving, an ideal paraphrase must also diverge as sharply as possible in form from the original while still sounding natural and fluent. [Chen & Dolan, 2011]

Firstly, prefer surface transformations that can be produced with regular expressions because they are simple and very eﬃcient in terms of computation. Paraphrases generation using regular expressions Examples of a text surface transformation The transition to a contracted verbal form and its inverse is a semantically invariant transformation provided that the ambiguities are preserved. I am => I'm, you are => you're, he is => he's, it is => it's, she is => she's, we are => we're, they are => they're,, I have => I've, you have => you've, we have => we've, they have => they've, he has => he's, it has => it's, she has => she's, I will => I'll, you will => you'll, he will => he'll, are not => aren't, is not => isn't, was not => wasn't, ..., I'm => I am, I'll => I will, you'll => you will, he'll => he will, aren't => are not, isn't => is not, wasn't => was not, weren't => were not, couldn't => could not, don't => do not, doesn't => does not, didn't => did not, mustn't => must not, shouldn't => should not, can't => can not, can't => cannot, won't => will not, shan't => shall not

Paraphrases generation using regex (2) The « respect for ambiguity » rule of thumb A transformation that create ambiguity or imprecision is often considered semantically invariant. A transformation that resolves an ambiguity, by specifying an information, cannot be considered a semantically invariant transformation, unless the information specified is motivated by the context. Examples of transformations to avoid because they resolves an ambiguity without justification she's => she is she's => she has

Paraphrases generation using syntax tree transformations The paraphrases generation by transforming syntactic trees is inspired by the work of Michel Gagnon, Polytechnique Montréal [Gagnon & Da Sylva, 2005], [Zouaq, Gagnon & Ozell, 2010]. The analysis of a sentence according to a dependency grammar formalism gives a tree whose nodes are the words of the sentence and the edges (links) the syntactic dependencies between the words.

Paraphrases generation using syntax tree transform. (2) Each sentence is submitted to the SyntaxNet analyzer [Petrov, 2016],[Kong et al, 2017] which is based on deep learning techniques and the TensorFlow library. The code is executed in Google's cloud infrastructure via the Cloud Natural Language API [Google, 2018a]. Paraphrases generator based on syntax tree transformations

Paraphrases generation using syntax tree transform. (3) The transformation rules have been built manually according to the paraphrases typology of [Vila, Martí & Rodríguez, 2014] under the 20/80 Pareto's engineering principle. Marie Bourdon, computer linguist at Coginov Montréal, provided valuable assistance to this work Examples of semantically invariant syntactic tree transformations The transition from the passive verb form to the active verb form and vice versa is a semantically invariant transformation. The replacement of a noun or a nominal group by a pronoun is a semantically invariant transformation. The withdrawal of an adjective, an adverb, an adjectival group or an adverbial group is a semantically invariant transformation.

Paraphrases generation using syntax tree transform. (4) Diagram of the transformation from active to the passive voice Diagram drawn with the help of spaCy [Honnibal & Montani, 2017] Transformation from the active voice to the the passive voice of the phrase "A man eats an apple in the kitchen." The head of the dependency structure is the verb "eat". The transformation rule starts by exchanging the "man" subject group (in red) and the "apple" object group (in blue). Then the "eat" verb is modified (in green) to give a new dependency structure which once flattened generates the phrase "An apple is eaten by a man in the kitchen. "

Paraphrases generation using back-translation The idea of paraphrases generation using back-translation has been suggested during a "discussion in a hallway, by my old friend, the physicist Antoine Saucier, " at Polytechnique Montréal at the beginning of 2015. Back-translation is an old trick used to test the quality of machine translation, which consists in translating a text already translated, back to the original language. In fact, for a given sentence there are generally a lot of equivalent correct translations because of the huge combinatorial productivity of the natural language. By deﬁnition, all these equivalent translations are paraphrases. Historically, the ﬁrst mention of the use of back-translation to introduce variants into textual data, under the term "round trip machine translation", can be found in an article by a team from King's College London presented at the ISCOL 2015 conference [Lau, Clark & Lappin, 2015].

Paraphrases generation using back-translation (2)

Paraphrases generation using back-translation (3) For the similarity we opted for a simple diﬀerence in length between the back-translation and the original text [Wieting, Mallinson & Gimpel, 2017], the BLEU metric and a logistic model trained on manually labelled data. Google's online translation services was used via Google Translate API. All the required code is contained in a small iPython notebook. Text data augmentation using back-translation Good quality back-translation is a semantically invariant transformation. Poor quality back-translation is not a semantically invariant transformation.

Experiment - Task and dataset selection To validate and compare the different TDA techniques, we chose a simple problem that involves a standardized dataset and common deep neural network architectures. This will make it easier to isolate the effect of amplifying text data. We opted for the task of predicting the positive or negative polarity of movie reviews contained in the IMDB database [Pang, Lee & Vaithyanathan, 2002]. For this task performance ranges from 70% for some traditional learning algorithms to over 90% for fine tuned deep neural networks.

Experiment - Setup (2) The experiment is divided into two phases: 1) a pre-processing phase where the text data augmentation (TDA) is performed using different techniques 2) the training phase of different deep neural network architectures on the original data and the amplified data. The models are then compared to see if there is any improvement or degradation in the performance of the models in terms of prediction (accuracy). Training errors and F1 measurements were also performed. Given scarce computing resources, we have limited our experiments to proof of concept only. Much work remains to be done to explore each method of text augmentation in detail.

Experiment - TDA Preprocessing phase (3)

Experiment - Training deep models (4) Multilayer Perceptron (MLP) Convolutional Network 1D LSTM BiLSTM

Results - Multilayers Perceptron

Results - CNN 1D

Results - LSTM

Results - biLSTM

Discussion - Observations All the proposed TDA techniques have increased the accuracy of the results on a standard task of predicting the polarity of movie reviews for diﬀerent neural network architectures. An important observation about deep neural networks is the great variability of the results they gave. The same network trained with the same data can give quite diﬀerent results from one training to another. This randomness of neural networks is necessary for their proper functioning. Deep learning systems in production base their results on sets of models to provide more consistent predictions.

Discussion - Advantages (2) The main advantages of the proposed text data augmentation techniques are from a practical point of view and that of software engineering. Leveraging the online NLP services of established suppliers oﬀers many concrete and immediate beneﬁts: availability, robustness, reliability, scalability. In addition, there are inexpensive, ready-to-use solutions available in a large number of languages. These TDA techniques are also easy to implement and use. Often a few lines of code are enough to call an online service and get the results.

Analyse - Drawbacks (3) The main drawback of some of the proposed text data augmentation (TDA) techniques remains the amount of processing and their massive nature, which requires the use of cloud computing infrastructure. In general, TDA methods based on syntactic tree transformation and back-translation degrade with the length of the sentences processed. Augmented data may mask or dilute some of the weak signals present in the original data. Some of the proposed TDA techniques rely on commercial NLP online services such as translation and parsing services that come from private providers.

Discussion - Comparison with other approaches (4) The proposed TDA techniques have the advantage over emerging approaches, like Textual GAN, that they do not seek to generate sentences from scratch, but rather to leverage existing already meaningful sentences by using semantically invariant transformations. Other techniques are diﬃcult to implement and require mastery the «cooking» of neural networks. They also involve very long and tedious calculations. Finally, the TDA techniques proposed in this study are similar to those already used in computer vision, which consist in amplifying the data by invariant transformations at the pre-processing stage.

Discussion - Limits of this work The main limit and criticism of this work is that the experiment was carried out on only a single task, and it is also a very simple one To get a better idea of the eﬀective contribution of the proposed TDA techniques, we would need to repeat our experiments with other datasets and for other tasks by varying the parameters. Much work remains to be done to explore each TDA technique in more detail, to vary the parameters and to combine them. But, this would require a lot of experience and signiﬁcant computing resources.

Conclusion This empirical work, conducted with limited computer resources, has shown diﬀerent text data augmentation (TDA) techniques that are simple, practical and easy to implement. Much work remains to be done to explore each TDA technique in more detail, to vary the ampliﬁcation parameters and to combine them. The continuation of this work will require access to a computing infrastructure equipped with graphics processors. Finally, dissemination work is needed to introduce these techniques to practitioners, engineers and researchers who are seeking concrete and practical solutions to overcome the « Big data wall » in the application of deep learning to NLP. Paper on arXiv: https://arxiv.org/abs/1812.04718

Acknowledgements I would like to take this opportunity to thank TELUQ and UQAM, where I have beneﬁted from great freedom and ﬂexibility in the “Doctorat en informatique cognitive”, program. A special thank you to my research director, Mr. Gilbert Paquette (AI and Education), and my two co-directors, Ms. Neila Mezghani (Machine Learning) and Mr. Michel Gagnon (NLP) from Polytechnique Montréal. A nod to Antoine Saucier Polytechnique Montréal and Marie Bourdon, computational linguist, at Coginov Montréal. Thanks also to MITACS and Coginov for my internship. Claude COULOMBE data scientist / consultant PhD candidate

Text Data Augmentation Made Simple By Leveragin...

Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs - Claude Coulombe

More Decks by PatternedScience

Other Decks in Technology

Featured

Transcript