Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs - Claude Coulombe

Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs - Claude Coulombe

Claude Coulombe at May 21, 2019 event of montrealml.dev

Title: Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs

Summary: In natural language processing, it is common to find oneself with far too little data to train a deep model. This “Big Data wall” represents a challenge for minority language communities, organizations, laboratories and companies that compete with the web GAFAM web giants. In this presentation we will discuss various simple, practical and robust text data augmentation techniques based on NLP and machine learning to overcome the lack of text data for training large statistical models, particularly for deep learning.

Bio: Claude Coulombe has evolved from the budding young scientist who participated in 15 science fairs, got a B.Sc. in physics and a master’s degree in AI at Université de Montréal (Homo Scientificus), evolved to become a Québec’s passionated high-tech entrepreneur, co-founder of an AI startup called Machina Sapiens, where he participated in the creation of a new generation of grammatical checker tools (Homo Québecensis). Following the bursting of the technology bubble, Claude took a new evolutionary path to start a family, launch Lingua Technologies, which combines machine translation and web technologies, and undertake a PhD in machine learning at MILA under the supervision of Yoshua Bengio (Homo FamilIA). In 2008, resources becoming scarce, Claude transformed into a Java tech lead, specializing in the creation of rich web applications with Ajax, HTML5, Javascript, GWT, REST architectures, cloud and mobile applications (Java Man). In 2013, Claude started a new PhD in cognitive science, participated in the development of two massive open online courses (MOOC) at TÉLUQ, learned Python and deep learning (Python Man, not to be confused with the Piltdown Man). In short, Claude is an old fossil that has evolved, reproduced, created tools and being adapted to the rhythm of his passions.

PatternedScience

May 21, 2019
Tweet

More Decks by PatternedScience

Other Decks in Technology

Transcript

  1. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Claude COULOMBE PhD candidate - TÉLUQ / UQAM Consultant - Lingua Technologies Inc. Scientific advisor DataFranca Montréal Machine Learning Tuesday - May 21 2019 Sponsor
  2. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    A well-kept secret of the Deep Learning There is a price to pay for Deep Learning that we do not talk enough about. A well-kept secret of Deep Learning Deep Learning requires huge amounts of data, access to large computing infrastructures with graphics processors, and a wealth of know-how to prepare data, build architectures, tune hyperparameters, and train deep neural networks.
  3. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    The Big Data wall It is not uncommon to find oneself with far too little data to train a deep model. This "Big Data wall" represents a challenge for minority language communities on the Web, organizations, laboratories and companies that compete with the giants of GAFAM. The need for a large amount of data is not specific to Deep Learning, it is related to the complexity of the task to be solved.
  4. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    The Big Data wall (2) Some rules of thumb about the amount of data for deep learning.
  5. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Data Augmentation The idea is to create new data from existing data. By habit, we say data augmentation, but it’s rather an amplification, because we use existing data to create new ones, while preserving the meaning that must remain the same. The idea of "semantically invariant transformation" is at the heart of the data augmentation process in natural language. It also refers to synthetic data, generated data, artificial data or sometimes fake data.
  6. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Previous works In computer vision, it is common practice to create new images by geometric transformations that preserve similarity [Simard, Steinkraus & Platt, 2003], [Ha & Bunke, 1997]. This type of data augmentation was used to win the ImageNet challenge in 2012 [Krizhevsky, Sutskever & Hinton, 2012] which marks a turning point for Deep Learning. In speech recognition, data augmentation is achieved by manipulating the signal by slowing or accelerating it [Ko et al, 2015], by noise injection and alteration of the spectrogram [Jaitly & Hinton, 2013].
  7. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Data augmentation for NLP Until recently, the only widespread technique for amplifying text data was lexical substitution, which consists of replacing a word with its synonym using a thesaurus [Zhang & LeCun, 2015]. But it is no secret that hand-coded rules, including noise injection, are used. A concrete example is the NoiseMix library [Bittlingmayer, 2018]. Data augmentation for NLP is still uncommon
  8. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Goal The present work aims to study the feasibility of text data augmentation (TDA) techniques by data pre-processing that are practical, robust and simple to implement; Some existing TDA techniques have been tested for comparisons, such as noise injection or the use of regular expressions. Others have been modified or improved like lexical substitution. Finally, more innovative TDA techniques have been tested, such as the generation of paraphrases through back-translation and syntactic tree transformations, using practical, robust and easy-to-use online services.
  9. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Some general guideline rules Rule of Statistical Distribution Respect The augmented data must follow a statistical distribution similar to that of the original data. Golden Rule of Plausibility A human being should not be able to distinguish between the amplified data and the original data. [Géron, 2017b] Semantic Invariance Rule Data augmentation involves semantically invariant transformations.
  10. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Text noise injection involves the removal, replacement or random addition of an alphabetic character or a sequence drawn from a pre-established look-up table. The algorithm proceeds in 3 steps. 1) Random choice of a character to be replaced in the text. 2) Random draw of an operator (removal, replacement, addition, equivalence). 3) Random choice of character or replacement sequence [Roquette, 2018]. Textual noise injection Textual noise injection Light textual noise injection is a semantically invariant transformation. Strong textual noise injection is not a semantically invariant transformation.
  11. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    The idea is to generate texts containing common misspellings in order to train our models, which will thus become more robust to this particular type of textual noise. The spelling error injection algorithm is based on a list of the most common misspellings in English. This list has been compiled by the publisher of Oxford Dictionaries[Oxford Dictionaries, 2018]. Spelling Errors Injection Spelling Errors Injection Spelling errors injection is a semantically invariant transformation.
  12. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Lexical substitution consists in proposing words that can replace a given word [Zhang & LeCun, 2015]. These words are typically synonyms coming from a thesaurus like Wordnet [Miller & al, 1990]. Lexical substitution using thesaurus Lexical Replacement Rules of Thumb Replacing a word by a real synonym is a semantically invariant transformation. Replacing a word by a hyperonym (more general word) is a semantically invariant transformation. Replacing a word by a hyponym (more specific word) is usually not a semantically invariant transformation. Replacing a word by an antonym is not a semantically invariant transformation.
  13. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Paraphrases generation Paraphrases generation considered unrealistic and costly Therefore, the best way to do data augmentation would have been using human rephrases of sentences, but this is unrealistic and expensive due the large volume of samples in our datasets. As a result, the most natural choice in data augmentation for us is to replace words or phrases with their synonyms. [Zhang & LeCun, 2015] Definition of the perfect paraphrase In addition to being meaning-preserving, an ideal paraphrase must also diverge as sharply as possible in form from the original while still sounding natural and fluent. [Chen & Dolan, 2011]
  14. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Firstly, prefer surface transformations that can be produced with regular expressions because they are simple and very efficient in terms of computation. Paraphrases generation using regular expressions Examples of a text surface transformation The transition to a contracted verbal form and its inverse is a semantically invariant transformation provided that the ambiguities are preserved. I am => I'm, you are => you're, he is => he's, it is => it's, she is => she's, we are => we're, they are => they're,, I have => I've, you have => you've, we have => we've, they have => they've, he has => he's, it has => it's, she has => she's, I will => I'll, you will => you'll, he will => he'll, are not => aren't, is not => isn't, was not => wasn't, ..., I'm => I am, I'll => I will, you'll => you will, he'll => he will, aren't => are not, isn't => is not, wasn't => was not, weren't => were not, couldn't => could not, don't => do not, doesn't => does not, didn't => did not, mustn't => must not, shouldn't => should not, can't => can not, can't => cannot, won't => will not, shan't => shall not
  15. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Paraphrases generation using regex (2) The « respect for ambiguity » rule of thumb A transformation that create ambiguity or imprecision is often considered semantically invariant. A transformation that resolves an ambiguity, by specifying an information, cannot be considered a semantically invariant transformation, unless the information specified is motivated by the context. Examples of transformations to avoid because they resolves an ambiguity without justification she's => she is she's => she has
  16. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Paraphrases generation using syntax tree transformations The paraphrases generation by transforming syntactic trees is inspired by the work of Michel Gagnon, Polytechnique Montréal [Gagnon & Da Sylva, 2005], [Zouaq, Gagnon & Ozell, 2010]. The analysis of a sentence according to a dependency grammar formalism gives a tree whose nodes are the words of the sentence and the edges (links) the syntactic dependencies between the words.
  17. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Paraphrases generation using syntax tree transform. (2) Each sentence is submitted to the SyntaxNet analyzer [Petrov, 2016],[Kong et al, 2017] which is based on deep learning techniques and the TensorFlow library. The code is executed in Google's cloud infrastructure via the Cloud Natural Language API [Google, 2018a]. Paraphrases generator based on syntax tree transformations
  18. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Paraphrases generation using syntax tree transform. (3) The transformation rules have been built manually according to the paraphrases typology of [Vila, Martí & Rodríguez, 2014] under the 20/80 Pareto's engineering principle. Marie Bourdon, computer linguist at Coginov Montréal, provided valuable assistance to this work Examples of semantically invariant syntactic tree transformations The transition from the passive verb form to the active verb form and vice versa is a semantically invariant transformation. The replacement of a noun or a nominal group by a pronoun is a semantically invariant transformation. The withdrawal of an adjective, an adverb, an adjectival group or an adverbial group is a semantically invariant transformation.
  19. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Paraphrases generation using syntax tree transform. (4) Diagram of the transformation from active to the passive voice Diagram drawn with the help of spaCy [Honnibal & Montani, 2017] Transformation from the active voice to the the passive voice of the phrase "A man eats an apple in the kitchen." The head of the dependency structure is the verb "eat". The transformation rule starts by exchanging the "man" subject group (in red) and the "apple" object group (in blue). Then the "eat" verb is modified (in green) to give a new dependency structure which once flattened generates the phrase "An apple is eaten by a man in the kitchen. "
  20. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Paraphrases generation using back-translation The idea of paraphrases generation using back-translation has been suggested during a "discussion in a hallway, by my old friend, the physicist Antoine Saucier, " at Polytechnique Montréal at the beginning of 2015. Back-translation is an old trick used to test the quality of machine translation, which consists in translating a text already translated, back to the original language. In fact, for a given sentence there are generally a lot of equivalent correct translations because of the huge combinatorial productivity of the natural language. By definition, all these equivalent translations are paraphrases. Historically, the first mention of the use of back-translation to introduce variants into textual data, under the term "round trip machine translation", can be found in an article by a team from King's College London presented at the ISCOL 2015 conference [Lau, Clark & Lappin, 2015].
  21. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Paraphrases generation using back-translation (2)
  22. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Paraphrases generation using back-translation (3) For the similarity we opted for a simple difference in length between the back-translation and the original text [Wieting, Mallinson & Gimpel, 2017], the BLEU metric and a logistic model trained on manually labelled data. Google's online translation services was used via Google Translate API. All the required code is contained in a small iPython notebook. Text data augmentation using back-translation Good quality back-translation is a semantically invariant transformation. Poor quality back-translation is not a semantically invariant transformation.
  23. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Experiment - Task and dataset selection To validate and compare the different TDA techniques, we chose a simple problem that involves a standardized dataset and common deep neural network architectures. This will make it easier to isolate the effect of amplifying text data. We opted for the task of predicting the positive or negative polarity of movie reviews contained in the IMDB database [Pang, Lee & Vaithyanathan, 2002]. For this task performance ranges from 70% for some traditional learning algorithms to over 90% for fine tuned deep neural networks.
  24. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Experiment - Setup (2) The experiment is divided into two phases: 1) a pre-processing phase where the text data augmentation (TDA) is performed using different techniques 2) the training phase of different deep neural network architectures on the original data and the amplified data. The models are then compared to see if there is any improvement or degradation in the performance of the models in terms of prediction (accuracy). Training errors and F1 measurements were also performed. Given scarce computing resources, we have limited our experiments to proof of concept only. Much work remains to be done to explore each method of text augmentation in detail.
  25. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Experiment - TDA Preprocessing phase (3)
  26. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Experiment - Training deep models (4) Multilayer Perceptron (MLP) Convolutional Network 1D LSTM BiLSTM
  27. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Discussion - Observations All the proposed TDA techniques have increased the accuracy of the results on a standard task of predicting the polarity of movie reviews for different neural network architectures. An important observation about deep neural networks is the great variability of the results they gave. The same network trained with the same data can give quite different results from one training to another. This randomness of neural networks is necessary for their proper functioning. Deep learning systems in production base their results on sets of models to provide more consistent predictions.
  28. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Discussion - Advantages (2) The main advantages of the proposed text data augmentation techniques are from a practical point of view and that of software engineering. Leveraging the online NLP services of established suppliers offers many concrete and immediate benefits: availability, robustness, reliability, scalability. In addition, there are inexpensive, ready-to-use solutions available in a large number of languages. These TDA techniques are also easy to implement and use. Often a few lines of code are enough to call an online service and get the results.
  29. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Analyse - Drawbacks (3) The main drawback of some of the proposed text data augmentation (TDA) techniques remains the amount of processing and their massive nature, which requires the use of cloud computing infrastructure. In general, TDA methods based on syntactic tree transformation and back-translation degrade with the length of the sentences processed. Augmented data may mask or dilute some of the weak signals present in the original data. Some of the proposed TDA techniques rely on commercial NLP online services such as translation and parsing services that come from private providers.
  30. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Discussion - Comparison with other approaches (4) The proposed TDA techniques have the advantage over emerging approaches, like Textual GAN, that they do not seek to generate sentences from scratch, but rather to leverage existing already meaningful sentences by using semantically invariant transformations. Other techniques are difficult to implement and require mastery the «cooking» of neural networks. They also involve very long and tedious calculations. Finally, the TDA techniques proposed in this study are similar to those already used in computer vision, which consist in amplifying the data by invariant transformations at the pre-processing stage.
  31. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Discussion - Limits of this work The main limit and criticism of this work is that the experiment was carried out on only a single task, and it is also a very simple one To get a better idea of the effective contribution of the proposed TDA techniques, we would need to repeat our experiments with other datasets and for other tasks by varying the parameters. Much work remains to be done to explore each TDA technique in more detail, to vary the parameters and to combine them. But, this would require a lot of experience and significant computing resources.
  32. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Conclusion This empirical work, conducted with limited computer resources, has shown different text data augmentation (TDA) techniques that are simple, practical and easy to implement. Much work remains to be done to explore each TDA technique in more detail, to vary the amplification parameters and to combine them. The continuation of this work will require access to a computing infrastructure equipped with graphics processors. Finally, dissemination work is needed to introduce these techniques to practitioners, engineers and researchers who are seeking concrete and practical solutions to overcome the « Big data wall » in the application of deep learning to NLP. Paper on arXiv: https://arxiv.org/abs/1812.04718
  33. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.

    Acknowledgements I would like to take this opportunity to thank TELUQ and UQAM, where I have benefited from great freedom and flexibility in the “Doctorat en informatique cognitive”, program. A special thank you to my research director, Mr. Gilbert Paquette (AI and Education), and my two co-directors, Ms. Neila Mezghani (Machine Learning) and Mr. Michel Gagnon (NLP) from Polytechnique Montréal. A nod to Antoine Saucier Polytechnique Montréal and Marie Bourdon, computational linguist, at Coginov Montréal. Thanks also to MITACS and Coginov for my internship. Claude COULOMBE data scientist / consultant PhD candidate