Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs - Claude Coulombe

Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs - Claude Coulombe

Claude Coulombe at May 21, 2019 event of montrealml.dev

Title: Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs

Summary: In natural language processing, it is common to find oneself with far too little data to train a deep model. This “Big Data wall” represents a challenge for minority language communities, organizations, laboratories and companies that compete with the web GAFAM web giants. In this presentation we will discuss various simple, practical and robust text data augmentation techniques based on NLP and machine learning to overcome the lack of text data for training large statistical models, particularly for deep learning.

Bio: Claude Coulombe has evolved from the budding young scientist who participated in 15 science fairs, got a B.Sc. in physics and a master’s degree in AI at Université de Montréal (Homo Scientificus), evolved to become a Québec’s passionated high-tech entrepreneur, co-founder of an AI startup called Machina Sapiens, where he participated in the creation of a new generation of grammatical checker tools (Homo Québecensis). Following the bursting of the technology bubble, Claude took a new evolutionary path to start a family, launch Lingua Technologies, which combines machine translation and web technologies, and undertake a PhD in machine learning at MILA under the supervision of Yoshua Bengio (Homo FamilIA). In 2008, resources becoming scarce, Claude transformed into a Java tech lead, specializing in the creation of rich web applications with Ajax, HTML5, Javascript, GWT, REST architectures, cloud and mobile applications (Java Man). In 2013, Claude started a new PhD in cognitive science, participated in the development of two massive open online courses (MOOC) at TÉLUQ, learned Python and deep learning (Python Man, not to be confused with the Piltdown Man). In short, Claude is an old fossil that has evolved, reproduced, created tools and being adapted to the rhythm of his passions.

PatternedScience

May 21, 2019
Tweet

More Decks by PatternedScience

Other Decks in Technology

Transcript

  1. Text Data Augmentation Made Simple By
    Leveraging NLP Cloud APIs.
    Claude COULOMBE
    PhD candidate - TÉLUQ / UQAM
    Consultant - Lingua Technologies Inc.
    Scientific advisor DataFranca
    Montréal Machine Learning
    Tuesday - May 21 2019
    Sponsor

    View full-size slide

  2. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    A well-kept secret of the Deep Learning
    There is a price to pay for Deep Learning that we do not talk enough
    about.
    A well-kept secret of Deep Learning
    Deep Learning requires huge amounts of data, access to large computing
    infrastructures with graphics processors, and a wealth of know-how to
    prepare data, build architectures, tune hyperparameters, and train deep
    neural networks.

    View full-size slide

  3. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    The Big Data wall
    It is not uncommon to find oneself with far too little data to train a deep
    model.
    This "Big Data wall" represents a challenge for minority language
    communities on the Web, organizations, laboratories and companies
    that compete with the giants of GAFAM.
    The need for a large amount of data is not specific to Deep Learning, it
    is related to the complexity of the task to be solved.

    View full-size slide

  4. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    The Big Data wall (2)
    Some rules of thumb about the amount of data for deep learning.

    View full-size slide

  5. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Data Augmentation
    The idea is to create new data from existing data.
    By habit, we say data augmentation, but it’s rather an amplification,
    because we use existing data to create new ones, while preserving
    the meaning that must remain the same. The idea of "semantically
    invariant transformation" is at the heart of the data augmentation
    process in natural language.
    It also refers to synthetic data, generated data, artificial data or
    sometimes fake data.

    View full-size slide

  6. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Previous works
    In computer vision, it is common practice to create new images by
    geometric transformations that preserve similarity [Simard,
    Steinkraus & Platt, 2003], [Ha & Bunke, 1997]. This type of data
    augmentation was used to win the ImageNet challenge in 2012
    [Krizhevsky, Sutskever & Hinton, 2012] which marks a turning point
    for Deep Learning.
    In speech recognition, data augmentation is achieved by
    manipulating the signal by slowing or accelerating it [Ko et al, 2015],
    by noise injection and alteration of the spectrogram [Jaitly & Hinton,
    2013].

    View full-size slide

  7. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Data augmentation for NLP
    Until recently, the only widespread technique for amplifying text
    data was lexical substitution, which consists of replacing a word with
    its synonym using a thesaurus [Zhang & LeCun, 2015].
    But it is no secret that hand-coded rules, including noise injection,
    are used. A concrete example is the NoiseMix library [Bittlingmayer,
    2018].
    Data augmentation for NLP is still uncommon

    View full-size slide

  8. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Goal
    The present work aims to study the feasibility of text data
    augmentation (TDA) techniques by data pre-processing that are
    practical, robust and simple to implement;
    Some existing TDA techniques have been tested for comparisons,
    such as noise injection or the use of regular expressions. Others
    have been modified or improved like lexical substitution. Finally,
    more innovative TDA techniques have been tested, such as the
    generation of paraphrases through back-translation and syntactic
    tree transformations, using practical, robust and easy-to-use online
    services.

    View full-size slide

  9. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Some general guideline rules
    Rule of Statistical Distribution Respect
    The augmented data must follow a statistical distribution similar to that of the
    original data.
    Golden Rule of Plausibility
    A human being should not be able to distinguish between the amplified data
    and the original data.
    [Géron, 2017b]
    Semantic Invariance Rule
    Data augmentation involves semantically invariant transformations.

    View full-size slide

  10. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Text noise injection involves the removal, replacement or random addition
    of an alphabetic character or a sequence drawn from a pre-established
    look-up table. The algorithm proceeds in 3 steps. 1) Random choice of a
    character to be replaced in the text. 2) Random draw of an operator
    (removal, replacement, addition, equivalence). 3) Random choice of
    character or replacement sequence [Roquette, 2018].
    Textual noise injection
    Textual noise injection
    Light textual noise injection is a semantically invariant transformation.
    Strong textual noise injection is not a semantically invariant transformation.

    View full-size slide

  11. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    The idea is to generate texts containing common misspellings in
    order to train our models, which will thus become more robust to
    this particular type of textual noise.
    The spelling error injection algorithm is based on a list of the most
    common misspellings in English. This list has been compiled by the
    publisher of Oxford Dictionaries[Oxford Dictionaries, 2018].
    Spelling Errors Injection
    Spelling Errors Injection
    Spelling errors injection is a semantically invariant transformation.

    View full-size slide

  12. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Lexical substitution consists in proposing words that can replace a
    given word [Zhang & LeCun, 2015]. These words are typically
    synonyms coming from a thesaurus like Wordnet [Miller & al, 1990].
    Lexical substitution using thesaurus
    Lexical Replacement Rules of Thumb
    Replacing a word by a real synonym is a semantically invariant transformation.
    Replacing a word by a hyperonym (more general word)
    is a semantically invariant transformation.
    Replacing a word by a hyponym (more specific word)
    is usually not a semantically invariant transformation.
    Replacing a word by an antonym is not a semantically invariant transformation.

    View full-size slide

  13. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Paraphrases generation
    Paraphrases generation considered unrealistic and costly
    Therefore, the best way to do data augmentation would have been using human
    rephrases of sentences, but this is unrealistic and expensive due the large volume
    of samples in our datasets. As a result, the most natural choice in data
    augmentation for us is to replace words or phrases with their synonyms.
    [Zhang & LeCun, 2015]
    Definition of the perfect paraphrase
    In addition to being meaning-preserving, an ideal paraphrase must also diverge as
    sharply as possible in form from the original while still sounding natural and fluent.
    [Chen & Dolan, 2011]

    View full-size slide

  14. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Firstly, prefer surface transformations that can be produced with regular
    expressions because they are simple and very efficient in terms of computation.
    Paraphrases generation using regular expressions
    Examples of a text surface transformation
    The transition to a contracted verbal form and its inverse is a semantically
    invariant transformation provided that the ambiguities are preserved.
    I am => I'm, you are => you're, he is => he's, it is => it's, she is => she's, we are =>
    we're, they are => they're,, I have => I've, you have => you've, we have => we've, they
    have => they've, he has => he's, it has => it's, she has => she's, I will => I'll, you
    will => you'll, he will => he'll, are not => aren't, is not => isn't, was not => wasn't,
    ..., I'm => I am, I'll => I will, you'll => you will, he'll => he will, aren't => are
    not, isn't => is not, wasn't => was not, weren't => were not, couldn't => could not,
    don't => do not, doesn't => does not, didn't => did not, mustn't => must not, shouldn't
    => should not, can't => can not, can't => cannot, won't => will not, shan't => shall not

    View full-size slide

  15. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Paraphrases generation using regex (2)
    The « respect for ambiguity » rule of thumb
    A transformation that create ambiguity or imprecision is
    often considered semantically invariant.
    A transformation that resolves an ambiguity, by specifying an information, cannot be
    considered a semantically invariant transformation, unless the information specified is
    motivated by the context.
    Examples of transformations to avoid because they resolves
    an ambiguity without justification
    she's => she is
    she's => she has

    View full-size slide

  16. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Paraphrases generation using syntax tree transformations
    The paraphrases generation by transforming syntactic trees is inspired by the work
    of Michel Gagnon, Polytechnique Montréal [Gagnon & Da Sylva, 2005], [Zouaq,
    Gagnon & Ozell, 2010].
    The analysis of a sentence according to a dependency grammar formalism gives a
    tree whose nodes are the words of the sentence and the edges (links) the syntactic
    dependencies between the words.

    View full-size slide

  17. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Paraphrases generation using syntax tree transform. (2)
    Each sentence is submitted to the SyntaxNet analyzer [Petrov, 2016],[Kong et al,
    2017] which is based on deep learning techniques and the TensorFlow library. The
    code is executed in Google's cloud infrastructure via the Cloud Natural Language
    API [Google, 2018a].
    Paraphrases generator based on syntax tree transformations

    View full-size slide

  18. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Paraphrases generation using syntax tree transform. (3)
    The transformation rules have been built manually according to the paraphrases
    typology of [Vila, Martí & Rodríguez, 2014] under the 20/80 Pareto's engineering
    principle. Marie Bourdon, computer linguist at Coginov Montréal, provided valuable
    assistance to this work
    Examples of semantically invariant syntactic tree transformations
    The transition from the passive verb form to the active verb form
    and vice versa is a semantically invariant transformation.
    The replacement of a noun or a nominal group by a pronoun
    is a semantically invariant transformation.
    The withdrawal of an adjective, an adverb, an adjectival group
    or an adverbial group is a semantically invariant transformation.

    View full-size slide

  19. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Paraphrases generation using syntax tree transform. (4)
    Diagram of the transformation from active to the passive voice
    Diagram drawn with the help of spaCy [Honnibal & Montani, 2017]
    Transformation from the active voice to the the passive voice of the phrase "A man eats an apple in the kitchen." The head of the dependency
    structure is the verb "eat". The transformation rule starts by exchanging the "man" subject group (in red) and the "apple" object group (in blue). Then
    the "eat" verb is modified (in green) to give a new dependency structure which once flattened generates the phrase "An apple is eaten by a man in
    the kitchen. "

    View full-size slide

  20. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Paraphrases generation using back-translation
    The idea of paraphrases generation using back-translation has been suggested
    during a "discussion in a hallway, by my old friend, the physicist Antoine Saucier, "
    at Polytechnique Montréal at the beginning of 2015.
    Back-translation is an old trick used to test the quality of machine translation, which
    consists in translating a text already translated, back to the original language. In
    fact, for a given sentence there are generally a lot of equivalent correct translations
    because of the huge combinatorial productivity of the natural language. By
    definition, all these equivalent translations are paraphrases.
    Historically, the first mention of the use of back-translation to introduce variants
    into textual data, under the term "round trip machine translation", can be found in
    an article by a team from King's College London presented at the ISCOL 2015
    conference [Lau, Clark & Lappin, 2015].

    View full-size slide

  21. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Paraphrases generation using back-translation (2)

    View full-size slide

  22. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Paraphrases generation using back-translation (3)
    For the similarity we opted for a simple difference in length between the
    back-translation and the original text [Wieting, Mallinson & Gimpel, 2017], the BLEU
    metric and a logistic model trained on manually labelled data.
    Google's online translation services was used via Google Translate API. All the
    required code is contained in a small iPython notebook.
    Text data augmentation using back-translation
    Good quality back-translation is a semantically invariant transformation.
    Poor quality back-translation is not a semantically invariant transformation.

    View full-size slide

  23. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Experiment - Task and dataset selection
    To validate and compare the different TDA techniques, we chose a simple problem
    that involves a standardized dataset and common deep neural network
    architectures. This will make it easier to isolate the effect of amplifying text data.
    We opted for the task of predicting the positive or negative polarity of movie
    reviews contained in the IMDB database [Pang, Lee & Vaithyanathan, 2002].
    For this task performance ranges from 70% for some traditional learning algorithms
    to over 90% for fine tuned deep neural networks.

    View full-size slide

  24. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Experiment - Setup (2)
    The experiment is divided into two phases: 1) a pre-processing phase where the text
    data augmentation (TDA) is performed using different techniques 2) the training
    phase of different deep neural network architectures on the original data and the
    amplified data.
    The models are then compared to see if there is any improvement or degradation in
    the performance of the models in terms of prediction (accuracy). Training errors
    and F1 measurements were also performed.
    Given scarce computing resources, we have limited our experiments to proof of
    concept only. Much work remains to be done to explore each method of text
    augmentation in detail.

    View full-size slide

  25. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Experiment - TDA Preprocessing phase (3)

    View full-size slide

  26. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Experiment - Training deep models (4)
    Multilayer Perceptron (MLP)
    Convolutional Network 1D
    LSTM
    BiLSTM

    View full-size slide

  27. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Results - Multilayers Perceptron

    View full-size slide

  28. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Results - CNN 1D

    View full-size slide

  29. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Results - LSTM

    View full-size slide

  30. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Results - biLSTM

    View full-size slide

  31. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Discussion - Observations
    All the proposed TDA techniques have increased the accuracy of the results on a
    standard task of predicting the polarity of movie reviews for different neural
    network architectures.
    An important observation about deep neural networks is the great variability of the
    results they gave. The same network trained with the same data can give quite
    different results from one training to another.
    This randomness of neural networks is necessary for their proper functioning. Deep
    learning systems in production base their results on sets of models to provide more
    consistent predictions.

    View full-size slide

  32. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Discussion - Advantages (2)
    The main advantages of the proposed text data augmentation techniques are from
    a practical point of view and that of software engineering.
    Leveraging the online NLP services of established suppliers offers many concrete
    and immediate benefits: availability, robustness, reliability, scalability. In addition,
    there are inexpensive, ready-to-use solutions available in a large number of
    languages.
    These TDA techniques are also easy to implement and use. Often a few lines of code
    are enough to call an online service and get the results.

    View full-size slide

  33. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Analyse - Drawbacks (3)
    The main drawback of some of the proposed text data augmentation (TDA)
    techniques remains the amount of processing and their massive nature, which
    requires the use of cloud computing infrastructure.
    In general, TDA methods based on syntactic tree transformation and
    back-translation degrade with the length of the sentences processed.
    Augmented data may mask or dilute some of the weak signals present in the
    original data.
    Some of the proposed TDA techniques rely on commercial NLP online services such
    as translation and parsing services that come from private providers.

    View full-size slide

  34. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Discussion - Comparison with other approaches (4)
    The proposed TDA techniques have the advantage over emerging approaches, like
    Textual GAN, that they do not seek to generate sentences from scratch, but rather
    to leverage existing already meaningful sentences by using semantically invariant
    transformations.
    Other techniques are difficult to implement and require mastery the «cooking» of
    neural networks. They also involve very long and tedious calculations.
    Finally, the TDA techniques proposed in this study are similar to those already used
    in computer vision, which consist in amplifying the data by invariant
    transformations at the pre-processing stage.

    View full-size slide

  35. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Discussion - Limits of this work
    The main limit and criticism of this work is that the experiment was carried out on
    only a single task, and it is also a very simple one
    To get a better idea of the effective contribution of the proposed TDA techniques,
    we would need to repeat our experiments with other datasets and for other tasks
    by varying the parameters.
    Much work remains to be done to explore each TDA technique in more detail, to
    vary the parameters and to combine them. But, this would require a lot of
    experience and significant computing resources.

    View full-size slide

  36. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Conclusion
    This empirical work, conducted with limited computer resources, has shown
    different text data augmentation (TDA) techniques that are simple, practical and
    easy to implement.
    Much work remains to be done to explore each TDA technique in more detail, to
    vary the amplification parameters and to combine them. The continuation of this
    work will require access to a computing infrastructure equipped with graphics
    processors.
    Finally, dissemination work is needed to introduce these techniques to practitioners,
    engineers and researchers who are seeking concrete and practical solutions to
    overcome the « Big data wall » in the application of deep learning to NLP.
    Paper on arXiv: https://arxiv.org/abs/1812.04718

    View full-size slide

  37. Text Data Augmentation Made Simple
    By Leveraging NLP Cloud APIs.
    Acknowledgements
    I would like to take this opportunity to thank TELUQ and UQAM, where I have
    benefited from great freedom and flexibility in the “Doctorat en informatique
    cognitive”, program.
    A special thank you to my research director, Mr. Gilbert Paquette (AI and Education),
    and my two co-directors, Ms. Neila Mezghani (Machine Learning) and Mr. Michel
    Gagnon (NLP) from Polytechnique Montréal.
    A nod to Antoine Saucier Polytechnique Montréal and Marie Bourdon,
    computational linguist, at Coginov Montréal.
    Thanks also to MITACS and Coginov for my internship.
    Claude COULOMBE
    data scientist / consultant
    PhD candidate

    View full-size slide