Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to NLP in Machine Learning : word2Vec and FastText

Kajal Puri
October 26, 2018

Introduction to NLP in Machine Learning : word2Vec and FastText

Kajal Puri

October 26, 2018
Tweet

More Decks by Kajal Puri

Other Decks in Programming

Transcript

  1. DIY guide to NLP in ML @Agirlhasnofame @kajal-puri kajal-puri.github.io @kajalp

    @kajalpuri @KajalP Kajal Puri Kajal Puri PyCon DE 26th October, 2018 1/22
  2. Finding a midway Kajal Puri PyCon DE 26th October, 2018

    4/22 Programming is one way and another latest technique is NLP
  3. Natural Language Processing Kajal Puri PyCon DE 26th October, 2018

    5/22 • NLP is a way for machines to interpret human language • The origin of NLP started in 1960s but due to lack of digital data, fast computation it has started gaining good results in early 2000.
  4. Step I : Break the “Paragraph” Kajal Puri PyCon DE

    26th October, 2018 6/22 • Separate each line from the paragraph. • Generally, break sentence just before a punctuation. • More sophisticated techniques available in case a document isn’t structured and formatted clearly. Note : If you’re looking for exploring all the *existing* possibilities for text pre-processing and all the model then you must look into nltk (python), but if you’re a developer looking for state-of-the-art results then use spacy. It implements all the best algorithms/models, ready-to-eat recipe for most of the things.
  5. Step II : Break the “Sentence” Kajal Puri PyCon DE

    26th October, 2018 7/22 • Time to break the sentence into “words” - word tokenization • E.g.- “I love PyCon DE community” will become [“I”, “love”, “PyCon”, “DE”, “community”] • Skip/Remove the punctuations. • Code: Words are broken when they have “space” • Easy to do in languages like English, German, French etc. • Python - from nltk.tokenize import sent_tokenize, word_tokenize data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy." words = word_tokenize(data)
  6. Step III : “Parts of Speech” Kajal Puri PyCon DE

    26th October, 2018 8/22 • Each word token has to be categorised into “noun”, “verb”, “adjective” etc. • Use pre-trained part-of-speech classification model - “POS tagging”. Quick Note : This classification model has been trained on millions of tagged English documents so the chances are prediction will be mostly on point. • Output will look like this : I love Pycon DE community. | | | | | Pronoun Adjective Noun Proper Noun Noun • Python - from nltk.tag import CRFTagger ct = CRFTagger()
  7. Step 4 : Token Stemming/Lemmatization Kajal Puri PyCon DE 26th

    October, 2018 9/22 What? • Process of getting the word in its original form : “Coder”, “coding” can be simplified into “code”. Why? • For a machine, “coder”, “coders”, coding”, “code” are all different words with each of its meaning, which might not be the best interpretation. How? • Look-up table of lemma of words based on their parts-of-speech form and some custom rules of English.
  8. Step V : Remove Stop Words Kajal Puri PyCon DE

    26th October, 2018 10/22 What? • “Stop words” are those words that doesn’t add real value to a sentence but used because of grammatical rules or keep a sentence structured. Why? • Doesn’t help in statistical analysis and add redundancy, ambiguity and noise. How? • Python library - from nltk.corpus import stopwords , stopWords = set(stopwords.words('english'))
  9. Step VI : Parsing Kajal Puri PyCon DE 26th October,

    2018 11/22 • Simplify the words by constructing a tree structure, keeping the most important verb as the root and the rest of the words as its leafs. • Predict the type of relationship that exist between the root and that particular word. • Active area of research and constantly changing. • Google releases new paser each year, defeating the accuracy of their last version since 2015. • Difficult to parse English language sentences due to context.
  10. Step VII : Named Entity Recognition Kajal Puri PyCon DE

    26th October, 2018 12/22 • Map filtered tokens to real life entities. Germany should be mapped to geographical location or a country. • NER systems can typically perform tasks like : People’s name, Product’s name, Data and time, Name of a place, Amount of money etc. • Takes help of context as well.
  11. Final Step Kajal Puri PyCon DE 26th October, 2018 13/22

    “My name is Kajal. I love PyCon DE community.” But, wait, Who is I? Humans : Well, easy-peasy, I is Kajal. Machine : Uuumm..I don’t know..?? • Last step is to map the pronouns like I, He, She, We, Them to their original named entity. • Combine Parse trees and NER. • Complicated
  12. One piece of code to rule them all Kajal Puri

    PyCon DE 26th October, 2018 14/22 import spacy # Load the NLP library nlp = spacy.load('en_core_web_lg') # Load the English NLP model def replace_name_with_placeholder(token): if token.ent_iob != 0 and token.ent_type_ == "PERSON": return "[REDACTED] " else: return token.string def scrub(text): doc = nlp(text) for ent in doc.ents: # Loop through all the entities in a document and check if they are names ent.merge() tokens = map(replace_name_with_placeholder, doc) return "".join(tokens) s = """In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". In 1957, Noam Chomsky’s Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures. """ print(scrub(s)) #Output of this code will be named entities and their corresponding types Code is inspired from this medium blog.
  13. Word2Vec vs Glove Kajal Puri PyCon DE 26th October, 2018

    15/22 • GLove - global vectors for word representation. • Word2Vec focuses on the co-occurrence of the word (according to the context) whereas Glove captures the overall count statistics i.e frequency of appearances. • Word2Vec uses densely connected neural network to which the words are feeded as input and then it generates their word-vectors. Glove constructs a frequency matrix of the full corpus and then using dimensionality reduction reduces gives a matrix of word-vectors.
  14. Word2Vec vs Glove Kajal Puri PyCon DE 26th October, 2018

    21/22 • Word2Vec can find syntactic similarity like “better”-”good” or “bad”-”worse” whereas glove can’t generate that linear relationship. • More accurate results given by word2vec. • To reduce the computation, it employs negative sampling with sigmoid function on the real data. • GLove : from glove import Corpus, Glove glove = Glove(no_components=100, learning_rate=0.05) • Word2Vec : model = gensim.models.Word2Vec(documents, size=150,window=10,min_count=2,workers=10) model.train(documents, total_examples=len(documents), epochs=10)
  15. CBOW vs Skip-gram Kajal Puri PyCon DE 26th October, 2018

    17/22 • Cbow predicts the word with the help of the context whereas skip-gram model predicts the context itself. • Skip-gram takes large time to process whereas cbow does it fastly because it has already learned the context. • For example : I had a wonderful time at the …… -- Cbow I had a ……………………….. time at the event -- Skip-gram • Skip-gram can work to predict the meaning for rare and unique words whereas cbow works well for frequent words in the text. Disclaimer : All the prediction I have talked about is in probability, the machine gives probability to each word and the word with maximum probability is predicted by the code. Computer *still* doesn’t understand human language. All of this is statistical analysis, pattern matching and machine/deep learning which is again just *curve fitting*. There, I said it.
  16. FastText vs Word2Vec Kajal Puri PyCon DE 26th October, 2018

    18/22 • FastText uses n-grams. Word2Vec treats each word as a smallest and different entity. • Learns word-embeddings/vectors of words within words. • Each token/word will be expressed as sum and average if it’s n-gram components. Kingdom = [“k”,”ki”,”kin”,”king”,”kingd”,”kingdo”,”kingdom”] • Word-vectors contain extra information about their sub-word. • Increased initial computation time during training. • More accurate. • Beneficial for morphologically rich languages like • Generation of better word-embeddings for different/rare words. • Can generate word-embeddings for out-of-vocabulary words whereas word2vec and glove can’t.
  17. Hyperparametric Tuning for FastText Kajal Puri PyCon DE 26th October,

    2018 19/22 The following arguments are mandatory: -input training file path -output output file path The following arguments are optional: -verbose verbosity level [2]-lr learning rate [0.1] -lrUpdateRate change the rate of updates for the learning rate [100] -dim size of word vectors [100] -ws size of the context window [5] -epoch number of epochs [5] -neg number of negatives sampled [5] -loss loss function {ns, hs, softmax} [softmax] -thread number of threads [12] -pretrainedVectors pretrained word vectors for supervised learning [] -saveOutput whether output params should be saved [0]
  18. How to train FastText Kajal Puri PyCon DE 26th October,

    2018 20/22 • Training takes more time than word3vec, but also depends on the size of n-grams. If n-gram window is very small, chances are it will be slow but if size is large/average then it’ll be pretty fast. • Trade off between generating accurate word-vector for lots of unique word vector for millions of words and the minimum size of the n-gram. Let’s say, to find the accurate word vector of 50 million unique words, min length of n-gram=5, will not even fit in 256GB of RAM. So, we need to increase its size to min=10 or min=15, depending upon available memory.
  19. References Kajal Puri PyCon DE 26th October, 2018 21/22 •

    FastText official documentation tutorial • NLTK Tutorial • NLP Pipeline Tutorial with Code • Sebastian Rudar’s blog • CS224D : Deep Learning for NLP • Fast.ai NLP blogs/forums • WildML NLP Blog • Subscribe to NLP Newsletter - Medium • NLP PyTorch • NLP TensorFlow • Quora NLP