Pro Yearly is on sale from $80 to $50! »

Manuel Ebert - Putting 1 million new words into the dictionary

Manuel Ebert - Putting 1 million new words into the dictionary

2015 was the year of spocking, amabots, dadbuds, and smol. Like half of all english words used every day, these words are not in the dictionary. Until we put them there. In this talk, I’ll describe how we found definitions for 1 Million words that were missing from dictionaries, what it takes to do Natural Language Processing at that scale, and how to be the least popular scrabble winner.

https://us.pycon.org/2016/schedule/presentation/2049/

Eec9d25835717f1f1f12a354faf68d87?s=128

PyCon 2016

May 29, 2016
Tweet

Transcript

  1. 1 @maebert #pycon2016

  2. + @maebert #pycon2016

  3. NERD TRIVIA Round 1 @maebert #pycon2016

  4. Proposed Withdrawal of greece from the eurozone GREXIT

  5. HaCKERS who implant
 ELECTRONICS INTO THEIR BODIES GRINDERS

  6. GENDER-NEUTRAL FORM OF
 LATINA & LATINO LATINX

  7. Small Fat deposits on otherwise athletically built males DAD BODS

  8. Combination of
 Blanket and Scarf BLARF

  9. I don’t know why. It’s a perfectly cromulent word. @maebert

    #pycon2016
  10. @maebert #pycon2016

  11. None
  12. Says UrbanDictionary.com: @maebert #pycon2016

  13. Free Range Definitions “Her gown was of white satin worked

    with gold, and had long open pendent sleeves, while from her slender and marble neck hung a cordeliere — a species of necklace imitated from the cord worn by Franciscan friars, and formed of crimson silk twisted with threads of Venetian gold.” — WH Ainsworth, Windsor castle @maebert #pycon2016
  14. How many words are missing? @maebert #pycon2016

  15. How many words are missing? ONE
 MILLION
 WORDS One
 Million


    Words @maebert #pycon2016
  16. What could possibly go wrong? The Plan @maebert #pycon2016

  17. Missing Words x3M PREPROCESSING 1.8M Detect Language “…” ✔ Detect

    FRD “…” FRD Save to DB FRD “…” html HTML PARSING Bing Search x50 html 12years 8months 6days 15hours
  18. NERD TRIVIA Round 2 @maebert #pycon2016

  19. @maebert #pycon2016

  20. S3 Elasticsearch EC2 Lambda @maebert #pycon2016

  21. Missing Words Preprocess LOCAL BOX NEW FILE S3 INDEX ES

    SEARCH Lambda PARSE HTML DETECT LANG DETECT FRDs @maebert #pycon2016
  22. PREPROCESSING 1.8M Detect Language “…” ✔ Missing Words x3M Detect

    FRD “…” FRD Save to DB FRD “…” html HTML PARSING Bing Search x50 html
  23. adeiladu bialya conquistar DL 3 Aminoisobutyric eu oi oa ou

    frappul galen's bondage etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid
  24. adeiladu bialya conquistar DL 3 Aminoisobutyric eu oi oa ou

    frappul galen's bondage etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid
  25. adeiladu bialya conquistar eu oi oa ou frappul galen's bondage

    etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid DL 3 Aminoisobutyric
  26. adeiladu bialya DL 3 Aminoisobutyric eu oi oa ou frappul

    galen's bondage etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid conquistar
  27. adeiladu bialya DL 3 Aminoisobutyric eu oi oa ou frappul

    galen's bondage etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid conquistar
  28. adeiladu bialya conquistar DL 3 Aminoisobutyric eu oi oa ou

    frappul galen's bondage etymologies h’ors d’oeuvres janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says i collect bizarre animal confectionery dioramas the dogs of slavery, misgovernment, and ostracism okcupid uranium thorium dating outsnark vésigniéite woggle zyxnoid def valid_term(term): words = term.split() exclusion_rules = ( any(len(word) > 15 for word in words), len(words) > 5, any(c in term for c in "█,!?:1234567890"), sum(ord(c) > 255 for c in term) > 2, all(len(word) < 3 for word in words) ) return not any(exclusion_rules)
  29. @maebert #pycon2016

  30. Detect Language “…” ✔ Missing Words x3M PREPROCESSING 1.8M Detect

    FRD “…” FRD Save to DB FRD “…” html HTML PARSING Bing Search x50 html
  31. from collections import defaultdict def trigram_freq(text): trigrams = [text[k:k+3] for

    k in range(len(text)-2)] freq = defaultdict(float) for trigram in trigrams: freq[trigram] += 1.0 / len(trigrams) return freq Detecting a language
  32. languages = { "english": trigram_freq("a quick brown fox jumps…”), "italian":

    trigram_freq("ma la volpe col suo balzo…”), "klingon": trigram_freq("SoH 'ej SenwI' rIlwI' je …”) } def detect_language(text): scores = defaultdict(float) for trigram, text_freq in trigram_freq(text).items(): for lang, lang_freq in languages.items(): scores[lang] += lang_freq[trigram] * text_freq return max(scores, key=scores.get) Detecting a language
  33. Detecting a language (ghetto remix) stopwords = """all just being

    over both through its before herself had should to only under ours has do them his very they not now him nor did""".split() def is_english(text): words = text.split() n_stopwords = sum(w in stopwords for w in words) return float(n_stopwords) / len(words) >= 0.12
  34. @maebert #pycon2016

  35. Detect FRD “…” FRD Detect Language “…” ✔ Missing Words

    x3M PREPROCESSING 1.8M Save to DB FRD “…” html HTML PARSING Bing Search x50 html
  36. Romney is trying to prevent a stampede to Trump of

    Vichy Republicans, collaborationists coming to terms with the occupation of their party. Text classification There are a lot of elected Vichy Republicans who don't know how to do anything but lose, or kowtow to an authority figure. vs.
  37. @maebert #pycon2016 What people think machine learning is about “Big

    data” 10% Alluring brain-inspired algorithms 90% Choosing the Right Algorithm 5% PICKING THE
 RIGHT features 25% What machine learning is actually about 70% having Clean, strong data
  38. TEXT VECTORISER (.32, .14, .78) Classification model (.15, .7, .65,

    ?) .91 Text classification Training Set (.32, .14, .78, 1) (.2, .72, .03, 0) (.44, .05, .91, 1) @maebert #pycon2016
  39. Text Classification in Python training_data = [ ("Horror vacui is

    a latin expression that means 'fear of emptiness'", 1), ("She put the cordeliere down next to a cap of black velvet faced with white satin", 0), ("Abusing anyone, I was told, violated Islamic tenets against zulm, or cruelty.", 1) ] sentences, classes = zip(*training_data)
  40. Text Classification in Python from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes

    import MultinomialNB tfidf = TfidfVectorizer() vectors = tfidf.fit_transform(sentences) classifier = MultinomialNB() classifier.fit(vectors, classes) # Predict something s = "A rose is a rose" vectorised_s = tfidf.transform([s]) classifier.predict(vectorised_s)
  41. Compile EC2 Missing Words Preprocess LOCAL BOX NEW FILE S3

    INDEX ES SEARCH Lambda PARSE HTML DETECT LANG DETECT FRDs Deploy @maebert #pycon2016
  42. None
  43. Setting up EC2 $ yum -y install blas lapack atlas-sse3-devel

    $ mkswap /swapfile; chmod 0600 /swapfile; swapon /swapfile $ pip install numpy scipy pandas sklearn $ dd if=/dev/zero of=/swapfile bs=1024 count=1500000 $ strip `find ~/stack/lib/python2.7/ -name=“*.so”` $ pushd ~/stack/lib/python2.7/site-packages/ $ zip -r9q ~/lambda.zip * ; popd $ aws s3 cp ~/lambda.zip s3://my_bucket/lambda.zip $ aws lambda update-function-code --s3-bucket my_bucket \ --s3-key lambda.zip --function-name lambda_function $ virtualenv ~/stack; source ~/stack/bin/activate
  44. bit.ly/ml_aws_lambda Full AWS Lambda Walkthrough: github.com/summer.ai/serapis All the Code:

  45. Manuel Ebert @maebert Manuel Ebert @maebert