Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Manuel Ebert - Putting 1 million new words into the dictionary

Manuel Ebert - Putting 1 million new words into the dictionary

2015 was the year of spocking, amabots, dadbuds, and smol. Like half of all english words used every day, these words are not in the dictionary. Until we put them there. In this talk, I’ll describe how we found definitions for 1 Million words that were missing from dictionaries, what it takes to do Natural Language Processing at that scale, and how to be the least popular scrabble winner.

https://us.pycon.org/2016/schedule/presentation/2049/

PyCon 2016

May 29, 2016
Tweet

More Decks by PyCon 2016

Other Decks in Programming

Transcript

  1. 1
    @maebert
    #pycon2016

    View Slide

  2. +
    @maebert
    #pycon2016

    View Slide

  3. NERD TRIVIA
    Round 1
    @maebert
    #pycon2016

    View Slide

  4. Proposed Withdrawal of
    greece from the eurozone
    GREXIT

    View Slide

  5. HaCKERS who implant

    ELECTRONICS INTO THEIR BODIES
    GRINDERS

    View Slide

  6. GENDER-NEUTRAL FORM OF

    LATINA & LATINO
    LATINX

    View Slide

  7. Small Fat deposits on otherwise
    athletically built males
    DAD BODS

    View Slide

  8. Combination of

    Blanket and Scarf
    BLARF

    View Slide

  9. I don’t know why. It’s a
    perfectly cromulent word.
    @maebert
    #pycon2016

    View Slide

  10. @maebert
    #pycon2016

    View Slide

  11. View Slide

  12. Says UrbanDictionary.com:
    @maebert
    #pycon2016

    View Slide

  13. Free Range Definitions
    “Her gown was of white satin worked with gold,
    and had long open pendent sleeves, while from
    her slender and marble neck hung a cordeliere —
    a species of necklace imitated from the cord
    worn by Franciscan friars, and formed of crimson
    silk twisted with threads of Venetian gold.”
    — WH Ainsworth, Windsor castle
    @maebert
    #pycon2016

    View Slide

  14. How many words are missing?
    @maebert
    #pycon2016

    View Slide

  15. How many words are missing?
    ONE

    MILLION

    WORDS
    One

    Million

    Words
    @maebert
    #pycon2016

    View Slide

  16. What could possibly go wrong?
    The Plan
    @maebert
    #pycon2016

    View Slide

  17. Missing Words
    x3M
    PREPROCESSING
    1.8M
    Detect Language
    “…” ✔
    Detect FRD
    “…” FRD
    Save to DB
    FRD
    “…”
    html
    HTML PARSING
    Bing Search
    x50
    html
    12years
    8months
    6days
    15hours

    View Slide

  18. NERD TRIVIA
    Round 2
    @maebert
    #pycon2016

    View Slide

  19. @maebert #pycon2016

    View Slide

  20. S3 Elasticsearch
    EC2
    Lambda
    @maebert #pycon2016

    View Slide

  21. Missing Words
    Preprocess
    LOCAL BOX
    NEW FILE
    S3
    INDEX
    ES
    SEARCH
    Lambda
    PARSE HTML
    DETECT LANG
    DETECT FRDs
    @maebert #pycon2016

    View Slide

  22. PREPROCESSING
    1.8M
    Detect Language
    “…” ✔
    Missing Words
    x3M
    Detect FRD
    “…” FRD
    Save to DB
    FRD
    “…”
    html
    HTML PARSING
    Bing Search
    x50
    html

    View Slide

  23. adeiladu bialya conquistar DL 3 Aminoisobutyric
    eu oi oa ou frappul galen's bondage etymologies
    h’ors d’oeuvres i collect bizarre animal confectionery dioramas
    janky kryogenkryokonitekryoscopy list of unusual deaths
    macãÆnaaaaa
    paenismus quaestuary revoltingly viviparous shit my pt says
    the dogs of slavery, misgovernment, and ostracism
    okcupid uranium-thorium dating
    outsnark
    vésigniéite woggle xenofeminism yebo zyxnoid

    View Slide

  24. adeiladu bialya conquistar DL 3 Aminoisobutyric
    eu oi oa ou frappul galen's bondage etymologies
    h’ors d’oeuvres i collect bizarre animal confectionery dioramas
    janky kryogenkryokonitekryoscopy list of unusual deaths
    macãÆnaaaaa
    paenismus quaestuary revoltingly viviparous shit my pt says
    the dogs of slavery, misgovernment, and ostracism
    okcupid uranium-thorium dating
    outsnark
    vésigniéite woggle xenofeminism yebo zyxnoid

    View Slide

  25. adeiladu bialya conquistar
    eu oi oa ou frappul galen's bondage etymologies
    h’ors d’oeuvres i collect bizarre animal confectionery dioramas
    janky kryogenkryokonitekryoscopy list of unusual deaths
    macãÆnaaaaa
    paenismus quaestuary revoltingly viviparous shit my pt says
    the dogs of slavery, misgovernment, and ostracism
    okcupid uranium-thorium dating
    outsnark
    vésigniéite woggle xenofeminism yebo zyxnoid
    DL 3 Aminoisobutyric

    View Slide

  26. adeiladu bialya DL 3 Aminoisobutyric
    eu oi oa ou frappul galen's bondage etymologies
    h’ors d’oeuvres i collect bizarre animal confectionery dioramas
    janky kryogenkryokonitekryoscopy list of unusual deaths
    macãÆnaaaaa
    paenismus quaestuary revoltingly viviparous shit my pt says
    the dogs of slavery, misgovernment, and ostracism
    okcupid uranium-thorium dating
    outsnark
    vésigniéite woggle xenofeminism yebo zyxnoid
    conquistar

    View Slide

  27. adeiladu bialya DL 3 Aminoisobutyric
    eu oi oa ou frappul galen's bondage etymologies
    h’ors d’oeuvres i collect bizarre animal confectionery dioramas
    janky kryogenkryokonitekryoscopy list of unusual deaths
    macãÆnaaaaa
    paenismus quaestuary revoltingly viviparous shit my pt says
    the dogs of slavery, misgovernment, and ostracism
    okcupid uranium-thorium dating
    outsnark
    vésigniéite woggle xenofeminism yebo zyxnoid
    conquistar

    View Slide

  28. adeiladu bialya conquistar DL 3 Aminoisobutyric eu oi oa ou frappul
    galen's bondage etymologies h’ors d’oeuvres
    janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa
    paenismus quaestuary revoltingly viviparous shit my pt says
    i collect bizarre animal confectionery dioramas
    the dogs of slavery, misgovernment, and ostracism
    okcupid uranium thorium dating
    outsnark vésigniéite woggle zyxnoid
    def valid_term(term):
    words = term.split()
    exclusion_rules = (
    any(len(word) > 15 for word in words),
    len(words) > 5,
    any(c in term for c in "█,!?:1234567890"),
    sum(ord(c) > 255 for c in term) > 2,
    all(len(word) < 3 for word in words)
    )
    return not any(exclusion_rules)

    View Slide

  29. @maebert
    #pycon2016

    View Slide

  30. Detect Language
    “…” ✔
    Missing Words
    x3M
    PREPROCESSING
    1.8M
    Detect FRD
    “…” FRD
    Save to DB
    FRD
    “…”
    html
    HTML PARSING
    Bing Search
    x50
    html

    View Slide

  31. from collections import defaultdict
    def trigram_freq(text):
    trigrams = [text[k:k+3] for k in range(len(text)-2)]
    freq = defaultdict(float)
    for trigram in trigrams:
    freq[trigram] += 1.0 / len(trigrams)
    return freq
    Detecting a language

    View Slide

  32. languages = {
    "english": trigram_freq("a quick brown fox jumps…”),
    "italian": trigram_freq("ma la volpe col suo balzo…”),
    "klingon": trigram_freq("SoH 'ej SenwI' rIlwI' je …”)
    }
    def detect_language(text):
    scores = defaultdict(float)
    for trigram, text_freq in trigram_freq(text).items():
    for lang, lang_freq in languages.items():
    scores[lang] += lang_freq[trigram] * text_freq
    return max(scores, key=scores.get)
    Detecting a language

    View Slide

  33. Detecting a language (ghetto remix)
    stopwords = """all just being over both through its
    before herself had should to only under ours has do
    them his very they not now him nor did""".split()
    def is_english(text):
    words = text.split()
    n_stopwords = sum(w in stopwords for w in words)
    return float(n_stopwords) / len(words) >= 0.12

    View Slide

  34. @maebert
    #pycon2016

    View Slide

  35. Detect FRD
    “…” FRD
    Detect Language
    “…” ✔
    Missing Words
    x3M
    PREPROCESSING
    1.8M
    Save to DB
    FRD
    “…”
    html
    HTML PARSING
    Bing Search
    x50
    html

    View Slide

  36. Romney is trying to prevent a stampede to Trump
    of Vichy Republicans, collaborationists coming to
    terms with the occupation of their party.
    Text classification
    There are a lot of elected Vichy Republicans
    who don't know how to do anything but lose,
    or kowtow to an authority figure.
    vs.

    View Slide

  37. @maebert #pycon2016
    What people think machine learning is about
    “Big data”
    10%
    Alluring brain-inspired algorithms
    90%
    Choosing the
    Right Algorithm
    5%
    PICKING THE

    RIGHT features
    25%
    What machine learning is actually about
    70%
    having Clean,
    strong data

    View Slide

  38. TEXT VECTORISER
    (.32, .14, .78)
    Classification
    model
    (.15, .7, .65, ?)
    .91
    Text classification
    Training Set
    (.32, .14, .78, 1)
    (.2, .72, .03, 0)
    (.44, .05, .91, 1)
    @maebert
    #pycon2016

    View Slide

  39. Text Classification in Python
    training_data = [
    ("Horror vacui is a latin expression
    that means 'fear of emptiness'", 1),
    ("She put the cordeliere down next to a cap
    of black velvet faced with white satin", 0),
    ("Abusing anyone, I was told, violated Islamic
    tenets against zulm, or cruelty.", 1)
    ]
    sentences, classes = zip(*training_data)

    View Slide

  40. Text Classification in Python
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    tfidf = TfidfVectorizer()
    vectors = tfidf.fit_transform(sentences)
    classifier = MultinomialNB()
    classifier.fit(vectors, classes)
    # Predict something
    s = "A rose is a rose"
    vectorised_s = tfidf.transform([s])
    classifier.predict(vectorised_s)

    View Slide

  41. Compile
    EC2
    Missing Words
    Preprocess
    LOCAL BOX
    NEW FILE
    S3
    INDEX
    ES
    SEARCH
    Lambda
    PARSE HTML
    DETECT LANG
    DETECT FRDs
    Deploy
    @maebert #pycon2016

    View Slide

  42. View Slide

  43. Setting up EC2
    $ yum -y install blas lapack atlas-sse3-devel
    $ mkswap /swapfile; chmod 0600 /swapfile; swapon /swapfile
    $ pip install numpy scipy pandas sklearn
    $ dd if=/dev/zero of=/swapfile bs=1024 count=1500000
    $ strip `find ~/stack/lib/python2.7/ -name=“*.so”`
    $ pushd ~/stack/lib/python2.7/site-packages/
    $ zip -r9q ~/lambda.zip * ; popd
    $ aws s3 cp ~/lambda.zip s3://my_bucket/lambda.zip
    $ aws lambda update-function-code --s3-bucket my_bucket \
    --s3-key lambda.zip --function-name lambda_function
    $ virtualenv ~/stack; source ~/stack/bin/activate

    View Slide

  44. bit.ly/ml_aws_lambda
    Full AWS Lambda Walkthrough:
    github.com/summer.ai/serapis
    All the Code:

    View Slide

  45. Manuel Ebert @maebert
    Manuel Ebert @maebert

    View Slide