Slide 1

Slide 1 text

1 @maebert #pycon2016

Slide 2

Slide 2 text

+ @maebert #pycon2016

Slide 3

Slide 3 text

NERD TRIVIA Round 1 @maebert #pycon2016

Slide 4

Slide 4 text

Proposed Withdrawal of greece from the eurozone GREXIT

Slide 5

Slide 5 text

HaCKERS who implant
 ELECTRONICS INTO THEIR BODIES GRINDERS

Slide 6

Slide 6 text

GENDER-NEUTRAL FORM OF
 LATINA & LATINO LATINX

Slide 7

Slide 7 text

Small Fat deposits on otherwise athletically built males DAD BODS

Slide 8

Slide 8 text

Combination of
 Blanket and Scarf BLARF

Slide 9

Slide 9 text

I don’t know why. It’s a perfectly cromulent word. @maebert #pycon2016

Slide 10

Slide 10 text

@maebert #pycon2016

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Says UrbanDictionary.com: @maebert #pycon2016

Slide 13

Slide 13 text

Free Range Definitions “Her gown was of white satin worked with gold, and had long open pendent sleeves, while from her slender and marble neck hung a cordeliere — a species of necklace imitated from the cord worn by Franciscan friars, and formed of crimson silk twisted with threads of Venetian gold.” — WH Ainsworth, Windsor castle @maebert #pycon2016

Slide 14

Slide 14 text

How many words are missing? @maebert #pycon2016

Slide 15

Slide 15 text

How many words are missing? ONE
 MILLION
 WORDS One
 Million
 Words @maebert #pycon2016

Slide 16

Slide 16 text

What could possibly go wrong? The Plan @maebert #pycon2016

Slide 17

Slide 17 text

Missing Words x3M PREPROCESSING 1.8M Detect Language “…” ✔ Detect FRD “…” FRD Save to DB FRD “…” html HTML PARSING Bing Search x50 html 12years 8months 6days 15hours

Slide 18

Slide 18 text

NERD TRIVIA Round 2 @maebert #pycon2016

Slide 19

Slide 19 text

@maebert #pycon2016

Slide 20

Slide 20 text

S3 Elasticsearch EC2 Lambda @maebert #pycon2016

Slide 21

Slide 21 text

Missing Words Preprocess LOCAL BOX NEW FILE S3 INDEX ES SEARCH Lambda PARSE HTML DETECT LANG DETECT FRDs @maebert #pycon2016

Slide 22

Slide 22 text

PREPROCESSING 1.8M Detect Language “…” ✔ Missing Words x3M Detect FRD “…” FRD Save to DB FRD “…” html HTML PARSING Bing Search x50 html

Slide 23

Slide 23 text

adeiladu bialya conquistar DL 3 Aminoisobutyric eu oi oa ou frappul galen's bondage etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid

Slide 24

Slide 24 text

adeiladu bialya conquistar DL 3 Aminoisobutyric eu oi oa ou frappul galen's bondage etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid

Slide 25

Slide 25 text

adeiladu bialya conquistar eu oi oa ou frappul galen's bondage etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid DL 3 Aminoisobutyric

Slide 26

Slide 26 text

adeiladu bialya DL 3 Aminoisobutyric eu oi oa ou frappul galen's bondage etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid conquistar

Slide 27

Slide 27 text

adeiladu bialya DL 3 Aminoisobutyric eu oi oa ou frappul galen's bondage etymologies h’ors d’oeuvres i collect bizarre animal confectionery dioramas janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says the dogs of slavery, misgovernment, and ostracism okcupid uranium-thorium dating outsnark vésigniéite woggle xenofeminism yebo zyxnoid conquistar

Slide 28

Slide 28 text

adeiladu bialya conquistar DL 3 Aminoisobutyric eu oi oa ou frappul galen's bondage etymologies h’ors d’oeuvres janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa paenismus quaestuary revoltingly viviparous shit my pt says i collect bizarre animal confectionery dioramas the dogs of slavery, misgovernment, and ostracism okcupid uranium thorium dating outsnark vésigniéite woggle zyxnoid def valid_term(term): words = term.split() exclusion_rules = ( any(len(word) > 15 for word in words), len(words) > 5, any(c in term for c in "█,!?:1234567890"), sum(ord(c) > 255 for c in term) > 2, all(len(word) < 3 for word in words) ) return not any(exclusion_rules)

Slide 29

Slide 29 text

@maebert #pycon2016

Slide 30

Slide 30 text

Detect Language “…” ✔ Missing Words x3M PREPROCESSING 1.8M Detect FRD “…” FRD Save to DB FRD “…” html HTML PARSING Bing Search x50 html

Slide 31

Slide 31 text

from collections import defaultdict def trigram_freq(text): trigrams = [text[k:k+3] for k in range(len(text)-2)] freq = defaultdict(float) for trigram in trigrams: freq[trigram] += 1.0 / len(trigrams) return freq Detecting a language

Slide 32

Slide 32 text

languages = { "english": trigram_freq("a quick brown fox jumps…”), "italian": trigram_freq("ma la volpe col suo balzo…”), "klingon": trigram_freq("SoH 'ej SenwI' rIlwI' je …”) } def detect_language(text): scores = defaultdict(float) for trigram, text_freq in trigram_freq(text).items(): for lang, lang_freq in languages.items(): scores[lang] += lang_freq[trigram] * text_freq return max(scores, key=scores.get) Detecting a language

Slide 33

Slide 33 text

Detecting a language (ghetto remix) stopwords = """all just being over both through its before herself had should to only under ours has do them his very they not now him nor did""".split() def is_english(text): words = text.split() n_stopwords = sum(w in stopwords for w in words) return float(n_stopwords) / len(words) >= 0.12

Slide 34

Slide 34 text

@maebert #pycon2016

Slide 35

Slide 35 text

Detect FRD “…” FRD Detect Language “…” ✔ Missing Words x3M PREPROCESSING 1.8M Save to DB FRD “…” html HTML PARSING Bing Search x50 html

Slide 36

Slide 36 text

Romney is trying to prevent a stampede to Trump of Vichy Republicans, collaborationists coming to terms with the occupation of their party. Text classification There are a lot of elected Vichy Republicans who don't know how to do anything but lose, or kowtow to an authority figure. vs.

Slide 37

Slide 37 text

@maebert #pycon2016 What people think machine learning is about “Big data” 10% Alluring brain-inspired algorithms 90% Choosing the Right Algorithm 5% PICKING THE
 RIGHT features 25% What machine learning is actually about 70% having Clean, strong data

Slide 38

Slide 38 text

TEXT VECTORISER (.32, .14, .78) Classification model (.15, .7, .65, ?) .91 Text classification Training Set (.32, .14, .78, 1) (.2, .72, .03, 0) (.44, .05, .91, 1) @maebert #pycon2016

Slide 39

Slide 39 text

Text Classification in Python training_data = [ ("Horror vacui is a latin expression that means 'fear of emptiness'", 1), ("She put the cordeliere down next to a cap of black velvet faced with white satin", 0), ("Abusing anyone, I was told, violated Islamic tenets against zulm, or cruelty.", 1) ] sentences, classes = zip(*training_data)

Slide 40

Slide 40 text

Text Classification in Python from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB tfidf = TfidfVectorizer() vectors = tfidf.fit_transform(sentences) classifier = MultinomialNB() classifier.fit(vectors, classes) # Predict something s = "A rose is a rose" vectorised_s = tfidf.transform([s]) classifier.predict(vectorised_s)

Slide 41

Slide 41 text

Compile EC2 Missing Words Preprocess LOCAL BOX NEW FILE S3 INDEX ES SEARCH Lambda PARSE HTML DETECT LANG DETECT FRDs Deploy @maebert #pycon2016

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

Setting up EC2 $ yum -y install blas lapack atlas-sse3-devel $ mkswap /swapfile; chmod 0600 /swapfile; swapon /swapfile $ pip install numpy scipy pandas sklearn $ dd if=/dev/zero of=/swapfile bs=1024 count=1500000 $ strip `find ~/stack/lib/python2.7/ -name=“*.so”` $ pushd ~/stack/lib/python2.7/site-packages/ $ zip -r9q ~/lambda.zip * ; popd $ aws s3 cp ~/lambda.zip s3://my_bucket/lambda.zip $ aws lambda update-function-code --s3-bucket my_bucket \ --s3-key lambda.zip --function-name lambda_function $ virtualenv ~/stack; source ~/stack/bin/activate

Slide 44

Slide 44 text

bit.ly/ml_aws_lambda Full AWS Lambda Walkthrough: github.com/summer.ai/serapis All the Code:

Slide 45

Slide 45 text

Manuel Ebert @maebert Manuel Ebert @maebert