Proposed Withdrawal of
greece from the eurozone
GREXIT
Slide 5
Slide 5 text
HaCKERS who implant
ELECTRONICS INTO THEIR BODIES
GRINDERS
Slide 6
Slide 6 text
GENDER-NEUTRAL FORM OF
LATINA & LATINO
LATINX
Slide 7
Slide 7 text
Small Fat deposits on otherwise
athletically built males
DAD BODS
Slide 8
Slide 8 text
Combination of
Blanket and Scarf
BLARF
Slide 9
Slide 9 text
I don’t know why. It’s a
perfectly cromulent word.
@maebert
#pycon2016
Slide 10
Slide 10 text
@maebert
#pycon2016
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
Says UrbanDictionary.com:
@maebert
#pycon2016
Slide 13
Slide 13 text
Free Range Definitions
“Her gown was of white satin worked with gold,
and had long open pendent sleeves, while from
her slender and marble neck hung a cordeliere —
a species of necklace imitated from the cord
worn by Franciscan friars, and formed of crimson
silk twisted with threads of Venetian gold.”
— WH Ainsworth, Windsor castle
@maebert
#pycon2016
Slide 14
Slide 14 text
How many words are missing?
@maebert
#pycon2016
Slide 15
Slide 15 text
How many words are missing?
ONE
MILLION
WORDS
One
Million
Words
@maebert
#pycon2016
Slide 16
Slide 16 text
What could possibly go wrong?
The Plan
@maebert
#pycon2016
Slide 17
Slide 17 text
Missing Words
x3M
PREPROCESSING
1.8M
Detect Language
“…” ✔
Detect FRD
“…” FRD
Save to DB
FRD
“…”
html
HTML PARSING
Bing Search
x50
html
12years
8months
6days
15hours
Slide 18
Slide 18 text
NERD TRIVIA
Round 2
@maebert
#pycon2016
Slide 19
Slide 19 text
@maebert #pycon2016
Slide 20
Slide 20 text
S3 Elasticsearch
EC2
Lambda
@maebert #pycon2016
Slide 21
Slide 21 text
Missing Words
Preprocess
LOCAL BOX
NEW FILE
S3
INDEX
ES
SEARCH
Lambda
PARSE HTML
DETECT LANG
DETECT FRDs
@maebert #pycon2016
Slide 22
Slide 22 text
PREPROCESSING
1.8M
Detect Language
“…” ✔
Missing Words
x3M
Detect FRD
“…” FRD
Save to DB
FRD
“…”
html
HTML PARSING
Bing Search
x50
html
Slide 23
Slide 23 text
adeiladu bialya conquistar DL 3 Aminoisobutyric
eu oi oa ou frappul galen's bondage etymologies
h’ors d’oeuvres i collect bizarre animal confectionery dioramas
janky kryogenkryokonitekryoscopy list of unusual deaths
macãÆnaaaaa
paenismus quaestuary revoltingly viviparous shit my pt says
the dogs of slavery, misgovernment, and ostracism
okcupid uranium-thorium dating
outsnark
vésigniéite woggle xenofeminism yebo zyxnoid
Slide 24
Slide 24 text
adeiladu bialya conquistar DL 3 Aminoisobutyric
eu oi oa ou frappul galen's bondage etymologies
h’ors d’oeuvres i collect bizarre animal confectionery dioramas
janky kryogenkryokonitekryoscopy list of unusual deaths
macãÆnaaaaa
paenismus quaestuary revoltingly viviparous shit my pt says
the dogs of slavery, misgovernment, and ostracism
okcupid uranium-thorium dating
outsnark
vésigniéite woggle xenofeminism yebo zyxnoid
Slide 25
Slide 25 text
adeiladu bialya conquistar
eu oi oa ou frappul galen's bondage etymologies
h’ors d’oeuvres i collect bizarre animal confectionery dioramas
janky kryogenkryokonitekryoscopy list of unusual deaths
macãÆnaaaaa
paenismus quaestuary revoltingly viviparous shit my pt says
the dogs of slavery, misgovernment, and ostracism
okcupid uranium-thorium dating
outsnark
vésigniéite woggle xenofeminism yebo zyxnoid
DL 3 Aminoisobutyric
Slide 26
Slide 26 text
adeiladu bialya DL 3 Aminoisobutyric
eu oi oa ou frappul galen's bondage etymologies
h’ors d’oeuvres i collect bizarre animal confectionery dioramas
janky kryogenkryokonitekryoscopy list of unusual deaths
macãÆnaaaaa
paenismus quaestuary revoltingly viviparous shit my pt says
the dogs of slavery, misgovernment, and ostracism
okcupid uranium-thorium dating
outsnark
vésigniéite woggle xenofeminism yebo zyxnoid
conquistar
Slide 27
Slide 27 text
adeiladu bialya DL 3 Aminoisobutyric
eu oi oa ou frappul galen's bondage etymologies
h’ors d’oeuvres i collect bizarre animal confectionery dioramas
janky kryogenkryokonitekryoscopy list of unusual deaths
macãÆnaaaaa
paenismus quaestuary revoltingly viviparous shit my pt says
the dogs of slavery, misgovernment, and ostracism
okcupid uranium-thorium dating
outsnark
vésigniéite woggle xenofeminism yebo zyxnoid
conquistar
Slide 28
Slide 28 text
adeiladu bialya conquistar DL 3 Aminoisobutyric eu oi oa ou frappul
galen's bondage etymologies h’ors d’oeuvres
janky kryogenkryokonitekryoscopy list of unusual deaths macãÆnaaaaa
paenismus quaestuary revoltingly viviparous shit my pt says
i collect bizarre animal confectionery dioramas
the dogs of slavery, misgovernment, and ostracism
okcupid uranium thorium dating
outsnark vésigniéite woggle zyxnoid
def valid_term(term):
words = term.split()
exclusion_rules = (
any(len(word) > 15 for word in words),
len(words) > 5,
any(c in term for c in "█,!?:1234567890"),
sum(ord(c) > 255 for c in term) > 2,
all(len(word) < 3 for word in words)
)
return not any(exclusion_rules)
Slide 29
Slide 29 text
@maebert
#pycon2016
Slide 30
Slide 30 text
Detect Language
“…” ✔
Missing Words
x3M
PREPROCESSING
1.8M
Detect FRD
“…” FRD
Save to DB
FRD
“…”
html
HTML PARSING
Bing Search
x50
html
Slide 31
Slide 31 text
from collections import defaultdict
def trigram_freq(text):
trigrams = [text[k:k+3] for k in range(len(text)-2)]
freq = defaultdict(float)
for trigram in trigrams:
freq[trigram] += 1.0 / len(trigrams)
return freq
Detecting a language
Slide 32
Slide 32 text
languages = {
"english": trigram_freq("a quick brown fox jumps…”),
"italian": trigram_freq("ma la volpe col suo balzo…”),
"klingon": trigram_freq("SoH 'ej SenwI' rIlwI' je …”)
}
def detect_language(text):
scores = defaultdict(float)
for trigram, text_freq in trigram_freq(text).items():
for lang, lang_freq in languages.items():
scores[lang] += lang_freq[trigram] * text_freq
return max(scores, key=scores.get)
Detecting a language
Slide 33
Slide 33 text
Detecting a language (ghetto remix)
stopwords = """all just being over both through its
before herself had should to only under ours has do
them his very they not now him nor did""".split()
def is_english(text):
words = text.split()
n_stopwords = sum(w in stopwords for w in words)
return float(n_stopwords) / len(words) >= 0.12
Slide 34
Slide 34 text
@maebert
#pycon2016
Slide 35
Slide 35 text
Detect FRD
“…” FRD
Detect Language
“…” ✔
Missing Words
x3M
PREPROCESSING
1.8M
Save to DB
FRD
“…”
html
HTML PARSING
Bing Search
x50
html
Slide 36
Slide 36 text
Romney is trying to prevent a stampede to Trump
of Vichy Republicans, collaborationists coming to
terms with the occupation of their party.
Text classification
There are a lot of elected Vichy Republicans
who don't know how to do anything but lose,
or kowtow to an authority figure.
vs.
Slide 37
Slide 37 text
@maebert #pycon2016
What people think machine learning is about
“Big data”
10%
Alluring brain-inspired algorithms
90%
Choosing the
Right Algorithm
5%
PICKING THE
RIGHT features
25%
What machine learning is actually about
70%
having Clean,
strong data
Slide 38
Slide 38 text
TEXT VECTORISER
(.32, .14, .78)
Classification
model
(.15, .7, .65, ?)
.91
Text classification
Training Set
(.32, .14, .78, 1)
(.2, .72, .03, 0)
(.44, .05, .91, 1)
@maebert
#pycon2016
Slide 39
Slide 39 text
Text Classification in Python
training_data = [
("Horror vacui is a latin expression
that means 'fear of emptiness'", 1),
("She put the cordeliere down next to a cap
of black velvet faced with white satin", 0),
("Abusing anyone, I was told, violated Islamic
tenets against zulm, or cruelty.", 1)
]
sentences, classes = zip(*training_data)
Slide 40
Slide 40 text
Text Classification in Python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
tfidf = TfidfVectorizer()
vectors = tfidf.fit_transform(sentences)
classifier = MultinomialNB()
classifier.fit(vectors, classes)
# Predict something
s = "A rose is a rose"
vectorised_s = tfidf.transform([s])
classifier.predict(vectorised_s)
Slide 41
Slide 41 text
Compile
EC2
Missing Words
Preprocess
LOCAL BOX
NEW FILE
S3
INDEX
ES
SEARCH
Lambda
PARSE HTML
DETECT LANG
DETECT FRDs
Deploy
@maebert #pycon2016