Slide 1

Slide 1 text

Building a Language Identifier

Slide 2

Slide 2 text

• ઍཬ೭ߦ﹐࢝ԙ଍Լɻ • Беда ́ (никогда ́) не прихо ́ дит одна ́. • A buen entendedor, pocas palabras bastan

Slide 3

Slide 3 text

Training Examples

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

• Wikinews • EuroParl corpus • UN Corpora • Hand crawled web pages

Slide 6

Slide 6 text

Extracting Features

Slide 7

Slide 7 text

n-grams “On ne peut désirer ce qu'on ne connaît pas.” ! on, ne, peut, désirer, ce, qu, on, ne, connaît, pas ! on ne, ne peut, peut désirer, désirer ce, ce qu, qu on, on ne, ne connaît, connaît pas

Slide 8

Slide 8 text

the pourquoi a ⁋࢐ antidisest ablishmen tarianism … 10 0 7 0 2 … 0 8 9 0 0 … 0 0 0 1 0 … 6 0 3 0 0 …

Slide 9

Slide 9 text

N distinct features (n-grams) in corpus M example documents

Slide 10

Slide 10 text

M × N × sizeof(double) N distinct features (n-grams) in corpus M example documents

Slide 11

Slide 11 text

Original shape: M × N

Slide 12

Slide 12 text

Original shape: M × N value 4 0 … 0 5 0 … 0 3 0 … 6 0 …

Slide 13

Slide 13 text

Original shape: M × N index value 1 4 2 0 … … 666 0 667 5 668 0 … … 986 0 987 3 989 0 … … 1037 6 1038 0 … …

Slide 14

Slide 14 text

Original shape: M × N

Slide 15

Slide 15 text

Original shape: M × N Index Value 1 4 667 5 987 3 1037 6 1408 10 2867 2 5680 1 7896 1 11763 4 15879 9

Slide 16

Slide 16 text

0 0 2 10 7 Hasher the pourquoi a ⁋࢐ antidisestablishmentarianism

Slide 17

Slide 17 text

Learning

Slide 18

Slide 18 text

Multinomial Naive Bayes Work from home today? Any meetings today? Is it raining? Am I out of coffee? What’s the temperature outside? % yes % no >

Slide 19

Slide 19 text

label features en zh ko fr en Classifier

Slide 20

Slide 20 text

Cross Validating

Slide 21

Slide 21 text

features Classifier predictions known en en ja zh ko ko fr fr nl en 60% accurate

Slide 22

Slide 22 text

The Pipeline

Slide 23

Slide 23 text

Feature Extractor Classifier Document Prediction

Slide 24

Slide 24 text

Feature Extractor Classifier Document Prediction Feature Extractor

Slide 25

Slide 25 text

Feature Extractor Classifier Document Feature Transformer

Slide 26

Slide 26 text

Making Predictions

Slide 27

Slide 27 text

ઍཬ೭ߦ﹐࢝ԙ଍ԼɻPipeline Chinese http://langue.herokuapp.com

Slide 28

Slide 28 text

Questions? (ask me to repeat them) @zacstewart