Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

…with   Natural Language Processing and Text Classiﬁcation Data Natives
2015 19.11.2015 - Peter Grosskopf

Hey, I’m Peter. Developer (mostly Ruby), Founder (of Zweitag) Chief
Development Officer @ HitFox Group Department „Tech & Development“ (TechDev)

Company Builder with 500+ employees in AdTech, FinTech and Big
Data

Company Builder = Ideas + People

How do we select the best people out of more
than 1000 applications every month in a consistent way? ? ? ? Machine Learning ?

Yeah! I found a solution Not really

Our Goal Add a sort-by- relevance to lower the screening
costs and invite people faster

Let’s Go!

Action Steps 1. Prepare the textual data 2. Build a
model to classify the data 3. Run it! 4. Display and interpret   the results

1. Prepare Load data Kick out outlier Clean out stopwords
(language detection + stemming with NLTK) Define classes for workflow states Link data

2. Build a model tf-idf / bag of words !:
term-frequency idf: inverse document frequency

Transform / Quantization from a textual shape to a numerical
vector-form I am a nice little text -> v(i, am, a, nice, little, text) -> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)

term-frequency (tf) Count occurrences in document I am a nice
little text -> v(i, am, a, nice, little, text) -> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)

inverse document frequency (idf) Count how often a term occurs
in the whole document set and invert with the logarithm d1(I play a fun game) -> v1(i, play, a, fun, game) d2(I am a nice little text) -> v2(i, am, a, nice, little, text) -> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …) -> v2(0, 0.3, 0, 0.3, 0.3, 0.3)

bag of words Simple approach to calculate the frequency of
relevant terms Ignores contextual information better: n-grams

n-grams Generate new tokens by concatenating neighboured tokens example (1
and 2-grams): (nice, little, text) -> (nice, nice_little, little, little_text, text) -> From three tokens we just generated 5 tokens. example2 (1 and 2-grams): (new, york, is, a, nice, city) -> (new, new_york, york, york_is, is, is_a, a, a_nice, nice, nice_city, city)

vectorize the resumes build 1 to 4 n_grams with Scikit
(sklearn) TdIdf-Vectorizer

Deﬁne runtime Train-test-split by date (80/20) Approach: Pick randomly CVs
out of the test group Count how many CVs have to be screened to find all the good CVs

3. run it! After the resumes are transformed to vector
form, the classification gets done with a classical statistical machine learning model     (e.g. multinominal-naive-bayes, stochastic-gradient-descent- classifier, logistic-regression and random-forest)

4. Results Generated with a combination of stochastic-gradient-descent- classifier and
logistic-regression with the python machine-learning library scikit-learn AUC: 73.0615 %

Wrap Up 1. Prepare 2. Build Model 3. Run 4.
Interpret import data vectorize the CVs with 1 to 4 n_grams choose Machine Learning model visualize results clean data define train-test- split run it! Area under curve (AUC)

Conclusion After trying many different approaches (doc2vec, Recurrent Neuronal Networks,
Feature Hashing)- bag of words still the best Explana<on: CV documents do not contain too many semantics

Outlook Build a better database Experiment with new approaches and
tune models Build a continuous learning model

Happy End. Thanks :-)

Into the Wild - wilth Natural Language Processi...

Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

Peter Grosskopf

More Decks by Peter Grosskopf

Other Decks in Technology

Featured

Transcript

…with   Natural Language Processing and Text Classiﬁcation Data Natives

Hey, I’m Peter. Developer (mostly Ruby), Founder (of Zweitag) Chief

Company Builder with 500+ employees in AdTech, FinTech and Big

Company Builder = Ideas + People

How do we select the best people out of more

Yeah! I found a solution Not really

Our Goal Add a sort-by- relevance to lower the screening

Let’s Go!

Action Steps 1. Prepare the textual data 2. Build a

1. Prepare Load data Kick out outlier Clean out stopwords

2. Build a model tf-idf / bag of words !:

Transform / Quantization from a textual shape to a numerical

term-frequency (tf) Count occurrences in document I am a nice

inverse document frequency (idf) Count how often a term occurs

bag of words Simple approach to calculate the frequency of

n-grams Generate new tokens by concatenating neighboured tokens example (1

vectorize the resumes build 1 to 4 n_grams with Scikit

Deﬁne runtime Train-test-split by date (80/20) Approach: Pick randomly CVs

3. run it! After the resumes are transformed to vector

4. Results Generated with a combination of stochastic-gradient-descent- classifier and

Wrap Up 1. Prepare 2. Build Model 3. Run 4.

Conclusion After trying many different approaches (doc2vec, Recurrent Neuronal Networks,

Outlook Build a better database Experiment with new approaches and

Happy End. Thanks :-)