Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Into the Wild - wilth Natural Language Processi...

Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

Talk from Data Natives Conference 2015 about a experimental project for Natural Language Processing.

Avatar for Peter Grosskopf

Peter Grosskopf

December 02, 2015
Tweet

More Decks by Peter Grosskopf

Other Decks in Technology

Transcript

  1. Hey, I’m Peter. Developer (mostly Ruby), Founder (of Zweitag) Chief

    Development Officer @ HitFox Group Department „Tech & Development“ (TechDev)
  2. How do we select the best people out of more

    than 1000 applications every month in a consistent way? ? ? ? Machine Learning ?
  3. Action Steps 1. Prepare the textual data 2. Build a

    model to classify the data 3. Run it! 4. Display and interpret 
 the results
  4. 1. Prepare Load data Kick out outlier Clean out stopwords

    (language detection + stemming with NLTK) Define classes for workflow states Link data
  5. 2. Build a model tf-idf / bag of words !:

    term-frequency idf: inverse document frequency
  6. Transform / Quantization from a textual shape to a numerical

    vector-form I am a nice little text -> v(i, am, a, nice, little, text) -> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)
  7. term-frequency (tf) Count occurrences in document I am a nice

    little text -> v(i, am, a, nice, little, text) -> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)
  8. inverse document frequency (idf) Count how often a term occurs

    in the whole document set and invert with the logarithm d1(I play a fun game) -> v1(i, play, a, fun, game) d2(I am a nice little text) -> v2(i, am, a, nice, little, text) -> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …) -> v2(0, 0.3, 0, 0.3, 0.3, 0.3)
  9. bag of words Simple approach to calculate the frequency of

    relevant terms Ignores contextual information better: n-grams
  10. n-grams Generate new tokens by concatenating neighboured tokens example (1

    and 2-grams): (nice, little, text) -> (nice, nice_little, little, little_text, text) -> From three tokens we just generated 5 tokens. example2 (1 and 2-grams): (new, york, is, a, nice, city) -> (new, new_york, york, york_is, is, is_a, a, a_nice, nice, nice_city, city)
  11. Define runtime Train-test-split by date (80/20) Approach: Pick randomly CVs

    out of the test group Count how many CVs have to be screened to find all the good CVs
  12. 3. run it! After the resumes are transformed to vector

    form, the classification gets done with a classical statistical machine learning model 
 
 (e.g. multinominal-naive-bayes, stochastic-gradient-descent- classifier, logistic-regression and random-forest)
  13. 4. Results Generated with a combination of stochastic-gradient-descent- classifier and

    logistic-regression with the python machine-learning library scikit-learn AUC: 73.0615 %
  14. Wrap Up 1. Prepare 2. Build Model 3. Run 4.

    Interpret import data vectorize the CVs with 1 to 4 n_grams choose Machine Learning model visualize results clean data define train-test- split run it! Area under curve (AUC)
  15. Conclusion After trying many different approaches (doc2vec, Recurrent Neuronal Networks,

    Feature Hashing)- bag of words still the best Explana<on: CV documents do not contain too many semantics
  16. Outlook Build a better database Experiment with new approaches and

    tune models Build a continuous learning model