The CV - A Data Scientist's View

The CV: A Data Scien0st’s View Rui Miguel Forte Lead
Data Scien0st @ Workable

About Me •  Previously worked on marke0ng language op0miza0on and
educa0onal gaming •  Founder of Data Science Athens Meetup www.meetup.com/Data-Science- Athens/ •  Author, Mastering Predic2ve Analy2cs with R (I’m signing a few free copies tomorrow at lunch) •  Instructor at MSc. Business Analy2cs at the Athens University of Economics and Business (AUEB) analy0cs.dmst.aueb.gr/

Workable •  Formed in 2012 •  In a nutshell, oﬀers
a SoWware-as-a-Service (SaaS) plaXorm for managing the hiring process of small and medium businesses. Create a job pos0ng Publish on job boards Receive Candidates Interact with Candidates Hire

Data Science at Workable •  We have data on millions
of candidates who have applied to hundreds of thousands of jobs across a wide array of industries in many different countries •  Massive opportuni0es for data science: –  Iden0fying duplicate candidates applying for the same job –  Matching candidate profiles to jobs –  Matching job pos0ngs to relevant job boards –  Detec0ng spam (job ads, candidates etc…) –  Crea0ng structured candidate profiles from unstructured data such as CVs –  Finding and merging data from sources such as social profiles –  Classifying documents (CVs, cover le]ers etc…)

Why Bother with CVs? •  You need a structured proﬁle
to power features from be]er search to recommenda0ons •  Most users can ﬁll in an online form or apply with their LinkedIn account, but the CV is not dead –  Many recruiters operate with CVs with obfuscated candidate details –  Clients oWen have a databank of CVs from candidates who applied in the past –  Some people simply prefer to submit their CV

The CV Parsing Task •  We are interested in these
ﬁelds: –  Name (First, Last, Middle) –  Contact Details (Phone, Email, Address) –  Educa0onal Background •  Degree •  Dates •  University •  Subject –  Work Experiences •  Job Title •  Dates •  Company –  Skills –  Social Network Proﬁles First Name: John, Last Name: …

Breaking Down the Task Content Analysis Text Pre- Processing Sec0on
Iden0fica0on Sentence Classifica0on Named En0ty Recogni0on (NER) Profile Construc0on Profile Post- Processing

Content Analysis •  Content Analysis allows us to ﬁnd and
extract: – The textual content of a CV – Metadata, such as author informa0on – Embedded Images •  We use Apache Tika

Text Extrac0on Issues •  Informa0on may appear out of order:
–  Tables oWen cause this, as does unusual layout, word art –  This is especially bad if sentences change sec0on or if sentences get split up •  Headers and footers may appear interspersed in the text •  Layout/markup informa0on is lost •  Characters may be altered (accents), repeated (strangely enough), spaced out (due to weird fonts or markup), lost (due to encoding issues), inserted (usually due to encoding issues)

Mi0ga0ng Text Extrac0on Errors •  Try to detect speciﬁc instances
of badly extracted text e.g. R u i M i g u e l F o r t e => Rui Miguel Forte •  Currently inves0ga0ng ways to detect other problems such as badly extracted tables

Text Pre-Processing •  We process text using a typical NLP
pipeline •  Use mostly open source implementa0ons such as OpenNLP and Stanford CoreNLP •  Chunking (as well as dependency parsing) generally have par0cularly low accuracy on our data Cleaning Sentence Splidng Tokenizing Part of Speech Tagging Chunking

Text Pre-Processing Issues •  Exis0ng models for NLP tasks are
usually trained on gramma0cal sentences from curated corpora. •  CVs: –  Are ungramma0cal and oWen have many typos –  Have many headlines / fragments (incomplete sentences) –  Have a large propor0on of proper nouns •  Sentence splidng is tricky – we want to keep related en00es on same sentence but they may be separated by a lot of whitespace. •  Training new models is diﬃcult and expensive

Sec0on Iden0ﬁca0on •  The most regular aspect of a CV
is that it is generally organized in a series of fairly recognizable sec0ons.

Sec0on Iden0ﬁca0on Cersei Lannister Public Rela0ons and Management Consultant with
extensive experience in managerial restructuring. Educa2on BA Human Rights, University of King’s Landing Experience Queen Regent, King’s Landing -  Managed a large team of guards and knights for keeping the King’s peace -  Taught leadership skills to King Geoﬀrey, First of his Name, King of the Andals, the Rhoynar and the First Men, Lord of the Seven Kingdoms and Protector of the Realm (contract expired)

Sec0on Iden0fica0on •  Key Idea: To find the boundaries of
the few sec0ons we are interested in we need to know about many different sec0ons –  A sec0on ends where another begins! •  Second Key Idea: This could be a great feature on its own! –  Computer-assisted data entry –  Be]er search (weighing terms and phrases in relevant sec0ons) •  We generally iden0fy headers using regular expressions e.g. ((Tertiary|Higher)\s+)?Education| (Academic|Educational)\s+Qualifications

Sentence Classiﬁca0on Sentence Labels Work Experience Header Educa0on Header Contact
Details Skills Other

Sentence Classiﬁca0on •  Most of the structured informa0on that we
want is concentrated on a small number of sentences •  We build models (currently SVM based) using annotated CVs that classify each sentence in a CV into one of a set of labels – The features we use include things like capitaliza0on, spacing, POS features, presence of par0cular words (Bag of Words) features, as well as sec0on informa0on

Named En0ty Recogni0on •  Named En0ty Recogni0on (NER) is the
task of ﬁnding par0cular en00es such as the names of people or organiza0ons in text Team Lead – Customer Relations Oct 2008 – Dec 2009 (Azdecca Co.) Key: Job Title Date Organiza4on

Named En0ty Recogni0on •  NER is a diﬃcult task as
you need to get both en0ty boundaries and labels correctly. •  We have had li]le success with exis0ng tools such as AlchemyAPI, OpenNLP etc… – They are trained on high quality text from a diﬀerent domain – They some0mes don’t even have some of the labels we want

A Basic NER •  One can build a basic rule-based
NER by hand using a mixture of regular expressions and lookup lists –  These then become features of an ML based approach –  Companies (Organiza0ons) turn out to be by far the hardest en00es to dis0nguish properly –  Depending on how strict your rules are you can balance precision and recall •  In our case, we can also use informa0on about what the label is for the current sentence –  Of course the sentence classiﬁca0on task could beneﬁt if we knew what en00es it incorporated so this is a chicken and egg problem –  We are considering a joint model for these two tasks

We’re building our own NER Model •  I won’t go
into details of Condi0onal Random Fields and other sequence predic0on models we compare when training on our data –  What is more interes0ng is the data we are collec0ng •  Feature design is more cri0cal and in our case it is especially tricky because NER is so far down the pipeline –  Very noisy input features (badly extracted text, wrong tokeniza0on, wrong POS tagging, wrong parsing) –  Rely on coarser features that are more reliable •  Obtaining annotated data is resource intensive and very error prone but can be well worth the eﬀort

Preparing an Annotated Data Set •  Make sure annotators are
given detailed guidelines with plenty of examples –  We have a forty page manual … –  … which we constantly revise and maintain •  Split training data in batches –  This allows changes to the annota0on methodology •  Have a lot of redundancy –  We use three-way annota0on for all our data •  Start building models early! Do not wait 0ll you get what you feel will be enough data –  This will guide decisions on the data that we need to get –  We are considering ac0ve learning

Skill Extrac0on from CVs •  Defining the concept of a
skill is very tricky –  Few domains (IT in par0cular) have well defined skills •  Industries, products, processes, areas of knowledge, foreign languages, character a]ributes, even job 0tles appear as skills •  A journalist might say “Newspapers”, a musician “Clarinet”, an HR professional “CVs”, a project manager “out-of-the-box thinking” •  Building a taxonomy of skills is unfortunately mostly a manual process –  e.g. we took the unique skills from all our candidates who applied from LinkedIn. On a random sample, < 1% were normalized and unambiguously iden0fiable as skills

Profile Construc0on •  Once we have iden0fied our important sentences
and and discovered the relevant named en00es we construct the profile •  We’ve trained models for picking out the best candidates for informa0on such as name and email for which there is only 1 correct answer •  We use simple rules to iden0fy related en00es (e.g. job 0tles with companies ) in order to form complex elements –  We can use a simple slot filling approach if our sentence classifica0on works well –  En0ty rela0on models are a future op0on

Profile Post-Processing •  Post-Processing involves normaliza0on and filtering •  Normaliza0on
involves: –  Fixing appearance e.g. mr. p k chang => P. K. Chang –  Finding standard form of known en00es: Imperial => Imperial College London Python Programming => Python •  Normaliza0on is very important –  Improved visualiza0on –  Improved searchability –  Be]er profile comparison for de-duplica0on, recommenda0ons •  Filtering prunes out components that seem too sparse –  e.g. work experiences with just a job 0tle in them –  Also acts as an error correc0on step

Managing Candidate Profiles •  Forming the final candidate profile involves
an extra step in which we check whether we know if this person has applied to our client before

Finding Duplicate Candidates

It’s Not Trivial… •  There is no universal iden0ﬁer for
people •  Fields can change (address, phone, name) or have many valid op0ons (email, phone) •  Data can be presented in diﬀerent ways: Rui Miguel Forte R. M. Forte Forte, Rui Miguel RUI M. FORTE Ρούι Μιγκέλ Φόρτε (my name in Greek)

Our Approach Normalize Fields Compare ﬁeld pairs with suitable comparator
(exact, Jaccard, Geo etc…) Aggregate pairwise scores into single score Iden0fy high scoring pairs

Tips and Tricks •  Prune bag of words features intelligently
for both performance and overfidng –  e.g. we discovered certain company names as frequent when looking for fraudulent job descrip0ons •  Watch memory footprint –  e.g. In the sentence classifica0on problem we first had features computed for every sentence before classifying all of them •  Do not go with ML if you can get high accuracy with hand craWed systems (rules, regexes etc…) –  e.g. Often for a data fusion problem you know exactly the rules you want –  e.g. Language detec0on in CVs can be done well with just a lookup of key terms we expect to find in CVs

Tips and Tricks •  Text Mantra: Normalize, normalize, normalize – E.g.
Removing Accents from names makes it easy to look them up e.g. François => Francois – Look for ﬁeld speciﬁc transforma0ons which could be rules, or lookups against a standard en0ty such as the name of a company – It is *far* be]er to have a simple ML model that compares normalized text than a more complex one on un-normalized text

Tips and Tricks •  Always incorporate performance in your model
design decisions. –  We had marginally be]er performance with Random Forests when we did sentence classiﬁca0on but they require a lot more memory so we went with SVMs instead. –  We had a deduplica0on model that used name variants (Nikos / Nikolaos, Pavlos / Pavel, Jim / James) but ended up making too many comparisons for very li]le gain •  When building models, reproducibility is essen0al but is oWen overlooked / not properly planned. –  A copy of the data you used in the experiment –  The source code you used to compute par0cular features –  The actual results of the experiment for veriﬁca0on and cross checking

Tips and Tricks •  Benchmark your work –  Maintain several
benchmark data sets for all the tasks you use –  We use several data sets with thousands of CVs –  Benchmark suites can be used as part of con0nuous integra0on •  Maintain a product perspec0ve at all 0mes –  Build things that are useful to the business –  Deliver features incrementally e.g. CV parsing has several useful intermediate steps such as contact details extrac0on, sec0on iden0ﬁca0on

Tips and Tricks •  Carefully craW your regular expressions – In
Python these are pre-compiled, make sure you manually do this in languages like Java – Take care to avoid issues such as catastrophic backtracking •  Test. Extensively. Aggressively. And Profusely. – Feature computa0ons, accuracy metrics etc… – Can even write smoke tests for trained models using easily predicted examples

We are Hiring!

The CV - A Data Scientist's View

The CV - A Data Scientist's View

More Decks by Rui Miguel Forte

Other Decks in Programming

Featured

Transcript