Encoding categorical variables with categorical encoders

Who am I • Will McGinnis • Write code at
Predikto (we’re hiring) • www.predikto.com • twitter.com/willmcginnis • github.com/wdm0006

What is a categorical variable? • Usually letters or words
• Sometimes integers • “a variable that can take on one of a limited, and usually ﬁxed, number of possible values, thus assigning each individual to a particular group or ‘category.’” • eg: color (red, green, blue), state, sex

Goal • Convert these categorical variables into numbers in a
consistent way so we can do math on them • Math might be statistical modeling, ML, plotting, or something else • Keep dimensionality low • Keep distance fair

Types of encoding • Ordinal (integer) • one column, each
category gets an integer • low dimensionality, poor distance preservation • One-hot (dummy) • one column per category, boolean for presence of category • high dimensionality, good distance preservation • Hashing • tunable number of columns, signed integer per column • whatever dimensionality you want, questionable distance preservation • Binary, helmert coding, polynomial coding, sum coding, backward difference coding, etc.

(overly) simple benchmark

Rules of Thumb • If there are only 2 categories
per variable, just use ordinal • If there is only one categorical variable, use one-hot or binary • If there are multiple variables with multiple categories, try hashing with different numbers of columns • Try different encoding techniques for different problems • Use category_encoders and let me know how it works for you.

https://github.com/wdm0006/ categorical_encoding

Encoding categorical variables with categorical...

Encoding categorical variables with categorical encoders

Will McGinnis

More Decks by Will McGinnis

Other Decks in Programming

Featured

Transcript