• Sometimes integers • “a variable that can take on one of a limited, and usually fixed, number of possible values, thus assigning each individual to a particular group or ‘category.’” • eg: color (red, green, blue), state, sex
consistent way so we can do math on them • Math might be statistical modeling, ML, plotting, or something else • Keep dimensionality low • Keep distance fair
category gets an integer • low dimensionality, poor distance preservation • One-hot (dummy) • one column per category, boolean for presence of category • high dimensionality, good distance preservation • Hashing • tunable number of columns, signed integer per column • whatever dimensionality you want, questionable distance preservation • Binary, helmert coding, polynomial coding, sum coding, backward difference coding, etc.
per variable, just use ordinal • If there is only one categorical variable, use one-hot or binary • If there are multiple variables with multiple categories, try hashing with different numbers of columns • Try different encoding techniques for different problems • Use category_encoders and let me know how it works for you.