Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Encoding categorical variables with categorical encoders

Encoding categorical variables with categorical encoders

Slides from a lightning talk at the July PyData Atlanta meetup.

Will McGinnis

July 20, 2016
Tweet

More Decks by Will McGinnis

Other Decks in Programming

Transcript

  1. Who am I • Will McGinnis • Write code at

    Predikto (we’re hiring) • www.predikto.com • twitter.com/willmcginnis • github.com/wdm0006
  2. What is a categorical variable? • Usually letters or words

    • Sometimes integers • “a variable that can take on one of a limited, and usually fixed, number of possible values, thus assigning each individual to a particular group or ‘category.’” • eg: color (red, green, blue), state, sex
  3. Goal • Convert these categorical variables into numbers in a

    consistent way so we can do math on them • Math might be statistical modeling, ML, plotting, or something else • Keep dimensionality low • Keep distance fair
  4. Types of encoding • Ordinal (integer) • one column, each

    category gets an integer • low dimensionality, poor distance preservation • One-hot (dummy) • one column per category, boolean for presence of category • high dimensionality, good distance preservation • Hashing • tunable number of columns, signed integer per column • whatever dimensionality you want, questionable distance preservation • Binary, helmert coding, polynomial coding, sum coding, backward difference coding, etc.
  5. Rules of Thumb • If there are only 2 categories

    per variable, just use ordinal • If there is only one categorical variable, use one-hot or binary • If there are multiple variables with multiple categories, try hashing with different numbers of columns • Try different encoding techniques for different problems • Use category_encoders and let me know how it works for you.