Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Encoding categorical variables with categorical...

Encoding categorical variables with categorical encoders

Slides from a lightning talk at the July PyData Atlanta meetup.

Avatar for Will McGinnis

Will McGinnis

July 20, 2016
Tweet

More Decks by Will McGinnis

Other Decks in Programming

Transcript

  1. Who am I • Will McGinnis • Write code at

    Predikto (we’re hiring) • www.predikto.com • twitter.com/willmcginnis • github.com/wdm0006
  2. What is a categorical variable? • Usually letters or words

    • Sometimes integers • “a variable that can take on one of a limited, and usually fixed, number of possible values, thus assigning each individual to a particular group or ‘category.’” • eg: color (red, green, blue), state, sex
  3. Goal • Convert these categorical variables into numbers in a

    consistent way so we can do math on them • Math might be statistical modeling, ML, plotting, or something else • Keep dimensionality low • Keep distance fair
  4. Types of encoding • Ordinal (integer) • one column, each

    category gets an integer • low dimensionality, poor distance preservation • One-hot (dummy) • one column per category, boolean for presence of category • high dimensionality, good distance preservation • Hashing • tunable number of columns, signed integer per column • whatever dimensionality you want, questionable distance preservation • Binary, helmert coding, polynomial coding, sum coding, backward difference coding, etc.
  5. Rules of Thumb • If there are only 2 categories

    per variable, just use ordinal • If there is only one categorical variable, use one-hot or binary • If there are multiple variables with multiple categories, try hashing with different numbers of columns • Try different encoding techniques for different problems • Use category_encoders and let me know how it works for you.