Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OML ML Concepts - Encoding of Categorical Attributes: OneHot vs Mean vs WoE and when to use them

OML ML Concepts - Encoding of Categorical Attributes: OneHot vs Mean vs WoE and when to use them

Join us on this weekly Office Hours for Oracle Machine Learning on Autonomous Database, where Jie Liu, Data Scientist for Oracle Machine Learning, will cover the different methods of encoding categorical attributes like One-Hot Encoding, Mean Econding and Weight-of-Evidence (WoE), and review the best usage for each of them. He will also present a demo running on a notebook with OML4Py.

The Oracle Machine Learning product family supports data scientists, analysts, developers, and IT to achieve data science project goals faster while taking full advantage of the Oracle platform.

The Oracle Machine Learning Notebooks offers an easy-to-use, interactive, multi-user, collaborative interface based on Apache Zeppelin notebook technology, and support SQL, PL/SQL, Python and Markdown interpreters. It is available on all Autonomous Database versions and Tiers, including the always-free editions.

OML includes AutoML, which provides automated machine learning algorithm features for algorithm selection, feature selection and model tuning, in addition to a specialized AutoML UI exclusive to the Autonomous Database.

OML Services is also included in Autonomous Database, where you can deploy and manage native in-database OML models as well as ONNX ML models (for classification and regression) built using third-party engines, and can also invoke cognitive text analytics.

9e699c26463e7da4cbc8a5313cf55da3?s=128

Marcos Arancibia

June 22, 2021
Tweet

Transcript

  1. Oracle Machine Learning Office Hours Machine Learning Concepts Encoding of

    Categorical Attributes: OneHot vs Mean vs WoE and when to use them with Jie Liu supported by Marcos Arancibia, Sherry LaMonica & Mark Hornick Oracle Machine Learning Product Management oracle.com/machine-learning Move the Algorithms; Not the Data! Copyright © 2021, Oracle and/or its affiliates.
  2. • Categorical Variable Encoding • Review of popular Categorical Encoding

    techniques • Introduction to Weight of Evidence and Information Value • Implementing WOE and IV using OML4Py Outline Copyright © 2021, Oracle and/or its affiliates. 2
  3. • Categorical Variables • variables with one or more categories/levels

    • Often without order – compare to ‘High, Medium, Low‘ • Categorical Variables need to be encoded to numerical variables • Most of ML algorithms needs numerical input • Some algorithm has limit on number of levels of the categorical variables, such as random forest Categorical Variable Encoding Copyright © 2021, Oracle and/or its affiliates. 3 • Example • Marital Status – 4 levels • Married • Single • Divorced • Widowed Required by most of machine learning algorithms
  4. One hot encoding/ Explosion • Convert each level to a

    column, if the current row belongs to the particular level, then assign 1, otherwise assign 0 • No information loss • Automatically handled in OML in-DB algorithms • Drawback • Generated high dimensional feature, especially categorical variables with high number of levels – high cardinality Popular techniques for encoding categorical variables Copyright © 2021, Oracle and/or its affiliates. 4
  5. Mean encoding • Definition: P( Y = 1 | X_i

    = x_j) • Interpretation • Among people who married, what is the proportion of them who bought insurance? • Directly encode each level to a unique numerical value • Keep the same feature dimension – low computation cost • May lose information – different levels may have values close Popular techniques for encoding categorical variables Copyright © 2021, Oracle and/or its affiliates. 5 Target Y: BUY_INSURANCE Feature X: MARITAL_STATUS 𝑃 𝑌 = 1 𝑋! = married = # of customer married and buy insurance # of customer married
  6. Target distribution impact on mean encoding • Data range of

    the mean encoding is affected by the target distribution • If the positive cases BUY_INSURANCE = 1 is relatively rare, then every encoded value is low Mean Encoding Copyright © 2021, Oracle and/or its affiliates. 6 Original target class distribution Reduced positive cases to 10%
  7. Definition and Benefit • Definition Weight of Evidence Copyright ©

    2021, Oracle and/or its affiliates. 7 • Benefit • The log ratio provides better contrast of the positive and negative target in the WOE values • The positive value means the feature level inclines to positive case, while the negative value means the feature level inclines to negative value Weight of Evidence Mean Encoding Interpretation: 1. Among people who bought insurance, how likely are they married 2. Among people who did not buy insurance, how likely are they married 3. Compute the ratio and take the log
  8. Dive into formula • What if we also compute the

    log ratio of mean encoding? • Log ratio of mean encoding can be written as sum of • Log ratio of target distribution • Weight of evidence • Weight of Evidence is the part with the target distribution removed Weight of Evidence Copyright © 2021, Oracle and/or its affiliates. 8 Log ratio of mean encoding Log ratio of target distribution Weight of Evidence WOE values with only 10% positive cases WOE values of the original data Weight of Evidence is insensitive to class imbalance
  9. Limitation • Only works for binary classification • multi-label classification

    • One vs other in order to use WoE • One hot encoding • Mean encoding Weight of Evidence Copyright © 2021, Oracle and/or its affiliates. 9
  10. Byproduct of Weight of Evidence • Definition • Measure of

    feature importance • Rule of Thumb from credit score analyst Information Value Copyright © 2021, Oracle and/or its affiliates. 10 Value Importance level Less than 0.02 generally non-predictive 0.02 to 0.1 weak 0.1 to 0.3 medium 0.3+ strong
  11. Implementation in OML4Py Weight of Evidence Copyright © 2021, Oracle

    and/or its affiliates. 11 • Use OML4Py transparency layer to compute WOE in a scalable way • Code can be found in the Blog: • https://blogs.oracle.com/machinelearning/weight-of-evidence-woe-implementation-using-oml4py • Go through notebook
  12. Copyright © 2021, Oracle and/or its affiliates 12 Where to

    go from here?
  13. Helpful Links ORACLE MACHINE LEARNING ON O.COM https://www.oracle.com/machine-learning OML TUTORIALS

    OML LiveLab: https://apexapps.oracle.com/pls/apex/dbpm/r/livelabs/view-workshop?p180_id=560 OML4Py LiveLab: https://apexapps.oracle.com/pls/apex/dbpm/r/livelabs/view-workshop?wid=786 Interactive tour: https://docs.oracle.com/en/cloud/paas/autonomous-database/oml-tour OML OFFICE HOURS https://asktom.oracle.com/pls/apex/asktom.search?office=6801#sessionss ORACLE ANALYTICS CLOUD https://www.oracle.com/solutions/business-analytics/data-visualization/examples.html OML4PY ORACLE AUTOML UI OML SERVICES Oracle Machine Learning AutoML UI (2m video) Oracle Machine Learning Demonstration (6m video) OML AutoML UI Technical Brief Blog: Introducing Oracle Machine Learning AutoML UI Oracle Machine Learning Services (2m video) OML Services Technical Brief Oracle Machine Learning Services Documentation Blog: Introducing Oracle Machine Learning Services GitHub Repository with OML Services examples OML4Py (2m video) OML4Py Introduction (17m video) OML4Py Technical Brief OML4Py User’s Guide Blog: Introducing OML4Py GitHub Repository with Python notebooks Copyright © 2021, Oracle and/or its affiliates 13
  14. On our GitHub, you can find: github.com/oracle/oracle-db-examples/tree/master/machine-learning • Example Notebooks

    in OML4SQL and OML4Python • SQL code examples for DB 18c, 19c and 21c • Labs folder with OML4Py HOL Labs • OML Services demos including Cognitive Text Demos, in PostMan collections Copyright © 2021, Oracle and/or its affiliates 14
  15. Q & A Copyright © 2021, Oracle and/or its affiliates

    15
  16. Thank you ! Copyright © 2021, Oracle and/or its affiliates

    16 jie.jl.liu@oracle.com
  17. None