Slide 1

Slide 1 text

Oracle Machine Learning Office Hours Machine Learning Concepts Encoding of Categorical Attributes: OneHot vs Mean vs WoE and when to use them with Jie Liu supported by Marcos Arancibia, Sherry LaMonica & Mark Hornick Oracle Machine Learning Product Management oracle.com/machine-learning Move the Algorithms; Not the Data! Copyright © 2021, Oracle and/or its affiliates.

Slide 2

Slide 2 text

• Categorical Variable Encoding • Review of popular Categorical Encoding techniques • Introduction to Weight of Evidence and Information Value • Implementing WOE and IV using OML4Py Outline Copyright © 2021, Oracle and/or its affiliates. 2

Slide 3

Slide 3 text

• Categorical Variables • variables with one or more categories/levels • Often without order – compare to ‘High, Medium, Low‘ • Categorical Variables need to be encoded to numerical variables • Most of ML algorithms needs numerical input • Some algorithm has limit on number of levels of the categorical variables, such as random forest Categorical Variable Encoding Copyright © 2021, Oracle and/or its affiliates. 3 • Example • Marital Status – 4 levels • Married • Single • Divorced • Widowed Required by most of machine learning algorithms

Slide 4

Slide 4 text

One hot encoding/ Explosion • Convert each level to a column, if the current row belongs to the particular level, then assign 1, otherwise assign 0 • No information loss • Automatically handled in OML in-DB algorithms • Drawback • Generated high dimensional feature, especially categorical variables with high number of levels – high cardinality Popular techniques for encoding categorical variables Copyright © 2021, Oracle and/or its affiliates. 4

Slide 5

Slide 5 text

Mean encoding • Definition: P( Y = 1 | X_i = x_j) • Interpretation • Among people who married, what is the proportion of them who bought insurance? • Directly encode each level to a unique numerical value • Keep the same feature dimension – low computation cost • May lose information – different levels may have values close Popular techniques for encoding categorical variables Copyright © 2021, Oracle and/or its affiliates. 5 Target Y: BUY_INSURANCE Feature X: MARITAL_STATUS 𝑃 𝑌 = 1 𝑋! = married = # of customer married and buy insurance # of customer married

Slide 6

Slide 6 text

Target distribution impact on mean encoding • Data range of the mean encoding is affected by the target distribution • If the positive cases BUY_INSURANCE = 1 is relatively rare, then every encoded value is low Mean Encoding Copyright © 2021, Oracle and/or its affiliates. 6 Original target class distribution Reduced positive cases to 10%

Slide 7

Slide 7 text

Definition and Benefit • Definition Weight of Evidence Copyright © 2021, Oracle and/or its affiliates. 7 • Benefit • The log ratio provides better contrast of the positive and negative target in the WOE values • The positive value means the feature level inclines to positive case, while the negative value means the feature level inclines to negative value Weight of Evidence Mean Encoding Interpretation: 1. Among people who bought insurance, how likely are they married 2. Among people who did not buy insurance, how likely are they married 3. Compute the ratio and take the log

Slide 8

Slide 8 text

Dive into formula • What if we also compute the log ratio of mean encoding? • Log ratio of mean encoding can be written as sum of • Log ratio of target distribution • Weight of evidence • Weight of Evidence is the part with the target distribution removed Weight of Evidence Copyright © 2021, Oracle and/or its affiliates. 8 Log ratio of mean encoding Log ratio of target distribution Weight of Evidence WOE values with only 10% positive cases WOE values of the original data Weight of Evidence is insensitive to class imbalance

Slide 9

Slide 9 text

Limitation • Only works for binary classification • multi-label classification • One vs other in order to use WoE • One hot encoding • Mean encoding Weight of Evidence Copyright © 2021, Oracle and/or its affiliates. 9

Slide 10

Slide 10 text

Byproduct of Weight of Evidence • Definition • Measure of feature importance • Rule of Thumb from credit score analyst Information Value Copyright © 2021, Oracle and/or its affiliates. 10 Value Importance level Less than 0.02 generally non-predictive 0.02 to 0.1 weak 0.1 to 0.3 medium 0.3+ strong

Slide 11

Slide 11 text

Implementation in OML4Py Weight of Evidence Copyright © 2021, Oracle and/or its affiliates. 11 • Use OML4Py transparency layer to compute WOE in a scalable way • Code can be found in the Blog: • https://blogs.oracle.com/machinelearning/weight-of-evidence-woe-implementation-using-oml4py • Go through notebook

Slide 12

Slide 12 text

Copyright © 2021, Oracle and/or its affiliates 12 Where to go from here?

Slide 13

Slide 13 text

Helpful Links ORACLE MACHINE LEARNING ON O.COM https://www.oracle.com/machine-learning OML TUTORIALS OML LiveLab: https://apexapps.oracle.com/pls/apex/dbpm/r/livelabs/view-workshop?p180_id=560 OML4Py LiveLab: https://apexapps.oracle.com/pls/apex/dbpm/r/livelabs/view-workshop?wid=786 Interactive tour: https://docs.oracle.com/en/cloud/paas/autonomous-database/oml-tour OML OFFICE HOURS https://asktom.oracle.com/pls/apex/asktom.search?office=6801#sessionss ORACLE ANALYTICS CLOUD https://www.oracle.com/solutions/business-analytics/data-visualization/examples.html OML4PY ORACLE AUTOML UI OML SERVICES Oracle Machine Learning AutoML UI (2m video) Oracle Machine Learning Demonstration (6m video) OML AutoML UI Technical Brief Blog: Introducing Oracle Machine Learning AutoML UI Oracle Machine Learning Services (2m video) OML Services Technical Brief Oracle Machine Learning Services Documentation Blog: Introducing Oracle Machine Learning Services GitHub Repository with OML Services examples OML4Py (2m video) OML4Py Introduction (17m video) OML4Py Technical Brief OML4Py User’s Guide Blog: Introducing OML4Py GitHub Repository with Python notebooks Copyright © 2021, Oracle and/or its affiliates 13

Slide 14

Slide 14 text

On our GitHub, you can find: github.com/oracle/oracle-db-examples/tree/master/machine-learning • Example Notebooks in OML4SQL and OML4Python • SQL code examples for DB 18c, 19c and 21c • Labs folder with OML4Py HOL Labs • OML Services demos including Cognitive Text Demos, in PostMan collections Copyright © 2021, Oracle and/or its affiliates 14

Slide 15

Slide 15 text

Q & A Copyright © 2021, Oracle and/or its affiliates 15

Slide 16

Slide 16 text

Thank you ! Copyright © 2021, Oracle and/or its affiliates 16 jie.jl.liu@oracle.com

Slide 17

Slide 17 text

No content