Slide 1

Slide 1 text

ML Concepts - Using Cross-Validation with OML in- Database and with Embedded Python Execution OML Office Hours Jie Liu Data Scientist, Oracle Machine Learning Supported by Marcos Arancibia, Sherry LaMonica & Mark Hornick Product Management, Oracle Machine Learning Move the Algorithms; Not the Data! Copyright © 2021, Oracle and/or its affiliates. This Session will be Recorded

Slide 2

Slide 2 text

• Italian language now available in OML Services • Cross Validation in OML4Py • Q&A Topics for today Copyright © 2021, Oracle and/or its affiliates 2

Slide 3

Slide 3 text

• Overview of model validation • Cross Validation in-database solution • Cross Validation using Embedded Python Execution • Conclusion Outline for Cross-Validation Copyright © 2021, Oracle and/or its affiliates.

Slide 4

Slide 4 text

• Need to evaluate the performance of the model before going online • Need to understand if a new method works, or works better • ML algorithm selection • Feature engineering • Parameter tuning Motivation 4 Provide better understanding of the performance of a machine learning algorithm Copyright © 2021, Oracle and/or its affiliates. Use random search method to select parameters for Random Forest. Precision increased from 0.76 to 0.78 Tried a new boosted tree library. AUC increased from 0.71 to 0.75 Created10 features by joining 3 tables. Recall increased from 0.73 to 0.82 Are those improvements really significant or reliable? If a data scientist reports recent work such as:

Slide 5

Slide 5 text

• Need to set aside part of the dataset for validation • We cannot train the model on a dataset and validate the model on the same dataset. If so, we cannot guarantee the model performance on new data • Split the dataset into training and test sets, usually 0.8/0.2 • Train the model on train dataset • Test the model on test dataset • Pay attention to the label distribution • Ratio of positive labels should be roughly the same for both train and test split • Unable to know the variation of the performance • Only one pair of train/test datasets. So only one set of validation result is obtained • The model may behave well on this particular pair of train/test set but not others Techniques for validation 5 Train/Test split validation Copyright © 2021, Oracle and/or its affiliates. Training Data Testing Data How to get a sense of the variation of model performance using the dataset available?

Slide 6

Slide 6 text

Leave one out validation • Leave 1 sample as the test point • Train the model on n-1 data points • Pro: Low bias on Prediction Error estimation • Con: High cost as n can be huge Techniques for validation 6 Copyright © 2021, Oracle and/or its affiliates. Training Data One sample used for Test Training Data Training Data • • • n splits

Slide 7

Slide 7 text

K-Fold Cross Validation • Divide the dataset into K parts • Use K-1 parts as training set • Use 1 part as test set • K= 10 is usually recommended Techniques for validation 7 Copyright © 2021, Oracle and/or its affiliates. Training Data • • • K = 5 fold Test Data Test Data Test Data Test Data Test Data • Pro: reduce the computation cost compared to leave-on-out validation • Con: slight overestimation of the prediction error due to the reduced training sizes • Caution: data with timestamps

Slide 8

Slide 8 text

Implementation in OML4Py K-Fold Cross Validation in OML4Py 8 • OML4Py API to split the data into K-folds • Parallelize the K-Fold CV using Embedded Python Execution Go through OML notebook Copyright © 2021, Oracle and/or its affiliates.

Slide 9

Slide 9 text

Helpful Links 9 ORACLE MACHINE LEARNING ON O.COM https://www.oracle.com/machine-learning OML TUTORIALS OML LiveLab: https://apexapps.oracle.com/pls/apex/dbpm/r/livelabs/view-workshop?p180_id=560 OML4Py LiveLab: https://apexapps.oracle.com/pls/apex/dbpm/r/livelabs/view-workshop?wid=786 Interactive tour: https://docs.oracle.com/en/cloud/paas/autonomous-database/oml-tour OML OFFICE HOURS https://asktom.oracle.com/pls/apex/asktom.search?office=6801#sessionss ORACLE ANALYTICS CLOUD https://www.oracle.com/solutions/business-analytics/data-visualization/examples.html OML4PY ORACLE AUTOML UI OML SERVICES Oracle Machine Learning AutoML UI (2m video) Oracle Machine Learning Demonstration (6m video) OML AutoML UI Technical Brief Blog: Introducing Oracle Machine Learning AutoML UI Oracle Machine Learning Services (2m video) OML Services Technical Brief Oracle Machine Learning Services Documentation Blog: Introducing Oracle Machine Learning Services GitHub Repository with OML Services examples OML4Py (2m video) OML4Py Introduction (17m video) OML4Py Technical Brief OML4Py User’s Guide Blog: Introducing OML4Py GitHub Repository with Python notebooks

Slide 10

Slide 10 text

On our GitHub, you can find: Copyright © 2021, Oracle and/or its affiliates 10 github.com/oracle/oracle-db-examples/tree/master/machine-learning • Example Notebooks in OML4SQL and OML4Python • SQL code examples for DB 18c, 19c and 21c • Labs folder with OML4Py HOL Labs • OML Services demos including Cognitive Text Demos, in PostMan collections

Slide 11

Slide 11 text

Q & A Copyright © 2021, Oracle and/or its affiliates 11

Slide 12

Slide 12 text

Thank you jie.jl.liu@oracle.com Copyright © 2021, Oracle and/or its affiliates. 12