Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ML Concepts - Using Cross-Validation with OML in- Database and with Embedded Python Execution

ML Concepts - Using Cross-Validation with OML in- Database and with Embedded Python Execution

On this weekly Office Hours for Oracle Machine Learning on Autonomous Database, Jie Liu, Data Scientist for Oracle Machine Learning, covered Cross-Validation methods, why they are useful and how to run it using in-Database methods using OML and also Embedded Python Execution open-source methods. He also presented a demo running on a notebook with OML4Py.

The Oracle Machine Learning product family supports data scientists, analysts, developers, and IT to achieve data science project goals faster while taking full advantage of the Oracle platform.

The Oracle Machine Learning Notebooks offers an easy-to-use, interactive, multi-user, collaborative interface based on Apache Zeppelin notebook technology, and support SQL, PL/SQL, Python and Markdown interpreters. It is available on all Autonomous Database versions and Tiers, including the always-free editions.

OML includes AutoML, which provides automated machine learning algorithm features for algorithm selection, feature selection and model tuning, in addition to a specialized AutoML UI exclusive to the Autonomous Database.

OML Services is also included in Autonomous Database, where you can deploy and manage native in-database OML models as well as ONNX ML models (for classification and regression) built using third-party engines, and can also invoke cognitive text analytics.

Marcos Arancibia

August 03, 2021
Tweet

More Decks by Marcos Arancibia

Other Decks in Technology

Transcript

  1. ML Concepts - Using Cross-Validation with OML in- Database and

    with Embedded Python Execution OML Office Hours Jie Liu Data Scientist, Oracle Machine Learning Supported by Marcos Arancibia, Sherry LaMonica & Mark Hornick Product Management, Oracle Machine Learning Move the Algorithms; Not the Data! Copyright © 2021, Oracle and/or its affiliates. This Session will be Recorded
  2. • Italian language now available in OML Services • Cross

    Validation in OML4Py • Q&A Topics for today Copyright © 2021, Oracle and/or its affiliates 2
  3. • Overview of model validation • Cross Validation in-database solution

    • Cross Validation using Embedded Python Execution • Conclusion Outline for Cross-Validation Copyright © 2021, Oracle and/or its affiliates.
  4. • Need to evaluate the performance of the model before

    going online • Need to understand if a new method works, or works better • ML algorithm selection • Feature engineering • Parameter tuning Motivation 4 Provide better understanding of the performance of a machine learning algorithm Copyright © 2021, Oracle and/or its affiliates. Use random search method to select parameters for Random Forest. Precision increased from 0.76 to 0.78 Tried a new boosted tree library. AUC increased from 0.71 to 0.75 Created10 features by joining 3 tables. Recall increased from 0.73 to 0.82 Are those improvements really significant or reliable? If a data scientist reports recent work such as:
  5. • Need to set aside part of the dataset for

    validation • We cannot train the model on a dataset and validate the model on the same dataset. If so, we cannot guarantee the model performance on new data • Split the dataset into training and test sets, usually 0.8/0.2 • Train the model on train dataset • Test the model on test dataset • Pay attention to the label distribution • Ratio of positive labels should be roughly the same for both train and test split • Unable to know the variation of the performance • Only one pair of train/test datasets. So only one set of validation result is obtained • The model may behave well on this particular pair of train/test set but not others Techniques for validation 5 Train/Test split validation Copyright © 2021, Oracle and/or its affiliates. Training Data Testing Data How to get a sense of the variation of model performance using the dataset available?
  6. Leave one out validation • Leave 1 sample as the

    test point • Train the model on n-1 data points • Pro: Low bias on Prediction Error estimation • Con: High cost as n can be huge Techniques for validation 6 Copyright © 2021, Oracle and/or its affiliates. Training Data One sample used for Test Training Data Training Data • • • n splits
  7. K-Fold Cross Validation • Divide the dataset into K parts

    • Use K-1 parts as training set • Use 1 part as test set • K= 10 is usually recommended Techniques for validation 7 Copyright © 2021, Oracle and/or its affiliates. Training Data • • • K = 5 fold Test Data Test Data Test Data Test Data Test Data • Pro: reduce the computation cost compared to leave-on-out validation • Con: slight overestimation of the prediction error due to the reduced training sizes • Caution: data with timestamps
  8. Implementation in OML4Py K-Fold Cross Validation in OML4Py 8 •

    OML4Py API to split the data into K-folds • Parallelize the K-Fold CV using Embedded Python Execution Go through OML notebook Copyright © 2021, Oracle and/or its affiliates.
  9. Helpful Links 9 ORACLE MACHINE LEARNING ON O.COM https://www.oracle.com/machine-learning OML

    TUTORIALS OML LiveLab: https://apexapps.oracle.com/pls/apex/dbpm/r/livelabs/view-workshop?p180_id=560 OML4Py LiveLab: https://apexapps.oracle.com/pls/apex/dbpm/r/livelabs/view-workshop?wid=786 Interactive tour: https://docs.oracle.com/en/cloud/paas/autonomous-database/oml-tour OML OFFICE HOURS https://asktom.oracle.com/pls/apex/asktom.search?office=6801#sessionss ORACLE ANALYTICS CLOUD https://www.oracle.com/solutions/business-analytics/data-visualization/examples.html OML4PY ORACLE AUTOML UI OML SERVICES Oracle Machine Learning AutoML UI (2m video) Oracle Machine Learning Demonstration (6m video) OML AutoML UI Technical Brief Blog: Introducing Oracle Machine Learning AutoML UI Oracle Machine Learning Services (2m video) OML Services Technical Brief Oracle Machine Learning Services Documentation Blog: Introducing Oracle Machine Learning Services GitHub Repository with OML Services examples OML4Py (2m video) OML4Py Introduction (17m video) OML4Py Technical Brief OML4Py User’s Guide Blog: Introducing OML4Py GitHub Repository with Python notebooks
  10. On our GitHub, you can find: Copyright © 2021, Oracle

    and/or its affiliates 10 github.com/oracle/oracle-db-examples/tree/master/machine-learning • Example Notebooks in OML4SQL and OML4Python • SQL code examples for DB 18c, 19c and 21c • Labs folder with OML4Py HOL Labs • OML Services demos including Cognitive Text Demos, in PostMan collections