Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oracle Machine Learning for R: an Introduction

Oracle Machine Learning for R: an Introduction

We learned about Oracle Machine Learning for R (OML4R) – the R interface to in-database machine learning and R script deployment from Oracle. R is the top statistical programming language for data science and computational statistics.

OML4R enables you to work with database tables and views using familiar R syntax and functions. For scalable and performant data exploration, data preparation, and machine learning, leverage Oracle Database as a high performance compute engine. Build machine learning models using parallelized in-database algorithms using R Formula-based specification.

Invoke user-defined R functions from SQL for deployment in applications and dashboards, where R engines are dynamically spawned and controlled by Oracle Database. Even take advantage of running your R functions in a data-parallel and task-parallel manner.

We included a demonstration of OML4R that took us through the transparency layer, in-database machine learning, and embedded R execution functionality.

Marcos Arancibia

October 28, 2020
Tweet

More Decks by Marcos Arancibia

Other Decks in Technology

Transcript

  1. With Mark Hornick, Senior Director, Product Management, Data Science and

    Machine Learning @MarkHornick Marcos Arancibia, Product Manager, Data Science and Big Data @MarcosArancibia oracle.com/machine-learning Oracle Machine Learning Office Hours Oracle Machine Learning for R: an Introduction Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  2. Today’s Agenda Upcoming session Today’s speaker Mark Hornick – Oracle

    Machine Learning for R: an Introduction Q&A Copyright © 2020 Oracle and/or its affiliates.
  3. Next Session November 5, 2020: Oracle Machine Office Hours, 8AM

    US Pacific Machine Learning 102 – Clustering Join us for this special series “Oracle Machine Learning Office Hours – Machine Learning 102”, where we will go through the main steps of solving a Business Problem from beginning to end, using the different components available in Oracle Machine Learning: programming languages and interfaces, including Notebooks with SQL, UI, and languages like R and Python. This sixth session in the series will cover Clustering 102, and we will learn more about the methods on multiple dimensions, how to compare Cluster techniques, and explore Dimensionality Reduction and how to extract only the most meaningful attributes from datasets with lots of attributes (or derived attributes). Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  4. Today’s Session: Oracle Machine Learning for R: an Introduction Copyright

    © 2020, Oracle and/or its affiliates. All rights reserved Oracle Machine Learning for R (OML4R) enables you to work with database tables and views using familiar R syntax and functions. For scalable and performant data exploration, data preparation, and machine learning, leverage Oracle Database as a high performance compute engine: • Build machine learning models using parallelized in-database algorithms using R Formula-based specification • Invoke user-defined R functions from SQL for deployment in applications and dashboards, where R engines are dynamically spawned and controlled by Oracle Database. • Take advantage of running your R functions in a data-parallel and task-parallel manner.
  5. • ML pain points • Oracle Machine Learning • Introducing

    OML4R • Demo • Q&A Agenda Copyright © 2020 Oracle and/or its affiliates 8
  6. “It takes too long to get my data or to

    get the ‘right’ data” “I can’t analyze or mine all of my data – it has to be sampled” “Putting open source models and results into production takes too long and is ad hoc and complex” “Our company is concerned about data security, backup and recovery” “We need to build and score with 100s or 1000s of models fast to meet business objectives” Sample of common enterprise machine learning pain points Copyright © 2020 Oracle and/or its affiliates.
  7. Oracle Machine Learning OML Services* Model Deployment and Management, Cognitive

    Text d * Coming soon OML4SQL SQL API OML4Py* Python API OML Notebooks with Apache Zeppelin on Autonomous Database OML4Spark R API on Big Data Oracle Data Miner Oracle SQL Developer extension Copyright © 2020 Oracle and/or its affiliates. OML AutoML UI* Code-free AutoML interface on Autonomous Database OML4R R API
  8. Oracle Machine Learning interfaces to Oracle Database Autonomous Database Oracle

    Database OML Notebooks Database Cloud Service OML4SQL Oracle Data Miner OML4Py* SQL Developer SQL*Plus SQL Developer R client, RStudio Python client, Jupyter Notebooks Data Management Platform Oracle Machine Learning Component Tool *Coming soon Apache Zeppelin OML4SQL OML4Py* OML4R* Copyright © 2020 Oracle and/or its affiliates. OML4R
  9. In-database, parallelized, distributed algorithms • No extracting data to separate

    ML engine • Fast and scalable • Batch and real-time scoring • Explanatory prediction details ML models as first class database objects • Access control via permissions • Audit user actions • Export / import models across databases Supports R and Python interfaces Leverage ML across Oracle stack Empower SQL users with immediate access to ML included with Oracle Database and Oracle Autonomous Database Oracle Machine Learning for SQL SQL Interfaces SQL*Plus SQLDeveloper … Oracle Autonomous Database OML Notebooks Oracle Database with OML Copyright © 2020 Oracle and/or its affiliates.
  10. Copyright © 2020 Oracle and/or its affiliates. Access latency Paradigm

    shift: R/Python à Data Access Language à R/Python Memory limitation – data size, in-memory processing Single threaded Issues for backup, recovery, security Ad hoc production deployment Traditional analytics and data source interaction Deployment Ad hoc cron job Data Source Flat Files extract / export read export load Data source connectivity packages Read/Write files using built-in tool capabilities ?
  11. Oracle Machine Learning for R Oracle Database as HPC environment

    In-database parallelized and distributed machine learning algorithms Manage scripts and objects in Oracle Database Integrate results into applications and dashboards via SQL Use Oracle R Distribution or open source R Empower data scientists with open source environments Database Server Machine SQL Interface OML4R Copyright © 2020 Oracle and/or its affiliates.
  12. Oracle Machine Learning for R Transparency layer • Leverage proxy

    objects so data remain in database • Overload native functions translating functionality to SQL • Use familiar R syntax on database data Parallel, distributed algorithms • Scalability and performance • Exposes in-database algorithms available from OML4SQL Embedded execution • Manage and invoke R scripts in Oracle Database • Data-parallel, task-parallel, and non-parallel execution • Use open source packages to augment functionality Empower data scientists with open source environments Copyright © 2020 Oracle and/or its affiliates. Database Server Machine SQL Interface OML4R
  13. Copyright © 2020 Oracle and/or its affiliates. Example using OML4R

    Proxy objects data.frame Proxy data.frame Inherits from
  14. Copyright © 2020, Oracle and/or its affiliates 17 Mapping between

    R and Oracle Database Data Types SQL – ROracle Read R SQL – ROracle Write varchar2, char, clob, rowid character varchar2(4000) number, float, binary_float, binary_double numeric if(ora.number==T) number else binary_double integer integer integer logical integer date, timestamp POSIXct timestamp Date timestamp interval day to second difftime interval day to second raw, blob, bfile ‘list’ of ‘raw’ vectors raw(2000) factor (and other types) character
  15. OML4R Algorithms Copyright © 2020 Oracle and/or its affiliates. •

    Decision Tree • Logistic Regression • Naïve Bayes • Support Vector Machine • RandomForest Regression • Linear Model • Generalized Linear Model • Multi-Layer Neural Networks • Stepwise Linear Regression • Support Vector Machine Classification Attribute Importance • Minimum Description Length Clustering • Hierarchical k-Means • Orthogonal Partitioning • Expectation Maximization Feature Extraction • Nonnegative Matrix Factorization • Principal Component Analysis • Singular Value Decomposition • Explicit Semantic Analysis Market Basket Analysis • Apriori – Association Rules Anomaly Detection • 1 Class Support Vector Machine Time Series • Single Exponential Smoothing • Double Exponential Smoothing …plus open source R packages for algorithms in combination with embedded R data- and task-parallel execution Supports automatic data preparation, partitioned model ensembles, integrated text mining
  16. Oracle Database Data c1 c2 ci cn f(dat,args,…) f(dat,args,…) f(dat,args,…)

    f(dat,args,…) Model c1 Model c2 Model cn Model ci R Datastore R Script Repository Scalable Data Analysis – Model Building Smart meter scenario f(dat,args,…) { } R Script build model
  17. Build models and store in database, partition on CUST_ID 20

    ore.groupApply (CUST_USAGE_DATA, CUST_USAGE_DATA$CUST_ID, function(dat, ds.name) { cust_id <- dat$CUST_ID[1] mod <- lm(Consumption ~ . -CUST_ID, dat) mod$effects <- mod$residuals <- mod$fitted.values <- NULL name <- paste("mod", cust_id,sep="") assign(name, mod) ds.name1 <- paste(ds.name,".",cust_id,sep="") ore.save(list=paste("mod",cust_id,sep=""), name=ds.name1, overwrite=TRUE) TRUE }, ds.name="myDatastore", ore.connect=TRUE, parallel=TRUE )
  18. Mission and Vision Promote the R language and lead initiatives

    in support of the R community The R Consortium is committed to help evolve the R language by identifying, developing, and implementing infrastructure projects. The R Consortium works with and provides support to the R Foundation and key organizations developing, maintaining, distributing, and using R software. https://r-consortium.org
  19. Membership “We see R as the future of statistical analysis

    because of its flexibility and the strong active community behind it.” – Alun Bedding: Director of Biostatistics at Genentech Become a member: [email protected]
  20. Why Oracle for Machine Learning with R? Empower data scientists

    and R users with powerful in-database ML from R Eliminate costly data movement and latency Scale R for data exploration, data preparation, and ML algorithms In-database algorithms supporting: regression, classification, time series, association rules, attribute importance, clustering, feature extraction, anomaly detection Automatic algorithm-specific data preparation, partition models, integrated text mining Ease of ML model and R script deployment with data-parallel and task-parallel support Leverage existing backup, recovery, and security mechanisms and protocols of Oracle Database That’s where most enterprise data lives – bring the algorithms to the data! Oracle Database and Oracle Autonomous Database Oracle integrates ML across the Oracle Stack and the Enterprise Copyright © 2020 Oracle and/or its affiliates.