Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oracle Machine Learning for Spark

Oracle Machine Learning for Spark

Oracle Machine Learning for Spark offers interfaces to run Machine Learning algorithms on top of Data Lakes, using Spark to distribute computation across Nodes, and brings integration with the Big Data ecosystem that allows for manipulation tables in HIVE and Impala, as well as integration with HDFS and the Oracle Database, using the R language as front-end.

It makes the open source R scripting language and environment ready for the enterprise and big data. Designed for problems involving both large and small volumes of data, Oracle Machine Learning for Spark integrates R with Data Lakes, allowing users to execute R commands and scripts for data processing, statistical and machine learning analytics on HIVE, IMPALA, Spark DataFrame tables and views using R and Spark SQL syntax. Many familiar R functions are overloaded and translate R functions into SQL for in-Data Lake execution.

Oracle Machine Learning consists of complementary components supporting scalable machine learning algorithms for in-database and big data environments (including Cloud and on-premises), notebook technology, SQL, Python and R APIs, and Hadoop/Spark environments.

Marcos Arancibia

February 12, 2020
Tweet

More Decks by Marcos Arancibia

Other Decks in Technology

Transcript

  1. With Marcos Arancibia, Product Mgr. Data Science and Big Data

    @MarcosArancibia Mark Hornick, Senior Director, Product Management, Data Science and Machine Learning @MarkHornick oracle.com/goto/machinelearning Oracle Machine Learning Office Hours Oracle Machine Learning for Spark Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  2. Topics • Upcoming session • Web Questions • Speaker Marcos

    Arancibia – Oracle Machine Learning for Spark • Q&A Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  3. Next Session to look for: • March 11th, 2020: Oracle

    Machine Office Hours, 9AM US Pacific • Oracle Machine Learning Notebooks Take a deeper dive into Oracle Machine Learning Notebooks, both the Zeppelin–based interface and additional OML4SQL API functionality, like partitioned models and text mining. Mark Hornick, Senior Director, Product Management, Data Science and Machine Learning Marcos Arancibia, Product Manager, Oracle Data Science and Big Data Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  4. Web Questions • Does OML support creation of PMML based

    models? • Will the OML4Spark be available with the new Big Data Service? Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  5. Today’s Session: Oracle Machine Learning for Spark, with Demos with

    Marcos Arancibia • Oracle Machine Learning for Spark offers interfaces to run Machine Learning algorithms on top of Data Lakes, using Spark to distribute computation across Nodes, and brings integration with the Big Data ecosystem that allows for manipulating tables in HIVE, Impala and Spark, as well as integration with HDFS and Oracle Database, using the R language Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  6. Oracle Machine Learning for Spark (OML4Spark) Marcos Arancibia Product Manager

    Data Science and Big Data Oracle Machine Learning Office Hours Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  7. The following is intended to outline our general product direction.

    It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation. Safe harbor statement Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  8. Oracle Machine Learning Key Attributes Automated Get better results faster

    with less effort – even non-expert users Scalable Handle big data volumes using parallel, distributed algorithms – no data movement Production-ready Deploy and update data science solutions faster with integrated ML platform Increase productivity, Achieve enterprise goals, Innovate more Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  9. Oracle data platform – products Any Data Source Any Data

    Consumer Data Management Data Access Data Ingestion Data Catalog Service Autonomous Data Warehouse NoSQL Service Autonomous Transaction Processing Big Data Service Spark Data Flow Service Data Lake Analytics (Sparkline SNAP) Object Storage Data Integration Service OCI Streaming Oracle Analytics Cloud Fusion Analytics Cloud Data Science Service Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  10. Oracle data platform – products Components with built-in Oracle Machine

    Learning Any Data Source Any Data Consumer Data Management Data Access Data Ingestion Data Catalog Service Autonomous Data Warehouse NoSQL Service Autonomous Transaction Processing Big Data Service Spark Data Flow Service Data Lake Analytics (Sparkline SNAP) Object Storage Data Integration Service OCI Streaming Oracle Analytics Cloud Fusion Analytics Cloud Data Science Service Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  11. d Oracle Machine Learning OML Services* Supporting Oracle Applications Image,

    Text, Scoring, Deployment, Model Management * Coming soon OML4SQL Oracle Advanced Analytics SQL API OML4Py* Python API OML4R Oracle R Enterprise R API OML Notebooks with Apache Zeppelin on Autonomous Database OML4Spark Oracle R Advanced Analytics for Hadoop Oracle Data Miner Oracle SQL Developer extension Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  12. Manage and Analyze Cross-Platform Data with Oracle Machine Learning Oracle

    Cloud SQL OML4R OML4Python OML4SQL REST Popular R IDEs Popular Python IDEs SQL Developer OML Notebooks Select User Interface, e.g. API Options Cloud or On-premises Reach broader Data Sources Oracle Object Storage Big Data Service (HDFS) NoSQL Databases Kafka Streams Amazon S3 Azure Blob Storage Oracle Database Data Lake OML4Spark Oracle Big Data SQL Data Science Service Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  13. Hadoop Cluster Hadoop, HIVE, Impala and Spark DF Integration plus

    R Engine and Open-Source R Packages Oracle Machine Learning for Spark: Integration R Analytics OML4Spark Client libraries R Client •OML4Spark algorithms: •Deep Neural, GLM, LM , ELM, H-ELM, SVD, PCA •Spark MLlib algorithms: •LM, GLM, LASSO, Ridge Regression, Decision Trees, Random Forests, SVM, k-Means, PCA, Gradient Boosted Trees •Open-source R packages distributed via Map-Red function in R SQL and HQL Basic Stats, Data Prep, Joins and View creation HQL Oracle Database & SQL JDBC or OCI Native Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  14. Oracle Machine Learning for Spark (OML4Spark) • supported by Oracle

    R Advanced Analytics for Hadoop Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  15. • Leverage Spark environment for powerful data preparation and machine

    learning • Use data across range of Data Lake sources • Achieve scalability and performance using full Hadoop cluster power • Parallel and distributed ML algorithms from native and Spark MLlib implementations • Use expressive R Formula specification R Language API Component to Oracle Big Data Connectors Oracle Machine Learning for Spark Java API HDFS | Hive | Spark DF | Impala | JDBC Sources BDA BDS DIY OML4Spark R Client Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  16. Transparency layer • Proxy objects reference data from file system,

    HDFS, Hive, Impala, Spark DataFrame and JDBC sources • Overloaded R functions translate functionality to native language, e.g., HiveQL for HIVE and Impala • Users manipulate data via standard R syntax Parallel, distributed machine learning algorithms • Scalability and performance leveraging full Hadoop cluster • Spark-based custom LM, GLM, NN, K-Means plus Spark MLlib • Use expressive R Formula specification Compute framework with custom R mappers/reducers • Data-parallel and task-parallel execution • Allows for open source CRAN packages run on Cluster Nodes R Language API Component to Oracle Big Data Connectors Oracle Machine Learning for Spark Java API HDFS | Hive | Spark DF | Impala | JDBC Sources BDA BDS DIY OML4Spark R Client Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  17. Logistic Regression (GLM) Data fits in memory • Up to

    7x faster than Spark MLlib Data cannot fit memory • Able to solve a 10B row model Benchmark environment • OML4Spark 2.8.0 • Big Data Appliance X7-2 • 6 Nodes, 256GB of RAM per Node OML4Spark Performance Formula: cancelled ~ distance + origin + dest + as.factor(month) + as.factor(year) + as.factor(dayofmonth) + as.factor(dayofweek) + as.factor(flightnum) 1.8 2.4 5.5 26 64.8 1530 7.4 16.2 37.9 131 278 1 10 100 1,000 10,000 100K 1M 10M 100M 1B 10B Execution Time (seconds) Dataset Size (# rows) OML4Spark vs. Spark MLlib for GLM Logistic Regression OML4Spark MLlib WNR Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  18. OML4Spark Formula Parser can handle the full set of open-source

    R formula transformations, so it can be used with any Spark MLlib algorithm supported by OML4Spark. Even the release 2.4.x of SparkR (Q1 ’20) fails to process a simple interaction between attributes. Using SparkMLlib Logistic Regression model in SparkR R> model <- glm( Kyphosis ~ (Age + Number)^2, df, family = "binomial") ERROR RBackendHandler: fitRModelFormula on org.apache.spark.ml.api.r.SparkRWrappers failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) :java.lang.IllegalArgumentException: Could not parse formula: Kyphosis ~ (Age + Number)^2 Using Spark MLlib Logistic Regression model in OML4Spark R> model <- orch.ml.logistic( Kyphosis ~ (Age + Number)^2, data = data) OBX Model Matrix: processed 1 factor variables, 0.050 sec OBX Model Matrix: created MLlib LabeledPoint RDD (81 rows) 0.008 sec OBX Machine Learning: MLlib Logistic Regression elapsed time 0.858 sec R> model$coefficients [1] -6.568918 0.027176503 1.022537535 -0.004490547 Using open-source R with the same complex formula to ensure OML4Spark’s model coefficients are correct glm( Kyphosis ~ (Age + Number)^2, data = kyphosis, family = "binomial")$coefficients (Intercept) Age Number Age:Number -6.568917860 0.027176503 1.022537536 -0.004490547 OML4Spark Benefits: Making Spark MLlib better for R users Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  19. OML4Spark user steps – 14 lines Python user steps –

    47 lines OML4Spark and Python Building a Spark MLlib Random Forest from a HIVE table http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables https://github.com/apache/spark/blob/master/examples/src/main/python/ml/random_forest_classifier_example.py https://github.com/apache/spark/blob/master/examples/src/main/python/ml/rformula_example.py Load Libraries Process Formula Establish Spark Session Copy data from HIVE Create 3rd copy of Data for vectors Build Model Single Vector of Predictions Load Libraries Establish HIVE and Spark Session Build Model directly against HIVE data with full formula support Predictions on HIVE data exported with desired columns http://www.oracle.com/technetwork/database/database-technologies/bdc/r-advanalytics-for-hadoop/documentation/index.html Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  20. OML4Spark on the Oracle Machine Learning Roadmap Copyright © 2020,

    Oracle and/or its affiliates. All rights reserved
  21. Roadmap: OML Services Model Management Services • Building and deploying

    OML models • Model Monitoring of accuracy and prediction/predictor drift Model repository • Store, version, compare ML models Cognitive Services • Feature Extraction, Image and Text User-defined scripts deployment • Python and R user-defined functions invoked via REST API REST APIs for application integration Currently integrated with internal Oracle Applications Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  22. Roadmap: OML4Spark Support advanced machine learning activities on Big Data

    • Model management and cognitive image and text processing • Model deployment and monitoring on Big Data (including Database models) Cloud-oriented packaging (containers, REST APIs) Enable OML4Py and OML4R for uniform experience across platforms Algorithms • Neural Network gradient descent enhancements avoid over-fitting • New native Support Vector Machine with linear and non-linear kernels • New native k-Means and k-Mode clustering algorithms New cloud-based architecture with powerful Spark analytics Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  23. Roadmap: Enabling OML on GPUs Leverage GPUs for user-defined R

    and Python functions • Include 3rd party packages leveraging GPUs, e.g., Tensorflow, Keras • Support state-of-the-art ML processing, e.g., deep learning Augment OML Services for GPU processing – key for images Copyright © 2020 Oracle and/or its affiliates.
  24. Oracle Machine Learning Key Attributes Automated Scalable Production-ready Increase productivity,

    Achieve enterprise goals, Innovate more Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  25. Thank You Marcos Arancibia Product Manager Data Science and Big

    Data Copyright © 2020, Oracle and/or its affiliates. All rights reserved