Slide 1

Slide 1 text

OML feature highlight: Using third-party Python packages from Python, SQL, and REST OML Office Hours Sherry LaMonica Principal Member of Technical Staff, Oracle Machine Learning Supported by Marcos Arancibia and Mark Hornick Product Management, Oracle Machine Learning Move the Algorithms; Not the Data! Copyright © 2021, Oracle and/or its affiliates. This Session will be Recorded

Slide 2

Slide 2 text

• Third-party packages • Introduction to Embedded Python Execution • Benefits and motivation • Embedded Python Execution interfaces • Demo • Q&A Goals for today Copyright © 2021, Oracle and/or its affiliates 2

Slide 3

Slide 3 text

Third-party Python packages 3 Copyright © 2021, Oracle and/or its affiliates

Slide 4

Slide 4 text

• A good source for state-of-the-art or new algorithms or functionality • Supports a wide range of functionality – Data transformation, statistical analysis, visualization, machine learning, etc. – Medical imaging, geospatial analysis, reproducible research • If a package is widely used it will quickly evolve into a robust offering through collaborative efforts • Not typically implemented for scalability and performance – Often focused on specialized topics and functionality rather than memory management or parallelism – Some packages quickly consume available memory Third-party packages 4 Copyright © 2021, Oracle and/or its affiliates.

Slide 5

Slide 5 text

• How big is the data? – Will data fit into Python memory? – Will the package work on my data volumes? • Is working directly in Python more performant? – For small data, working in open-source Python may provide adequate performance • Can the data be partitioned for processing in parallel? – Partition by group or row (data-parallel) – Run n times in parallel (task-parallel), e.g., simulations • Is there equivalent in-database functionality? – In-database machine learning are scalable and parallelized Should I use a third-party Python package on database data? 5 Copyright © 2021, Oracle and/or its affiliates.

Slide 6

Slide 6 text

There is no magic! • Performance gains are the result of running in data-parallel or task-parallel Python engines in the database environment • Using embedded Python does not make a Python package automatically scale – No automatic modification to third-party package code – No automation of scalable or parallel behavior for Python packages • Database environment may have more memory, CPU, or faster CPUs to increase scalability from a typical client machine Performance characteristics for open-source Python packages with Embedded Python Execution Copyright © 2021, Oracle and/or its affiliates 6

Slide 7

Slide 7 text

Oracle Autonomous Database User-installed packages currently not available OML4Py Dependencies cx_Oracle 8.1.0 matplotlib 3.1.2 numpy 1.18.1 pandas 0.25.3 scikit-learn 0.23.1 scipy 1.6.1 cycler 0.10.0 joblib 0.14.0 kiwisolver 1.1.0 pyparsing 2.4.0 python-dateutil 2.8.1 pytz 2019.3 six 1.13.0 threadpoolctl 2.1.0 Oracle Database User-installed and configured packages supported Which third-party packages can be used with OML4Py? Copyright © 2021, Oracle and/or its affiliates 7

Slide 8

Slide 8 text

Using pip • Python’s internal package manager • Installs Python packages from PyPI repository along with their dependencies • Often used with configuration file requirements.txt to install specified packages and versions $ pip install pandas $ pip install -r requirements.txt Package Management Copyright © 2021, Oracle and/or its affiliates. 8

Slide 9

Slide 9 text

Using virtualenv • Tool for creating isolated Python environments • Use different versions of a package for distinct projects • Organize groups of packages into different isolated environments $ pip install virtualenv Package Management Copyright © 2021, Oracle and/or its affiliates. 9 Python 3.5 Python 3.9 Pandas 0.23.1 Pandas 0.25.3 virtualenv

Slide 10

Slide 10 text

Embedded Python Execution 10 Copyright © 2021, Oracle and/or its affiliates

Slide 11

Slide 11 text

• Develop user-defined Python functions and manage these in the database for production deployment – Invoked using REST endpoints on Oracle Autonomous Database – Invoked using a SQL interface on Oracle Database • Python objects can be stored in the database using the OML4Py Datastore – Eliminates the need to manage objects using flat files • Datastores work with embedded Python execution for passing objects to user-defined functions – Pass non-scalar objects (e.g., a model) as arguments to user-defined functions in SQL, REST, and Python embedded execution Interfaces – Enables sharing data science work products from data scientists with application developers for deployment • Run user-defined Python functions in a data-parallel and task-parallel manner – Supports “embarrassingly parallel” solutions – Enables scoring third-party Python models at scale • User-defined function results can contain both DataFrames and images • Supports SQL, Python, and REST Interfaces Embedded Python Execution: Introduction Copyright © 2021, Oracle and/or its affiliates 11

Slide 12

Slide 12 text

• Supplements OML4Py functionality through use of third-party packages in user-defined functions • Facilitates application use of Python script results – Develop/test Python scripts interactively with Python interface – Invoke Python scripts directly from SQL or REST for production applications o Python scripts stored in Oracle Database o Schedule Python scripts to run automatically via SQL • Potential for improved performance and throughput – Data-parallel and task-level parallelism – Leverage compute and memory resources of the database environment – Avoids the overhead of stacked SQL • Supports structured and image results • Retrieve results in various forms depending on application requirements: table, XML, PNG, JSON Embedded Python Execution: Benefits Copyright © 2021, Oracle and/or its affiliates 12

Slide 13

Slide 13 text

OML4Py Interfaces for Embedded Python Execution Autonomous Database • Python Interface (via OML Notebooks) • REST Interface Oracle Database • Python Interface • SQL Interface Copyright © 2021 Oracle and/or its affiliates. OML4Py REST Interface Oracle Autonomous Database Oracle Database SQL Interface

Slide 14

Slide 14 text

Python and SQL Interfaces Embedded Python Execution functions Copyright © 2021, Oracle and/or its affiliates 14 Python Interface Function SQL Interface Function Purpose oml.do_eval pyqEval Invoke stand-alone Python script oml.table_apply pyqTableEval Invoke Python script with full table input oml.row_apply pyqRowEval Invoke Python script one row at a time, or multiple rows in “chunks” oml.group_apply pyqGroupEval Invoke Python script on data indexed by grouping column oml.index_apply N/A (use pyqGroupEval) Invoke Python script N times

Slide 15

Slide 15 text

REST Interface Embedded Python Execution functions Copyright © 2021, Oracle and/or its affiliates 15 REST Endpoint Purpose /api/py-scripts/v1/do-eval/{scriptName} Invoke stand-alone Python script /api/py-scripts/v1/table-apply/{scriptName} Invoke Python script with full table input /api/py-scripts/v1/row-apply/{scriptName} Invoke Python script one row at a time, or multiple rows in “chunks” /api/py-scripts/v1/group-apply/{scriptName} Invoke Python script on data indexed by grouping column /api/py-scripts/v1/index-apply/{scriptName} Invoke Python script N times

Slide 16

Slide 16 text

Dependent on skill set and application requirements Python Interface • Develop user defined-functions • Create reproducible, shareable functions SQL and REST Interfaces • Invoke user-defined Python functions for deployment in applications – enables deployment to other platforms – increases project scalability • Schedule invocation using DBMS_SCHEDULER (SQL only) Which OML4Py Interface should I use for Embedded Python Execution? Copyright © 2021, Oracle and/or its affiliates 16

Slide 17

Slide 17 text

Script repository and datastore 17 Copyright © 2021, Oracle and/or its affiliates

Slide 18

Slide 18 text

• Table of Python scripts created by OML4Py user that can be saved and managed for use within Embedded Python Execution – Leverages security provided by Oracle Database – Python scripts are created securely • Creating Scripts – On premises, the PYQADMIN role is required for users who create and drop Python scripts – In ADB, OML_Developer role is required for users to create and drop Python scripts • Sharing Scripts – Scripts can be global or private (oml.script.create(..., is_global=True) – Functions oml.grant and oml.revoke can be used to grant or revoke the read privilege for a private Python script to other users – Only the user who created the Python script can modify and drop a private script – Every user has read access to a global script OML4Py Script Repository Copyright © 2021, Oracle and/or its affiliates 18

Slide 19

Slide 19 text

Python, SQL, and REST interfaces Functions for managing scripts Copyright © 2021, Oracle and/or its affiliates 19 Python Interface SQL Interface REST Endpoint Purpose oml.script.create pyqScriptCreate N/A Registers a single user- defined Python function in the script repository oml.script.dir USER_PYQ_SCRIPTS view /api/py-scripts/v1/scripts Lists the user-defined Python functions present in the script repository oml.script.drop pyqScriptDrop N/A Drops a user-defined Python function from the script repository oml.script.load N/A Scripts are automatically loaded in the REST API when the function is called Loads a user-defined Python function from the script repository into a Python session

Slide 20

Slide 20 text

• Repository in user schema allows storage and management of Python and OML4Py objects • Used for passing arguments to Python functions with embedded Python execution, especially non-scalar values for REST and SQL invocation, such as native Python ML models • Serialize and unserialize Python and OML4Py objects in a database table • Metadata stored in OML4Py's SYS schema, PYQSYS, and objects stored in user schema • Share objects with other data analysts and data scientists – Functions oml.grant and oml.revoke can be used to grant or revoke the read privilege to datastore instances – Objects can be shared as read-only with specific users or all users OML4Py Datastore Copyright © 2021, Oracle and/or its affiliates 20

Slide 21

Slide 21 text

Python and SQL interfaces Functions for managing datastore objects Copyright © 2021, Oracle and/or its affiliates 21 Python Interface SQL Interface Purpose oml.ds.save N/A Saves Python objects to a named datastore in the user’s database schema oml.ds.dir USER_PYQ_DATASTORES view Lists the datastores or datastores contents available to the current user oml.ds.describe ALL_PYQ_DATASTORE_CONTENTS view Shows the objects in the specified datastore oml.ds.delete N/A Deletes one or more datastores or Python objects from a datastore oml.ds.load N/A Loads Python objects from a datastore into the user’s session

Slide 22

Slide 22 text

Parallelism 22 Copyright © 2021, Oracle and/or its affiliates

Slide 23

Slide 23 text

Autonomous Database • Extends parallelism by enabling different service levels to manage the load on the system • Service levels control the degree of parallelism for jobs • Parallelism for service levels is LOW(2), MEDIUM(4), HIGH(8) – parallel=True corresponds to service level – parallel=x is limited by service level • For SQL, auto-scaling adds compute resources on demand - up to 3X for CPU and memory Blog: Machine Learning Performance on Autonomous Database https://blogs.oracle.com/machinelearning/post/machine-learning-performance-on-autonomous-database Parallelism - Embedded Python Execution Copyright © 2021, Oracle and/or its affiliates 23

Slide 24

Slide 24 text

Oracle Database Parallelism - Embedded Python Execution Python interface Copyright © 2021, Oracle and/or its affiliates 24 OML4Py embedded Python function “parallel” argument • Maps to DB parallel query hint • Supported by oml.group_apply, oml.row_apply, oml.index_apply • Values – positive integer >= 2 for a specific degree of parallelism – FALSE or 1 for no parallelism – TRUE takes on the ‘data’ argument's default parallelism – NULL for the database default for the operation; serial execution by default Table Parallelism Embedded Python “parallel” argument

Slide 25

Slide 25 text

Oracle Database Parallelism - Embedded Python Execution Python SQL interface Copyright © 2021, Oracle and/or its affiliates 25 Table Parallelism Parallel Query Hint Parallel query execution is prioritized by following DOP settings in order: • hint • SELECT /*+ parallel(4) */ COUNT(*) FROM IRIS; • table • ALTER TABLE IRIS PARALLEL 4;

Slide 26

Slide 26 text

Interfaces for Embedded Python Execution 26 Copyright © 2021, Oracle and/or its affiliates

Slide 27

Slide 27 text

spawns Python interface for Embedded Python Execution Example of parallel partitioned data flow using third party package # user-defined function using sklearn def score_mod(dat): import pandas as pd import oml obj_dict = oml.ds.load(name="ds_regr",to_globals=False) regr = obj_dict["regr"] pred = regr.predict(dat[['SEPAL_LENGTH]]) return pd.concat([dat[['SPECIES','PETAL_WIDTH']], pd.DataFrame(pred, columns=['Pred_PETAL_WIDTH'])], axis=1) # invoke function in parallel on IRIS table pred = oml.row_apply(IRIS, score_mod, rows=10, parallel=True, func_value=pd.DataFrame([('a', 1, 1)], columns=['SPECIES', 'PETAL_LENGTH','PRED_PETAL_LENGTH'])) OML4Py Python Engine OML4Py Python Engine OML4Py OML Notebooks Copyright © 2021 Oracle and/or its affiliates. REST Interface Oracle Autonomous Database User tables

Slide 28

Slide 28 text

spawns SQL interface for Embedded Python Execution Example of parallel partitioned data flow using third party package OML4Py Python Engine OML4Py Python Engine OML4Py OML Notebooks Copyright © 2021 Oracle and/or its affiliates. REST Interface Oracle Autonomous Database User tables SELECT * FROM table(pyqRowEval( IRIS, # table or view '{"oml_connect":1, # arguments "oml_input_type":"pandas.DataFrame", '{"SPECIES":"varchar2(10)", # return format "PETAL_LENGTH":"number", "Pred_PETAL_LENGTH":"number"}', 10, # rows per invocation 'score_mod')); # function name

Slide 29

Slide 29 text

REST Interface for Embedded Python Execution Copyright © 2021, Oracle and/or its affiliates 29 /oml/tenants//databases//api/py-scripts/v1/// py_scripts for executing user-defined functions (Python “scripts”) Cloud service URL Customer tenant name Name of pluggable database within ADB Name of script in repository do-eval table-apply group-apply index-apply row-apply Example of synchronous invocation from cURL $ curl -X POST --header "Authorization: Bearer ${token}" --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"input":"select * from IRIS", "parameters":"{\"oml_input_type\":\"pandas.DataFrame\"}", "rows":10, "parallelFlag":true, "service":“MEDIUM"}’ “$

Slide 30

Slide 30 text

Demo 30 Copyright © 2021, Oracle and/or its affiliates

Slide 31

Slide 31 text

Q & A Copyright © 2021, Oracle and/or its affiliates 31

Slide 32

Slide 32 text

Thank you [email protected] Copyright © 2021, Oracle and/or its affiliates. 32