Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OML4Py features: Using third-party Python packages from Python, SQL and REST

OML4Py features: Using third-party Python packages from Python, SQL and REST

On this Office Hours for Oracle Machine Learning on Autonomous Database, Sherry LaMonica, Principal Member of Technical Staff at Oracle Machine Learning team explained how to use third-party Python packages from Python, SQL and REST via OML4Py.

So you want to deploy your Python scripts to production in the database environment? With Oracle Machine Learning for Python, you can invoke user-defined functions on database-spawned and controlled Python engines using embedded Python execution. An added benefit is the ability to easily leverage data-parallel and task-parallel invocation aided by the database environment. Join this session to learn about the options for working with third-party packages with OML4Py, both on premises and in Autonomous Database, from the Python, SQL and REST APIs.

The Oracle Machine Learning product family supports data scientists, analysts, developers, and IT to achieve data science project goals faster while taking full advantage of the Oracle platform.

Marcos Arancibia

October 05, 2021
Tweet

More Decks by Marcos Arancibia

Other Decks in Technology

Transcript

  1. OML feature highlight:
    Using third-party Python packages
    from Python, SQL, and REST
    OML Office Hours
    Sherry LaMonica
    Principal Member of Technical Staff, Oracle Machine Learning
    Supported by Marcos Arancibia and Mark Hornick
    Product Management, Oracle Machine Learning
    Move the Algorithms; Not the Data!
    Copyright © 2021, Oracle and/or its affiliates.
    This Session will
    be Recorded

    View Slide

  2. • Third-party packages
    • Introduction to Embedded Python Execution
    • Benefits and motivation
    • Embedded Python Execution interfaces
    • Demo
    • Q&A
    Goals for today
    Copyright © 2021, Oracle and/or its affiliates
    2

    View Slide

  3. Third-party Python packages
    3 Copyright © 2021, Oracle and/or its affiliates

    View Slide

  4. • A good source for state-of-the-art or new algorithms or functionality
    • Supports a wide range of functionality
    – Data transformation, statistical analysis, visualization, machine learning, etc.
    – Medical imaging, geospatial analysis, reproducible research
    • If a package is widely used it will quickly evolve into a robust offering through
    collaborative efforts
    • Not typically implemented for scalability and performance
    – Often focused on specialized topics and functionality rather than memory
    management or parallelism
    – Some packages quickly consume available memory
    Third-party packages
    4 Copyright © 2021, Oracle and/or its affiliates.

    View Slide

  5. • How big is the data?
    – Will data fit into Python memory?
    – Will the package work on my data volumes?
    • Is working directly in Python more performant?
    – For small data, working in open-source Python may provide adequate performance
    • Can the data be partitioned for processing in parallel?
    – Partition by group or row (data-parallel)
    – Run n times in parallel (task-parallel), e.g., simulations
    • Is there equivalent in-database functionality?
    – In-database machine learning are scalable and parallelized
    Should I use a third-party Python package on database data?
    5 Copyright © 2021, Oracle and/or its affiliates.

    View Slide

  6. There is no magic!
    • Performance gains are the result of running in data-parallel or task-parallel Python
    engines in the database environment
    • Using embedded Python does not make a Python package automatically scale
    – No automatic modification to third-party package code
    – No automation of scalable or parallel behavior for Python packages
    • Database environment may have more memory, CPU, or faster CPUs
    to increase scalability from a typical client machine
    Performance characteristics for open-source Python packages with
    Embedded Python Execution
    Copyright © 2021, Oracle and/or its affiliates
    6

    View Slide

  7. Oracle Autonomous Database
    User-installed packages
    currently not available
    OML4Py Dependencies
    cx_Oracle 8.1.0
    matplotlib 3.1.2
    numpy 1.18.1
    pandas 0.25.3
    scikit-learn 0.23.1
    scipy 1.6.1
    cycler 0.10.0
    joblib 0.14.0
    kiwisolver 1.1.0
    pyparsing 2.4.0
    python-dateutil 2.8.1
    pytz 2019.3
    six 1.13.0
    threadpoolctl 2.1.0
    Oracle Database
    User-installed and configured
    packages supported
    Which third-party packages can be used with OML4Py?
    Copyright © 2021, Oracle and/or its affiliates
    7

    View Slide

  8. Using pip
    • Python’s internal
    package manager
    • Installs Python
    packages from PyPI
    repository along with
    their dependencies
    • Often used with
    configuration file
    requirements.txt to
    install specified
    packages and versions
    $ pip install pandas
    $ pip install -r requirements.txt
    Package Management
    Copyright © 2021, Oracle and/or its affiliates.
    8

    View Slide

  9. Using virtualenv
    • Tool for creating
    isolated Python
    environments
    • Use different versions
    of a package for
    distinct projects
    • Organize groups of
    packages into different
    isolated environments
    $ pip install virtualenv
    Package Management
    Copyright © 2021, Oracle and/or its affiliates.
    9
    Python
    3.5
    Python
    3.9
    Pandas
    0.23.1
    Pandas
    0.25.3
    virtualenv

    View Slide

  10. Embedded Python Execution
    10 Copyright © 2021, Oracle and/or its affiliates

    View Slide

  11. • Develop user-defined Python functions and manage these in the database for production deployment
    – Invoked using REST endpoints on Oracle Autonomous Database
    – Invoked using a SQL interface on Oracle Database
    • Python objects can be stored in the database using the OML4Py Datastore
    – Eliminates the need to manage objects using flat files
    • Datastores work with embedded Python execution for passing objects to user-defined functions
    – Pass non-scalar objects (e.g., a model) as arguments to user-defined functions
    in SQL, REST, and Python embedded execution Interfaces
    – Enables sharing data science work products from data scientists
    with application developers for deployment
    • Run user-defined Python functions in a data-parallel and task-parallel manner
    – Supports “embarrassingly parallel” solutions
    – Enables scoring third-party Python models at scale
    • User-defined function results can contain both DataFrames and images
    • Supports SQL, Python, and REST Interfaces
    Embedded Python Execution: Introduction
    Copyright © 2021, Oracle and/or its affiliates
    11

    View Slide

  12. • Supplements OML4Py functionality through use of third-party packages in user-defined functions
    • Facilitates application use of Python script results
    – Develop/test Python scripts interactively with Python interface
    – Invoke Python scripts directly from SQL or REST for production applications
    o Python scripts stored in Oracle Database
    o Schedule Python scripts to run automatically via SQL
    • Potential for improved performance and throughput
    – Data-parallel and task-level parallelism
    – Leverage compute and memory resources of the database environment
    – Avoids the overhead of stacked SQL
    • Supports structured and image results
    • Retrieve results in various forms depending on application requirements: table, XML, PNG, JSON
    Embedded Python Execution: Benefits
    Copyright © 2021, Oracle and/or its affiliates
    12

    View Slide

  13. OML4Py Interfaces for Embedded Python Execution
    Autonomous Database
    • Python Interface (via OML Notebooks)
    • REST Interface
    Oracle Database
    • Python Interface
    • SQL Interface
    Copyright © 2021 Oracle and/or its affiliates.
    OML4Py
    REST Interface
    Oracle
    Autonomous
    Database
    Oracle
    Database
    SQL Interface

    View Slide

  14. Python and SQL Interfaces
    Embedded Python Execution functions
    Copyright © 2021, Oracle and/or its affiliates
    14
    Python Interface Function SQL Interface Function Purpose
    oml.do_eval pyqEval
    Invoke stand-alone Python script
    oml.table_apply pyqTableEval
    Invoke Python script with full table input
    oml.row_apply pyqRowEval
    Invoke Python script one row at a time,
    or multiple rows in “chunks”
    oml.group_apply pyqGroupEval
    Invoke Python script on data indexed
    by grouping column
    oml.index_apply N/A
    (use pyqGroupEval)
    Invoke Python script N times

    View Slide

  15. REST Interface
    Embedded Python Execution functions
    Copyright © 2021, Oracle and/or its affiliates
    15
    REST Endpoint Purpose
    /api/py-scripts/v1/do-eval/{scriptName}
    Invoke stand-alone Python script
    /api/py-scripts/v1/table-apply/{scriptName}
    Invoke Python script with full table input
    /api/py-scripts/v1/row-apply/{scriptName}
    Invoke Python script one row at a time,
    or multiple rows in “chunks”
    /api/py-scripts/v1/group-apply/{scriptName}
    Invoke Python script on data indexed
    by grouping column
    /api/py-scripts/v1/index-apply/{scriptName} Invoke Python script N times

    View Slide

  16. Dependent on skill set and application requirements
    Python Interface
    • Develop user defined-functions
    • Create reproducible, shareable functions
    SQL and REST Interfaces
    • Invoke user-defined Python functions for deployment in applications
    – enables deployment to other platforms
    – increases project scalability
    • Schedule invocation using DBMS_SCHEDULER (SQL only)
    Which OML4Py Interface should I use for Embedded Python Execution?
    Copyright © 2021, Oracle and/or its affiliates
    16

    View Slide

  17. Script repository and datastore
    17 Copyright © 2021, Oracle and/or its affiliates

    View Slide

  18. • Table of Python scripts created by OML4Py user that can be saved and managed
    for use within Embedded Python Execution
    – Leverages security provided by Oracle Database
    – Python scripts are created securely
    • Creating Scripts
    – On premises, the PYQADMIN role is required for users who create and drop Python scripts
    – In ADB, OML_Developer role is required for users to create and drop Python scripts
    • Sharing Scripts
    – Scripts can be global or private (oml.script.create(..., is_global=True)
    – Functions oml.grant and oml.revoke can be used to grant or revoke the read privilege
    for a private Python script to other users
    – Only the user who created the Python script can modify and drop a private script
    – Every user has read access to a global script
    OML4Py Script Repository
    Copyright © 2021, Oracle and/or its affiliates
    18

    View Slide

  19. Python, SQL, and REST interfaces
    Functions for managing scripts
    Copyright © 2021, Oracle and/or its affiliates
    19
    Python Interface SQL Interface REST Endpoint Purpose
    oml.script.create pyqScriptCreate N/A
    Registers a single user-
    defined Python function in
    the script repository
    oml.script.dir
    USER_PYQ_SCRIPTS
    view
    /api/py-scripts/v1/scripts
    Lists the user-defined
    Python functions present in
    the script repository
    oml.script.drop
    pyqScriptDrop
    N/A
    Drops a user-defined
    Python function from the
    script repository
    oml.script.load N/A
    Scripts are automatically
    loaded in the REST API
    when the function is called
    Loads a user-defined
    Python function from the
    script repository into a
    Python session

    View Slide

  20. • Repository in user schema allows storage and management of Python and OML4Py objects
    • Used for passing arguments to Python functions with embedded Python execution, especially
    non-scalar values for REST and SQL invocation, such as native Python ML models
    • Serialize and unserialize Python and OML4Py objects in a database table
    • Metadata stored in OML4Py's SYS schema, PYQSYS, and objects stored in user schema
    • Share objects with other data analysts and data scientists
    – Functions oml.grant and oml.revoke can be used to grant or revoke the read privilege
    to datastore instances
    – Objects can be shared as read-only with specific users or all users
    OML4Py Datastore
    Copyright © 2021, Oracle and/or its affiliates
    20

    View Slide

  21. Python and SQL interfaces
    Functions for managing datastore objects
    Copyright © 2021, Oracle and/or its affiliates
    21
    Python Interface SQL Interface Purpose
    oml.ds.save N/A
    Saves Python objects to a named datastore
    in the user’s database schema
    oml.ds.dir
    USER_PYQ_DATASTORES
    view
    Lists the datastores or datastores contents
    available to the current user
    oml.ds.describe
    ALL_PYQ_DATASTORE_CONTENTS
    view Shows the objects in the specified datastore
    oml.ds.delete
    N/A
    Deletes one or more datastores or Python
    objects from a datastore
    oml.ds.load
    N/A Loads Python objects from a datastore into
    the user’s session

    View Slide

  22. Parallelism
    22 Copyright © 2021, Oracle and/or its affiliates

    View Slide

  23. Autonomous Database
    • Extends parallelism by enabling different service levels to manage the load on the system
    • Service levels control the degree of parallelism for jobs
    • Parallelism for service levels is LOW(2), MEDIUM(4), HIGH(8)
    – parallel=True corresponds to service level
    – parallel=x is limited by service level
    • For SQL, auto-scaling adds compute resources on demand - up to 3X for CPU and memory
    Blog: Machine Learning Performance on Autonomous Database
    https://blogs.oracle.com/machinelearning/post/machine-learning-performance-on-autonomous-database
    Parallelism - Embedded Python Execution
    Copyright © 2021, Oracle and/or its affiliates
    23

    View Slide

  24. Oracle Database
    Parallelism - Embedded Python Execution Python interface
    Copyright © 2021, Oracle and/or its affiliates
    24
    OML4Py embedded Python function “parallel” argument
    • Maps to DB parallel query hint
    • Supported by oml.group_apply, oml.row_apply, oml.index_apply
    • Values
    – positive integer >= 2 for a specific degree of parallelism
    – FALSE or 1 for no parallelism
    – TRUE takes on the ‘data’ argument's default parallelism
    – NULL for the database default for the operation; serial execution by default
    Table Parallelism
    Embedded Python
    “parallel” argument

    View Slide

  25. Oracle Database
    Parallelism - Embedded Python Execution Python SQL interface
    Copyright © 2021, Oracle and/or its affiliates
    25
    Table Parallelism
    Parallel Query
    Hint
    Parallel query execution is prioritized by following DOP settings in order:
    • hint
    • SELECT /*+ parallel(4) */ COUNT(*) FROM IRIS;
    • table
    • ALTER TABLE IRIS PARALLEL 4;

    View Slide

  26. Interfaces for
    Embedded Python Execution
    26 Copyright © 2021, Oracle and/or its affiliates

    View Slide

  27. spawns
    Python interface for Embedded Python Execution
    Example of parallel partitioned data flow using third party package
    # user-defined function using sklearn
    def score_mod(dat):
    import pandas as pd
    import oml
    obj_dict = oml.ds.load(name="ds_regr",to_globals=False)
    regr = obj_dict["regr"]
    pred = regr.predict(dat[['SEPAL_LENGTH]])
    return
    pd.concat([dat[['SPECIES','PETAL_WIDTH']],
    pd.DataFrame(pred,
    columns=['Pred_PETAL_WIDTH'])],
    axis=1)
    # invoke function in parallel on IRIS table
    pred = oml.row_apply(IRIS,
    score_mod,
    rows=10,
    parallel=True,
    func_value=pd.DataFrame([('a', 1, 1)],
    columns=['SPECIES', 'PETAL_LENGTH','PRED_PETAL_LENGTH']))
    OML4Py
    Python Engine
    OML4Py
    Python Engine
    OML4Py
    OML Notebooks
    Copyright © 2021 Oracle and/or its affiliates.
    REST Interface
    Oracle Autonomous
    Database
    User tables

    View Slide

  28. spawns
    SQL interface for Embedded Python Execution
    Example of parallel partitioned data flow using third party package
    OML4Py
    Python Engine
    OML4Py
    Python Engine
    OML4Py
    OML Notebooks
    Copyright © 2021 Oracle and/or its affiliates.
    REST Interface
    Oracle Autonomous
    Database
    User tables
    SELECT *
    FROM table(pyqRowEval(
    IRIS, # table or view
    '{"oml_connect":1, # arguments
    "oml_input_type":"pandas.DataFrame",
    '{"SPECIES":"varchar2(10)", # return format
    "PETAL_LENGTH":"number",
    "Pred_PETAL_LENGTH":"number"}',
    10, # rows per invocation
    'score_mod')); # function name

    View Slide

  29. REST Interface for Embedded Python Execution
    Copyright © 2021, Oracle and/or its affiliates
    29
    /oml/tenants//databases//api/py-scripts/v1///
    py_scripts for executing user-defined functions (Python “scripts”)
    Cloud service
    URL
    Customer
    tenant
    name
    Name of
    pluggable
    database
    within ADB
    Name of
    script in
    repository
    do-eval
    table-apply
    group-apply
    index-apply
    row-apply
    Example of synchronous invocation from cURL
    $ curl -X POST --header "Authorization: Bearer ${token}" --header 'Content-Type: application/json' --header 'Accept: application/json' -d
    '{"input":"select * from IRIS", "parameters":"{\"oml_input_type\":\"pandas.DataFrame\"}", "rows":10, "parallelFlag":true, "service":“MEDIUM"}’
    “$Asynchronous invocation also available

    View Slide

  30. Demo
    30 Copyright © 2021, Oracle and/or its affiliates

    View Slide

  31. Q & A
    Copyright © 2021, Oracle and/or its affiliates
    31

    View Slide

  32. Thank you
    [email protected]
    Copyright © 2021, Oracle and/or its affiliates.
    32

    View Slide