$30 off During Our Annual Pro Sale. View Details »

How to Transform Research Oriented Code into Machine Learning APIs with Python

tetsuya0617
September 20, 2019

How to Transform Research Oriented Code into Machine Learning APIs with Python

This is my talk in Pycon Taiwan 2019 🇹🇼 (https://tw.pycon.org/2019/en-us/events/schedule/)

tetsuya0617

September 20, 2019
Tweet

More Decks by tetsuya0617

Other Decks in Programming

Transcript

  1. How to Transform Research Oriented Code
    into Machine Learning APIs with Python
    Tetsuya (Jesse) Hirata
    @JesseTetsuya
    ————————————————————————————————————————————————————————————————————————————————
    Software Engineer at Classi which is an EdTech company.
    I mostly work in both data science and engineering.

    View Slide

  2. Background and Purpose
    - Recently, Python Engineers have more opportunities to
    work with data scientists and researchers than before.
    - Understanding the processes to develop ML APIs can
    help make AI/ML projects work more smoothly

    View Slide

  3. 3FTFBSDI
    0SJFOUFE
    $PEF
    .-"1*T

    Steps to transform Research
    Oriented Code into ML APIs

    3FGBDUPS $IFDL
    6OEFSTUBOE .PEVMBSJ[F

    View Slide

  4. 3FTFBSDI
    0SJFOUFE
    $PEF
    .-"1*T

    Steps to transform Research
    Oriented Code into ML APIs
    6OEFSTUBOE
    8IBUJT3FTFBSDI0SJFOUFE$PEF
    8IBUBSF.-"1*T
    )PXTIPVMEFOHJOFFSTIBOEMFSFTFBSDIPSJFOUFEDPEF

    View Slide

  5. Definition
    Research oriented code in AI/ML projects is
    the code written mainly by data scientists or
    researchers for figuring out the most
    efficient and suitable machine learning
    model.

    View Slide

  6. 1.Preparation code for accessing data
    2.Pre-processing code
    3.Machine learning (ML) code
    Production code (Engineers)
    Research oriented code
    (Data Scientists/Researchers)
    Machine Learning APIs are composed of
    three elements
    Research oriented code is developed through an iterative process
    and integrated into production code.

    View Slide

  7. Data Pre-Processing code
    Visually trace the code from the top to the bottom
    Easily and quickly write it

    View Slide

  8. ML code (a part of whole code)
    Easily handle input data and trace output data
    with data frame

    View Slide

  9. Refactor both code in Pythonic way
    This code builds the model in a much faster and simpler way

    View Slide

  10. 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF
    4DPQFT
    1SFQBSBUJPODPEF

    1SFQSPDFTTJOHDPEF
    .-DPEF
    1SFQBSBUJPODPEF
    1SFQSPDFTTJOHDPEF
    .-DPEF
    $IBSBDUFSJTUJDT
    PG$PEJOH4UZMF
    &BTJMZIBOEMFE
    7JTVBMMZUSBDFBCMF
    )JHIDBMDVMBUJPOTQFFE
    )JHISFBEBCJMJUZ
    5FTUBCMFBOENPEVMBS
    0CKFDUJWFTPG
    $PEJOH4UZMF
    'JOEJOHUIFNPTUF⒏DJFOUBOE
    TVJUBCMFNBDIJOFMFBSOJOHNPEFM
    .BLJOHUIFDPEFXPSLPOUIFTFSWFS
    DPSSFDUMZBOESFMJBCMZ
    Three Differences
    between Research Oriented Code
    and Production Code

    View Slide

  11. What are Python Engineers supposed
    to do for Research Oriented Code?

    View Slide

  12. 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF
    4DPQFT
    1SFQSPDFTTJOHDPEF
    .BDIJOFMFBSOJOHDPEF
    1SFQBSBUJPODPEF
    1SFQSPDFTTJOHDPEF
    .-DPEF
    $IBSBDUFSJTUJDT
    PG$PEJOH4UZMF
    &BTJMZIBOEMFE
    7JTVBMMZUSBDFBCMF
    )JHIDBMDVMBUJPOTQFFE
    )JHISFBEBCJMJUZ
    5FTUBCMFBOENPEVMBS
    0CKFDUJWFTPG
    $PEJOH4UZMF
    'JOEJOHUIFNPTUF⒏DJFOUBOE
    TVJUBCMFNBDIJOFMFBSOJOHNPEFM
    .BLJOHUIFDPEFXPSLPOUIFTFSWFS
    DPSSFDUMZBOESFMJBCMZ
    Three Differences
    between Research Oriented Code
    and Production Code
    3FGBDUPS
    $IFDL
    .PEVMBSJ[F

    View Slide

  13. 3FTFBSDI
    0SJFOUFE
    $PEF
    .-"1*T

    Steps to Transform Research
    Oriented Code into ML APIs
    .PEVMBSJ[F

    $BUFHPSJ[FSFTFBSDIPSJFOUFEDPEFJOUPQSFQBSBUJPODPEF
    QSFQSPDFTTJOHDPEF BOE.-DPEF
    #SFBLUIFNPVUJOUPGVODUJPOTBOENBLFUIFNUFTUBCMF
    $MBSJGZJOQVUBOEPVUQVUPGUIFDPEF BOEEFpOF63*

    View Slide

  14. This is a page of research oriented code
    written with jupyter notebook.
    This code is procedural and some of them
    are not classified.
    The research oriented code
    seems to be tightly coupled.
    2.1. Categorize research oriented code into preparation code, preprocessing
    code, ML code

    View Slide

  15. Find the code to load input data
    or access database
    → preparation code
    Find the code to
    make, replace, filter, or delete input data
    → preprocessing code
    Find the code to
    execute calculation or train data
    → ML code
    2.1. Categorize research oriented code into preparation code, preprocessing
    code, ML code

    View Slide

  16. Module name Functions
    Preparation
    code
    preparation.py
    - Access big query, execute query, and load input
    data
    - Rename columns
    Preprocessing
    code
    preprocessing.py - Replace categorical data with discrete numbers
    - Filter input data
    ML code prediction.py - Calculate icc parameters, logistic regression, and
    item response theory (IRT)
    The research oriented code became loosely coupled
    2.2. Break them out into functions and make them testable

    View Slide

  17. app.py
    @app.route("/v1/probabilities", methods=['GET'])
    def probabilities():
    return calc_results(), 200
    return get_probs(), 200
    ← noun
    ← the same endpoint name
    ← verb (+ noun)
    INPUT OUTPUT
    *item means a question
    INPUT: results of student answers
    OUTPUT: probabilities to answer questions correctly
    2.3. Clarify input and output of the whole code and define URI

    View Slide

  18. 3FTFBSDI
    0SJFOUFE
    $PEF
    .-"1*T

    Steps to transform Research
    Oriented Code into ML APIs

    Refactor

    1. Prepare for refactoring
    2. Simplify I/O in preparation code
    3. Pandas → Python in preprocessing code

    View Slide

  19. .
    ᵓᴷᴷ ml_api
    ᴹ ᵓᴷᴷ api
    ᴹ ᴹ ᵓᴷᴷ app.py
    ᴹ ᴹ ᵓᴷᴷ config
    ᴹ ᴹ ᵓᴷᴷ prediction.py
    ᴹ ᴹ ᵓᴷᴷ preparation.py
    ᴹ ᴹ ᵓᴷᴷ preprocessing.py
    ᴹ ᵓᴷᴷ requirements.txt
    ᴹ ᵓᴷᴷ run.py
    ᴹ ᵋᴷᴷ tests
    ᴹ ᵓᴷᴷ test_app.py
    ᴹ ᵓᴷᴷ test_prediction.py
    ᴹ ᵓᴷᴷ test_preparation.py
    ᴹ ᵋᴷᴷ test_preprocessing.py
    ᵋᴷᴷ setup.py
    3.1 Prepare for refactoring
    Narrow down requirements of each code by writing test code and take notes about requirements on
    the comments for refactoring (or you can tell data scientist to write comments in advance)
    def func(arg1, arg2):
    """Summary line.
    Extended description of function.
    Args:
    arg1 (int): Description of arg1
    arg2 (str): Description of arg2
    Returns:
    bool: Description of return value
    """
    return True
    ex) Google Style
    #comments out or doc strings
    (reStructuredText style /Numpy style/Google Style)

    View Slide

  20. CASE STUDY:
    Refactoring the code to access BigQuery and GCS by
    using google cloud client libraries with Python
    3.2 Simplify I/O in preparation code

    View Slide

  21. from google.cloud import bigquery
    client = bigquery.Client()
    query = "SELECT column_1, column_2, column_3 FROM `data set name` where column_1 is not NULL
    query_job = client.query(query)
    results = [list(row.values()) for row in query_job.result()]
    OUTPUT:
    Two Dimensional Arrays + Filter Values + Drop Null
    OUTPUT:
    Two Dimensional Arrays
    from google.cloud import bigquery
    client = bigquery.Client()
    query = "SELECT * FROM `data set name`
    query_job = client.query(query)
    results = [list(row.values()) for row in query_job.result()]
    → Preprocess the data with query as much as possible
    → It is faster and lower-cost than preprocess data with python
    Code B
    Code A
    3.2. Simplify I/O in preparation code
    ex) Big Query with Python

    View Slide

  22. import io, csv, gzip
    from google.cloud import storage
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(‘bucket name’)
    with io.StringIO() as csv_obj:
    writer = csv.writer(csv_obj, quotechar='"',
    quoting=csv.QUOTE_ALL, lineterminator="\n")
    writer.writerows(two_dimentional_arrays)
    result = csv_obj.getvalue()
    with io.BytesIO() as gzip_obj:
    with gzip.GzipFile(fileobj=gzip_obj, mode="wb") as gzip_file:
    bytes_f = result.encode()
    gzip_file.write(bytes_f)
    blob = bucket.blob(‘storage_path’)
    blob.upload_from_file(gzip_obj, rewind=True, content_type='application/gzip')
    Make bytes object and upload it from memory to GCS with Python
    3.2 Simplify I/O in preparation code
    ex) Google Cloud Storage with Python

    View Slide

  23. import gcp_accessor
    bq = gcp_accessor.BigQueryAccessor()
    query = "SELECT * FROM `data set name`
    bq.execute_query(query)
    gcs = gcp_accessor.GoogleCloudStorageAccessor()
    gcs.upload_csv_gzip(
    ‘bucket name',
    ‘full path on gcs',
    ‘input data’)
    3.2. Simplify I/O more by using wrapper
    import io, csv, gzip
    from google.cloud import storage
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(‘bucket name’)
    with io.StringIO() as csv_obj:
    writer = csv.writer(csv_obj, quotechar='"',
    quoting=csv.QUOTE_ALL, lineterminator="\n")
    writer.writerows(two_dimentional_arrays)
    result = csv_obj.getvalue()
    with io.BytesIO() as gzip_obj:
    with gzip.GzipFile(fileobj=gzip_obj, mode="wb") as gzip_file:
    bytes_f = result.encode()
    gzip_file.write(bytes_f)
    blob = bucket.blob(storage_path)
    blob.upload_from_file(gzip_obj, rewind=True,
    content_type='application/gzip')
    from google.cloud import bigquery
    client = bigquery.Client()
    query = "SELECT * FROM `data set name`
    query_job = client.query(query)
    results = [list(row.values()) for row in query_job.result()]
    google-cloud-bigquery
    google-cloud-storage
    gcp-accessor (wrapper library)
    (https://pypi.org/project/gcp-accessor/)

    View Slide

  24. All data in the api is processed using the same data type.
    This improves readability and maintainability as opposed to
    prioritizing processing speed
    3.3. Pandas → Python in preprocessing code

    View Slide

  25. One day, I wondered why I struggled so much
    with refactoring of preprocessing code in research
    oriented code that I wrote a previous week.
    3.3. Pandas → Python in preprocessing code

    View Slide

  26. 3.3. Pandas → Python in preprocessing code
    Code Styles/
    Preprocessing
    Functions
    Pandas Python
    Filter
    dataframe.where(.query)
    dataframe.groupby()
    dataframe[[“”, “”, ‘“]]
    dataframe.loc[]
    dataframe.iloc[]
    if - else + for +.append()
    [[v1, v2, v3] for value in values]
    Replace
    dataframe.fillna()
    dic = {“key1”: value1, “key2”: value, …}
    dataframe['column1'].replace(dic, inplace=True)
    dic = {“key1”: value1, “key2”: value, …}
    [[dic.get(v, v) for v in value] for value in
    values]
    De-duplicate
    /Be unique
    duplicated() / drop_duplicates()
    dataframe['column1'].unique()

    (outuput: array([v1, v2, v3]))
    set(list)
    list({v1, v2, v2, …})
    list({value[0] for value in values})
    Delete/Drop
    dataframe.dropna()
    dataframe.drop()
    dataframe.drop(index=index list)
    if - else + for +.append()
    [[v1, v2, v3] for value in values]

    View Slide

  27. 3FTFBSDI
    0SJFOUFE
    $PEF
    .-"1*T

    Steps to transform Research
    Oriented Code into ML APIs

    $IFDL

    1. Write decorators to check parameters
    2. Set up production-like environments

    View Slide

  28. 4.1. Write decorators to check parameters
    Error handling
    Request parameter check
    Access token check
    Image of Decorators in APIs
    3FRVFTU
    $MJFOU
    URIs
    preparation
    preprocessing
    calculation

    View Slide

  29. 4.1. Write decorators to check parameters
    Error handling
    Request parameter check
    Access token check
    Image of Decorators in APIs
    3FRVFTU
    $MJFOU
    URIs
    preparation
    preprocessing
    calculation

    View Slide

  30. {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "student_name": {
    "type": "string",
    "required": "True"
    },
    "student_grade": {
    "type": "string",
    "required": "True",
    "maximum": 120,
    "minimum": 1
    }
    }
    curl http://localhost:5000/ -X POST -H "Content-Type: application/json" -d '{"student_name": "test_name",
    "student_grade": “forth-grade"}'
    make_name_grade.json
    request curl command
    4.1. Write decorators to check parameters
    ex) Request parameter check with JSON Schema

    View Slide

  31. def validate_json(f):
    @wraps(f)
    def wrapper(*args, **kw):
    try:
    request.json
    except BadRequest as e:
    msg = “ This is an invalid json"
    return jsonify({"error": msg}), 400
    return f(*args, **kw)
    return wrapper
    def validate_schema(schema_name):
    def decorator(f):
    @wraps(f)
    def wrapper(*args, **kw):
    try:
    validate(request.json,
    current_app.config[schema_name])
    except ValidationError as e:
    return jsonify({"error": e.message}), 400
    return f(*args, **kw)
    return wrapper
    return decorator
    @app.route('/', methods=['POST'])
    @validate_json
    @validate_schema('make_name_grade')
    def index():
    if request.is_post:
    data = json.loads(request.data)
    print(data["student_name"])
    print(data["student_grade"])
    return "Hi! " + data["student_name"]
    else:
    return "Hi!"
    app.py
    json_validate.py
    This code of json_validate.py is cited from the URL:
    https://stackoverflow.com/questions/24238743/flask-decorator-to-verify-json-and-json-schema
    4.1. Write decorators to check parameters
    ex) Request parameter check with JSON Schema

    View Slide

  32. Automate Continuous Integration
    Visualize data
    (Load Test)
    Deploy on GCP
    'MBTL"QQ
    #VJMEFS
    %BTI
    4.2. Set up production-like environments with Flask

    View Slide

  33. Resources
    LOCUST: https://www.youtube.com/watch?v=XQ4hrbgVysk (Pycon Korea 2015)
    Refactoring: https://www.youtube.com/watch?v=D_6ybDcU5gc (Pycon US 2016)

    Pytest: https://www.youtube.com/watch?v=G-MAMrJ-CSA (Pycon US 2019)
    Flask workshop: https://www.youtube.com/watch?v=DIcpEg77gdE (Pycon US 2015)
    Dash: https://www.youtube.com/watch?v=WLbQYFZc-YY (Pycon Jp 2019)
    google-cloud-bigquery: https://pypi.org/project/google-cloud-bigquery/
    google-cloud-storage: https://pypi.org/project/google-cloud-storage/
    gcp-accessor: https://pypi.org/project/gcp-accessor/0.0.1/
    Flask-AppBuilder: https://flask-appbuilder.readthedocs.io/en/latest/
    Python Tools that I mentioned in this talk
    Python Packages that I mentioned in this talk

    View Slide

  34. Summary
    3FTFBSDI
    0SJFOUFE
    $PEF
    .-"1*T



    3FGBDUPS
    $IFDL
    6OEFSTUBOE
    .PEVMBSJ[F

    - What is Research Oriented Code ?
    - What are ML APIs
    - How should engineers handle research oriented code ?
    - Categorize research oriented code into preparation code,
    preprocessing code, ML code
    - Break them out into functions and make them testable
    - Clarify input and output of the code, and define URI
    - Prepare for refactoring
    - Simplify I/O in preparation code
    - Pandas → Python in preprocessing code
    - Write decorators to check parameters
    - Set up production-like environments

    View Slide

  35. Tetsuya (Jesse) Hirata
    @JesseTetsuya
    ————————————————————————————————————————————————————————————————————————————————
    Software Engineer at Classi which is an EdTech company.
    I mostly work in both data science and engineering.
    If you have an interest in
    how I am refactoring, in
    the EdTech domain, or in
    what our team is doing,
    feel free to talk to me
    later !!

    View Slide