How to Transform Research Oriented Code into Machine Learning APIs with Python

Slide 1

Slide 1 text

How to Transform Research Oriented Code into Machine Learning APIs with Python @JesseTetsuya ———————————————————————————————————————————————————————————————————————————————— Software Engineer at an IT company specializing in education technology based in Tokyo. I mostly work in both data science and engineering. PyCon 2020 talk at !

Slide 2

Slide 2 text

Background and Purpose - Recently, Python Engineers have more opportunities to work with data scientists and researchers than before. - Understanding the processes to develop ML APIs can help make AI / ML projects work more smoothly. @JesseTetsuya

Slide 3

Slide 3 text

Premise 1: Educational Technologies Learning Contents / Functions Learning Management System (LMS) / Online Learning Platform Learning Contents / Functions - Online Quiz - Video Lesson - Discussion Forum - Contents Box - Information View / Notiﬁcations - Peer and self assessment - Integrated Badges - Personalized Dashboard - Recommend System - Intelligent Tutoring System - Authoring Tools etc…

Slide 4

Slide 4 text

Premise 2: A Development Cycle in AI / ML projects Learning Log Data From Database LMS / Online Learning Platform 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze learning log data and ﬁnd optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into LMS / Online learning platform Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months

Slide 5

Slide 5 text

3FTFBSDI 0SJFOUFE $PEF .-"1*T Steps to transform Research Oriented Code into ML APIs 3FGBDUPS $IFDL 6OEFSTUBOE .PEVMBSJ[F @JesseTetsuya

Slide 6

Slide 6 text

3FTFBSDI 0SJFOUFE $PEF .-"1*T Steps to transform Research Oriented Code into ML APIs 6OEFSTUBOE 8IBUJT3FTFBSDI0SJFOUFE$PEF 8IBUBSF.-"1*T )PXTIPVMEFOHJOFFSTIBOEMFSFTFBSDIPSJFOUFEDPEF @JesseTetsuya

Slide 7

Slide 7 text

Definition Research oriented code in AI/ML projects is the code written mainly by data scientists or researchers for figuring out the most efficient and suitable machine learning model.

Slide 8

Slide 8 text

1.Preparation code for accessing data 2.Pre-processing code 3.Machine learning (ML) code Production code (Engineers) Research oriented code (Data Scientists/Researchers) Machine Learning APIs are composed of three elements Research oriented code is developed through an iterative process and integrated into production code.

Slide 9

Slide 9 text

Data Pre-Processing code Visually trace the code from the top to the bottom Easily and quickly write it

Slide 10

Slide 10 text

ML code (a part of whole code) Easily handle input data and trace output data with data frame

Slide 11

Slide 11 text

Refactor both code in Pythonic way This code builds the model in a much faster and simpler way

Slide 12

Slide 12 text

3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF⒏DJFOUBOE TVJUBCMFNBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ Three Differences between Research Oriented Code and Production Code

Slide 13

Slide 13 text

What are Python Engineers supposed to do for Research Oriented Code?

Slide 14

Slide 14 text

3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQSPDFTTJOHDPEF .BDIJOFMFBSOJOHDPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF⒏DJFOUBOE TVJUBCMFNBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ Three Differences between Research Oriented Code and Production Code 3FGBDUPS $IFDL .PEVMBSJ[F

Slide 15

Slide 15 text

3FTFBSDI 0SJFOUFE $PEF .-"1*T Steps to Transform Research Oriented Code into ML APIs .PEVMBSJ[F $BUFHPSJ[FSFTFBSDIPSJFOUFEDPEFJOUPQSFQBSBUJPODPEF QSFQSPDFTTJOHDPEF BOE.-DPEF #SFBLUIFNPVUJOUPGVODUJPOTBOENBLFUIFNUFTUBCMF $MBSJGZJOQVUBOEPVUQVUPGUIFDPEF BOEEFpOF63* @JesseTetsuya

Slide 16

Slide 16 text

This is a page of research oriented code written with jupyter notebook. This code is procedural and some of them are not classiﬁed. The research oriented code seems to be tightly coupled. 2.1. Categorize research oriented code into preparation code, preprocessing code, ML code

Slide 17

Slide 17 text

Find the code to load input data or access database → preparation code Find the code to make, replace, ﬁlter, or delete input data → preprocessing code Find the code to execute calculation or train data → ML code 2.1. Categorize research oriented code into preparation code, preprocessing code, ML code

Slide 18

Slide 18 text

Module name Functions Preparation code preparation.py - Access big query, execute query, and load input data - Rename columns Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data ML code prediction.py - Calculate icc parameters, logistic regression, and item response theory (IRT) The research oriented code became loosely coupled 2.2. Break them out into functions and make them testable

Slide 19

Slide 19 text

app.py @app.route("/v1/probabilities", methods=['GET']) def probabilities(): return calc_results(), 200 return get_probs(), 200 ← noun ← the same endpoint name ← verb (+ noun) INPUT OUTPUT *item means a question INPUT: results of student answers OUTPUT: probabilities to answer questions correctly 2.3. Clarify input and output of the whole code and deﬁne URI

Slide 20

Slide 20 text

3FTFBSDI 0SJFOUFE $PEF .-"1*T Steps to transform Research Oriented Code into ML APIs Refactor 1. Prepare for refactoring 2. Simplify I/O in preparation code 3. Pandas → Python in preprocessing code @JesseTetsuya

Slide 21

Slide 21 text

The code to prepare and preprocess data tends to be ’Big Ball of Mud’ [1] -Redundant -Repetitive [1] Foote, Brian; Yoder, Joseph (26 June 1999). "Big Ball of Mud". laputan.org. Retrieved 14 April 2019.

Slide 22

Slide 22 text

. ᵓᴷᴷ ml_api ᴹ ᵓᴷᴷ api ᴹ ᴹ ᵓᴷᴷ app.py ᴹ ᴹ ᵓᴷᴷ conﬁg ᴹ ᴹ ᵓᴷᴷ prediction.py ᴹ ᴹ ᵓᴷᴷ preparation.py ᴹ ᴹ ᵓᴷᴷ preprocessing.py ᴹ ᵓᴷᴷ requirements.txt ᴹ ᵓᴷᴷ run.py ᴹ ᵋᴷᴷ tests ᴹ ᵓᴷᴷ test_app.py ᴹ ᵓᴷᴷ test_prediction.py ᴹ ᵓᴷᴷ test_preparation.py ᴹ ᵋᴷᴷ test_preprocessing.py ᵋᴷᴷ setup.py 3.1 Prepare for refactoring Narrow down requirements of each code by writing test code and take notes about requirements on the comments for refactoring (or you can tell data scientist to write comments in advance) def func(arg1, arg2): """Summary line. Extended description of function. Args: arg1 (int): Description of arg1 arg2 (str): Description of arg2 Returns: bool: Description of return value """ return True ex) Google Style #comments out or doc strings (reStructuredText style /Numpy style/Google Style)

Slide 23

Slide 23 text

CASE STUDY: Refactoring the redundant code to access BigQuery and GCS by using google cloud client libraries with Python 3.2 Simplify I/O in redundant preparation code

Slide 24

Slide 24 text

from google.cloud import bigquery client = bigquery.Client() query = "SELECT column_1, column_2, column_3 FROM `data set name` where column_1 is not NULL query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] OUTPUT: Two Dimensional Arrays + Filter Values + Drop Null OUTPUT: Two Dimensional Arrays from google.cloud import bigquery client = bigquery.Client() query = "SELECT * FROM `data set name` query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] → Preprocess the data with query as much as possible → It is faster and lower-cost than preprocess data with python Code B Code A 3.2. Simplify I/O in preparation code ex) Big Query with Python

Slide 25

Slide 25 text

import io, csv, gzip from google.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile(fileobj=gzip_obj, mode="wb") as gzip_file: bytes_f = result.encode() gzip_file.write(bytes_f) blob = bucket.blob(‘storage_path’) blob.upload_from_file(gzip_obj, rewind=True, content_type='application/gzip') Make bytes object and upload it from memory to GCS with Python 3.2 Simplify I/O in preparation code ex) Google Cloud Storage with Python

Slide 26

Slide 26 text

import gcp_accessor bq = gcp_accessor.BigQueryAccessor() query = "SELECT * FROM `data set name` bq.execute_query(query) gcs = gcp_accessor.GoogleCloudStorageAccessor() gcs.upload_csv_gzip( ‘bucket name', ‘full path on gcs', ‘input data’) 3.2. Simplify I/O more by using wrapper import io, csv, gzip from google.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile(fileobj=gzip_obj, mode="wb") as gzip_file: bytes_f = result.encode() gzip_file.write(bytes_f) blob = bucket.blob(storage_path) blob.upload_from_file(gzip_obj, rewind=True, content_type='application/gzip') from google.cloud import bigquery client = bigquery.Client() query = "SELECT * FROM `data set name` query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] google-cloud-bigquery google-cloud-storage gcp-accessor (wrapper library) (https://pypi.org/project/gcp-accessor/)

Slide 27

Slide 27 text

3.3. Pandas → Python in preprocessing code All data in the api is processed using the same data type. This improves readability and maintainability as opposed to prioritizing coding speed.

Slide 28

Slide 28 text

One day, I wondered why I struggled so much with refactoring of the repetitive code of preprocessing in research oriented code that I wrote a previous week. 3.3. Pandas → Python in preprocessing code

Slide 29

Slide 29 text

3.3. Pandas → Python in preprocessing code Code Styles/ Preprocessing Functions Pandas Python Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] Replace dataframe.ﬁllna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique()  (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values]

Slide 30

Slide 30 text

3FTFBSDI 0SJFOUFE $PEF .-"1*T Steps to transform Research Oriented Code into ML APIs $IFDL 1. Write decorators to check parameters 2. Set up production-like environments @JesseTetsuya

Slide 31

Slide 31 text

4.1. Write decorators to check parameters Error handling Request parameter check Access token check Image of Decorators in APIs 3FRVFTU $MJFOU URIs preparation preprocessing calculation

Slide 32

Slide 32 text

4.1. Write decorators to check parameters Error handling Request parameter check Access token check Image of Decorators in APIs 3FRVFTU $MJFOU URIs preparation preprocessing calculation

Slide 33

Slide 33 text

{ "$schema": "http://json-schema.org/draft-04/schema#", "student_name": { "type": "string", "required": "True" }, "student_grade": { "type": "string", "required": "True", "maximum": 120, "minimum": 1 } } curl http://localhost:5000/ -X POST -H "Content-Type: application/json" -d '{"student_name": "test_name", "student_grade": “forth-grade"}' make_name_grade.json request curl command 4.1. Write decorators to check parameters ex) Request parameter check with JSON Schema

Slide 34

Slide 34 text

def validate_json(f): @wraps(f) def wrapper(*args, **kw): try: request.json except BadRequest as e: msg = “ This is an invalid json" return jsonify({"error": msg}), 400 return f(*args, **kw) return wrapper def validate_schema(schema_name): def decorator(f): @wraps(f) def wrapper(*args, **kw): try: validate(request.json, current_app.conﬁg[schema_name]) except ValidationError as e: return jsonify({"error": e.message}), 400 return f(*args, **kw) return wrapper return decorator @app.route('/', methods=['POST']) @validate_json @validate_schema('make_name_grade') def index(): if request.is_post: data = json.loads(request.data) print(data["student_name"]) print(data["student_grade"]) return "Hi! " + data["student_name"] else: return "Hi!" app.py json_validate.py 4.1. Write decorators to check parameters ex) Request parameter check with JSON Schema

Slide 35

Slide 35 text

Automate Continuous Integration Visualize data (Load Test) Deploy on GCP 'MBTL"QQ #VJMEFS %BTI 4.2. Set up production-like environments with Flask Monitor the accuracy of the ML model $MPVE.POJUPSJOH 4UBDLESJWFS

Slide 36

Slide 36 text

Resources LOCUST: https://www.youtube.com/watch?v=XQ4hrbgVysk (Pycon Korea 2015) Refactoring: https://www.youtube.com/watch?v=D_6ybDcU5gc (Pycon US 2016)  Pytest: https://www.youtube.com/watch?v=G-MAMrJ-CSA (Pycon US 2019) Flask workshop: https://www.youtube.com/watch?v=DIcpEg77gdE (Pycon US 2015) Dash: https://www.youtube.com/watch?v=WLbQYFZc-YY (Pycon Jp 2019) google-cloud-bigquery: https://pypi.org/project/google-cloud-bigquery/ google-cloud-storage: https://pypi.org/project/google-cloud-storage/ gcp-accessor: https://pypi.org/project/gcp-accessor/0.0.1/ Flask-AppBuilder: https://ﬂask-appbuilder.readthedocs.io/en/latest/ Python Tools that I mentioned in this talk Python Packages that I mentioned in this talk

Slide 37

Slide 37 text

Summary 3FTFBSDI 0SJFOUFE $PEF .-"1*T 3FGBDUPS $IFDL 6OEFSTUBOE .PEVMBSJ[F - What is Research Oriented Code ? - What are ML APIs - How should engineers handle research oriented code ? - Categorize research oriented code into preparation code, preprocessing code, ML code - Break them out into functions and make them testable - Clarify input and output of the code, and deﬁne URI - Prepare for refactoring - Simplify I/O in preparation code - Pandas → Python in preprocessing code - Write decorators to check parameters - Set up production-like environments @JesseTetsuya

Slide 38

Slide 38 text

@JesseTetsuya ———————————————————————————————————————————————————————————————————————————————— Software Engineer at an IT company specializing in education industry based in Tokyo. I mostly work in both data science and engineering. If you have an interest in the education and technology domain, feel free to contact with me !!

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text