How to develop ML APIs with Python by using online learning dataset

How to develop ML APIs with Python from Online Learning
Dataset @JesseTetsuya ———————————————————————————————————————————————————————————————————————————————— Software Engineer at an IT company specializing in education industry based in Tokyo. I mostly work in both data science and engineering. PyCon Taiwan 2020 tutorial !

Agenda - Introduction - Main Talk - First Step Transformation
/ Demonstration 1 ( - 15m) - Second Step Transformation / Demonstration 2 ( - 15m) (10 minutes break) - Third Step Transformation / Demonstration 3 ( - 15m) - Fourth Step Transformation / Demonstration 4 ( - 15m) - Summarize this tutorial. @JesseTetsuya

Background and Purpose - Recently, Python Engineers have more opportunities
to work with data scientists and researchers than before. - There are less business use cases to implement AI / ML application than develop ML models. @JesseTetsuya

Goals - Earning the generalized methods to apply on the
tasks of your AI / ML projects - Understanding the whole processes from analysis to API implementation by using python @JesseTetsuya

Not Goals - Avoid math and stats and focus on
Python   - Avoid the detailed tutorials of and focus on the general use case of python based tools @JesseTetsuya

A Development Cycle in AI / ML projects Log Data
From Database Application 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze log data and ﬁnd optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into the application Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months

From Database Application 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze log data and ﬁnd optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into the application Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months 8IBUJTUIF 3FTFBSDI 0SJFOUFE $PEF

Definition Research oriented code in AI/ML projects is the code
written mainly by data scientists or researchers for figuring out the most efficient and suitable machine learning model.

From Database Application 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze log data and ﬁnd optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into the application Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months 8IBUJTUIF .-"1*T

1.Preparation code for accessing data 2.Pre-processing code 3.Machine learning (ML)
code Production code (Engineers) Research oriented code (Data Scientists/Researchers) Machine Learning APIs are composed of three elements Research oriented code is developed through an iterative process and integrated into production code.

Data Pre-Processing code Visually trace the code from the top
to the bottom Easily and quickly write it

ML code (a part of whole code) Easily handle input
data and trace output data with data frame

Refactor both code in Pythonic way This code builds the
model in a much faster and simpler way

From Database Application 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze log data and ﬁnd optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into the application Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months 8IBUJTUIFHBQ CX3FTFBSDI 0SJFOUFE$PEF BOEQSPEVDUJPO DPEF

What is the gap b/w Research Oriented Code and Production
Code ? 0 points: More of a research project than a productionized system 1-2 points: Not totally untested, but it is worth considering the possibility of serious holes in reliability. 3-4 points: There’s been ﬁrst pass at basic productionization, but additional investment may be needed. 5-6 points: Reasonably tested, but it’s possible that more of those tests and procedures may be automated. 7-10 points: Strong levels of automated testing and monitoring, appropriate for missioncritical systems. 12+ points: Exceptional levels of automated testing and monitoring. @JesseTetsuya 8IBU`TZPVS.-UFTUTDPSF "SVCSJDGPS.-QSPEVDUJPOTZTUFNT &SJD#SFDL4IBORJOH$BJ&SJD/JFMTFO.JDIBFM4BMJC%4DVMMFZ3FMJBCMF.BDIJOF-FBSOJOHJOUIF8JME/*148PSLTIPQ https://research.google/pubs/pub45742/

3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF
$IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF⒏DJFOUBOE TVJUBCMFNBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ Three Differences between Research Oriented Code and Production Code

What are Python Engineers supposed to do for Research Oriented
Code?

3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQSPDFTTJOHDPEF .BDIJOFMFBSOJOHDPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF
&BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF⒏DJFOUBOE TVJUBCMFNBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ Three Differences between Research Oriented Code and Production Code 3FGBDUPS $IFDL .PEVMBSJ[F

A Development Cycle in AI / ML projects Learning Log
Data From Database LMS / Online Learning Platform 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze learning log data and ﬁnd optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into LMS / Online learning platform Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months

3FTFBSDI 0SJFOUFE $PEF .-"1*T Steps to transform Research Oriented
Code into ML APIs 3FGBDUPS $IFDL .PEVMBSJ[F @JesseTetsuya 6OEFSTUBOE %BUB

3FTFBSDI 0SJFOUFE $PEF .-"1*T Steps to transform Research Oriented Code
into ML APIs @JesseTetsuya 6OEFSTUBOE %BUB 8IBUEBUBDBOCFTUPSFEBOEDSFBUFE 8IBUEBUBMPPLTMJLF 8IBUJTUIFOBUVSFPGUIFEBUB 8IBUBMHPSJUINJTBQQMJFEPOUIJTEBUB 8IBUGPS

What data can be stored and created in Education Technology
? Learning Contents / Functions Learning Management System (LMS) / Online Learning Platform Learning Contents / Functions - Online Quiz - Video Lesson - Discussion Forum - Contents Box - Information View / Notiﬁcations - Peer and self assessment - Integrated Badges - Personalized Dashboard - Recommend System - Intelligent Tutoring System - Authoring Tools etc…

What data looks like ? @JesseTetsuya

What is the nature of the data ? Do data
wangling with pandas - Filter - Replace - De-duplicate / Be unique - Delete/Drop @JesseTetsuya

What algorithm is applied on this data ? What for
? @JesseTetsuya Input theta: ability a: discrimination parameter b: difﬁculty parameter Item Response Theory: Two Parameters Logistic Regression Output Probabilities to predict correction for item https://www.publichealth.columbia.edu/research/population-health-methods/item-response-theory

Lets look at the real code ( - 15m) Demonstration
1 @JesseTetsuya

Code into ML APIs .PEVMBSJ[F @JesseTetsuya 6OEFSTUBOE %BUB $BUFHPSJ[FSFTFBSDIPSJFOUFEDPEFJOUPQSFQBSBUJPODPEF QSFQSPDFTTJOHDPEF BOE.-DPEF #SFBLUIFNPVUJOUPGVODUJPOTBOENBLFUIFNUFTUBCMF $MBSJGZJOQVUBOEPVUQVUPGUIFDPEF BOEEFpOF63*

This is a page of research oriented code written with
jupyter notebook. This code is procedural and some of them are not classiﬁed. The research oriented code seems to be tightly coupled. 2.1. Categorize research oriented code into preparation code, preprocessing code, ML code

Find the code to load input data or access database
→ preparation code Find the code to make, replace, ﬁlter, or delete input data → preprocessing code Find the code to execute calculation or train data → ML code 2.1. Categorize research oriented code into preparation code, preprocessing code, ML code

Module name Functions Preparation code preparation.py - Access big query,
execute query, and load input data - Rename columns Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data ML code prediction.py - Calculate icc parameters, logistic regression, and item response theory (IRT) The research oriented code became loosely coupled 2.2. Break them out into functions and make them testable

app.py @app.route("/v1/probabilities", methods=['GET']) def probabilities(): return calc_results(), 200 return get_probs(),
200 ← noun ← the same endpoint name ← verb (+ noun) INPUT OUTPUT *item means a question INPUT: results of student answers OUTPUT: probabilities to answer questions correctly 2.3. Clarify input and output of the whole code and deﬁne URI

Lets look at the real code ( - 15m) Demonstration
2 @JesseTetsuya

10 minutes break @JesseTetsuya My Twitter Account

Code into ML APIs 3FGBDUPS .PEVMBSJ[F @JesseTetsuya 6OEFSTUBOE %BUB 1. Prepare for refactoring 2. Simplify I/O in preparation code 3. Pandas → Python in preprocessing code

The code to prepare and preprocess data tend to be
’Big Ball of Mud’ [1] -Redundant -Repetitive [1] Foote, Brian; Yoder, Joseph (26 June 1999). "Big Ball of Mud". laputan.org. Retrieved 14 April 2019.

. ᵓᴷᴷ ml_api ᴹ ᵓᴷᴷ api ᴹ ᴹ ᵓᴷᴷ app.py
ᴹ ᴹ ᵓᴷᴷ conﬁg ᴹ ᴹ ᵓᴷᴷ prediction.py ᴹ ᴹ ᵓᴷᴷ preparation.py ᴹ ᴹ ᵓᴷᴷ preprocessing.py ᴹ ᵓᴷᴷ requirements.txt ᴹ ᵓᴷᴷ run.py ᴹ ᵋᴷᴷ tests ᴹ ᵓᴷᴷ test_app.py ᴹ ᵓᴷᴷ test_prediction.py ᴹ ᵓᴷᴷ test_preparation.py ᴹ ᵋᴷᴷ test_preprocessing.py ᵋᴷᴷ setup.py 3.1 Prepare for refactoring Narrow down requirements of each code by writing test code and take notes about requirements on the comments for refactoring (or you can tell data scientist to write comments in advance) def func(arg1, arg2): """Summary line. Extended description of function. Args: arg1 (int): Description of arg1 arg2 (str): Description of arg2 Returns: bool: Description of return value """ return True ex) Google Style #comments out or doc strings (reStructuredText style /Numpy style/Google Style)

3.1 Tips: Prepare for refactoring

CASE STUDY: Refactoring the redundant code to access BigQuery and
GCS by using google cloud client libraries with Python 3.2 Simplify I/O in redundant preparation code

from google.cloud import bigquery client = bigquery.Client() query = "SELECT
column_1, column_2, column_3 FROM `data set name` where column_1 is not NULL query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] OUTPUT: Two Dimensional Arrays + Filter Values + Drop Null OUTPUT: Two Dimensional Arrays from google.cloud import bigquery client = bigquery.Client() query = "SELECT * FROM `data set name` query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] → Preprocess the data with query as much as possible → It is faster and lower-cost than preprocess data with python Code B Code A 3.2. Simplify I/O in preparation code ex) Big Query with Python

import io, csv, gzip from google.cloud import storage storage_client =
storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile(fileobj=gzip_obj, mode="wb") as gzip_file: bytes_f = result.encode() gzip_file.write(bytes_f) blob = bucket.blob(‘storage_path’) blob.upload_from_file(gzip_obj, rewind=True, content_type='application/gzip') Make bytes object and upload it from memory to GCS with Python 3.2 Simplify I/O in preparation code ex) Google Cloud Storage with Python

import gcp_accessor bq = gcp_accessor.BigQueryAccessor() query = "SELECT * FROM
`data set name` bq.execute_query(query) gcs = gcp_accessor.GoogleCloudStorageAccessor() gcs.upload_csv_gzip( ‘bucket name', ‘full path on gcs', ‘input data’) 3.2. Simplify I/O more by using wrapper import io, csv, gzip from google.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile(fileobj=gzip_obj, mode="wb") as gzip_file: bytes_f = result.encode() gzip_file.write(bytes_f) blob = bucket.blob(storage_path) blob.upload_from_file(gzip_obj, rewind=True, content_type='application/gzip') from google.cloud import bigquery client = bigquery.Client() query = "SELECT * FROM `data set name` query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] google-cloud-bigquery google-cloud-storage gcp-accessor (wrapper library) (https://pypi.org/project/gcp-accessor/)

3.3. Pandas → Python in preprocessing code All data in
the api is processed using the same data type. This improves readability and maintainability as opposed to prioritizing coding speed.

One day, I wondered why I struggled so much with
refactoring of the repetitive code of preprocessing in research oriented code that I wrote a previous week. 3.3. Pandas → Python in preprocessing code

3.3. Pandas → Python in preprocessing code Code Styles/ Preprocessing
Functions Pandas Python Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] Replace dataframe.ﬁllna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique()  (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values]

3.3. Tips: The ways to write test code with pytest
JNQPSUQBOEBTBTQE GSPNQBOEBTUFTUJOHJNQPSUBTTFSU@GSBNF@FRVBM MJTU<< > < > < >> JOEFY<3PX 3PX 3PX> DPMVNOT<$PM $PM $PM> BQE%BUB'SBNF EBUBMJTU JOEFYJOEFY DPMVNOTDPMVNOT CQE%BUB'SBNF EBUBMJTU JOEFYJOEFY DPMVNOTDPMVNOT BTTFSU@GSBNF@FRVBM B C JNQPSUQBOEBTBTQE JNQPSUQZUFTU MJTU<< > < > < >> JOEFY<3PX 3PX 3PX> DPMVNOT<l$PM $PM $PM> BQE%BUB'SBNF EBUBMJTU JOEFYJOEFY DPMVNOTDPMVNOT CQE%BUB'SBNF EBUBMJTU JOEFYJOEFY DPMVNOTDPMVNOT DBWBMVFTUPMJTU ECWBMVFTUPMJTU BTTFSUDE

Refactor the code following test code 1. prepare doc string
and test functions > decide what to refactor > preprocessing.py 2. make sure of output and input data > decide how to refactor > remove dataframe @JesseTetsuya

Lets look at the real code (- 15m) Demonstration 3
@JesseTetsuya

Code into ML APIs 3FGBDUPS $IFDL .PEVMBSJ[F @JesseTetsuya 6OEFSTUBOE %BUB 1. Write decorators to check parameters 2. Set up production-like environments

4.1. Write decorators to check parameters Error handling Request parameter
check Access token check Image of Decorators in APIs 3FRVFTU $MJFOU URIs preparation preprocessing calculation

{ "$schema": "http://json-schema.org/draft-04/schema#", "type": "object", "properties": { "student_name": { "type":
"string" }, "student_grade": { "type": "string", "maximum": 120, "minimum": 1 } }, "required": [ "student_name", "student_grade" ] } curl http://localhost:5000/v1/check_schema -X POST -H "Content-Type: application/json" -d '{"student_name": "test_name", "student_grade": "forth-grade"}' make_name_grade.json request curl command 4.1. Write decorators to check parameters ex) Request parameter check with JSON Schema

def validate_json(f): @wraps(f) def wrapper(*args, **kw): try: request.json except BadRequest
as e: msg = “ This is an invalid json" return jsonify({"error": msg}), 400 return f(*args, **kw) return wrapper def validate_schema(schema_name): def decorator(f): @wraps(f) def wrapper(*args, **kw): try: validate(request.json, current_app.conﬁg[schema_name]) except ValidationError as e: return jsonify({"error": e.message}), 400 return f(*args, **kw) return wrapper return decorator @app.route(‘/v1/check_schema’, methods=['POST']) @validate_json @validate_schema('make_name_grade') def index(): if request.is_json: data = json.loads(request.data) print(data["student_name"]) print(data["student_grade"]) return "Hi! " + data["student_name"] else: return "Hi!" app.py json_validate.py 4.1. Write decorators to check parameters ex) Request parameter check with JSON Schema

Automate Continuous Integration Visualize data (Load Test) Deploy on GCP
'MBTL"QQ #VJMEFS %BTI 4.2. Set up production-like environments with Flask Monitor the accuracy of the ML model $MPVE.POJUPSJOH 4UBDLESJWFS

Lets look at the real code (- 15m) Demonstration 4
@JesseTetsuya

Resources: video materials LOCUST: https://www.youtube.com/watch?v=XQ4hrbgVysk (Pycon Korea 2015) Refactoring: https://www.youtube.com/watch?v=D_6ybDcU5gc
(Pycon US 2016)  Pytest: https://www.youtube.com/watch?v=G-MAMrJ-CSA (Pycon US 2019) Flask workshop: https://www.youtube.com/watch?v=DIcpEg77gdE (Pycon US 2015) Dash: https://www.youtube.com/watch?v=WLbQYFZc-YY (Pycon Jp 2019) google-cloud-bigquery: https://pypi.org/project/google-cloud-bigquery/ google-cloud-storage: https://pypi.org/project/google-cloud-storage/ gcp-accessor: https://pypi.org/project/gcp-accessor/0.0.1/ Flask-AppBuilder: https://ﬂask-appbuilder.readthedocs.io/en/latest/ Python Tools that I mentioned in this talk Python Packages that I mentioned in this talk

Resources: text based materials Pandas Cheat Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf Pandas Tutorial:
https://pandas.pydata.org/docs/user_guide/index.html What’s your ML test score? A rubric for ML production systems: https://research.google/pubs/ pub45742/ Item Response Theory Tutorial: https://www.publichealth.columbia.edu/research/population- health-methods/item-response-theory Research Oriented Code: https://towardsdatascience.com/research-oriented-code-in-ai-ml- projects-f0dde4f9e1ac Why is Educational Data Mining important in the research?: https://towardsdatascience.com/why-is-educational-data-mining-important-in-the-research- e78ed1a17908 The resources that I mentioned in this tutorial

Summary 3FTFBSDI 0SJFOUFE $PEF .-"1*T 3FGBDUPS $IFDL 6OEFSTUBOE
%BUB .PEVMBSJ[F - What data can be stored and created ? - What data looks like ? What is the nature of the data ? - What product can be generated from this data ? - Categorize research oriented code into preparation code, preprocessing code, ML code - Break them out into functions and make them testable - Clarify input and output of the code, and deﬁne URI - Prepare for refactoring - Simplify I/O in preparation code - Pandas → Python in preprocessing code - Write decorators to check parameters - Set up production-like environments @JesseTetsuya

@JesseTetsuya ———————————————————————————————————————————————————————————————————————————————— Software Engineer at an IT company specializing in
education industry based in Tokyo. I mostly work in both data science and engineering. If you have an interest in the education and technology domain, feel free to contact with me !!

@JesseTetsuya ———————————————————————————————————————————————————————————————————————————————— Software Engineer at an IT company specializing in
education industry based in Tokyo. I mostly work in both data science and engineering. Q & A

How to develop ML APIs with Python by using onl...

How to develop ML APIs with Python by using online learning dataset

More Decks by tetsuya0617

Other Decks in Programming

Featured

Transcript