Productionize Research Oriented Code By Python

Tetsuya Jesse Hirata (@JesseTetsuya) At Salt Lake City in PyCon
US 2022 Productionize Research Oriented Code By Python

Background • Python engineers assign to AI / ML projects
and frequently face the research oriented code. • Understanding the process to productionize the research oriented code can help make AI / ML projects work more smoothly.

Research Oriented Code in AI/ML projects [Lightning Talk] - PyCon
US May 2019: https://www.youtube.com/watch?v=yFcCuinRVnU

4 Step Transformation from Research Oriented code into Products 1
2 3 4 Understand Modularize Refactor Make them a product

Out of the scope in this talk Input theta: ability
a: discrimination parameter b: dif fi culty parameter Output Probabilities to predict correction for item Item Response Theory: Two Parameters Logistic Regression https://www.publichealth.columbia.edu/research/population-health-methods/item-response-theory

Understand the research oriented code

What is Research Oriented Code ? 1

Definition 1 Research Oriented Code in AI/ML projects is the
code written mainly by data scientists or researchers for fi guring out new knowledge.

1 1. Prepare for data 2. Pre-process data 3. Train
or calculate pre-processed data Write paper with results

1 Example of pre processing code Visually trace the code
from the top to the bottom. Easily and quickly write it. It’s not clean code but it’s enough to quickly get results.

1 Example of calculation code Easily handle input data and
trace output data with data frame

What is the code in production level ? 1

1 0 points: More of a research project than a
productionized system 1-2 points: Not totally untested, but it is worth considering the possibility of serious holes in reliability. 3-4 points: There’s been fi rst pass at basic productionization, but additional investment may be needed. 5-6 points: Reasonably tested, but it’s possible that more of those tests and procedures may be automated. 7-10 points: Strong levels of automated testing and monitoring, appropriate for mission critical systems. 12+ points: Exceptional levels of automated testing and monitoring. What’s your ML test score? A rubric for ML production systems Eric Breck Shanqing Cai Eric Nielsen Michael Salib D. Sculley Reliable Machine Learning in the Wild - NIPS 2016 Workshop (2016) https://research.google/pubs/pub45742/ Quality definition of the production code

1 1. Architecting 2.Implement new features or fi x bugs
3. Test and Review Release a product Receive feedback all the time

1 Example of both previous code This code builds the
model in a faster and simpler way

1 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEFDBMDVMBUJPODPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF
.-DPEFDBMDVMBUJPODPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF ff i DJFOUBOETVJUBCMF NBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ 3 differences between research oriented code and production code

1 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF
.-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF ff i DJFOUBOETVJUBCMF NBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ 3 differences between research oriented code and production code What are Python Engineers supposed to do for Research Oriented Code at fi rst?

1 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF .-DPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF
.-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF ff i DJFOUBOETVJUBCMF NBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ 3 differences between research oriented code and production code Read the code before write it.

1 A page of research oriented code written with jupyter
notebook Code Reading / Code Documentation 1.Write comments by using “#” 2.Add “TODO” comments 3.Mob Documentation Three strategies to take notes to prepare for modularization VS Live Share: https://marketplace.visualstudio.com/items?itemName=MS-vsliveshare.vsliveshare

1 A part of page of research oriented code with
comments Code Reading / Code Documentation # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … Code documentation when reading:

2 Understand Modularize the code by using labels

1 Categorize research oriented code from code documentation Use category
labels such as preparation code, pre/post processing code, and calculation code. 2 # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … 1. Preparation code # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … 2. Pre/Post processing code … 3. Calculation code … …

1 Break them into functions and make them testable Find
duplicated code and delete or uni fi ed them, or fi x small bugs 2 1. Preparation code 1. init_db() 2. get_ fi lename() 3. load_con fi g() 2. Pre/Post processing code … … 3. Calculation code … … 1. Preparation code # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … 2. Pre/Post processing code … 3. Calculation code … …

1 Modularization outcome 2 Module name Functions Preparation code preparation.py
- Access database - Execute query - Load input data Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data - Rename columns Calculation / execution code prediction.py - Calculate logistic regression - Output results …

1 Modularization outcome 2 Module name Functions Preparation code preparation.py
- Access database, execute query, and load input data - Rename columns - … Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data - … Calculation / execution code prediction.py - Calculate logistic regression and output results - … The research oriented code became loosely coupled.

1 Mapping each module into directory 2 ᵓᴷᴷ%PDLFS fi
MF ᵓᴷᴷ3&"%.&NE ᵓᴷᴷBQJ ᴹᵓᴷᴷ@@JOJU@@QZSPVUJOHMJTU ᴹᵓᴷᴷDPO fi H ᴹᴹᵓᴷᴷ@@JOJU@@QZ ᴹᴹᵋᴷᴷCBTFQZ ᴹᵋᴷᴷVSMT ᴹᵓᴷᴷ@@JOJU@@QZ ᴹᵓᴷᴷQSFQBSBUJPOQZ ᴹᵓᴷᴷQSFQSPDFTTJOHQZ ᴹᵋᴷᴷQSFEJDUJPOQZ ᵓᴷᴷEPDLFSDPNQPTFZBNM ᵓᴷᴷNPEFMT ᴹᵋᴷᴷNPEFMTQZEBUBCBTFTDIFNB ᵓᴷᴷSFRVJSFNFOUTUYU ᵓᴷᴷSVOQZ ᵓᴷᴷUFTUT ᵋᴷᴷWFOW ᵓᴷᴷ%PDLFS fi MF ᵓᴷᴷ3&"%.&NE ᵓᴷᴷBQJT ᴹᵓᴷᴷ@@JOJU@@QZSPVUJOHMJTU ᴹᵓᴷᴷDPO fi H ᴹᴹᵓᴷᴷ@@JOJU@@QZ ᴹᴹᵓᴷᴷCBTFQZ ᴹᴹᵓᴷᴷTUBHJOHQZ ᴹᴹᵓᴷᴷQSPEVDUJPOQZ ᴹᴹᵋᴷᴷMPDBMQZ ᴹᵓᴷᴷW ᴹᴹᵓᴷᴷ@@JOJU@@QZ ᴹᴹᵓᴷᴷQSFQBSBUJPOQZ ᴹᴹᵓᴷᴷQSFQSPDETTJOHQZ ᴹᴹᵋᴷᴷQSFEJDUJPOQZ ᴹᵋᴷᴷW ᴹᵓᴷᴷ@@JOJU@@QZ ᴹᵋᴷᴷQSFEJDUJPOQZ ᵓᴷᴷEPDLFSDPNQPTFZBNM ᵓᴷᴷSFRVJSFNFOUTUYU ᵓᴷᴷNPEFMT ᴹᵋᴷᴷNPEFMTQZEBUBCBTFTDIFNB ᵓᴷᴷUFTUT ᵋᴷᴷSVOQZ 'MBTLEJSFDUPSZGPSBTNBMMUFBN JOBOJOUFSOBM"1*EFWFMPQNFOU 'MBTLEJSFDUPSZGPSBCJHUFBN JOBOJOUFSOBM"1*EFWFMPQNFOU

4 Step Transformation from Research Oriented Code into Products 1
2 3 Understand Modularize Refactor preparation code and pre processing code

1 Before refactoring the code 2 3 ᵓᴷᴷ%PDLFS fi
MF ᵓᴷᴷ3&"%.&NE ᵓᴷᴷBQJ ᴹᵓᴷᴷ@@JOJU@@QZ ᴹᵓᴷᴷDPO fi H ᴹᴹᵓᴷᴷ@@JOJU@@QZ ᴹᴹᵋᴷᴷCBTFQZ ᴹᵋᴷᴷVSMT ᴹᵓᴷᴷ@@JOJU@@QZ ᴹᵓᴷᴷFOEQPJOUQZ ᴹᵋᴷᴷFOEQPJOUQZ ᵓᴷᴷEPDLFSDPNQPTFZBNM ᵓᴷᴷNPEFMT ᴹᵋᴷᴷNPEFMTQZ ᵓᴷᴷSFRVJSFNFOUTUYU ᵓᴷᴷSVOQZ ᵓᴷᴷ tests │ ᵓᴷᴷ test_app.py │ ᵓᴷᴷ test_prediction.py │ ᵓᴷᴷ test_preparation.py │ └── test_preprocessing.py ᵋᴷᴷWFOW def func(arg1, arg2): """Summary line. Extended description of function. Args: arg1 (int): Description of arg1 arg2 (str): Description of arg2 Returns: bool: Description of return value """ return True 2. Add docstrings such as reStructuredText style, Numpy style, and Google style. 1. Write test code 3. Execute code formatter and check if code correctly work and coding style. Code formatter: • black/autopep8 • isort Code checker: • pytest • fl ake8 This is doctoring of google style.

1 Now which part of code should we refactor ?
2 3 . ᵓᴷᴷ src │ ᵓᴷᴷ api │ │ ᵓᴷᴷ app.py │ │ ᵓᴷᴷ con fi g │ │ ᵓᴷᴷ prediction.py │ │ ᵓᴷᴷ preparation.py │ │ ᵓᴷᴷ preprocessing.py │ ᵓᴷᴷ requirements.txt │ ᵓᴷᴷ run.py │ └── tests │ ᵓᴷᴷ test_app.py │ ᵓᴷᴷ test_prediction.py │ ᵓᴷᴷ test_preparation.py │ └── test_preprocessing.py └── setup.py Improve CPU bound processing but in this phase nothing to do. Simplify I/O Remove extra modules or replace library for application, not for analysis.

1 preparation.py: simplify i/o 2 3 from xxxx import yyyy
client = yyyy.Client() query = "SELECT * FROM `data set name`” query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] from xxxx import yyyy client = yyyy.Client() query = "SELECT column_1, column_2, column_3 FROM `data set name` where column_1 is not NULL” query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] The two strategies to simplify I/O 1. Narrow down the data to extract from database -> faster and lower cost 2. Wrapping client library

1 preparation.py: simplify i/o 2 3 1. Narrow down the
data to extract from database 2. Wrapping client library -> re-usable and cost-effective import io, csv, gzip from xxx.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile( fi leobj=gzip_obj, mode="wb") as gzip_ fi le: bytes_f = result.encode() gzip_ fi le.write(bytes_f) blob = bucket.blob(‘storage_path’) blob.upload_from_ fi le(gzip_obj, rewind=True, content_type='application/gzip') import xxx_accessor cs = xxx_accessor.CloudStorageAccessor() cs.upload_csv_gzip( ‘bucket name', ‘full path on gcs', ‘input data’)

1 Now which part of code should we refactor ?
2 3 . ᵓᴷᴷ src │ ᵓᴷᴷ api │ │ ᵓᴷᴷ app.py │ │ ᵓᴷᴷ con fi g │ │ ᵓᴷᴷ prediction.py │ │ ᵓᴷᴷ preparation.py │ │ ᵓᴷᴷ preprocessing.py │ ᵓᴷᴷ requirements.txt │ ᵓᴷᴷ run.py │ └── tests │ ᵓᴷᴷ test_app.py │ ᵓᴷᴷ test_prediction.py │ ᵓᴷᴷ test_preparation.py │ └── test_preprocessing.py └── setup.py Improve CPU bound processing but in this phase nothing to do. Simplify I/O Remove extra modules or replace library for application, not for analysis.

1 preprocess.py: the three ways to pre-process data 2 3
Code Styles/ Preprocessing Functions Pandas Python SQL Query Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] SELECT * FROM Customers WHERE CustomerID=1; Replace dataframe. fi llna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] SELCT REPLACE("XYZ FGH XYZ", "X", “m”); De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique()   (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) SELCT DISTINCT(column) FROM table1; Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values] DELETE FROM table_name WHERE condition;

1 preprocess.py: the three ways to pre-process data 2 3
Code Styles/ Preprocessing Functions Pandas Python SQL Query Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] SELECT * FROM Customers WHERE CustomerID=1; Replace dataframe. fi llna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] SELCT REPLACE("XYZ FGH XYZ", "X", “m”); De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique()   (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) SELCT DISTINCT(column) FROM table1; Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values] DELETE FROM table_name WHERE condition; Iterative Testable Performance Simple grammer

2 3 4 Understand Modularize Refactor Make them a product which is API

What products can be generated from Research Oriented Code? 1
2 3 4

The Flow Chart of Transformation from Research Oriented code into
Products Research Oriented Code WEB API WEB Application Server PoC Analysis Scripts Integration Productionize Implement API Implement API Integrate API Deploy Data Store Storage Read trained models or features Save artifacts such as trained models or features 1 2 3 4

The Flow Chart of Transformation from Research Oriented code into
Products Research Oriented Code WEB API WEB Application Server PoC Analysis Scripts Integration Productionize Implement API Implement API Integrate API Deploy Data Store Storage Read trained models or features Save artifacts such as trained models or features 1 2 3 4 4 4 4 1 2 3 4

1 2 3 4 Error check Request parameter check $MJFOU
URIs preparation preprocessing calculation Request Routing The Flow Chart of Transformation fromɹ Research Oriented code into WEB API

1 2 3 4 Request routing: clarify input and output
and define URI from data *OQVUEBUB 0VUQVUEBUB @app.route("/v1/probabilities", methods=['GET']) def probabilities(): return calc_results(), 200 return get_probs(), 200 ← noun ← the same endpoint name ← verb (+ noun)

URIs preparation preprocessing calculation Request Routing The Flow Chart of Transformation fromɹ Research Oriented code into WEB API

1 2 3 4 Request parameter check: write decorators with
JSON Schema curl http://localhost:5000/ -X POST -H "Content-Type: application/json” -d '{"student_name": "test_name", "student_grade": “forth-grade"}' Request curl command { "$schema": "http://json-schema.org/draft-04/ schema#", "student_name": { "type": "string", "required": "True" }, "student_grade": { "type": "string", "required": "True", "maximum": 120, "minimum": 1 } } JSON Schema File(make_name_grade.json)

1 2 3 4 Request parameter check: write decorators with
JSON Schema Validate request body based on schema fi le Add schema fi le to each endpoint @app.route('/', methods=['POST']) @validate_json @validate_schema('make_name_grade') def index(): if request.is_post: data = json.loads(request.data) print(data["student_name"]) print(data["student_grade"]) return "Hi! " + data["student_name"] else: return "Hi!" def validate_json(f): @wraps(f) def wrapper(*args, **kw): try: request.json except BadRequest as e: msg = “ This is an invalid json" return jsonify({"error": msg}), 400 return f(*args, **kw) return wrapper def validate_schema(schema_name): def decorator(f): @wraps(f) def wrapper(*args, **kw): try: validate(request.json, current_app.con fi g[schema_name]) except ValidationError as e: return jsonify({"error": e.message}), 400 return f(*args, **kw) return wrapper return decorator

URIs preparation preprocessing calculation Request Routing The Flow Chart Transformation fromɹ Research Oriented code into WEB API

1 2 3 4 Error check: think if processing should
be stoped or continue Module name Functions Preparation code preparation.py - Access database - Execute query - Load input data Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data - Rename columns Calculation / execution code prediction.py - Calculate logistic regression and output results - … STOP or Continued depends on prediction.py STOP or depends on prediction.py Depends on services

1 2 3 4 Error check: use error handler functions
to detect error by using Flask @api.errorhandler(400) @api.errorhandler(404) def error_handler(error): response = jsonify( {"error_message": error.description["error_message"], “result": error.code} ) return response, error.code from fl ask import abort, jsonify def extract_ fi lenames( fi le_id: str) -> list[str]: “""Get fi lename from database""" img_obj = db.session.query(ImageInfo). fi lter(ImageInfo. fi le_id == fi le_id) fi lenames = [img. fi lename for img in img_obj if img. fi lename] if not fi lenames: # stop processing with abort # abort(404, {"error_message": " fi lenames are not found in database"}) # Not stop processing without abort return ( jsonify({"message": " fi lenames are not found in database”, "result": 400}), 400, ) return fi lenames

Summarize 4 Step Transformation from Research Oriented code into Products
1 2 3 4 Understand the characteristics of the code and fi gure out how it is working by taking notes. Modularize the code based on the code documentation by labeling the code as preparation, pre/post processing, and calculation. Refactor the preparation code by simplifying I/O and the pre processing code by changing the coding style. Make them a product which is an API composed of request routing, request parameter check, and error check.

After deployed the product… 1 2 3 4 Performance check
such as speed and stability by using loading test tools Parameter tuning of web server(nginx/apatch), app server(uwsgi/gunicorn/uvicorn) Think about asynchronies or synchronies Rethink about architecture of infrastructure or refactoring the code with di ff erent language 7FHFUB"UUBDL FDIP(&5IUUQ cWFHFUBBUUBDLSBUF EVSBUJPOTcUFF 7FHFUBIUUQTHJUIVCDPNUTFOBSUWFHFUB -0$645IUUQTMPDVTUJP

2 3 4 Understand Modularize Refactor Make them a product

Thank you ! @JesseTetsuya

Productionize Research Oriented Code By Python

Productionize Research Oriented Code By Python

More Decks by tetsuya0617

Other Decks in Programming

Featured

Transcript