and frequently face the research oriented code. • Understanding the process to productionize the research oriented code can help make AI / ML projects work more smoothly.
productionized system 1-2 points: Not totally untested, but it is worth considering the possibility of serious holes in reliability. 3-4 points: There’s been fi rst pass at basic productionization, but additional investment may be needed. 5-6 points: Reasonably tested, but it’s possible that more of those tests and procedures may be automated. 7-10 points: Strong levels of automated testing and monitoring, appropriate for mission critical systems. 12+ points: Exceptional levels of automated testing and monitoring. What’s your ML test score? A rubric for ML production systems Eric Breck Shanqing Cai Eric Nielsen Michael Salib D. Sculley Reliable Machine Learning in the Wild - NIPS 2016 Workshop (2016) https://research.google/pubs/pub45742/ Quality definition of the production code
.-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF ff i DJFOUBOETVJUBCMF NBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ 3 differences between research oriented code and production code What are Python Engineers supposed to do for Research Oriented Code at fi rst?
.-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF ff i DJFOUBOETVJUBCMF NBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ 3 differences between research oriented code and production code Read the code before write it.
notebook Code Reading / Code Documentation 1.Write comments by using “#” 2.Add “TODO” comments 3.Mob Documentation Three strategies to take notes to prepare for modularization VS Live Share: https://marketplace.visualstudio.com/items?itemName=MS-vsliveshare.vsliveshare
comments Code Reading / Code Documentation # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … Code documentation when reading:
labels such as preparation code, pre/post processing code, and calculation code. 2 # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … 1. Preparation code # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … 2. Pre/Post processing code … 3. Calculation code … …
duplicated code and delete or uni fi ed them, or fi x small bugs 2 1. Preparation code 1. init_db() 2. get_ fi lename() 3. load_con fi g() 2. Pre/Post processing code … … 3. Calculation code … … 1. Preparation code # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … 2. Pre/Post processing code … 3. Calculation code … …
client = yyyy.Client() query = "SELECT * FROM `data set name`” query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] from xxxx import yyyy client = yyyy.Client() query = "SELECT column_1, column_2, column_3 FROM `data set name` where column_1 is not NULL” query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] The two strategies to simplify I/O 1. Narrow down the data to extract from database -> faster and lower cost 2. Wrapping client library
data to extract from database 2. Wrapping client library -> re-usable and cost-effective import io, csv, gzip from xxx.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile( fi leobj=gzip_obj, mode="wb") as gzip_ fi le: bytes_f = result.encode() gzip_ fi le.write(bytes_f) blob = bucket.blob(‘storage_path’) blob.upload_from_ fi le(gzip_obj, rewind=True, content_type='application/gzip') import xxx_accessor cs = xxx_accessor.CloudStorageAccessor() cs.upload_csv_gzip( ‘bucket name', ‘full path on gcs', ‘input data’)
Code Styles/ Preprocessing Functions Pandas Python SQL Query Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] SELECT * FROM Customers WHERE CustomerID=1; Replace dataframe. fi llna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] SELCT REPLACE("XYZ FGH XYZ", "X", “m”); De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique() (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) SELCT DISTINCT(column) FROM table1; Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values] DELETE FROM table_name WHERE condition;
Code Styles/ Preprocessing Functions Pandas Python SQL Query Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] SELECT * FROM Customers WHERE CustomerID=1; Replace dataframe. fi llna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] SELCT REPLACE("XYZ FGH XYZ", "X", “m”); De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique() (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) SELCT DISTINCT(column) FROM table1; Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values] DELETE FROM table_name WHERE condition; Iterative Testable Performance Simple grammer
Products Research Oriented Code WEB API WEB Application Server PoC Analysis Scripts Integration Productionize Implement API Implement API Integrate API Deploy Data Store Storage Read trained models or features Save artifacts such as trained models or features 1 2 3 4
Products Research Oriented Code WEB API WEB Application Server PoC Analysis Scripts Integration Productionize Implement API Implement API Integrate API Deploy Data Store Storage Read trained models or features Save artifacts such as trained models or features 1 2 3 4 4 4 4 1 2 3 4
and define URI from data *OQVUEBUB 0VUQVUEBUB @app.route("/v1/probabilities", methods=['GET']) def probabilities(): return calc_results(), 200 return get_probs(), 200 ← noun ← the same endpoint name ← verb (+ noun)
be stoped or continue Module name Functions Preparation code preparation.py - Access database - Execute query - Load input data Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data - Rename columns Calculation / execution code prediction.py - Calculate logistic regression and output results - … STOP or Continued depends on prediction.py STOP or depends on prediction.py Depends on services
to detect error by using Flask @api.errorhandler(400) @api.errorhandler(404) def error_handler(error): response = jsonify( {"error_message": error.description["error_message"], “result": error.code} ) return response, error.code from fl ask import abort, jsonify def extract_ fi lenames( fi le_id: str) -> list[str]: “""Get fi lename from database""" img_obj = db.session.query(ImageInfo). fi lter(ImageInfo. fi le_id == fi le_id) fi lenames = [img. fi lename for img in img_obj if img. fi lename] if not fi lenames: # stop processing with abort # abort(404, {"error_message": " fi lenames are not found in database"}) # Not stop processing without abort return ( jsonify({"message": " fi lenames are not found in database”, "result": 400}), 400, ) return fi lenames
1 2 3 4 Understand the characteristics of the code and fi gure out how it is working by taking notes. Modularize the code based on the code documentation by labeling the code as preparation, pre/post processing, and calculation. Refactor the preparation code by simplifying I/O and the pre processing code by changing the coding style. Make them a product which is an API composed of request routing, request parameter check, and error check.
such as speed and stability by using loading test tools Parameter tuning of web server(nginx/apatch), app server(uwsgi/gunicorn/uvicorn) Think about asynchronies or synchronies Rethink about architecture of infrastructure or refactoring the code with di ff erent language 7FHFUB"UUBDL FDIP(&5IUUQ cWFHFUBBUUBDLSBUF EVSBUJPOTcUFF 7FHFUBIUUQTHJUIVCDPNUTFOBSUWFHFUB -0$645IUUQTMPDVTUJP