Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Productionize Research Oriented Code By Python

Productionize Research Oriented Code By Python

Python Conference Talk 2022 at Salt Lake City, United States

tetsuya0617

May 01, 2022
Tweet

More Decks by tetsuya0617

Other Decks in Programming

Transcript

  1. Tetsuya Jesse Hirata (@JesseTetsuya) At Salt Lake City in PyCon

    US 2022 Productionize Research Oriented Code By Python
  2. Background • Python engineers assign to AI / ML projects

    and frequently face the research oriented code. • Understanding the process to productionize the research oriented code can help make AI / ML projects work more smoothly.
  3. Research Oriented Code in AI/ML projects [Lightning Talk] - PyCon

    US May 2019: https://www.youtube.com/watch?v=yFcCuinRVnU
  4. 4 Step Transformation from Research Oriented code into Products 1

    2 3 4 Understand Modularize Refactor Make them a product
  5. Out of the scope in this talk Input theta: ability

    a: discrimination parameter b: dif fi culty parameter Output Probabilities to predict correction for item Item Response Theory: Two Parameters Logistic Regression https://www.publichealth.columbia.edu/research/population-health-methods/item-response-theory
  6. Definition 1 Research Oriented Code in AI/ML projects is the

    code written mainly by data scientists or researchers for fi guring out new knowledge.
  7. 1 1. Prepare for data 2. Pre-process data 3. Train

    or calculate pre-processed data Write paper with results
  8. 1 Example of pre processing code Visually trace the code

    from the top to the bottom. Easily and quickly write it. It’s not clean code but it’s enough to quickly get results.
  9. 1 0 points: More of a research project than a

    productionized system 1-2 points: Not totally untested, but it is worth considering the possibility of serious holes in reliability. 3-4 points: There’s been fi rst pass at basic productionization, but additional investment may be needed. 5-6 points: Reasonably tested, but it’s possible that more of those tests and procedures may be automated. 7-10 points: Strong levels of automated testing and monitoring, appropriate for mission critical systems. 12+ points: Exceptional levels of automated testing and monitoring. What’s your ML test score? A rubric for ML production systems Eric Breck Shanqing Cai Eric Nielsen Michael Salib D. Sculley Reliable Machine Learning in the Wild - NIPS 2016 Workshop (2016) https://research.google/pubs/pub45742/ Quality definition of the production code
  10. 1 1. Architecting 2.Implement new features or fi x bugs

    3. Test and Review Release a product Receive feedback all the time
  11. 1 Example of both previous code This code builds the

    model in a faster and simpler way
  12. 1 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQBSBUJPODPEF  1SFQSPDFTTJOHDPEF .-DPEFDBMDVMBUJPODPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF

    .-DPEFDBMDVMBUJPODPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF ff i DJFOUBOETVJUBCMF NBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ 3 differences between research oriented code and production code
  13. 1 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQBSBUJPODPEF  1SFQSPDFTTJOHDPEF .-DPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF

    .-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF ff i DJFOUBOETVJUBCMF NBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ 3 differences between research oriented code and production code What are Python Engineers supposed to do for Research Oriented Code at fi rst?
  14. 1 3FTFBSDI0SJFOUFE$PEF 1SPEVDUJPO$PEF 4DPQFT 1SFQBSBUJPODPEF  1SFQSPDFTTJOHDPEF .-DPEF 1SFQBSBUJPODPEF 1SFQSPDFTTJOHDPEF

    .-DPEF $IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF ff i DJFOUBOETVJUBCMF NBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ 3 differences between research oriented code and production code Read the code before write it.
  15. 1 A page of research oriented code written with jupyter

    notebook Code Reading / Code Documentation 1.Write comments by using “#” 2.Add “TODO” comments 3.Mob Documentation Three strategies to take notes to prepare for modularization VS Live Share: https://marketplace.visualstudio.com/items?itemName=MS-vsliveshare.vsliveshare
  16. 1 A part of page of research oriented code with

    comments Code Reading / Code Documentation # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … Code documentation when reading:
  17. 4 Step Transformation from Research Oriented code into Products 1

    2 Understand Modularize the code by using labels
  18. 1 Categorize research oriented code from code documentation Use category

    labels such as preparation code, pre/post processing code, and calculation code. 2 # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … 1. Preparation code # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … 2. Pre/Post processing code … 3. Calculation code … …
  19. 1 Break them into functions and make them testable Find

    duplicated code and delete or uni fi ed them, or fi x small bugs 2 1. Preparation code 1. init_db() 2. get_ fi lename() 3. load_con fi g() 2. Pre/Post processing code … … 3. Calculation code … … 1. Preparation code # Get fi le name # Create new datebase "images.db" # TODO: set it con fi g fi le # Create new connection to database object # TODO: Use OR mapper # Create cursor object to operate sqlite # Initialize database … 2. Pre/Post processing code … 3. Calculation code … …
  20. 1 Modularization outcome 2 Module name Functions Preparation code preparation.py

    - Access database - Execute query - Load input data Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data - Rename columns Calculation / execution code prediction.py - Calculate logistic regression - Output results …
  21. 1 Modularization outcome 2 Module name Functions Preparation code preparation.py

    - Access database, execute query, and load input data - Rename columns - … Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data - … Calculation / execution code prediction.py - Calculate logistic regression and output results - … The research oriented code became loosely coupled.
  22. 1 Mapping each module into directory 2  ᵓᴷᴷ%PDLFS fi

    MF ᵓᴷᴷ3&"%.&NE ᵓᴷᴷBQJ ᴹᵓᴷᴷ@@JOJU@@QZSPVUJOHMJTU ᴹᵓᴷᴷDPO fi H ᴹᴹᵓᴷᴷ@@JOJU@@QZ ᴹᴹᵋᴷᴷCBTFQZ ᴹᵋᴷᴷVSMT ᴹᵓᴷᴷ@@JOJU@@QZ ᴹᵓᴷᴷQSFQBSBUJPOQZ ᴹᵓᴷᴷQSFQSPDFTTJOHQZ ᴹᵋᴷᴷQSFEJDUJPOQZ ᵓᴷᴷEPDLFSDPNQPTFZBNM ᵓᴷᴷNPEFMT ᴹᵋᴷᴷNPEFMTQZEBUBCBTFTDIFNB ᵓᴷᴷSFRVJSFNFOUTUYU ᵓᴷᴷSVOQZ ᵓᴷᴷUFTUT ᵋᴷᴷWFOW  ᵓᴷᴷ%PDLFS fi MF ᵓᴷᴷ3&"%.&NE ᵓᴷᴷBQJT ᴹᵓᴷᴷ@@JOJU@@QZSPVUJOHMJTU ᴹᵓᴷᴷDPO fi H ᴹᴹᵓᴷᴷ@@JOJU@@QZ ᴹᴹᵓᴷᴷCBTFQZ ᴹᴹᵓᴷᴷTUBHJOHQZ ᴹᴹᵓᴷᴷQSPEVDUJPOQZ ᴹᴹᵋᴷᴷMPDBMQZ ᴹᵓᴷᴷW ᴹᴹᵓᴷᴷ@@JOJU@@QZ ᴹᴹᵓᴷᴷQSFQBSBUJPOQZ ᴹᴹᵓᴷᴷQSFQSPDETTJOHQZ ᴹᴹᵋᴷᴷQSFEJDUJPOQZ ᴹᵋᴷᴷW ᴹᵓᴷᴷ@@JOJU@@QZ ᴹᵋᴷᴷQSFEJDUJPOQZ ᵓᴷᴷEPDLFSDPNQPTFZBNM ᵓᴷᴷSFRVJSFNFOUTUYU ᵓᴷᴷNPEFMT ᴹᵋᴷᴷNPEFMTQZEBUBCBTFTDIFNB ᵓᴷᴷUFTUT ᵋᴷᴷSVOQZ 'MBTLEJSFDUPSZGPSBTNBMMUFBN JOBOJOUFSOBM"1*EFWFMPQNFOU 'MBTLEJSFDUPSZGPSBCJHUFBN JOBOJOUFSOBM"1*EFWFMPQNFOU
  23. 4 Step Transformation from Research Oriented Code into Products 1

    2 3 Understand Modularize Refactor preparation code and pre processing code
  24. 1 Before refactoring the code 2 3  ᵓᴷᴷ%PDLFS fi

    MF ᵓᴷᴷ3&"%.&NE ᵓᴷᴷBQJ ᴹᵓᴷᴷ@@JOJU@@QZ ᴹᵓᴷᴷDPO fi H ᴹᴹᵓᴷᴷ@@JOJU@@QZ ᴹᴹᵋᴷᴷCBTFQZ ᴹᵋᴷᴷVSMT ᴹᵓᴷᴷ@@JOJU@@QZ ᴹᵓᴷᴷFOEQPJOUQZ ᴹᵋᴷᴷFOEQPJOUQZ ᵓᴷᴷEPDLFSDPNQPTFZBNM ᵓᴷᴷNPEFMT ᴹᵋᴷᴷNPEFMTQZ ᵓᴷᴷSFRVJSFNFOUTUYU ᵓᴷᴷSVOQZ ᵓᴷᴷ tests │ ᵓᴷᴷ test_app.py │ ᵓᴷᴷ test_prediction.py │ ᵓᴷᴷ test_preparation.py │ └── test_preprocessing.py ᵋᴷᴷWFOW def func(arg1, arg2): """Summary line. Extended description of function. Args: arg1 (int): Description of arg1 arg2 (str): Description of arg2 Returns: bool: Description of return value """ return True 2. Add docstrings such as reStructuredText style, Numpy style, and Google style. 1. Write test code 3. Execute code formatter and check if code correctly work and coding style. Code formatter: • black/autopep8 • isort Code checker: • pytest • fl ake8 This is doctoring of google style.
  25. 1 Now which part of code should we refactor ?

    2 3 . ᵓᴷᴷ src │ ᵓᴷᴷ api │ │ ᵓᴷᴷ app.py │ │ ᵓᴷᴷ con fi g │ │ ᵓᴷᴷ prediction.py │ │ ᵓᴷᴷ preparation.py │ │ ᵓᴷᴷ preprocessing.py │ ᵓᴷᴷ requirements.txt │ ᵓᴷᴷ run.py │ └── tests │ ᵓᴷᴷ test_app.py │ ᵓᴷᴷ test_prediction.py │ ᵓᴷᴷ test_preparation.py │ └── test_preprocessing.py └── setup.py Improve CPU bound processing but in this phase nothing to do. Simplify I/O Remove extra modules or replace library for application, not for analysis.
  26. 1 Now which part of code should we refactor ?

    2 3 . ᵓᴷᴷ src │ ᵓᴷᴷ api │ │ ᵓᴷᴷ app.py │ │ ᵓᴷᴷ con fi g │ │ ᵓᴷᴷ prediction.py │ │ ᵓᴷᴷ preparation.py │ │ ᵓᴷᴷ preprocessing.py │ ᵓᴷᴷ requirements.txt │ ᵓᴷᴷ run.py │ └── tests │ ᵓᴷᴷ test_app.py │ ᵓᴷᴷ test_prediction.py │ ᵓᴷᴷ test_preparation.py │ └── test_preprocessing.py └── setup.py Improve CPU bound processing but in this phase nothing to do. Simplify I/O Remove extra modules or replace library for application, not for analysis.
  27. 1 preparation.py: simplify i/o 2 3 from xxxx import yyyy

    client = yyyy.Client() query = "SELECT * FROM `data set name`” query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] from xxxx import yyyy client = yyyy.Client() query = "SELECT column_1, column_2, column_3 FROM `data set name` where column_1 is not NULL” query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] The two strategies to simplify I/O 1. Narrow down the data to extract from database -> faster and lower cost 2. Wrapping client library
  28. 1 preparation.py: simplify i/o 2 3 1. Narrow down the

    data to extract from database 2. Wrapping client library -> re-usable and cost-effective import io, csv, gzip from xxx.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile( fi leobj=gzip_obj, mode="wb") as gzip_ fi le: bytes_f = result.encode() gzip_ fi le.write(bytes_f) blob = bucket.blob(‘storage_path’) blob.upload_from_ fi le(gzip_obj, rewind=True, content_type='application/gzip') import xxx_accessor cs = xxx_accessor.CloudStorageAccessor() cs.upload_csv_gzip( ‘bucket name', ‘full path on gcs', ‘input data’)
  29. 1 Now which part of code should we refactor ?

    2 3 . ᵓᴷᴷ src │ ᵓᴷᴷ api │ │ ᵓᴷᴷ app.py │ │ ᵓᴷᴷ con fi g │ │ ᵓᴷᴷ prediction.py │ │ ᵓᴷᴷ preparation.py │ │ ᵓᴷᴷ preprocessing.py │ ᵓᴷᴷ requirements.txt │ ᵓᴷᴷ run.py │ └── tests │ ᵓᴷᴷ test_app.py │ ᵓᴷᴷ test_prediction.py │ ᵓᴷᴷ test_preparation.py │ └── test_preprocessing.py └── setup.py Improve CPU bound processing but in this phase nothing to do. Simplify I/O Remove extra modules or replace library for application, not for analysis.
  30. 1 preprocess.py: the three ways to pre-process data 2 3

    Code Styles/ Preprocessing Functions Pandas Python SQL Query Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] SELECT * FROM Customers WHERE CustomerID=1; Replace dataframe. fi llna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] SELCT REPLACE("XYZ FGH XYZ", "X", “m”); De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique() 
 (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) SELCT DISTINCT(column) FROM table1; Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values] DELETE FROM table_name WHERE condition;
  31. 1 preprocess.py: the three ways to pre-process data 2 3

    Code Styles/ Preprocessing Functions Pandas Python SQL Query Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] SELECT * FROM Customers WHERE CustomerID=1; Replace dataframe. fi llna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] SELCT REPLACE("XYZ FGH XYZ", "X", “m”); De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique() 
 (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) SELCT DISTINCT(column) FROM table1; Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values] DELETE FROM table_name WHERE condition; Iterative Testable Performance Simple grammer
  32. 4 Step Transformation from Research Oriented code into Products 1

    2 3 4 Understand Modularize Refactor Make them a product which is API
  33. The Flow Chart of Transformation from Research Oriented code into

    Products Research Oriented Code WEB API WEB Application Server PoC Analysis Scripts Integration Productionize Implement API Implement API Integrate API Deploy Data Store Storage Read trained models or features Save artifacts such as trained models or features 1 2 3 4
  34. The Flow Chart of Transformation from Research Oriented code into

    Products Research Oriented Code WEB API WEB Application Server PoC Analysis Scripts Integration Productionize Implement API Implement API Integrate API Deploy Data Store Storage Read trained models or features Save artifacts such as trained models or features 1 2 3 4 4 4 4 1 2 3 4
  35. 1 2 3 4 Error check Request parameter check $MJFOU

    URIs preparation preprocessing calculation Request Routing The Flow Chart of Transformation fromɹ Research Oriented code into WEB API
  36. 1 2 3 4 Error check Request parameter check $MJFOU

    URIs preparation preprocessing calculation Request Routing The Flow Chart of Transformation fromɹ Research Oriented code into WEB API
  37. 1 2 3 4 Request routing: clarify input and output

    and define URI from data *OQVUEBUB 0VUQVUEBUB @app.route("/v1/probabilities", methods=['GET']) def probabilities(): return calc_results(), 200 return get_probs(), 200 ← noun ← the same endpoint name ← verb (+ noun)
  38. 1 2 3 4 Error check Request parameter check $MJFOU

    URIs preparation preprocessing calculation Request Routing The Flow Chart of Transformation fromɹ Research Oriented code into WEB API
  39. 1 2 3 4 Request parameter check: write decorators with

    JSON Schema curl http://localhost:5000/ -X POST -H "Content-Type: application/json” -d '{"student_name": "test_name", "student_grade": “forth-grade"}' Request curl command { "$schema": "http://json-schema.org/draft-04/ schema#", "student_name": { "type": "string", "required": "True" }, "student_grade": { "type": "string", "required": "True", "maximum": 120, "minimum": 1 } } JSON Schema File(make_name_grade.json)
  40. 1 2 3 4 Request parameter check: write decorators with

    JSON Schema Validate request body based on schema fi le Add schema fi le to each endpoint @app.route('/', methods=['POST']) @validate_json @validate_schema('make_name_grade') def index(): if request.is_post: data = json.loads(request.data) print(data["student_name"]) print(data["student_grade"]) return "Hi! " + data["student_name"] else: return "Hi!" def validate_json(f): @wraps(f) def wrapper(*args, **kw): try: request.json except BadRequest as e: msg = “ This is an invalid json" return jsonify({"error": msg}), 400 return f(*args, **kw) return wrapper def validate_schema(schema_name): def decorator(f): @wraps(f) def wrapper(*args, **kw): try: validate(request.json, current_app.con fi g[schema_name]) except ValidationError as e: return jsonify({"error": e.message}), 400 return f(*args, **kw) return wrapper return decorator
  41. 1 2 3 4 Error check Request parameter check $MJFOU

    URIs preparation preprocessing calculation Request Routing The Flow Chart Transformation fromɹ Research Oriented code into WEB API
  42. 1 2 3 4 Error check: think if processing should

    be stoped or continue Module name Functions Preparation code preparation.py - Access database - Execute query - Load input data Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data - Rename columns Calculation / execution code prediction.py - Calculate logistic regression and output results - … STOP or Continued depends on prediction.py STOP or depends on prediction.py Depends on services
  43. 1 2 3 4 Error check: use error handler functions

    to detect error by using Flask @api.errorhandler(400) @api.errorhandler(404) def error_handler(error): response = jsonify( {"error_message": error.description["error_message"], “result": error.code} ) return response, error.code from fl ask import abort, jsonify def extract_ fi lenames( fi le_id: str) -> list[str]: “""Get fi lename from database""" img_obj = db.session.query(ImageInfo). fi lter(ImageInfo. fi le_id == fi le_id) fi lenames = [img. fi lename for img in img_obj if img. fi lename] if not fi lenames: # stop processing with abort # abort(404, {"error_message": " fi lenames are not found in database"}) # Not stop processing without abort return ( jsonify({"message": " fi lenames are not found in database”, "result": 400}), 400, ) return fi lenames
  44. Summarize 4 Step Transformation from Research Oriented code into Products

    1 2 3 4 Understand the characteristics of the code and fi gure out how it is working by taking notes. Modularize the code based on the code documentation by labeling the code as preparation, pre/post processing, and calculation. Refactor the preparation code by simplifying I/O and the pre processing code by changing the coding style. Make them a product which is an API composed of request routing, request parameter check, and error check.
  45. After deployed the product… 1 2 3 4 Performance check

    such as speed and stability by using loading test tools Parameter tuning of web server(nginx/apatch), app server(uwsgi/gunicorn/uvicorn) Think about asynchronies or synchronies Rethink about architecture of infrastructure or refactoring the code with di ff erent language 7FHFUB"UUBDL FDIP(&5IUUQ cWFHFUBBUUBDLSBUF EVSBUJPOTcUFF 7FHFUBIUUQTHJUIVCDPNUTFOBSUWFHFUB -0$645IUUQTMPDVTUJP
  46. 4 Step Transformation from Research Oriented code into Products 1

    2 3 4 Understand Modularize Refactor Make them a product