Dataset @JesseTetsuya ———————————————————————————————————————————————————————————————————————————————— Software Engineer at an IT company specializing in education industry based in Tokyo. I mostly work in both data science and engineering. PyCon Taiwan 2020 tutorial !
to work with data scientists and researchers than before. - There are less business use cases to implement AI / ML application than develop ML models. @JesseTetsuya
From Database Application 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze log data and find optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into the application Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months
From Database Application 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze log data and find optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into the application Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months 8IBUJTUIF 3FTFBSDI 0SJFOUFE $PEF
From Database Application 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze log data and find optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into the application Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months 8IBUJTUIF .-"1*T
code Production code (Engineers) Research oriented code (Data Scientists/Researchers) Machine Learning APIs are composed of three elements Research oriented code is developed through an iterative process and integrated into production code.
From Database Application 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze log data and find optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into the application Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months 8IBUJTUIFHBQ CX3FTFBSDI 0SJFOUFE$PEF BOEQSPEVDUJPO DPEF
Code ? 0 points: More of a research project than a productionized system 1-2 points: Not totally untested, but it is worth considering the possibility of serious holes in reliability. 3-4 points: There’s been first pass at basic productionization, but additional investment may be needed. 5-6 points: Reasonably tested, but it’s possible that more of those tests and procedures may be automated. 7-10 points: Strong levels of automated testing and monitoring, appropriate for missioncritical systems. 12+ points: Exceptional levels of automated testing and monitoring. @JesseTetsuya 8IBU`TZPVS.-UFTUTDPSF "SVCSJDGPS.-QSPEVDUJPOTZTUFNT &SJD#SFDL4IBORJOH$BJ&SJD/JFMTFO.JDIBFM4BMJC%4DVMMFZ3FMJBCMF.BDIJOF-FBSOJOHJOUIF8JME/*148PSLTIPQ https://research.google/pubs/pub45742/
$IBSBDUFSJTUJDT PG$PEJOH4UZMF &BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF⒏DJFOUBOE TVJUBCMFNBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ Three Differences between Research Oriented Code and Production Code
&BTJMZIBOEMFE 7JTVBMMZUSBDFBCMF )JHIDBMDVMBUJPOTQFFE )JHISFBEBCJMJUZ 5FTUBCMFBOENPEVMBS 0CKFDUJWFTPG $PEJOH4UZMF 'JOEJOHUIFNPTUF⒏DJFOUBOE TVJUBCMFNBDIJOFMFBSOJOHNPEFM .BLJOHUIFDPEFXPSLPOUIFTFSWFS DPSSFDUMZBOESFMJBCMZ Three Differences between Research Oriented Code and Production Code 3FGBDUPS $IFDL .PEVMBSJ[F
Data From Database LMS / Online Learning Platform 3FTFBSDI 0SJFOUFE $PEF .-"1*T Analyze learning log data and find optimal ml models / algorithms . Transform Research oriented code into ML APIs Integrate ML APIs into LMS / Online learning platform Data Scientists Engineers Periods: 2 - 3 months Periods: between 2 weeks and 2 months
jupyter notebook. This code is procedural and some of them are not classified. The research oriented code seems to be tightly coupled. 2.1. Categorize research oriented code into preparation code, preprocessing code, ML code
→ preparation code Find the code to make, replace, filter, or delete input data → preprocessing code Find the code to execute calculation or train data → ML code 2.1. Categorize research oriented code into preparation code, preprocessing code, ML code
execute query, and load input data - Rename columns Preprocessing code preprocessing.py - Replace categorical data with discrete numbers - Filter input data ML code prediction.py - Calculate icc parameters, logistic regression, and item response theory (IRT) The research oriented code became loosely coupled 2.2. Break them out into functions and make them testable
200 ← noun ← the same endpoint name ← verb (+ noun) INPUT OUTPUT *item means a question INPUT: results of student answers OUTPUT: probabilities to answer questions correctly 2.3. Clarify input and output of the whole code and define URI
ᴹ ᴹ ᵓᴷᴷ config ᴹ ᴹ ᵓᴷᴷ prediction.py ᴹ ᴹ ᵓᴷᴷ preparation.py ᴹ ᴹ ᵓᴷᴷ preprocessing.py ᴹ ᵓᴷᴷ requirements.txt ᴹ ᵓᴷᴷ run.py ᴹ ᵋᴷᴷ tests ᴹ ᵓᴷᴷ test_app.py ᴹ ᵓᴷᴷ test_prediction.py ᴹ ᵓᴷᴷ test_preparation.py ᴹ ᵋᴷᴷ test_preprocessing.py ᵋᴷᴷ setup.py 3.1 Prepare for refactoring Narrow down requirements of each code by writing test code and take notes about requirements on the comments for refactoring (or you can tell data scientist to write comments in advance) def func(arg1, arg2): """Summary line. Extended description of function. Args: arg1 (int): Description of arg1 arg2 (str): Description of arg2 Returns: bool: Description of return value """ return True ex) Google Style #comments out or doc strings (reStructuredText style /Numpy style/Google Style)
column_1, column_2, column_3 FROM `data set name` where column_1 is not NULL query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] OUTPUT: Two Dimensional Arrays + Filter Values + Drop Null OUTPUT: Two Dimensional Arrays from google.cloud import bigquery client = bigquery.Client() query = "SELECT * FROM `data set name` query_job = client.query(query) results = [list(row.values()) for row in query_job.result()] → Preprocess the data with query as much as possible → It is faster and lower-cost than preprocess data with python Code B Code A 3.2. Simplify I/O in preparation code ex) Big Query with Python
storage.Client() bucket = storage_client.get_bucket(‘bucket name’) with io.StringIO() as csv_obj: writer = csv.writer(csv_obj, quotechar='"', quoting=csv.QUOTE_ALL, lineterminator="\n") writer.writerows(two_dimentional_arrays) result = csv_obj.getvalue() with io.BytesIO() as gzip_obj: with gzip.GzipFile(fileobj=gzip_obj, mode="wb") as gzip_file: bytes_f = result.encode() gzip_file.write(bytes_f) blob = bucket.blob(‘storage_path’) blob.upload_from_file(gzip_obj, rewind=True, content_type='application/gzip') Make bytes object and upload it from memory to GCS with Python 3.2 Simplify I/O in preparation code ex) Google Cloud Storage with Python
Functions Pandas Python Filter dataframe.where(.query) dataframe.groupby() dataframe[[“”, “”, ‘“]] dataframe.loc[] dataframe.iloc[] if - else + for +.append() [[v1, v2, v3] for value in values] Replace dataframe.fillna() dic = {“key1”: value1, “key2”: value, …} dataframe['column1'].replace(dic, inplace=True) dic = {“key1”: value1, “key2”: value, …} [[dic.get(v, v) for v in value] for value in values] De-duplicate /Be unique duplicated() / drop_duplicates() dataframe['column1'].unique() (outuput: array([v1, v2, v3])) set(list) list({v1, v2, v2, …}) list({value[0] for value in values}) Delete/Drop dataframe.dropna() dataframe.drop() dataframe.drop(index=index list) if - else + for +.append() [[v1, v2, v3] for value in values]
and test functions > decide what to refactor > preprocessing.py 2. make sure of output and input data > decide how to refactor > remove dataframe @JesseTetsuya
Code into ML APIs 3FGBDUPS $IFDL .PEVMBSJ[F @JesseTetsuya 6OEFSTUBOE %BUB 1. Write decorators to check parameters 2. Set up production-like environments
(Pycon US 2016) Pytest: https://www.youtube.com/watch?v=G-MAMrJ-CSA (Pycon US 2019) Flask workshop: https://www.youtube.com/watch?v=DIcpEg77gdE (Pycon US 2015) Dash: https://www.youtube.com/watch?v=WLbQYFZc-YY (Pycon Jp 2019) google-cloud-bigquery: https://pypi.org/project/google-cloud-bigquery/ google-cloud-storage: https://pypi.org/project/google-cloud-storage/ gcp-accessor: https://pypi.org/project/gcp-accessor/0.0.1/ Flask-AppBuilder: https://flask-appbuilder.readthedocs.io/en/latest/ Python Tools that I mentioned in this talk Python Packages that I mentioned in this talk
https://pandas.pydata.org/docs/user_guide/index.html What’s your ML test score? A rubric for ML production systems: https://research.google/pubs/ pub45742/ Item Response Theory Tutorial: https://www.publichealth.columbia.edu/research/population- health-methods/item-response-theory Research Oriented Code: https://towardsdatascience.com/research-oriented-code-in-ai-ml- projects-f0dde4f9e1ac Why is Educational Data Mining important in the research?: https://towardsdatascience.com/why-is-educational-data-mining-important-in-the-research- e78ed1a17908 The resources that I mentioned in this tutorial
%BUB .PEVMBSJ[F - What data can be stored and created ? - What data looks like ? What is the nature of the data ? - What product can be generated from this data ? - Categorize research oriented code into preparation code, preprocessing code, ML code - Break them out into functions and make them testable - Clarify input and output of the code, and define URI - Prepare for refactoring - Simplify I/O in preparation code - Pandas → Python in preprocessing code - Write decorators to check parameters - Set up production-like environments @JesseTetsuya
education industry based in Tokyo. I mostly work in both data science and engineering. If you have an interest in the education and technology domain, feel free to contact with me !!