Slide 1

Slide 1 text

Honours Thesis Defense

Slide 2

Slide 2 text

Code Search in the IDE with Query Reformulation 2 Shihui Gao BCS student Faculty of Computer Science, Dalhousie University, Canada Supervisor: Dr. Masud Rahman RAISE Lab Intelligent Automation in Software EngineeRing Shihui Gao, Dalhousie University

Slide 3

Slide 3 text

Outline of the Talk 3 Part 1: Motivation Part 2: Research Methodology Shihui Gao, Dalhousie University Part 3: Tool Demonstration Part 4: Empirical Findings Part 5: Threats to Validity Part 6: Take-home Messages Part 7: Q&A

Slide 4

Slide 4 text

4 Part 1: Motivation Shihui Gao, Dalhousie University P1 P2 P4 P3 P5 P7 P6 P5 P7 P6

Slide 5

Slide 5 text

Finding Accurate answers is Difficult • Vocabulary mismatch problems • Many answers might not contain the relevant code examples 5 Shihui Gao, Dalhousie University Fig. 1: Search result from Google: “How do you parse HTML?”

Slide 6

Slide 6 text

6 Part 2: Research Methodology Shihui Gao, Dalhousie University P2 P4 P3 P1 P5 P7 P6

Slide 7

Slide 7 text

Replication of RACK in Python • Analyze the original implementation of RACK • Replicate RACK in Python language • Migrate the syntax and styles between Java & Python 2 January 2024 Shihui Gao, Dalhousie University 7

Slide 8

Slide 8 text

Construction of token-API database using Python posts from Stack Overflow 2 January 2024 Shihui Gao, Dalhousie University 8 Fig. 2: Construction of token-API database using Python posts from Stack Overflow [1].

Slide 9

Slide 9 text

Natural Language Pre-processing • Identify individual words • Determine Parts of Speech (POS) • Remove stop words • Use stemming 2 January 2024 Shihui Gao, Dalhousie University 9 How do you use the argparse module in Python to parse command-line arguments? ['How', 'do', 'you', 'use', 'the', 'argparse', 'module', 'in', 'Python', 'to', 'parse', 'command- line', 'arguments', '?'] { 'use': 'VB', 'parse': 'VB', 'argparse': 'NN', 'module': 'NN', 'Python': 'NN', 'command-line': 'NN', 'arguments': 'NN' } ['use', 'argparse', 'module', 'Python', 'parse', 'arguments'] ['use', 'argpars', 'modul', 'python', 'pars', 'argument']

Slide 10

Slide 10 text

Construction of token-API database using Python posts from Stack Overflow. 2 January 2024 Shihui Gao, Dalhousie University 10 Fig. 2: Construction of token-API database using Python posts from Stack Overflow [1].

Slide 11

Slide 11 text

Determine the relevance between a natural language query and API classes and return a list of related API classes 2 January 2024 Shihui Gao, Dalhousie University 11 Fig. 3: Determining the relevance between a natural language query and API classes and return a list of related API classes

Slide 12

Slide 12 text

Heuristics for API Suggestion • Keyword-API Co- occurence (KAC) • Keyword-Keyword Co-occurence (KKC) • Keyword Pair API Co-occurrence (KPAC) • ALL 2 January 2024 Shihui Gao, Dalhousie University 12 Fig. 4: The recommended APIs by KAC, KPAC, KkC and all. Query: How to send email? Data Format: API token | KAC | KPAC | KKC | ALL

Slide 13

Slide 13 text

Design of a VS-Code Plug-in for Code Search Fig. 5: Code snippet search Fig.7: Recommended code snippet 13 Fig. 6: Jaccard Similarity between recommended API classes and ground truth API classes.

Slide 14

Slide 14 text

14 Part 3: Demonstration Shihui Gao, Dalhousie University P1 P2 P3 P4 P5 P7 P6

Slide 15

Slide 15 text

15 Part 4: Empirical Findings Shihui Gao, Dalhousie University P1 P2 P3 P4 P5 P7 P6

Slide 16

Slide 16 text

Evaluation 2 January 2024 Shihui Gao, Dalhousie University 16 50 examples from four websites : freecodecamp.org, programiz.com, geeksforgeeks.org and realpython.com Fig. 9: Recommended API classes and Ground-truth API classes and methods for the questions

Slide 17

Slide 17 text

Performance Metrics • Precision (P) : It refers to the percentage of the retrieved API classes that are relevant. Formula: (GT ∩ Ra) / Ra • Recall (R) : It refers to the percentage of relevant API classes that are retrieved. Formula: (GT ∩ Ra) / GT • F1-score (F) : It is the harmonic mean of precision and recall. Formula: (2 * P * R) / (P + R) GT: It is a shorthand for the term “ground truth”, all method names and function names of the website definition function Ra: ranked recommended API classes 2 January 2024 Shihui Gao, Dalhousie University 17

Slide 18

Slide 18 text

Research Questions 18 RQ1: What are the Precision, Recall and F1-Score of RACK in Python for all heuristics and individual heuristics? RQ2: Java & Python – for which language does RACK perform better? Shihui Gao, Dalhousie University 18

Slide 19

Slide 19 text

The P, R and F of RACK in Python of the top 10 recommended API classes for all heuristics and individual heuristics 2 January 2024 Shihui Gao, Dalhousie University 19 Fig. 10: The P, R and F of all heuristics (KAC, KKC and KPAC) with the top 10 ranked API classes Fig. 13: The P, R and F of KKC heuristics with the top 10 ranked API classes Fig. 11: The P, R and F of KPAC heuristics with the top 10 ranked API classes Fig. 12: The P, R and F of KAC heuristics with the top 10 ranked API classes All is the most effective. KPAC is the most effective among three heuristics

Slide 20

Slide 20 text

P, R and F of RACK in Python of different top K recommended API classes for all heuristics 2 January 2024 Shihui Gao, Dalhousie University 20 Fig. 14: The P, R and F of all heuristics (KAC, KKC and KPAC) with the top 10 ranked API classes Fig. 15: The P, R and F of all heuristics (KAC, KKC and KPAC) with the top 5 ranked API classes Fig. 16: The P, R and F of all heuristics (KAC, KKC and KPAC) with the top 3 ranked API classes The top 5 ranked API classes have the highest Mean P (precision). The top 10 ranked API classes have the highest Mean R (recall). All of the ranked top 3, 5 and 10 API classes have the similar highest Mean f1-score.

Slide 21

Slide 21 text

Compare the performance of RACK of different top K recommended API classes for all heuristics between Java and Python languages 2 January 2024 Shihui Gao, Dalhousie University 21 Fig.17: Top-3, 5 and 10 Mean Precision and Mean Recall of all heuristics in Python Fig.18: Top-3, 5 and 10 Mean Average Precision@K, and Mean Recall@K of all in JAVA [2] Performance Metric Top-3 Top-5 Top-10 Mean precision 33.3% 40% 20% Mean Recall 8.3% 14.3% 23.6% Performance Metric Top-3 Top-5 Top-10 Mean Average precision 30.39% 33.36% 34.92% Mean Recall 23.71% 33.48% 45.02% Python: Java:

Slide 22

Slide 22 text

Compare the performance of RACK of the top 10 recommended API classes for individual heuristics between Java and Python languages 2 January 2024 Shihui Gao, Dalhousie University 22 Heuristics Metric Top-10 Keyword-API Co- occurence (KAC) Mean precision 11.1% Mean Recall 6.3% Keyword-Keyword Co-occurence (KKC) Mean precision 11.1% Mean Recall 0% Fig. 19: Mean Precision, and Mean Recall of KKC, KAC and (KKC+KAC )in Python Heuristics Metric Top-10 Keyword-API Co- occurence (KAC) Mean Average precision 35.41% Mean Recall 44.8% Keyword-Keyword Co-occurence (KKC) Mean Average precision 24.11% Mean Recall 19.52% Python: Java: Fig. 20: Top-3, 5 and 10 Mean Average Precision@K, and Mean Recall@K of KKC, KAC and (KKC+KAC) in JAVA [2]

Slide 23

Slide 23 text

23 Part 5: Threats to Validity Shihui Gao, Dalhousie University P1 P2 P3 P4 P5 P7 P6 P4

Slide 24

Slide 24 text

Threats to validity • Less common Python programming questions in RACK • The generalizability of RACK 2 January 2024 Shihui Gao, Dalhousie University 24

Slide 25

Slide 25 text

25 Part 6: Take-home Messages Shihui Gao, Dalhousie University P1 P2 P3 P4 P7 P6 P4 P5

Slide 26

Slide 26 text

Take-Home Message 26 (1) Summary (2) Future opportunities RACK is a good API recommendation tool More accuracy in API recommendation Shihui Gao, Dalhousie University Get the relevant recommended API classes Make RACK more comprehensive and accurate Return the relevant code snippet against a query

Slide 27

Slide 27 text

27 Part 7: Q&A Shihui Gao, Dalhousie University P1 P2 P3 P4 P7 P6 P4 P5 P6

Slide 28

Slide 28 text

28 THANK YOU! QUESTIONS? Contact: [email protected] RAISE Lab Intelligent Automation in Software EngineeRing Shihui Gao, Dalhousie University

Slide 29

Slide 29 text

Appendix References: [1] Mohammad Masudur Rahman,Chanchal K. Roy and David Lo. RACK: Code Search in the IDE using Crowdsourced Knowledge, IEEE 2017 [2] Mohammad Masudur Rahman,Chanchal K. Roy and David Lo. RACK: Automatic API Recommendation using Crowdsourced Knowledge, IEEE 2016 [3] Parse - MDN Web Docs Glossary: Definitions of Web-related terms: MDN. MDN Web Docs Glossary: Definitions of Web-related terms | MDN. (n.d.). Retrieved February 5, 2023, from https://developer.mozilla.org/en-US/docs/Glossary/Parse 29 Shihui Gao, Dalhousie University