Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Code Search in the IDE with Query Reformulation

Shihui
January 02, 2024
21

Code Search in the IDE with Query Reformulation

Software developers spend about 20% of their programming time searching for relevant code. They often spend a lot of time to manually choose queries for their code search. Unfortunately, due to vocabulary mismatch problems, the accurate answer may not be always retrieved, which leads to numerous trials and errors. Furthermore, many answers might not contain the relevant code examples that the developers look for. In this thesis, we extend RACK, an existing solution for code search, and attempt to solve the code search problem effectively. First, we replicate RACK in Python language from its original implementation in Java. Second, we construct a token-API database by analyzing thousands of Python posts from Stack Overflow. Third, we determine the relevance between a natural language query and API classes using three co-occurrence based heuristics -- KKC, KAC and KPAC. Then we return a list of relevant API classes against a natural language query. Finally, we integrate our RACK implementation into a VS-code plugin. The plug-in accepts a natural language query and retrieves relevant code examples from GitHub by leveraging its search API and the API classes from RACK. Software developers can use these code examples to solve their programming problems much faster.

Shihui

January 02, 2024
Tweet

Transcript

  1. Code Search in the IDE with Query Reformulation 2 Shihui

    Gao BCS student Faculty of Computer Science, Dalhousie University, Canada Supervisor: Dr. Masud Rahman RAISE Lab Intelligent Automation in Software EngineeRing Shihui Gao, Dalhousie University
  2. Outline of the Talk 3 Part 1: Motivation Part 2:

    Research Methodology Shihui Gao, Dalhousie University Part 3: Tool Demonstration Part 4: Empirical Findings Part 5: Threats to Validity Part 6: Take-home Messages Part 7: Q&A
  3. Finding Accurate answers is Difficult • Vocabulary mismatch problems •

    Many answers might not contain the relevant code examples 5 Shihui Gao, Dalhousie University Fig. 1: Search result from Google: “How do you parse HTML?”
  4. Replication of RACK in Python • Analyze the original implementation

    of RACK • Replicate RACK in Python language • Migrate the syntax and styles between Java & Python 2 January 2024 Shihui Gao, Dalhousie University 7
  5. Construction of token-API database using Python posts from Stack Overflow

    2 January 2024 Shihui Gao, Dalhousie University 8 Fig. 2: Construction of token-API database using Python posts from Stack Overflow [1].
  6. Natural Language Pre-processing • Identify individual words • Determine Parts

    of Speech (POS) • Remove stop words • Use stemming 2 January 2024 Shihui Gao, Dalhousie University 9 How do you use the argparse module in Python to parse command-line arguments? ['How', 'do', 'you', 'use', 'the', 'argparse', 'module', 'in', 'Python', 'to', 'parse', 'command- line', 'arguments', '?'] { 'use': 'VB', 'parse': 'VB', 'argparse': 'NN', 'module': 'NN', 'Python': 'NN', 'command-line': 'NN', 'arguments': 'NN' } ['use', 'argparse', 'module', 'Python', 'parse', 'arguments'] ['use', 'argpars', 'modul', 'python', 'pars', 'argument']
  7. Construction of token-API database using Python posts from Stack Overflow.

    2 January 2024 Shihui Gao, Dalhousie University 10 Fig. 2: Construction of token-API database using Python posts from Stack Overflow [1].
  8. Determine the relevance between a natural language query and API

    classes and return a list of related API classes 2 January 2024 Shihui Gao, Dalhousie University 11 Fig. 3: Determining the relevance between a natural language query and API classes and return a list of related API classes
  9. Heuristics for API Suggestion • Keyword-API Co- occurence (KAC) •

    Keyword-Keyword Co-occurence (KKC) • Keyword Pair API Co-occurrence (KPAC) • ALL 2 January 2024 Shihui Gao, Dalhousie University 12 Fig. 4: The recommended APIs by KAC, KPAC, KkC and all. Query: How to send email? Data Format: API token | KAC | KPAC | KKC | ALL
  10. Design of a VS-Code Plug-in for Code Search Fig. 5:

    Code snippet search Fig.7: Recommended code snippet 13 Fig. 6: Jaccard Similarity between recommended API classes and ground truth API classes.
  11. Evaluation 2 January 2024 Shihui Gao, Dalhousie University 16 50

    examples from four websites : freecodecamp.org, programiz.com, geeksforgeeks.org and realpython.com Fig. 9: Recommended API classes and Ground-truth API classes and methods for the questions
  12. Performance Metrics • Precision (P) : It refers to the

    percentage of the retrieved API classes that are relevant. Formula: (GT ∩ Ra) / Ra • Recall (R) : It refers to the percentage of relevant API classes that are retrieved. Formula: (GT ∩ Ra) / GT • F1-score (F) : It is the harmonic mean of precision and recall. Formula: (2 * P * R) / (P + R) GT: It is a shorthand for the term “ground truth”, all method names and function names of the website definition function Ra: ranked recommended API classes 2 January 2024 Shihui Gao, Dalhousie University 17
  13. Research Questions 18 RQ1: What are the Precision, Recall and

    F1-Score of RACK in Python for all heuristics and individual heuristics? RQ2: Java & Python – for which language does RACK perform better? Shihui Gao, Dalhousie University 18
  14. The P, R and F of RACK in Python of

    the top 10 recommended API classes for all heuristics and individual heuristics 2 January 2024 Shihui Gao, Dalhousie University 19 Fig. 10: The P, R and F of all heuristics (KAC, KKC and KPAC) with the top 10 ranked API classes Fig. 13: The P, R and F of KKC heuristics with the top 10 ranked API classes Fig. 11: The P, R and F of KPAC heuristics with the top 10 ranked API classes Fig. 12: The P, R and F of KAC heuristics with the top 10 ranked API classes All is the most effective. KPAC is the most effective among three heuristics
  15. P, R and F of RACK in Python of different

    top K recommended API classes for all heuristics 2 January 2024 Shihui Gao, Dalhousie University 20 Fig. 14: The P, R and F of all heuristics (KAC, KKC and KPAC) with the top 10 ranked API classes Fig. 15: The P, R and F of all heuristics (KAC, KKC and KPAC) with the top 5 ranked API classes Fig. 16: The P, R and F of all heuristics (KAC, KKC and KPAC) with the top 3 ranked API classes The top 5 ranked API classes have the highest Mean P (precision). The top 10 ranked API classes have the highest Mean R (recall). All of the ranked top 3, 5 and 10 API classes have the similar highest Mean f1-score.
  16. Compare the performance of RACK of different top K recommended

    API classes for all heuristics between Java and Python languages 2 January 2024 Shihui Gao, Dalhousie University 21 Fig.17: Top-3, 5 and 10 Mean Precision and Mean Recall of all heuristics in Python Fig.18: Top-3, 5 and 10 Mean Average Precision@K, and Mean Recall@K of all in JAVA [2] Performance Metric Top-3 Top-5 Top-10 Mean precision 33.3% 40% 20% Mean Recall 8.3% 14.3% 23.6% Performance Metric Top-3 Top-5 Top-10 Mean Average precision 30.39% 33.36% 34.92% Mean Recall 23.71% 33.48% 45.02% Python: Java:
  17. Compare the performance of RACK of the top 10 recommended

    API classes for individual heuristics between Java and Python languages 2 January 2024 Shihui Gao, Dalhousie University 22 Heuristics Metric Top-10 Keyword-API Co- occurence (KAC) Mean precision 11.1% Mean Recall 6.3% Keyword-Keyword Co-occurence (KKC) Mean precision 11.1% Mean Recall 0% Fig. 19: Mean Precision, and Mean Recall of KKC, KAC and (KKC+KAC )in Python Heuristics Metric Top-10 Keyword-API Co- occurence (KAC) Mean Average precision 35.41% Mean Recall 44.8% Keyword-Keyword Co-occurence (KKC) Mean Average precision 24.11% Mean Recall 19.52% Python: Java: Fig. 20: Top-3, 5 and 10 Mean Average Precision@K, and Mean Recall@K of KKC, KAC and (KKC+KAC) in JAVA [2]
  18. Threats to validity • Less common Python programming questions in

    RACK • The generalizability of RACK 2 January 2024 Shihui Gao, Dalhousie University 24
  19. Take-Home Message 26 (1) Summary (2) Future opportunities RACK is

    a good API recommendation tool More accuracy in API recommendation Shihui Gao, Dalhousie University Get the relevant recommended API classes Make RACK more comprehensive and accurate Return the relevant code snippet against a query
  20. 28 THANK YOU! QUESTIONS? Contact: [email protected] RAISE Lab Intelligent Automation

    in Software EngineeRing Shihui Gao, Dalhousie University
  21. Appendix References: [1] Mohammad Masudur Rahman,Chanchal K. Roy and David

    Lo. RACK: Code Search in the IDE using Crowdsourced Knowledge, IEEE 2017 [2] Mohammad Masudur Rahman,Chanchal K. Roy and David Lo. RACK: Automatic API Recommendation using Crowdsourced Knowledge, IEEE 2016 [3] Parse - MDN Web Docs Glossary: Definitions of Web-related terms: MDN. MDN Web Docs Glossary: Definitions of Web-related terms | MDN. (n.d.). Retrieved February 5, 2023, from https://developer.mozilla.org/en-US/docs/Glossary/Parse 29 Shihui Gao, Dalhousie University