Software Architecture Patterns Detector for Java Applications using BERT model

Slide 1

Slide 1 text

Software Architecture Patterns Detector for Java Applications using BERT model MODULE 8; CAPSTONE PROJECT - APRIL 2025 BRUNO TINOCO

Slide 2

Slide 2 text

Project Summary  This project intent to build a model using a pre-trained BERT model called CodeBERT to extract common features of a series of Github repositories with labels to be able to detect which architecture style is applied for any specific Java programming language project using its source code as the model input.  Most software development projects adopt some architecture patterns to build solutions for different types of problems for each industry. There patterns are sometimes combined with others in order to solve common issues and many of them share the same characteristics that can be mapped based on patterns found on source codes; Some of the most known architecture patterns or styles are the following:  Layered  Monolith  Clean  Microservices  DDD  Hexagonal  Event-driven These architectures can be combined into a solution, and it requires a manual inspection to source code to assert it they are being applied to a specific project. This could be an initial step to automate software architecture review or to generate diagrams to visualize its structure.

Slide 3

Slide 3 text

Project Solution  The first step to build this project was to find and label Github source code repositories applying each of the mapped architecture patterns.  Github provides a search mechanism to find repositories that could be used based on tags, this is the most time-consuming task, with the help of online LLMs we can make the search faster.  The dataset generated from labeled repositories follow the structure below to identify the repository, the architecture style used, and a list of key files and its contents to be used as the input for the embedding's generation.  [ { "repo": "spring-petclinic", "architecture": "layered", "files": [ { "file_path": "src/main/java/org/springframework/Ow nerController.java", "content": "package org.springframework...\npublic class OwnerController {...}" }, ... ] } ]  Some pre-processing was used to normalize the data; using the assumption most of the enterprise java projects adopt some keywords for class naming convention (model, controller, service and repository). we grouped code contents into these naming convention categories. DATASET SELECTION

Slide 4

Slide 4 text

Project Solution  Find a pre-trained model that could support the feature extraction from the source code; to build a word tokenizer and the embeddings.  Considering the input dataset is a set of java programming code, the model needs to be able to process texts understands its semantics, so this problem falls into the Natural Language Processing category.  BERT was the main model considered for the task but we found that there was a specific version of BERT fine- tunned for programming code, which is CodeBERT created by some researches from Microsoft that is a bimodal pre-trained model fro natural language and programing languages(NL-PL) including Java.  Hugging Face provided an open-source Python library that allows us to load the pre-trainned codebert model using Pytorch.  The model is composed of two main functions;  Code Tokenizer  Embedding generation MODEL DEFINITION

Slide 5

Slide 5 text

Project Solution  After we normalized the data we loaded a customized tensor dataset to combine the embeddings and labels for the model training using CodeBERT Tokenizer  We defined a custom classifier with 768 inputs, using dropout and ReLU layers down to a 6 possible classes of output for each one of the architecture styles/patterns  The model was trained using 20 epochs with batch size of 8 for each group (train and validation) TRAINNING INPUT LAYER: 768 OUTPUT LAYER: 6

Slide 6

Slide 6 text

Model Execution Results  The model reach 89% of accuracy as demonstrated by the classification report below The confusion matrix helped to validate how well each class performed

Slide 7

Slide 7 text

Project Solution  We found that were some miss-classifications that could be improved by increasing the number of sample classes.  We were not considering that these classes could be combined into the same project, so we should adjust the model to consider a multiclass output, for instance many projects adopt DDD pattern combined with other architecture type.  We also could improve the model accuracy and explore the model semantics by getting more relevant code details, because we are truncating, and we may be missing some features. IMPROVEMENTS

Slide 8

Slide 8 text

References: CodeBERT: A Pre-Trained Model for Programming and Natural Languages https://arxiv.org/abs/2002.08155 CodeBERT Github repository https://github.com/microsoft/CodeBERT Hugging Face CodeBERT-base model https://huggingface.co/microsoft/codebert-base List of software architecture styles and patterns https://en.wikipedia.org/wiki/List_of_software_architecture_styles_and_patterns Github Source Code repositories https://github.com/

Slide 9

Slide 9 text

Bruno Tinoco Business Development Solutions Architect at GFT US https://www.linkedin.com/in/brunocrt/ [email protected] Thank you