Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Smelling Source Code Using Deep Learning

Smelling Source Code Using Deep Learning

Poor quality code contributes to increasing technical debt and makes the software difficult to extend and maintain. Code smells capture such poor code quality practices. Traditionally, the software engineering community identifies code smells in deterministic ways by using metrics and pre-defined rules/heuristics. Creating a deterministic tool for a specific language is an expensive and arduous task since it requires source code analysis starting from parsing, symbol resolution, intermediate model preparation, and applying rules/heuristics/metrics on the model. It would be great if we can leverage the tools available for one programming language and cross-apply them on another language.

In this presentation, I would like to present our work on detecting smells using deep learning models. It will cover the tooling aspects summarizing the preparation goes behind the scene before the source code is fed into a deep learning model. The focus of the work is on two specific aspects: 1. to show that we can detect code smells with minimal pre-processing without converting them to a feature set. We compare the performance of smell detection among different deep learning models (CNN and RNN) in different configurations (i.e., model architectures). 2. to explore the feasibility of applying deep learning models across the programming languages. In other words, learning smell detection from samples in one programming language and using the model to detect smells in samples of another programming language. The presentation will bring out insights from this exploration.

Tushar Sharma

February 03, 2019
Tweet

More Decks by Tushar Sharma

Other Decks in Programming

Transcript

  1. What is a smell? …certain structures in the code that

    suggest (sometimes they scream for) the possibility of refactoring. - Kent Beck 20 Definitions of smells: http://www.tusharma.in/smells/smellDefs.html Smells’ catalog: http://www.tusharma.in/smells/
  2. Machine learning-based smell detection Code (or source artifact) Smells Machine

    learning algorithm Existing examples Source model Trained model f(x) f(x) f(x) f(x) < > <!> < > <!> < >
  3. Machine learning-based smell detection Existing academic work: - Support vector

    machines - Bayesian belief network - Logistic regression - CNN Take metrics as the features/input Validation on balanced samples m f(m)
  4. Research questions RQ1: Would it be possible to use deep

    learning methods to detect code smells? RQ2: Is transfer-learning feasible in the context of detecting smells? Transfer-learning refers to the technique where a learning algorithm exploits the commonalities between different learning tasks to enable knowledge transfer across the tasks
  5. Overview Learning data generator Deep learning models Tokenized samples --

    ---- -- -- ---- -- 23 51 32 200 11 45 -- ---- -- -- ---- -- 23 51 32 200 11 45 Tokenizer Preprocess Positive and negative samples -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- Research questions Detected smells -- ---- -- -- ---- -- CodeSplit Code fragments </> C# C# C# Java Java Java
  6. Repositories download C# C# C# Java Java Java 1,072 repositories

    2,528 and selected 100 repositories Architecture Community CI Documentation History License Issues Unit test Stars
  7. Splitting code fragments C# C# C# </> CodeSplit -- ----

    -- -- ---- -- -- ---- -- Code fragments (methods or classes) Java Java Java </> -- ---- -- -- ---- -- -- ---- -- https://github.com/tushartushar/CodeSplitJava
  8. Smell detection C# C# C# Detected code smells Java Java

    Java https://github.com/tushartushar/DesigniteJava Java http://www.designite-tools.com/
  9. Generating training and evaluation samples Sample generator Code fragments --

    ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- Code smells Positive and negative samples
  10. Tokenizing learning samples Tokenizer Code fragments -- ---- -- --

    ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- https://github.com/dspinellis/tokenizer Tokenized samples -- ---- -- -- ---- -- 23 51 32 200 11 45 -- ---- -- -- ---- -- 23 51 32 200 11 45
  11. Tokenizing learning samples -- ---- -- -- ---- -- --

    ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- 23 51 32 200 11 45 -- ---- -- -- ---- -- 23 51 32 200 11 45 public void InternalCallback(object state) { Callback(State); try { timer.Change(Period, TimeSpan.Zero); } catch (ObjectDisposedException) { } } 2-D 123 2002 40 2003 41 59 474 123 2004 46 2005 40 2006 44 2007 46 2 125 329 40 2009 41 123 125 125 1-D 123 2002 40 2003 41 59 474 123 2004 46 200
  12. Data preparation 5,146 311,533 3,602 1,544 218,073 93,460 3,602 1,544

    3,602 93,460 Training samples Evaluation samples 70-30 split
  13. Selection of smells • Complex method • Magic number •

    Empty catch block • Multifaceted abstraction the method has high cyclomatic complexity an unexplained numeric literal is used in an expression a catch block of an exception is empty a class has more than one responsibility assigned to it
  14. Architecture - CNN • Filters = {8, 16, 32, 64}

    • Kernel size = {5, 7, 11} • Pooling window = {2, 3, 4, 5} • Dynamic Batch size = {32, 64, 128, 256} • Callbacks • Early stopping (patience = 5) • Model check point Convolution layer Batch normalization layer Max pooling layer Dropout layer Flatten layer Dense layer 1 Dense layer 2 Inputs Output Repeat this set of hidden units 0.1 32, relu 1, sigmoid
  15. Architecture - RNN • Dimensionality of embedding layer = {16,

    32} • LSTM units = {32, 64, 128} • Dynamic Batch size = {32, 64, 128, 256} • Callbacks • Early stopping (patience = 2) • Model check point 0.2 1, sigmoid Embedding layer LSTM layer Dropout layer Dense layer Inputs Output Repeat this set of hidden units
  16. Running experiments • Phase 1 – Grid search for optimal

    hyper- parameters • Validation set – 20% • Number of configurations • CNN = 144 • RNN = 18 • Phase 2 – experiment with the optimal hyper-parameters GRNET Super computing facility Each experiment using 1 GPU with 64 GB memory
  17. RQ1. Would it be possible to use deep learning methods

    to detect code smells? 0.40 0.50 0.60 0.70 0.80 0.90 CM ECB MN MA CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D RNN AUC-ROC 0.38 0.04 0.29 0.09 0.41 0.02 0.35 0.06 0.31 0.22 0.57 0.02 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 CM ECB MN MA CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D RNN F1
  18. CNN-1D vs CNN-2D CNN-1D (max) - 0.40 CNN-2D (max) -

    0.39 CNN-1D (max) - 0.05 CNN-2D (max) - 0.04 CNN-1D (max) - 0.18 CNN-2D (max) - 0.16 CNN-1D (max) - 0.36 CNN-2D (max) - 0.35
  19. CNN vs RNN RNN and CNN-1D RNN and CNN-2D CM

    -22.94 -33.81 ECB 80.23 91.94 MN 48.96 38.58 MA -349.12 -205.26 Difference in percentage; comparing max F1
  20. Are more deep layers always good? Layers CM ECB MN

    MA CNN- 1D 1 0.36 0.05 0.36 0.08 2 0.40 0.05 0.36 0.18 3 0.40 0.05 0.36 0.19 CNN- 2D 1 0.39 0.04 0.35 0.07 2 0.39 0.04 0.34 0.16 3 0.39 0.05 0.34 0.10 RNN 1 0.34 0.21 0.48 0.28 2 0.36 0.24 0.48 0.22 3 0.37 0.23 0.48 0.20
  21. RQ2: Is transfer-learning feasible in the context of detecting smells?

    0.54 0.14 0.49 0.03 0.57 0.07 0.49 0.06 0.00 0.10 0.20 0.30 0.40 0.50 0.60 CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D F1 0.54 0.14 0.49 0.03 0.57 0.07 0.49 0.01 0.38 0.04 0.29 0.09 0.41 0.02 0.35 0.06 0 0.1 0.2 0.3 0.4 0.5 0.6 CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D F1 Transfer-learning Direct-learning
  22. Conclusions It is feasible to make the deep learning model

    learn to detect smells Transfer-learning is feasible. Improvements – many possibilities - Performance - Add more smells – different kinds
  23. Relevant links Source code and data https://github.com/tushartushar/DeepLearningSmells Smell detection tool

    Java - https://github.com/tushartushar/DesigniteJava C# - http://www.designite-tools.com </> CodeSplit Java - https://github.com/tushartushar/CodeSplitJava C# - https://github.com/tushartushar/DeepLearningSmells/tree/master/CodeSplit Tokenizer https://github.com/dspinellis/tokenizer