Smelling Source Code Using Deep Learning

Slide 1

Slide 1 text

Smelling Source Code Using Deep Learning Tushar Sharma http://www.tusharma.in

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

What is a smell? …certain structures in the code that suggest (sometimes they scream for) the possibility of refactoring. - Kent Beck 20 Definitions of smells: http://www.tusharma.in/smells/smellDefs.html Smells’ catalog: http://www.tusharma.in/smells/

Slide 5

Slide 5 text

Implementation smells

Slide 6

Slide 6 text

Design Smells

Slide 7

Slide 7 text

Architecture Smells

Slide 8

Slide 8 text

How smells get detected?

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Metrics-based smell detection Code (or source artifact) Source model Smells Metrics < >

Slide 11

Slide 11 text

Machine learning-based smell detection Code (or source artifact) Smells Machine learning algorithm Existing examples Source model Trained model f(x) f(x) f(x) f(x) < > < > < >

Slide 12

Slide 12 text

Machine learning-based smell detection Existing academic work: - Support vector machines - Bayesian belief network - Logistic regression - CNN Take metrics as the features/input Validation on balanced samples m f(m)

Slide 13

Slide 13 text

Research questions RQ1: Would it be possible to use deep learning methods to detect code smells? RQ2: Is transfer-learning feasible in the context of detecting smells? Transfer-learning refers to the technique where a learning algorithm exploits the commonalities between different learning tasks to enable knowledge transfer across the tasks

Slide 14

Slide 14 text

Overview Learning data generator Deep learning models Tokenized samples -- ---- -- -- ---- -- 23 51 32 200 11 45 -- ---- -- -- ---- -- 23 51 32 200 11 45 Tokenizer Preprocess Positive and negative samples -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- Research questions Detected smells -- ---- -- -- ---- -- CodeSplit Code fragments C# C# C# Java Java Java

Slide 15

Slide 15 text

Data Curation

Slide 16

Slide 16 text

Repositories download C# C# C# Java Java Java 1,072 repositories 2,528 and selected 100 repositories Architecture Community CI Documentation History License Issues Unit test Stars

Slide 17

Slide 17 text

Splitting code fragments C# C# C# CodeSplit -- ---- -- -- ---- -- -- ---- -- Code fragments (methods or classes) Java Java Java -- ---- -- -- ---- -- -- ---- -- https://github.com/tushartushar/CodeSplitJava

Slide 18

Slide 18 text

Smell detection C# C# C# Detected code smells Java Java Java https://github.com/tushartushar/DesigniteJava Java http://www.designite-tools.com/

Slide 19

Slide 19 text

Generating training and evaluation samples Sample generator Code fragments -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- Code smells Positive and negative samples

Slide 20

Slide 20 text

Tokenizing learning samples Tokenizer Code fragments -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- https://github.com/dspinellis/tokenizer Tokenized samples -- ---- -- -- ---- -- 23 51 32 200 11 45 -- ---- -- -- ---- -- 23 51 32 200 11 45

Slide 21

Slide 21 text

Tokenizing learning samples -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- -- ---- -- 23 51 32 200 11 45 -- ---- -- -- ---- -- 23 51 32 200 11 45 public void InternalCallback(object state) { Callback(State); try { timer.Change(Period, TimeSpan.Zero); } catch (ObjectDisposedException) { } } 2-D 123 2002 40 2003 41 59 474 123 2004 46 2005 40 2006 44 2007 46 2 125 329 40 2009 41 123 125 125 1-D 123 2002 40 2003 41 59 474 123 2004 46 200

Slide 22

Slide 22 text

Data preparation 5,146 311,533 3,602 1,544 218,073 93,460 3,602 1,544 3,602 93,460 Training samples Evaluation samples 70-30 split

Slide 23

Slide 23 text

Selection of smells • Complex method • Magic number • Empty catch block • Multifaceted abstraction the method has high cyclomatic complexity an unexplained numeric literal is used in an expression a catch block of an exception is empty a class has more than one responsibility assigned to it

Slide 24

Slide 24 text

Architecture - CNN • Filters = {8, 16, 32, 64} • Kernel size = {5, 7, 11} • Pooling window = {2, 3, 4, 5} • Dynamic Batch size = {32, 64, 128, 256} • Callbacks • Early stopping (patience = 5) • Model check point Convolution layer Batch normalization layer Max pooling layer Dropout layer Flatten layer Dense layer 1 Dense layer 2 Inputs Output Repeat this set of hidden units 0.1 32, relu 1, sigmoid

Slide 25

Slide 25 text

Architecture - RNN • Dimensionality of embedding layer = {16, 32} • LSTM units = {32, 64, 128} • Dynamic Batch size = {32, 64, 128, 256} • Callbacks • Early stopping (patience = 2) • Model check point 0.2 1, sigmoid Embedding layer LSTM layer Dropout layer Dense layer Inputs Output Repeat this set of hidden units

Slide 26

Slide 26 text

Running experiments • Phase 1 – Grid search for optimal hyper- parameters • Validation set – 20% • Number of configurations • CNN = 144 • RNN = 18 • Phase 2 – experiment with the optimal hyper-parameters GRNET Super computing facility Each experiment using 1 GPU with 64 GB memory

Slide 27

Slide 27 text

Results

Slide 28

Slide 28 text

RQ1. Would it be possible to use deep learning methods to detect code smells? 0.40 0.50 0.60 0.70 0.80 0.90 CM ECB MN MA CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D RNN AUC-ROC 0.38 0.04 0.29 0.09 0.41 0.02 0.35 0.06 0.31 0.22 0.57 0.02 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 CM ECB MN MA CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D RNN F1

Slide 29

Slide 29 text

CNN-1D vs CNN-2D CNN-1D (max) - 0.40 CNN-2D (max) - 0.39 CNN-1D (max) - 0.05 CNN-2D (max) - 0.04 CNN-1D (max) - 0.18 CNN-2D (max) - 0.16 CNN-1D (max) - 0.36 CNN-2D (max) - 0.35

Slide 30

Slide 30 text

CNN vs RNN RNN and CNN-1D RNN and CNN-2D CM -22.94 -33.81 ECB 80.23 91.94 MN 48.96 38.58 MA -349.12 -205.26 Difference in percentage; comparing max F1

Slide 31

Slide 31 text

Are more deep layers always good? Layers CM ECB MN MA CNN- 1D 1 0.36 0.05 0.36 0.08 2 0.40 0.05 0.36 0.18 3 0.40 0.05 0.36 0.19 CNN- 2D 1 0.39 0.04 0.35 0.07 2 0.39 0.04 0.34 0.16 3 0.39 0.05 0.34 0.10 RNN 1 0.34 0.21 0.48 0.28 2 0.36 0.24 0.48 0.22 3 0.37 0.23 0.48 0.20

Slide 32

Slide 32 text

RQ2: Is transfer-learning feasible in the context of detecting smells? 0.54 0.14 0.49 0.03 0.57 0.07 0.49 0.06 0.00 0.10 0.20 0.30 0.40 0.50 0.60 CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D F1 0.54 0.14 0.49 0.03 0.57 0.07 0.49 0.01 0.38 0.04 0.29 0.09 0.41 0.02 0.35 0.06 0 0.1 0.2 0.3 0.4 0.5 0.6 CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D F1 Transfer-learning Direct-learning

Slide 33

Slide 33 text

Conclusions It is feasible to make the deep learning model learn to detect smells Transfer-learning is feasible. Improvements – many possibilities - Performance - Add more smells – different kinds

Slide 34

Slide 34 text

Relevant links Source code and data https://github.com/tushartushar/DeepLearningSmells Smell detection tool Java - https://github.com/tushartushar/DesigniteJava C# - http://www.designite-tools.com CodeSplit Java - https://github.com/tushartushar/CodeSplitJava C# - https://github.com/tushartushar/DeepLearningSmells/tree/master/CodeSplit Tokenizer https://github.com/dspinellis/tokenizer

Slide 35

Slide 35 text

Thank you!! Courtesy: spikedmath.com