CISTI 2026 - Test-Driven Development Versus Test Smells: An Empirical Study With Students From UFABC University

Test-Driven Development versus Test Smells Matheus Marabesi¹, Alicia García-Holgado¹, Francisco
José García-Peñalvo¹, Juliana Cristina Braga², Ismar Frango Silveira³ ¹Universidad de Salamanca, Spain · ²UFABC, Brazil · ³Mackenzie, Brazil CISTI 2026

Outline 1. Introduction 2. Motivation and research question 3. Study
design 4. Results 5. Conclusions, limitations and future work

1. Introduction

Test-Driven Development - TDD Common terms used for this presentation:
• Production code: the code that will be executed by the client • Test code: the test code produced • Chicago school/Classicist TDD was used • SonarQube: software quality assessment • Test smells: design issues in the test code 1. Red 2. Green 3. Refactor It shifts the software development process

2. Motivation

2. Motivation • TDD (Test-Driven Development) has been widely promoted
for improving code quality and reducing defects • Yet empirical results on TDD’s effects remain mixed and inconclusive • Most research focuses on production code quality • The quality of test code itself is often overlooked ◦ Poorly written tests → test smells → maintenance challenges in the test code, one more layer besides the production code maintenance Research gap Limited research examines whether TDD practice reduces the prevalence of test smells.

2. Motivation - research questions Study type • Pilot study
• Controlled intervention • Undergraduate context (UFABC) What we measured 1. Production code quality (SonarQube) 2. Test smells (detection tools) 3. Student perceptions (interviews) Does TDD reduce the prevalence of test smells?

3. Study design

3. Study Design – Overview Timeframe of controlled intervention across
the course span. Course context • Software Engineering course • 3-month duration, classes twice/week • Scrum framework

3. Study design - Context and Sample Course context •
38 students across 10 groups Final sample • 20 out of 31 submitted exercises • 8 with invalid/incomplete - lacking the exercise, or submitted code that was not part of the task requested • 12 valid student submissions Student proﬁles • Programming skill: 7 intermediate, 3 advanced, 2 basic • Industry experience: 6 with, 6 without • TDD knowledge: majority had no prior experience

3. Study design - Test Smell Tools • SonarQube used
for broader code quality (reliability, maintainability, duplication) • Code coverage measured per language Language Tool Reference C# xNose Paul et al. (2024) Java tsDetect Peruma et al. (2020) Javascript SNUTS.js Oliveira et al. (2024) Python PyNose Wang et al. (2021) Typescript Manual Garousi & Küçük (2018)

4. Results

4. Results – Code Quality Metrics Left: completeness rate. Middle:
maintainability issues. Right: coverage comparison. • Reliability: A score in all cases (before and after) • Maintainability: more issues after TDD (likely due to higher completeness) • Coverage: improved after TDD across languages

4. Results – Test Smells Left: test smells in FizzBuzz
+ String Calculator. Right: number of test cases in FizzBuzz + String Calculator.

4. Results - Interview Findings – Adoption Decisions Takeaway: Adoption
was driven by project context assessment, not prior knowledge alone. Group TDD Adopted Reasoning Group 1a Yes Automated checks without manual testing Group 1b No (Had experience) Split decision within team Group 2 No Wanted to see app working ﬁrst Group 3 No “It was just a prototype” Group 4 No (Used intermittently before) One member had experience

4. Results - Interview Findings – Benefits and Challenges Perceived
benefits • Confidence in the code • Helped with refactoring • Better error handling awareness • Improved problem-solving thinking • Found hidden bugs Reported challenges • Test doubles / mocking complexity • “No way to run a test without knowing the output” • Perceived overhead for simple projects • Inconsistent application of TDD principles Student quote: “It kept the tests running and helped with refactoring.” – Group 1a

4. Results - Interview Findings – Adoption Students did not
reject TDD – they assessed its applicability based on context: • Group 1b: Would use TDD for authentication, complex logic – not simple projects • Group 2: TDD in backend helped identify errors • Group 3: “Not worth it for university projects”, but makes sense professionally • Group 4: Would use TDD with documentation support “Using TDD, it’s easier to ﬁnd errors that a user might encounter” Takeaway: Exposure to TDD is necessary but not sufﬁcient for consistent adoption.

5. Conclusions, limitations and future work

5. Conclusions • Code reliability: consistently high (A score) regardless
of TDD • Code coverage: improved after TDD intervention • Completeness: higher after TDD (but confounded by prior exposure) • Maintainability: more issues after TDD (likely due to increased complete rate and more code written) • Adoption decisions were context-driven, not knowledge-driven • Positive student perceptions != improvement in test code quality • Core issue: not just writing tests, but designing tests (mocking, test doubles) • Need for explicit training on test quality alongside TDD instruction • TDD alone was not sufﬁcient to prevent test smells. • The Magic Number smell persisted across all languages. • Implication for Educators: Emphasize comprehensive test design alongside TDD instruction, not just the Red-Green-Refactor cycle

5. Conclusions - Limitations and Future Work Limitations • Small
sample size (12 valid submissions) • Single language per student (1 student for C#, Java, TypeScript) • Prior exposure to kata may confound results • Different test smell tools per language detect different sets of smells Future directions • Replicate with professional developers • Study test smells in scenarios with databases, external services • Develop multi-language test smell detection tools • Investigate teaching strategies that combine TDD with test quality

¡Thank you! Matheus Marabesi Universidad de Salamanca · UFABC ·
Mackenzie

References K. Beck, Test Driven Development: By Example, Addison-Wesley, 2002.
F. Anwer et al., “Agile software development models TDD, FDD, DSDM, and Crystal methods: A survey,” Int. J. Multidiscip. Sci. Eng., vol. 8, no. 2, pp. 1–10, 2017. M. Ghafari et al., “Why research on test-driven development is inconclusive?” in Proc. ESEM, pp. 1–10, 2020. V. Garousi and B. Küçük, “Smells in software test code: A survey of knowledge in industry and academia,” J. Syst. Softw., vol. 138, pp. 52–81, 2018. M. Aniche and M.A. Gerosa, “Does test-driven development improve class design? A qualitative study on developers’ perceptions,” J. Braz. Comput. Soc., vol. 21, no. 1, p. 15, 2015. A. Nanthaamornphong and S. Bressan, “The empirical study: Encouraging students’ interest in software development using TDD,” Tehnički glasnik, vol. 13, no. 4, pp. 267–274, 2019. A. Peruma et al., “tsDetect: An open source test smells detection tool,” in Proc. ESEC/FSE, pp. 1650–1654, 2020. Wang et al., “PyNose: A test smell detector for Python,” 2021. J. Oliveira et al., “SNUTS.js: Snifﬁng nasty unit test smells in JavaScript,” in Proc. SBES, pp. 720–726, 2024. P.P. Paul et al., “xNose: A test smell detector for C#,” in Proc. ICSE-Companion, pp. 370–371, 2024.

CISTI 2026 - Test-Driven Development Versus Tes...

CISTI 2026 - Test-Driven Development Versus Test Smells: An Empirical Study With Students From UFABC University

Marabesi

More Decks by Marabesi

Other Decks in Science

Featured

Transcript

Test-Driven Development versus Test Smells Matheus Marabesi¹, Alicia García-Holgado¹, Francisco

Outline 1. Introduction 2. Motivation and research question 3. Study

1. Introduction

Test-Driven Development - TDD Common terms used for this presentation:

2. Motivation

2. Motivation • TDD (Test-Driven Development) has been widely promoted

2. Motivation - research questions Study type • Pilot study

3. Study design

3. Study Design – Overview Timeframe of controlled intervention across

3. Study design - Context and Sample Course context •

3. Study design - Test Smell Tools • SonarQube used

4. Results

4. Results – Code Quality Metrics Left: completeness rate. Middle:

4. Results – Test Smells Left: test smells in FizzBuzz

4. Results - Interview Findings – Adoption Decisions Takeaway: Adoption

4. Results - Interview Findings – Beneﬁts and Challenges Perceived

4. Results - Interview Findings – Adoption Students did not

5. Conclusions, limitations and future work

5. Conclusions • Code reliability: consistently high (A score) regardless

5. Conclusions - Limitations and Future Work Limitations • Small

¡Thank you! Matheus Marabesi Universidad de Salamanca · UFABC ·

References K. Beck, Test Driven Development: By Example, Addison-Wesley, 2002.