Quantitatively Evaluating Test-Driven Development

Slide 1

Slide 1 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Quantitatively Evaluating Test-Driven Development Rod Hilton April 9, 2013

Slide 2

Slide 2 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Contents 1 Introduction 2 Test-Driven Development 3 Prior Work 4 Metrics 5 Survey & Experiment 6 Results 7 Conclusions

Slide 3

Slide 3 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions What is Test-Driven Development? Proposed by Kent Beck as part of eXtreme Programming Originally called Test-First Programming Gaining popularity (job post mentions quadrupled in 4 years)

Slide 4

Slide 4 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions The Test-Driven Development Practice Test-Driven Development makes developers write the test before the code. Write a test Run test - FAIL Write code Run test - PASS Refactor code Repeat

Slide 5

Slide 5 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions The TDD Cycle Often referred to as ‘Red-Green-Refactor’

Slide 6

Slide 6 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Benefits of TDD TDD advocates claim many benefits to Test-Driven Development Learning (tests help engineers understand code) Reliability (automatic regression test suite) Speed (less time debugging) Scope Limiting (avoids “scope creep” - not allowed to write code unless test drives it) Confidence (ideally, 100% test coverage on production code)

Slide 7

Slide 7 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Eﬀect on Code Most important beneﬁt of TDD is that it encourages “emergent design”. Every line of code is driven by a test. Perspective shift: every test is a client for the interface Avoids “big design up front” - accurately design each small change, then code it All code must be easy to test - encourages modularity. Refactoring step encourages high internal quality. external quality (software) != internal quality (code).

Slide 8

Slide 8 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Bottom Line Test-Driven Development helps produce good code. Test-Driven Development Design helps produce good code.

Slide 9

Slide 9 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Prior Work Many researchers have attempted to evaluate if Test-Driven Development delivers on its promises.

Slide 10

Slide 10 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Academic Studies Many studies are done in an academic setting, as a controlled experiment with students. The University of Karlsruhe Study The Bethel College Study ...many more

Slide 11

Slide 11 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Academic Study Issues Academic studies suﬀer from two main problems Students wouldn’t have to maintain code. Not real-world TDD usage Students typically learn TDD for the study. Newcomers struggle.

Slide 12

Slide 12 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions The IBM Case Study Joint eﬀort by IBM and researchers from North Carolina State University. 10 years developing device drivers: 7 releases 1st release: No TDD (with manual testing) 7th release: TDD Quality determined by functional test system

Slide 13

Slide 13 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions The IBM Case Study Results Result: TDD increased quality TDD group test cases found 1.8x the defects (better test cases) TDD group defects per lines of code 40% reduced (fewer defects) Productivity the same for both teams

Slide 14

Slide 14 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions The IBM Case Study Issues Measures external quality rather than internal quality Software, not code External quality refers to quality of what users see, internal quality refers to quality of what developers see. TDD promises better CODE, so internal quality must be studied.

Slide 15

Slide 15 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions The Janzen Studies David Janzen’s PhD thesis performs code quality analysis on code written as part of an industrial experiment 4 industrial studies Training seminar (given a program to write, measured) Real-world code: No-Tests/TDD Real-world code: Test-Last/TDD Real-world code: TDD/Test-Last Applied many object oriented metrics to code produced

Slide 16

Slide 16 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions The Janzen Studies Results TDD eﬀective, but not very much. TDD decreased code complexity Only among seasoned developers. Newer developers wrote worse code with TDD Increased dependencies and amount of code, but that counts test code Most results not statistically signiﬁcant - for the most part code was the same.

Slide 17

Slide 17 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions The Janzen Studies Issues In-depth study of TDD on internal quality, but begs for additional work Counting test code in metrics artificially inflated dependencies and size “John Henry Effect”/Observer Effect Control and experimental groups knew they were control and experimental groups Knew that could would be measured according to code quality metrics Behavior may have been altered to try and increase internal quality beyond what they would do naturally May help explain lack of difference between two groups in industrial setting

Slide 18

Slide 18 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Summary Still a great deal to learn about TDD Most studies measure external quality, not internal quality. Good to know, but TDD is sold as a way to improve code quality, and should be measured accordingly Many studies are academic, not industrial. Not “real-world” - students don’t have to maintain code. Why write clean code? Doesn’t test how well TDD works after its been used for some time The studies that are both measuring internal quality and measuring industrial code may suﬀer from observer eﬀect

Slide 19

Slide 19 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Next Steps Need to measure TDD in a manner more consistent with its supposed beneﬁts. Problem: measuring defects doesn’t speak to internal quality Solution: use object oriented code metrics. Problem: academic setting doesn’t test real-world TDD Solution: measure real-world code used in production Problem: experimental apparatus may skew results Solution: measure code written without the knowledge that it would be measured Next Steps: apply object-oriented metrics to Open Source projects.

Slide 20

Slide 20 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Internal Quality How can we deﬁne good object-oriented code? High Cohesion Low Coupling Low Complexity

Slide 21

Slide 21 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Cohesion Cohesion: the degree to which the elements of a module are related High Cohesion: all code in a class works to support a single responsibility Low Cohesion: code in a class supports random collection of functions (example: utility library) High cohesion makes it easier to reuse modules, easier to maintain, and isolates faults to a single module

Slide 22

Slide 22 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Lack of Cohesion Methods Measures disconnect between the set of methods and the set of attributes public class Rectangle { // LCOM: 0 private double width; private double height; public Rectangle(double width, double height) { super(); this.width = width; this.height = height; } public double getArea() { return this.width * this.height; } public double getPerimeter() { return this.width * 2 + this.height * 2; } } public class Circle { // LCOM: 0.333 private double x; private double y; private double radius; public Circle(double x, double y, double radius) { this.x = x; this.y = y; this.radius = radius; } public double getArea() { return Math.PI * this.radius * this.radius; } public boolean contains(double x, double y) { double distance = Math.sqrt( (x - this.x) * (x - this.x) + (y - this.y) * (y - this.y)); return distance <= this.radius; } }

Slide 23

Slide 23 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Specialization Index Measures how eﬀectively subclasses add new behavior while reusing existing behavior Woodpecker has a specialization index of 0, Penguin has 1

Slide 24

Slide 24 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Number of Method Parameters Method arguments represent data that the method does not have access to. Disconnect between the operations a class can perform and the data it needs to do the operation: lack of cohesion A method that takes 2 parameters is like a method that accesses two attributes that no other methods access. Similar to LCOM.

Slide 25

Slide 25 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Number of Static Methods Static methods belong to classes that do not own the data needed by the methods. Similar to method parameters, an indicator that the data for a responsibility and the implementation of that responsibility are separated

Slide 26

Slide 26 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Coupling Coupling is the degree to which modules are dependent upon other modules. Low Coupling: modules pass messages to others High Coupling: one module depends on the inner-workings of another module High coupling is bad: modifying one module requires modifying another. Modules cannot be understood in isolation. Reuse is diﬃcult; requires pulling in additional modules.

Slide 27

Slide 27 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Afferent and Efferent Coupling Measures dependencies between packages. Afferent Coupling: number of classes outside a package that depend upon classes within it Efferent Coupling: number of classes outside a package that are depended upon by classes within it Package A Package B Package C Package D q t r u s v Ca for Package C: 3. Ce for Package C: 1

Slide 28

Slide 28 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Instability and Abstractness Instability: the amount of work required to make a change (determined by a package’s afferent and efferent coupling) Stable package: many packages depend on it but it depends on few (hard to change, don’t need to) Instable package: few packages depend on it but it depends on many (easy to change, often need to) Abstractness: how easy a package is to extend Instability is okay: package should be as abstract as it is stable. Stable packages are difficult to change, so abstractness should make it easy to extend Instable packages are easy to change, so they should be concrete.

Slide 29

Slide 29 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Distance From The Main Sequence Measures how problematic the level of coupling is. I should be inversely proportional to A. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Abstractness (A) Instability (I) The Main Sequence

Slide 30

Slide 30 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Depth of Inheritance Tree Inheritance is a form of coupling (changes to parent implementations aﬀect children). DIT measures how coupled classes are via inheritance. HouseCat has a DIT of 4, as it takes 4 jumps to get back to Object. java.lang.Object Animal Mammal Cat Dog HouseCat

Slide 31

Slide 31 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Complexity The eﬀort needed to understand and modify the code. Low Complexity: classes are relatively easy to read, understand, and change. High Complexity: classes are confusing Low Complexity is better. Code is easier to understand.

Slide 32

Slide 32 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Size of Classes and Methods Long methods and large classes are diﬃcult to understand. Lines of code are easy to measure (but don’t count comments or whitespace). Total Lines of Code not counted, as that is a measure of software complexity, not code complexity (more complex software will always have more code)

Slide 33

Slide 33 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions McCabe Cyclomatic Complexity Measures number of linearly independent paths through a method. Number of decision points plus one public class McCabe { private Bar bar; public void performWork(boolean a, boolean b) { if (a) { bar.baz(); } else { bar.quux(); } if (b) { bar.corge(); } else { bar.grault(); } } } MCC = 3

Slide 34

Slide 34 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Nested Block Depth Measures the level of nesting within a code block. public void complexProcedure(boolean[] conditions) { if ( conditions[0] ) { if ( conditions[1] && conditions[2] ) { if ( conditions[3] || !conditions[4] ) { doSomething(); } else { doSomethingElse(); } } else { if ( !conditions[6] ) { doManyThings(); } else { doManyOtherThings(); } } } else if ( conditions[5] ) { doAnotherThing(); } else { if ( conditions[7] && conditions[8] && !conditions[9]) { doNothing(); } } } NBD = 4.

Slide 35

Slide 35 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Creating Groups Created two groups of Open Source projects - one using TDD, one not. Sent surveys to dozens of projects asking people how often they used TDD and if they are committers or contributors Only results from committers used.

Slide 36

Slide 36 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Selecting Projects Selected projects based on size (tried to get both groups approximately equal in lines of code). Small: 10,000 LOC TDD: JUnit Non-TDD: Jericho HTML Parser Medium: 30,000 LOC TDD: Commons Math Non-TDD: JAMWiki Large: 100,000 LOC TDD: FitNesse Non-TDD: JSPWiki Very Large: 300,000 LOC TDD: Hudson Non-TDD: XWiki

Slide 37

Slide 37 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Measuring Internal Quality Eclipse Metrics tool used to track metrics previously discussed. Excluded from measurement: test code, generated code, example code.

Slide 38

Slide 38 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Scaling Results Eclipse Metrics measures projects. ”Project A LCOM = 0.3”. How does one get the LCOM score for the entire TDD codebase? Averages for the TDD and Non-TDD groups are calculated by weighting scores by sizes. Example: What is TDD’s score for LCOM? Two TDD Projects: Project A has an average LCOM score of 0.3 per class, Project B has an average of 0.6 per class. Project A contains 20 classes, Project B contains 50. LCOM = (0.3 · 20) + (0.6 · 50) 20 + 50 ≈ 0.5143 LCOM is per-class. Other metrics may be per package or per method, so they are scaled according to each projects number of packages and methods

Slide 39

Slide 39 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Results TDD improved nearly every metric.

Slide 40

Slide 40 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Lack of Cohesion Methods Results 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Small Medium Large Very Large Lack of Cohesion Methods Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki LOCMTDD = 0.1934524. LOCMNTDD = 0.208. TDD improvement: 6.97%.

Slide 41

Slide 41 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Specialization Index Results 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Small Medium Large Very Large Specialization Index Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki SITDD = 0.2128426. SINTDD = 0.359. TDD improvement: 40.67%. JAMWiki: no overrides, nearly no inheritance

Slide 42

Slide 42 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Number of Parameters Results 0 0.5 1 1.5 2 Small Medium Large Very Large Number of Parameters Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki PARTDD = 0.9660960. PARNTDD = 1.286. TDD improvement: 24.86%

Slide 43

Slide 43 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Number of Static Methods Results 0 0.5 1 1.5 2 2.5 3 3.5 4 Small Medium Large Very Large Number of Static Methods Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki NSMTDD = 0.7144425. NSMNTDD = 0.819. TDD improvement: 12.81% Math did better even though it’s a utility library. Other projects were about the same.

Slide 44

Slide 44 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Eﬀect on Cohesion Metric TDD Non-TDD TDD Improvement LCOM 0.1934524 0.208 6.97% SI 0.2128426 0.359 40.67% PAR 0.9660960 1.286 24.86% NSM 0.7144425 0.819 12.81% Overall 21.33%

Slide 45

Slide 45 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Aﬀerent Coupling Results 0 5 10 15 20 25 Small Medium Large Very Large Afferent Coupling Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki CaTDD = 12.7669483. CaNTDD = 14.982. Improvement: 14.78% Jericho’s low score: only two packages, one of which has 110/112 classes.

Slide 46

Slide 46 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Eﬀerent Coupling Results 0 5 10 15 20 Small Medium Large Very Large Efferent Coupling Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki CeTDD = 6.7503621. CeNTDD = 6.658. Improvement: -1.39%

Slide 47

Slide 47 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Distance from The Main Sequence Results 0 0.1 0.2 0.3 0.4 0.5 0.6 Small Medium Large Very Large Distance from the Main Sequence Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki DTDD = 0.2701667. DNTDD = 0.332. Improvement: 18.68% Jericho result again due to having two packages (very high instability) and good level of abstraction

Slide 48

Slide 48 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Depth of Inheritance Tree Results 0 0.5 1 1.5 2 2.5 3 3.5 Small Medium Large Very Large Depth of Inheritance Tree Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki DITTDD = 1.9245879. DNTDD = 2.095. Improvement: 8.13%

Slide 49

Slide 49 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Eﬀect on Coupling Metric TDD Non-TDD TDD Improvement Ca 12.7669483 14.982 14.78% Ce 6.7503621 6.658 -1.39% D 0.2701667 0.332 18.68% DIT 1.9245879 2.095 8.13% Overall 10.05%

Slide 50

Slide 50 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Method Lines of Code Results 0 5 10 15 20 Small Medium Large Very Large Method Lines of Code Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki MLOCTDD = 5.7168565. MLOCNTDD = 8.672. Improvement: 34.08%. Only Math did worse. Fitnesse scored lower than all 7 other projects (written by Bob Martin, advocate of short methods).

Slide 51

Slide 51 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Class Lines of Code Results 0 20 40 60 80 100 120 Small Medium Large Very Large Class Lines of Code Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki CLOCTDD = 38.1216800. CLOCNTDD = 76.150. Improvement: 49.94%

Slide 52

Slide 52 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions McCabe Cyclomatic Complexity Results 0 1 2 3 4 5 Small Medium Large Very Large McCabe Cyclomatic Complexity Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki MCCTDD = 1.7430383. MCCNTDD = 2.361. Improvement: 26.18%

Slide 53

Slide 53 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Nested Block Depth Results 0 0.5 1 1.5 2 2.5 3 Small Medium Large Very Large Nested Block Depth Project Size TDD Non-TDD JUnit Math Fitnesse Hudson Jericho JAMWiki JSPWiki XWiki NBDTDD = 1.3246218. NBDNTDD = 1.536. Improvement: 13.74%.

Slide 54

Slide 54 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Eﬀect on Complexity Metric TDD Non-TDD TDD Improvement MLOC 5.7168565 8.672 34.08% CLOC 38.1216800 76.150 49.94% MCC 1.7430383 2.361 26.18% NBD 1.3246218 1.536 13.74% Overall 30.98% Worth noting: survey revealed TDD adherence lower with Commons Math than other TDD projects and it consistently had a lower score than JAMWiki on complexity metrics (and was the only TDD project to score lower on Complexity).

Slide 55

Slide 55 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Overall Eﬀect of TDD TDD improved internal quality Cohesion: 21.33% Coupling: 10.05% Complexity: 30.98% Overall: 20.79%

Slide 56

Slide 56 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Conclusions TDD helps improve real-world, industrial code. Strongest effect on complexity. TDD encourages regular refactoring, which improves complexity. Second strongest on cohesion. Writing tests first confronts programmer with ‘is this the right place for this functionality’ more often before code is written. Weakest on coupling. Writing tests for highly coupled code is difficult (must pull in extra dependencies for test). But mocking frameworks make it easier to stub out dependencies, which may explain weaker effect.

Slide 57

Slide 57 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Professionalism Revisited “A professional writes clean, ﬂexible code that works [...] TDD’s disciplines are a huge help in meeting professionalism’s requirements and it would therefore be unprofessional of me not to follow them”. -Robert C. Martin, 2007

Slide 58

Slide 58 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Further Reading “Test Driven Development: By Example” by Kent Beck “Extreme Programming Explained” by Kent Beck “Working Eﬀectively with Legacy Code” by Michael Feathers “Refactoring” by Martin Fowler “Growing Object-Oriented Software, Guided by Tests” by Steve Freeman and Nat Pryce “xUnit Test Patterns” by Gerard Meszaros “Clean Code” by Robert Martin “Agile Software Development, Principles, Patterns, and Practices” by Robert Martin

Slide 59

Slide 59 text

Introduction Test-Driven Development Prior Work Metrics Survey & Experiment Results Conclusions Questions?