Source Code Diff Revolution (JetBrains Open Reading Club)

Source code diff revolution

How much time a developer spends on coding and code
reviewing per day?

Global Code Time Report (250K+ developers) https://www.software.com/reports/code-time-report 41 minutes of
code reviewing per day

How are developers reviewing code?

AST Diff to the rescue • GumTree (Falleri et al.,
2014) • First phase: Top-down AST matching to find the largest identical subtrees iteratively (AST hash value based on node label and value). • Second phase: matching previously unmatched AST, having a fair amount of their children matched (dice function, the ratio of common descendants between two nodes must be greater or equal to 0.5).

GumTree clones • MTDiff (Dotzler & Philippsen, 2016): introduces 5
optimizations to improve the accuracy of the generated edit script, specifically for the Move actions, which make the edit scripts shorter. • IJM - Iterative Java Matcher (Frick et al., 2018): Partial matching, application of GumTree to selected parts of the source code (import declarations, methods with same signature). • All papers focus their evaluation on which tool generates the shorter edit script.

Is shorter (edit script) better? Fan et al. 2021 “GumTree,
MTDiff and IJM generate inaccurate mappings for 20%–29%, 25%–36% and 21%– 30% of the file revisions, respectively. Our experimental results show that state-of-the-art AST mapping algorithms still need improvements.”

Language agnostic Limitation #1

GumTree Simple renamed new method

RefactoringMiner call to extracted method

No support for multi-mappings Limitation #2

GumTree Simple

GumTree Greedy

RefactoringMiner calls to extracted method

How common are multi-mappings?

Semantic ignorance Limitation #3

GumTree Greedy Type matched with variable Variable matched with method
call Variable matched with lambda parameter

RefactoringMiner

Matching only nodes of same AST type Limitation #4

Refactoring un-awareness Limitation #5

GumTree Simple RefactoringMiner Rename object to item Extract variable itemKey
object matched with itemKey

GumTree Greedy RefactoringMiner Extract variable onResource Extract variable offResource method
call matched with type

No support for commit-level analysis Limitation #6

GumTree Greedy

RefactoringMiner

Approach

remaining unmatched composites remaining unmatched leaves remaining unmatched leaves Leaf
statement mapping method #1 method #2 Round 1: Iden�cal + same depth (iso-structural control flow) Round 2: Iden�cal + different depth Round 3: Non-iden�cal Candidate Sor�ng Composite statement mapping Round 1: Iden�cal + different depth + # mapped children ≥ 1 Round 2: Non-iden�cal + # mapped children ≥ 1 Candidate Sor�ng Inexact leaf mappings unmatched leaves from #1 and #2 Inexact composite mappings unmatched composites from #1 and #2 added methods Leaf + composite statement mapping unmatched/inexact leaves + composites from #1 matched with added method removed methods Leaf + composite statement mapping unmatched/inexact leaves + composites from #2 matched with removed method Extract Method detec�on Inline Method detec�on Mapping op�miza�on 3.1 3.2 3.5 3.4 3.4 3.3 3.3

Improvements over RefactoringMiner 2.0 • Sorting criteria for leaf statement
mappings • Sorting criteria for composite statement mappings • Multi-mapping support for duplicated code moved out of or moved into conditionals • Statement mapping scope based on call sites

tachyon/worker/block/allocator/MaxFreeAllocator.java @Override public TempBlockMeta allocateBlock(long userId, long blockId, long blockSize
BlockStoreLocation location) throws IOException { StorageDir candidateDir = null; long maxFreeBytes = blockSize; if (location.equals(BlockStoreLocation.anyTier())) { for (StorageTier tier : mMetaManager.getTiers()) { for (StorageDir dir : tier.getStorageDirs()) { if (dir.getAvailableBytes() >= maxFreeBytes) { maxFreeBytes = dir.getAvailableBytes(); candidateDir = dir; } } } } else if (location.equals(BlockStoreLocation.anyDirInTier(location.tierAli StorageTier tier = mMetaManager.getTier(location.tierAlias()); for (StorageDir dir : tier.getStorageDirs()) { if (dir.getAvailableBytes() >= maxFreeBytes) { maxFreeBytes = dir.getAvailableBytes(); candidateDir = dir; } } } return candidateDir != null ? new TempBlockMeta(userId, blockId, blockSize, candidateDir) : null; } tachyon/worker/block/allocator/MaxFreeAllocator.java @Override public TempBlockMeta allocateBlock(long userId, long blockId, long blockSize BlockStoreLocation location) throws IOException { StorageDir candidateDir = null; if (location.equals(BlockStoreLocation.anyTier())) { for (StorageTier tier : mMetaManager.getTiers()) { candidateDir = getCandidateDirInTier(tier, blockSize); if (candidateDir != null) { return new TempBlockMeta(userId, blockId, blockSize, candidateDir); } } } else if (location.equals(BlockStoreLocation.anyDirInTier(location.tierAli StorageTier tier = mMetaManager.getTier(location.tierAlias()); candidateDir = getCandidateDirInTier(tier, blockSize); } return candidateDir != null ? new TempBlockMeta(userId, blockId, blockSize, candidateDir) : null; } private StorageDir getCandidateDirInTier(StorageTier tier, long blockSize) { StorageDir candidateDir = null; long maxFreeBytes = blockSize - 1; for (StorageDir dir : tier.getStorageDirs()) { if (dir.getAvailableBytes() > maxFreeBytes) { maxFreeBytes = dir.getAvailableBytes(); candidateDir = dir; } } return candidateDir; } Call to the extracted method Moved to the extracted method Parent block mapping

GumTree minHeight hyperparameter • minHeight: length of the longest path
from one leaf to the root of the subtree • GumTree default threshold = 2 • Why? Avoids matching remaining leaf expressions with height 1 (e.g., SimpleName nodes), which coincidentally have the same value. • Since we give as input a pair of matched statements, we configure minHeight = 1

KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate()
.endSpec() .endReplicationControllerItem(); KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .withReadinessProbe(getReadinessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); RefactoringMiner KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .withReadinessProbe(getReadinessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); GT 3.0 simple KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .withReadinessProbe( getReadinessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); GT 3.0 greedy

Evaluation results

AST Diff benchmark • Process: 1. Run all ASTDiff tools
2. Manually validate the diffs 3. Construct the “perfect” diff • Datasets: • 800 bug fixings commits from Defects4J • 187 refactoring commits from Refactoring Oracle

Novel AST Diff Quality Metrics • Statement (program element) mapping
accuracy: • True Positive: a mapping given by a tool that exists in the benchmark • False Positive: a mapping given by a tool does not exist in the benchmark • False Negative: a mapping that exists in the benchmark, but was not reported by a tool • Semantically incompatible mappings: • M = (m1 , m2 ) returned by a tool • m1 and m2 have the same AST type • The parents of m1 and m2 have a different AST type • M is not included in the benchmark

Accuracy – Defects4J +0.5-1% F-measure +1-3% F-measure

Accuracy – Defects4J +0.2-1.5% F-measure

Accuracy – Refactoring oracle +11-12% F-measure +5-7% F-measure

Execution time – Defects4J 3 times slower on median from
GTS 4 times slower on average from GTS

Execution time – Refactoring oracle 5 times slower on median
from GTS 8 times slower on average from GTS None of the other tools supports inter-file mappings

Conclusions • RefactoringMiner has the best precision and recall in
both benchmarks. • The accuracy improvements are more evident in the Refactoring benchmark. • GumTree 3.0 (simple) has better precision and recall than GumTree 3.0 (greedy), when considering sub-expression mappings. • RefactoringMiner and IJM excel in matching program elements (i.e., method, field declarations) accurately. • GumTree (greedy) and MTDiff generate the largest numbers of semantically incompatible mappings. • RefactoringMiner’s execution time is in the same order of magnitude with that of the faster tools.

https://github.com/tsantalis/RefactoringMiner https://github.com/pouryafard75/DiffBenchmark

Source Code Diff Revolution (JetBrains Open Rea...

Source Code Diff Revolution (JetBrains Open Reading Club)

More Decks by Nikolaos Tsantalis

Other Decks in Research

Featured

Transcript