Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Source Code Diff Revolution (JetBrains Open Rea...

Source Code Diff Revolution (JetBrains Open Reading Club)

Invited talk at the JetBrains Open Reading Club
April 21, 2023
https://github.com/JetBrains/reading-club

Nikolaos Tsantalis

December 16, 2023
Tweet

More Decks by Nikolaos Tsantalis

Other Decks in Research

Transcript

  1. AST Diff to the rescue • GumTree (Falleri et al.,

    2014) • First phase: Top-down AST matching to find the largest identical subtrees iteratively (AST hash value based on node label and value). • Second phase: matching previously unmatched AST, having a fair amount of their children matched (dice function, the ratio of common descendants between two nodes must be greater or equal to 0.5).
  2. GumTree clones • MTDiff (Dotzler & Philippsen, 2016): introduces 5

    optimizations to improve the accuracy of the generated edit script, specifically for the Move actions, which make the edit scripts shorter. • IJM - Iterative Java Matcher (Frick et al., 2018): Partial matching, application of GumTree to selected parts of the source code (import declarations, methods with same signature). • All papers focus their evaluation on which tool generates the shorter edit script.
  3. Is shorter (edit script) better? Fan et al. 2021 “GumTree,

    MTDiff and IJM generate inaccurate mappings for 20%–29%, 25%–36% and 21%– 30% of the file revisions, respectively. Our experimental results show that state-of-the-art AST mapping algorithms still need improvements.”
  4. remaining unmatched composites remaining unmatched leaves remaining unmatched leaves Leaf

    statement mapping method #1 method #2 Round 1: Iden�cal + same depth (iso-structural control flow) Round 2: Iden�cal + different depth Round 3: Non-iden�cal Candidate Sor�ng Composite statement mapping Round 1: Iden�cal + different depth + # mapped children ≥ 1 Round 2: Non-iden�cal + # mapped children ≥ 1 Candidate Sor�ng Inexact leaf mappings unmatched leaves from #1 and #2 Inexact composite mappings unmatched composites from #1 and #2 added methods Leaf + composite statement mapping unmatched/inexact leaves + composites from #1 matched with added method removed methods Leaf + composite statement mapping unmatched/inexact leaves + composites from #2 matched with removed method Extract Method detec�on Inline Method detec�on Mapping op�miza�on 3.1 3.2 3.5 3.4 3.4 3.3 3.3
  5. Improvements over RefactoringMiner 2.0 • Sorting criteria for leaf statement

    mappings • Sorting criteria for composite statement mappings • Multi-mapping support for duplicated code moved out of or moved into conditionals • Statement mapping scope based on call sites
  6. tachyon/worker/block/allocator/MaxFreeAllocator.java @Override public TempBlockMeta allocateBlock(long userId, long blockId, long blockSize

    BlockStoreLocation location) throws IOException { StorageDir candidateDir = null; long maxFreeBytes = blockSize; if (location.equals(BlockStoreLocation.anyTier())) { for (StorageTier tier : mMetaManager.getTiers()) { for (StorageDir dir : tier.getStorageDirs()) { if (dir.getAvailableBytes() >= maxFreeBytes) { maxFreeBytes = dir.getAvailableBytes(); candidateDir = dir; } } } } else if (location.equals(BlockStoreLocation.anyDirInTier(location.tierAli StorageTier tier = mMetaManager.getTier(location.tierAlias()); for (StorageDir dir : tier.getStorageDirs()) { if (dir.getAvailableBytes() >= maxFreeBytes) { maxFreeBytes = dir.getAvailableBytes(); candidateDir = dir; } } } return candidateDir != null ? new TempBlockMeta(userId, blockId, blockSize, candidateDir) : null; } tachyon/worker/block/allocator/MaxFreeAllocator.java @Override public TempBlockMeta allocateBlock(long userId, long blockId, long blockSize BlockStoreLocation location) throws IOException { StorageDir candidateDir = null; if (location.equals(BlockStoreLocation.anyTier())) { for (StorageTier tier : mMetaManager.getTiers()) { candidateDir = getCandidateDirInTier(tier, blockSize); if (candidateDir != null) { return new TempBlockMeta(userId, blockId, blockSize, candidateDir); } } } else if (location.equals(BlockStoreLocation.anyDirInTier(location.tierAli StorageTier tier = mMetaManager.getTier(location.tierAlias()); candidateDir = getCandidateDirInTier(tier, blockSize); } return candidateDir != null ? new TempBlockMeta(userId, blockId, blockSize, candidateDir) : null; } private StorageDir getCandidateDirInTier(StorageTier tier, long blockSize) { StorageDir candidateDir = null; long maxFreeBytes = blockSize - 1; for (StorageDir dir : tier.getStorageDirs()) { if (dir.getAvailableBytes() > maxFreeBytes) { maxFreeBytes = dir.getAvailableBytes(); candidateDir = dir; } } return candidateDir; } Call to the extracted method Moved to the extracted method Parent block mapping
  7. GumTree minHeight hyperparameter • minHeight: length of the longest path

    from one leaf to the root of the subtree • GumTree default threshold = 2 • Why? Avoids matching remaining leaf expressions with height 1 (e.g., SimpleName nodes), which coincidentally have the same value. • Since we give as input a pair of matched statements, we configure minHeight = 1
  8. KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate()

    .endSpec() .endReplicationControllerItem(); KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .withReadinessProbe(getReadinessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); RefactoringMiner KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .withReadinessProbe(getReadinessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); GT 3.0 simple KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); KubernetesListBuilder builder = new KubernetesListBuilder() .withLivenessProbe(getLivenessProbe()) .withReadinessProbe( getReadinessProbe()) .endContainer() .withVolumes(getVolumes()) .endSpec() .endTemplate() .endSpec() .endReplicationControllerItem(); GT 3.0 greedy
  9. AST Diff benchmark • Process: 1. Run all ASTDiff tools

    2. Manually validate the diffs 3. Construct the “perfect” diff • Datasets: • 800 bug fixings commits from Defects4J • 187 refactoring commits from Refactoring Oracle
  10. Novel AST Diff Quality Metrics • Statement (program element) mapping

    accuracy: • True Positive: a mapping given by a tool that exists in the benchmark • False Positive: a mapping given by a tool does not exist in the benchmark • False Negative: a mapping that exists in the benchmark, but was not reported by a tool • Semantically incompatible mappings: • M = (m1 , m2 ) returned by a tool • m1 and m2 have the same AST type • The parents of m1 and m2 have a different AST type • M is not included in the benchmark
  11. Execution time – Defects4J 3 times slower on median from

    GTS 4 times slower on average from GTS
  12. Execution time – Refactoring oracle 5 times slower on median

    from GTS 8 times slower on average from GTS None of the other tools supports inter-file mappings
  13. Conclusions • RefactoringMiner has the best precision and recall in

    both benchmarks. • The accuracy improvements are more evident in the Refactoring benchmark. • GumTree 3.0 (simple) has better precision and recall than GumTree 3.0 (greedy), when considering sub-expression mappings. • RefactoringMiner and IJM excel in matching program elements (i.e., method, field declarations) accurately. • GumTree (greedy) and MTDiff generate the largest numbers of semantically incompatible mappings. • RefactoringMiner’s execution time is in the same order of magnitude with that of the faster tools.