Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analyzing Large Software Bases by Means of an E...

Avatar for Jo Atlee Jo Atlee
September 09, 2025

Analyzing Large Software Bases by Means of an Extracted Model of the Code

Faced with the goal of performing a system-wide analysis on large heterogeneous systems without the benefit of a system-wide model, we sought instead to derive models from code. The result is a suite of tools for (1) extracting from code, and other software artifacts, a lightweight graphical model of the software that is sufficiently detailed to support analyses of control flows, data flows, and software dependencies; (2) expressing diverse analyses of interest; (3) analyzing relatively large software models; and (4) and visualizing the analysis results. In this talk, we present the tools as well as our experiences in applying them to open-source software systems and to automotive software components and product-lines of components.

Avatar for Jo Atlee

Jo Atlee

September 09, 2025
Tweet

Other Decks in Research

Transcript

  1. Analyzing Large Software Bases by Means of an Extracted Model

    of the Code Joanne Atlee MODELS 2024, Linz September 25, 2024
  2. GOAL: system-wide interaction analysis of code Analyzing Large Software Bases

    by Means of an Extracted Model of the Code PAGE 2 http://www.flexautomotive.net/EMCFLEXBLOG/post/2015/09/08/can-bus-for-controller-area-network) • No system-wide model • Heterogeneous components - legacy, generated, third-party - distributed ECUs - bus-based communications • 100 million lines of code (roughly) • High variability (SPL)
  3. modelling and analysis of heterogeneous programs Analyzing Large Software Bases

    by Means of an Extracted Model of the Code PAGE 3 Source Code Source Code Source Code Models Build scripts / Config files Object Code Facts software artifacts Fact Extractor Fact Extractor Fact Extractor Fact Extractor Fact Extractor extractors Fact Base Fact Base Fact Base Fact Base Fact Base lightweight model queries software querying Linker linkage facts Linked Fact Base (Neo4j) graph database Bryan J. Muscedere, et al. “Detecting Feature-Interaction Symptoms in Automotive Software using Lightweight Analysis,” SANER’19
  4. not the only ones working on this Wiggle1 SCoRE2,3 eKNOWS4

    Frappé5 ProgQuery6 Software artifacts Java code PLC programs Java code, build, config files, MAVEN POM files, XML C/C++ code Java code Software size 500,000 LOC 742,000 LOC 44 MLOC 11.4 MLOC 55,700 LOC Factbase schema (element types) 46 entity 91 relationship 24 entity 15 relationship 83 entity 88 relationship 21 entity 30 relationship 76 entity 147 relationship Analyses • Program comprehension • Source-code queries • Design compliance • Dependency analyses • Metrics • Locate declarations • Method’s call graph • Dependency analyses • Program comprehension • Source-code queries • Bad smells • Coding practices • Security vulnerabilities 1 Urma, R-G, Mycroft, A. “Source-code queries with graph databases—with application to programming language usage and evolution,” Science of Computer Programming, 2015. 2 Ramler, R. et al. ”Benefits and Drawbacks of Representing and Analyzing Source Code and Software Engineering Artifacts with Graph Databases”, SWQD, 2019. 3 Prähofer, H. et al., "Static Code Analysis of IEC 61131-3 Programs: Comprehensive Tool Support and Experiences from Large-Scale Industrial Application," IEEE Trans. Ind. Informatics, 2017. 4 Buchgeher, G., “A Platform for the Automated Provisioning of Architecture Information for Large-Scale Service-Oriented Software Systems.” ECSA, 2018. 5 Hawes, N., Barham, B., Cifuentes, C. “Frappé: Querying the Linux kernel dependency graph” GRADES, 2015. 6O. Rodriguez-Prieto, et al. "An Efficient and Scalable Platform for Java Source Code Analysis Using Overlaid Graph Representations," in IEEE Access, 2020. Rudolf Ramler, Software Competence Center Hagenberg Rainer Weinreich, Johannes Kepler University Heinz Huber, Raiffeisen Software GmbH Software artifacts: heterogeneous languages, variability (SPL) Software size:1.5 MLOC (goal of 100 MLOC) Analyses: interaction analyses
  5. facts (entities) Analyzing Large Software Bases by Means of an

    Extracted Model of the Code PAGE 6 example a y z example_ret variable function function return 1 int example(int a) { 2 int y = 1; 3 int z = 0; 4 if (a > 1) { 5 y = a; 6 } else { 7 z = y; 8 } 9 return z; 10 }
  6. facts (relations) Analyzing Large Software Bases by Means of an

    Extracted Model of the Code PAGE 7 example a y z example_ret variable function function return 1 int example(int a) { 2 int y = 1; 3 int z = 0; 4 if (a > 1) { 5 y = a; 6 } else { 7 z = y; 8 } 9 return z; 10 } varWrite varWrite varW rite contain contain contain contain
  7. facts (attributes) Analyzing Large Software Bases by Means of an

    Extracted Model of the Code PAGE 8 example a y z example_ret variable function function return 1 int example(int a) { 2 int y = 1; 3 int z = 0; 4 if (a > 1) { 5 y = a; 6 } else { 7 z = y; 8 } 9 return z; 10 } type=int inDecisionCond=true type=int inDecisionCond=false type=int inDecisionCond=false
  8. facts (abstraction) Analyzing Large Software Bases by Means of an

    Extracted Model of the Code PAGE 9 example a y z example_ret 1 int example(int a) { 2 int y = 1; 3 int z = 0; 4 if (a > 1) { 5 y = a; 6 } else { 7 z = y; 8 } 9 return z; 10 } variable function function return basic block bb: entry bb:4 bb:2 bb:3 bb:1 bb:0 bb:entry bb:4 bb:3 bb:2 bb:1 bb:0 BasicBlock nodes: represent each basic block in the program nextBasicBlock edges: links successor basic blocks together Other edges: associate program facts with their respective basic blocks
  9. facts (projection) Analyzing Large Software Bases by Means of an

    Extracted Model of the Code PAGE 10 Entities (~15) component class variable function function return basicBlock ROS publisher ROS subscriber ROS topic Relationships (~30) function call from f1 to f2 varWrite is variable assignment from v1 to v2 parWrite is a parameter assignment from v1 to p1 retWrite is a function return from v1 to returned value subtype relates a subtype t1 to supertype t2 contain relates container entity to contained entity nextBasicBlock relates possible consecutive basicBlocks varWriteSource maps the source of a varWrite to its basicBlock … Attributes (~10) entity name variable type variable used inDecisionCond function is a callback
  10. facts from other sources (e.g., XML/XMI) Analyzing Large Software Bases

    by Means of an Extracted Model of the Code PAGE 12 State-machines machine state region sub-state transition sibling region contains name trigger guard action entities relationships attributes Class diagram class class attribute operation subclass contains association aggregation name abstract visibility multiplicity Simulink Block diagram block port sub-block contains signal name type signal label Feature model feature child feature mutual exclusion mutual inclusion disjunction conjunction implication name abstract mandatory Robert Hackman et al. “mel- Model Extractor Language for Extracting Facts from Models,” MODELS’20
  11. linking facts Analyzing Large Software Bases by Means of an

    Extracted Model of the Code PAGE 13 Source Code Source Code Source Code Models Build scripts / Config files Object Code Facts software artifacts Fact Extractor Fact Extractor Fact Extractor Fact Extractor Fact Extractor extractors Fact Base Fact Base Fact Base Fact Base Fact Base lightweight model queries software querying Linker linkage facts Linked Fact Base (Neo4j) graph database
  12. neo4j query language Cypher – declarative graph query language APOC

    – awesome procedures on Cypher (library) Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 16
  13. neo4j query language Example: where was this variable last changed?

    (inputs in red text) MATCH cfgPath = (writingBB:BasicBlock)-[:nextBasicBlock*]->(bbInput:BasicBlock{id : ”BBID"}) MATCH (writingBB)<-[:varWriteDestination | parWriteDestination | retWriteDestination]- (target:Variable{id: "V"}) WHERE none(node in tail(nodes(cfgPath)) WHERE EXISTS ((node)<-[: varWriteDestination | parWriteDestination | retWriteDestination]-(target))) RETURN writingBB Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 17 // find reverse execution paths to previous basic blocks // match previous basic blocks that include assignment to V // exclude any paths with intermediate blocks that overwrite value of V // return the matching previous basic blocks
  14. queries of interest simple queries (about related information) • What

    are the arguments to this function? • What are the parts of this type? • What is the file name? search-based queries • What are the constant variables? • Where are instances of this class instantiated? • Where is this variable defined? • Where is this method being used? Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 18
  15. queries of interest type hierarchy queries • How are these

    types/objects related? • Where is this method defined in the type hierarchy? • Does this type have any siblings in the type hierarchy? control flow queries • How does control progress from here to here? • How often does this method get called? • What are the unused methods? • What gets called when this methods is called? • Who can call this method? Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 19
  16. queries of interest information flow queries • Where was this

    variable most recently changed? • What code directly or indirectly uses this variable? • What code directly or indirectly influences the value of this variable? • Where can this global variable be changed? interaction queries • Behaviour interaction (variable assignment impacts remote control-flow decision) • Communication loops (possible delays, nontermination) • Race conditions among input messages (possible nondeterminism) • Multiple inputs (possible livelock) Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 20
  17. subject system (Waterloo automotive software) Analyzing Large Software Bases by

    Means of an Extracted Model of the Code PAGE 22 Code Statistics Autonomoose LOC 18,739 # of Components 14 # Entities (nodes) 10,369 # Relationships (edges) 51,712
  18. subject analyses Analyzing Large Software Bases by Means of an

    Extracted Model of the Code PAGE 23 Inter-component communications – ROS communication paths among components Communication loops – cycles of ROS communication paths (where a component communicates directly/indirectly with itself) Race condition – multiple components can communicate with the same target component, potentially leading to competing assignments to the same variable Behaviour interaction – a variable assignment in one component influences (via dataflow) another component to call a function
  19. precision evaluation (Waterloo automotive software) Analyzing Large Software Bases by

    Means of an Extracted Model of the Code PAGE 24 Analysis #Reported #Valid Inter-component communications 4 2 (50%) Communication loops 15 13 (87.0%) Race conditions 181 103 (57.0%) Behaviour interactions 364 192 (52.7%) TOTAL 564 310 (55.0%)
  20. Analyzing Large Software Bases by Means of an Extracted Model

    of the Code PAGE 25 sources of imprecision Invalid order of related assignments in different basic blocks of a called function Function return associated with all possible calling contexts Two functions/methods called in the same basic block change the same parameter, object, or global variable Variable assignment and passed parameter involving the same variable in the same basic block
  21. Speed Limit Control Dynamics Control Intelligent Brake Assist Lane Change

    Control Stability Control Hill Hold Anti-Lock Braking Air Quality System Enhanced Traction System Lane Centering Control Cruise Control Road Change Alert Trailer Stability Assist variable (or feature-oriented) software feature – unit of added value software product line – a family of related products managed by integrating a collection of mandatory and optional features Forward Collision Avoidance
  22. 1 bool Weighted; // Feature variable 2 bool Directed; //

    Feature variable 3 … 4 void GraphApp::BFS(int node) { 5 visited[node] = true; 6 list<int> queue; 7 queue.push_back(node); 8 int curNode; 9 10 while(!queue.empty()) { 11 curNode = queue.front(); 12 queue.pop_front(); 13 14 if ( Weighted ) { 15 for (Edge* edge : edges[curNode]) { 16 int endNode = edge->getEndNode(); 17 18 if ( Directed ) { 19 if (visited[endNode] == false) { 20 visited[endNode] = true; 21 queue.push_back(endNode); 22 } 23 … 24 if ( !Directed ) { 25 int startNode = edge->getStartNode(); 26 if (startNode != node && visited[startNode] == false) { 27 visited[startNode] = true; 28 queue.push_back(startNode); 29 } else if (endNode != node && visited[endNode] == false) { 30 visited[endNode] = true; 31 queue.push_back(endNode); 32 … 33 // red code is program variant Weighted && Directed variable programs Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 28 1 bool Weighted; // Feature variable 2 bool Directed; // Feature variable 3 … 4 void GraphApp::BFS(int node) { 5 visited[node] = true; 6 list<int> queue; 7 queue.push_back(node); 8 int curNode; 9 10 while(!queue.empty()) { 11 curNode = queue.front(); 12 queue.pop_front(); 13 14 if ( Weighted ) { 15 for (Edge* edge : edges[curNode]) { 16 int endNode = edge->getEndNode(); 17 18 if ( Directed ) { 19 if (visited[endNode] == false) { 20 visited[endNode] = true; 21 queue.push_back(endNode); 22 } 23 … 24 if ( !Directed ) { 25 int startNode = edge->getStartNode(); 26 if (startNode != nodeID && visited[startNode] == false) { 27 visited[startNode] = true; 28 queue.push_back(startNode); 29 } else if (endNode != node && visited[endNode] == false) { 30 visited[endNode] = true; 31 queue.push_back(endNode); 32 … 33 // blue code is program variant Weighted && !Directed
  23. variable programs Analyzing Large Software Bases by Means of an

    Extracted Model of the Code PAGE 29 1 bool Weighted; // Feature variable 2 bool Directed; // Feature variable 3 … 4 void GraphApp::BFS(int node) { 5 visited[node] = true; 6 list<int> queue; 7 queue.push_back(node); 8 int curNode; 9 10 while(!queue.empty()) { 11 curNode = queue.front(); 12 queue.pop_front(); 13 14 if ( Weighted ) { 15 for (Edge* edge : edges[curNode]) { // Weighted 16 int endNode = edge->getEndNode(); // Weighted 17 18 if ( Directed ) { // Weighted 19 if (visited[endNode] == false) { // Weighted & Directed 20 visited[endNode] = true; // Weighted & Directed 21 queue.push_back(endNode); // Weighted & Directed 22 } 23 … 24 if ( !Directed ) { // Weighted 25 int startNode = edge->getStartNode(); // Weighted & !Directed 26 if (startNode != node && visited[startNode] == false) { // Weighted & !Directed 27 visited[startNode] = true; // Weighted & !Directed 28 queue.push_back(startNode); // Weighted & !Directed 29 } else if (endNode != node && visited[endNode] == false) { // Weighted & !Directed 30 visited[endNode] = true; // Weighted & !Directed 31 queue.push_back(endNode); // Weighted & !Directed 32 … Annotate feature-specific code with a Presence Condition (PC): • feature expression (propositional formula over features) that denotes the set of product variants associated with the code fragment Variablity-aware analysis: • analyzes a variable program as a single artifact • associates analysis results with their respective program variants
  24. annotated factbase 1 bool Weighted; // Feature variable 2 bool

    Directed; // Feature variable 3 … 4 void GraphApp::BFS(int node) { 5 visited[node] = true; 6 list<int> queue; 7 queue.push_back(node); 8 int curNode; 9 10 while(!queue.empty()) { 11 curNode = queue.front(); 12 queue.pop_front(); 13 14 if ( Weighted ) { 15 for (Edge* edge : edges[curNode]) { // Weighted 16 int endNode = edge->getEndNode(); 17 18 if ( Directed ) { 19 if (visited[endNode] == false) { // Weighted & Directed 20 visited[endNode] = true; 21 queue.push_back(endNode); 22 } 23 … 24 if ( !Directed ) { // Weighted 25 int startNode = edge->getStartNode(); // Weighted & !Directed 26 if (startNode != node && visited[startNode] == false) { 27 visited[startNode] = true; 28 queue.push_back(startNode); 29 } else if (endNode != node && visited[endNode] == false) { 30 visited[endNode] = true; 31 queue.push_back(endNode); 32 … Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 30 BFS end Node edge start Node queue node cur Node contains contains contains contains contains varWrite varWrite varWrite varWrite varWrite varWrite Weighted Weighted Weighted && !Directed Weighted && !Direc… W eighted Weighted && !Directed Weighted variable function
  25. annotated analysis results Analyzing Large Software Bases by Means of

    an Extracted Model of the Code PAGE 31 end Node edge start Node queue node cur Node varWrite varWrite varWrite varWrite varWrite varWrite Weighted && !Direc… W eighted Weighted && !Directed Weighted variable function Example: dataflow MATCH (n)-[t:varWrite*]->(m) RETURN * Xiang Chen et al., "Variability-aware Neo4j for Analyzing a Graphical Model of a Software Product Line," MODELS 2023.
  26. varWrite Weighted varWrite W eighted visualizing variant-specific results (using filters)

    Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 32 end Node edge start Node queue node cur Node varWrite varWrite varWrite varWrite Weighted && !Direc… Weighted && !Directed variable function Example: dataflow MATCH (n)-[t:varWrite*]->(m) RETURN * Filters: Weighted && Directed Ramy Shahin et al.., “Applying Declarative Analysis to Industrial Automotive Software Product Line Models” in EMSE, 2023 Rafael Toledo et al., “Visualizing Analysis Results for SPL Models - A User Study,” in VISSOFT 2024.
  27. Weighted && !Direc… varWrite varWrite W eighted visualizing variant-specific results

    (using filters) Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 33 end Node edge start Node queue cur Node varWrite varWrite varWrite Weighted && !Directed variable function Example: dataflow MATCH (n)-[t:varWrite*]->(m) RETURN * Filters: Weighted && Directed Weighted && !Directed Weighted node varWrite Ramy Shahin et al.., “Applying Declarative Analysis to Industrial Automotive Software Product Line Models” in EMSE, 2023 Rafael Toledo et al., “Visualizing Analysis Results for SPL Models - A User Study,” in VISSOFT 2024.
  28. scalability strategies Analyzing Large Software Bases by Means of an

    Extracted Model of the Code PAGE 35 • Return endpoints of path results (cf full paths) • Return shortest path results (cf all paths) except that function returns need to return to all possible calling contexts • Stage queries over sub-factbases • Triage results
  29. subject systems (open source subject systems) Analyzing Large Software Bases

    by Means of an Extracted Model of the Code PAGE 36 Code Statistics axTLS1 ToyBox2 BusyBox3 BerkeleyDB4 Subversion5 LOC 23,832 56,770 253,942 258,777 1,195,953 # Components 5 7 25 25 31 # Features 40 225 847 264 652 # Facts 12,972 47,849 146,441 129,009 449,189 # Variable Facts 7,308 12,025 37,027 20,442 9.252 % Variable Facts 56% 25% 25% 16% 2% 1 https://axtls.sourceforge.net 2 http://landley.net/toybox/ 3 https://github.com/brgl/busybox 4 https://github.com/berkeleydb/libdb 5 https://subversion.apache.org/source-code.html
  30. Analysis axTLS ToyBox BusyBox BerkeleyDB Subversion # ms over head1

    # ms over head # ms over head # ms over head # ms over head Inter-component communications 518 708 161% 39 52 63% 405 201 62% 1,151 106 18% 37,321 1,895 22% Communication loops 0 374 144% 0 105 102% 264 170 27% 0 102 38% 30,860 1,944 38% Multiple callers 246 96 81% 11,302 412 84% 128,458 2,657 12% 1,058 202 30% 2,893,290 57,577 2% Race conditions 0 48 45% 2 118 90% 0 363 12% 0 151 13% 336 1,016 7% Behaviour interactions 4,602 1,596 456% 0 99 39% 161 391 69% 0 184 31% 33,886 2,293 9% Call graph 45,921 6,810 423% 3,138 70 25% 11,556 589 34% 90,784 3,197 11% 472,923 21,659 27% Recursion 9 132 116% 0 37 42% 212 790 34% 561 16,223 64% 86 9,017 42% performance evaluation (open source subject systems) Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 37 1Overhead measures the performance overhead of a lifted analysis (on the variable program) vs an analysis of the largest product variant (a single product including all features)
  31. subject systems (automotive controllers) Analyzing Large Software Bases by Means

    of an Extracted Model of the Code PAGE 38 Code Statistics SPL-A SPL-B SPL-C SPL-D SPL-E SPL-F SPL-G LOC 730,947 1,016,063 750,000 979,466 752,669 1,088,811 1,639,822 # C Files 5133 6826 4300 6943 4981 6464 8458 # Features ∼400 ∼500 ∼900 ∼600 ∼600 ∼500 ∼1600 # Facts 157,303 225,538 215,120 228,185 227,241 226,640 621,714 # Variable Facts 698 1070 2126 1148 2078 955 3944 %Variable Facts 0.44 0.47 0.99 0.50 0.91 0.42 0.63
  32. performance evaluation (automotive controllers) Analyzing Large Software Bases by Means

    of an Extracted Model of the Code PAGE 39 Analysis1 SPL-A SPL-B SPL-C SPL-D SPL-E SPL-F SPL-G # sec # sec # sec # sec # sec # sec # sec Behaviour interactions 1,189 6.49 1,525 8.29 2,757 28.68 1,724 7.79 3,881 32.63 1,398 7.62 22,822 427,60 Global variable analysis 89 5,102.92 117 10,032.1 108 5,090.34 106 10,225.2 160 6,280.87 97 10,381.3 459 28,618 Function recursion 4 2.17 4 2.74 0 3.39 4 2.94 6 3.78 4 2.57 10 23.26 Component recursion 0 2.29 0 2.95 0 3.60 0 3.01 0 4.35 0 2.73 0 22.78 Ramy Shahin et al., "Applying Declarative Analysis to Software Product Line Models: An Industrial Study,” MODELS 2021 Ramy Shahin et al.., “Applying Declarative Analysis to Industrial Automotive Software Product Line Models” in EMSE, 2023 1These experiments used a lifted Datalog query engine V-Soufflé for the analysis. Analysis results are endpoints of path results.
  33. staged queries (e.g. behaviour interaction) Analyzing Large Software Bases by

    Means of an Extracted Model of the Code PAGE 40 exit entry exit dataflow suffix v/r exit entry f suffix dataflow* prefix v/r f varInfFunc entry v/r varWrite All 3 subqueries run on each component System-wide query v prefix variable function entry/exit point variable/return (varWrite | parWrite | retWrite)* (varWrite | parWrite | retWrite)* (varWrite | parWrite | retWrite)* 1) Decompose query into component-level subqueries 2) Run each intracomponent subquery on each component’s factbase 3) Create summary facts for subquery results 4) Pose system-wide query in terms of summary facts over full factbase
  34. stress test (behaviour-interaction analysis on automotive controller) Analyzing Large Software

    Bases by Means of an Extracted Model of the Code PAGE 41 #behaviour interactions: 1,092,919 paths (vs. 22,822 pairs of path endpoints) Code Statistics SPL-G LOC 1,639,822 # Components 1353 # C Files 8458 # Features ∼1600 # Facts 621,714 # Variable Facts 3944 %Variable Facts 0.63
  35. stress test (behaviour-interaction analysis on automotive controller) Analyzing Large Software

    Bases by Means of an Extracted Model of the Code PAGE 42 #behaviour interactions: 1,092,919 paths (vs. 22,822 pairs of path endpoints) performance: ~12 days (if component analyses are performed sequentially) - 77% of this time is spent analyzing one component (of 1353) Code Statistics SPL-G LOC 1,639,822 # Components 1353 # C Files 8458 # Features ∼1600 # Facts 621,714 # Variable Facts 3944 %Variable Facts 0.63
  36. stress test (triage results) Analyzing Large Software Bases by Means

    of an Extracted Model of the Code PAGE 43 639,108 290,191 113,723 47,219 2,678 Most paths are short Triage: Longer paths may indicate higher chance of error
  37. summary and future work Source Code Source Code Source Code

    Models Build scripts / Config files Object Code Facts software artifacts Fact Extractor Fact Extractor Fact Extractor Fact Extractor Fact Extractor extractors Fact Base Fact Base Fact Base Fact Base Fact Base lightweight model queries software querying Linker linkage facts Linked Fact Base (Neo4j) graph database Analyzing Large Software Bases by Means of an Extracted Model of the Code PAGE 44 Still have work to do to • Improve precision • Improve scalability • Improve user experience
  38. Analyzing Large Software Bases by Means of an Extracted Model

    of the Code PAGE 45 acknowledgements! Gary Feng Miranda Liu Ramy Shahin Mike Godfrey Marsha Chechik Rafael Toledo Fa Fa Ke Echo Chen Max Xiong Rob Hackman