Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software Analytics - Data-Driven Improvement of...

Software Analytics - Data-Driven Improvement of Software Quality

Here is the complete slidedeck of the complete workshop on Software Analytics.

In this workshop we analyze data-driven software systems as well as the surrounding processes and organization in order to uncover weak points in the development and operation. As our foundation we use best practices and methodologies from the field of data science.

For the undertaking of our analysis we use open-source analysis tools. You can use these yourself after the workshop without charge. Thanks to the large community behind these tools, an abundance of tips and further knowledge will be at hand.

Avatar for Markus Harrer

Markus Harrer

October 27, 2025
Tweet

More Decks by Markus Harrer

Other Decks in Technology

Transcript

  1. Note to the readers These are the slides of my

    workshop on “Software Analytics - Data-driven improvement of software systems“ in the last version of 2022. The very first analyses were created during my Master’s studies in 2013, especially as part of my Master’s thesis “Possible Applications of Automated Analysis of Artifacts and Metadata from Software Projects to Support the Maintainability Optimization of Long- Lived Software Systems.”1 Thanks to my former employer, INNOQ and many customers with very interesting legacy systems and challenges, I was able to enhance these techniques to analyze large-scale codebases in a data-driven way. With the release of this workshop under a Creative Commons license, I want to help you improve your own software systems in a data-driven way as well. Markus Harrer, October 2025 1 https://speakerdeck.com/feststelltaste/einsatzmoglichkeiten-der-automatisierten- analyse-von-artefakten-und-metadaten-aus-softwareprojekten-zur-unterstutzung- der-wartbarkeitsoptimierung-langlebiger-softwaresysteme
  2. Legal notice 3 Licensed under Creative Commons BY-SA 4.0 You

    are free to: • Share — copy and redistribute the material in any medium or format for any purpose, even commercially. • Adapt — remix, transform, and build upon the material for any purpose, even commercially. • The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: • Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. • No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
  3. “ Markus Harrer Senior Consultant / Nuremberg, Germany Tools only

    find, people have to find out!" • Architecture, design and code reviews • Software modernization and evolution • Data analysis in software development 5 Foundation & IMPROVE https://softwareanalytics.de https://feststelltaste.de/ Instructor
  4. Questions Please introduce yourself • Name • Company / business

    domain • Job / role • What do you want to analyze?
  5. Schedule • 09:00 am: Start • 12:30 pm - 1:30

    pm: Lunch • 4:30 pm: End of new content • Until 5:00 pm: Ask me anything
  6. Online training recommendations Webcams on After 1 hour we take

    a 10 minutes break Please mute when not speaking We address each other with our first name You are welcome to ask questions at any time (directly or per hand sign) Please also initiate discussions around the topics
  7. Contents 10 • Introduction to Software Analytics • Data sources

    for analyses in software development • Challenges while analyzing software data • Introduction to Reproducible Data Science • Data analysis with Jupyter, Python, pandas & Co. • Outlook on graph-based software analysis and machine learning on code Plus: Interactive, hands-on projects and katas
  8. 11 Analysis of software systems* with techniques, methods and tools

    from (Graph) Data Science to help improving the quality of our software systems. Main idea * and their environments
  9. Quality issues Just a few examples Systems at their stress

    limits Workarounds and “temporary” patches Always nice surprises Lack of understanding of the system
  10. Questions Your experience with data analysis Please put a number

    into the chat box 1 The topic is new to me 2 I have performed a few analyses by myself 3 The topic is my daily work 4 I wrote a few books about the topic
  11. Quantifying knowledge loss in the event of developer turnover 20

    https://www.feststelltaste.de/knowledge-islands/
  12. Identification of performance hotspots via call tree analysis 23 https://www.feststelltaste.de/performance-hotspots/

    API calls to the backend Entry point into the application Hot spot metric (e.g. number of DB calls) Starting point to the bottleneck Where are the performance hotspots?
  13. Definition "Software Analytics“ 25 "Software Analytics is analytics on software

    data for managers and software engineers with the aim of empowering software development individuals and teams to gain and share insight from their data to make better decisions." Tim Menzies and Thomas Zimmermann
  14. Insights on important, specific issues 26 Frequency Issues Importance Use

    standard tools for general issues Option 2: Use Software Analytics to answer your important issues! Option 1: Just ignore all the other issues (not recommended at all!)
  15. Data sources for analyses 30 chronological Community Runtime static +

    combinations code tests docs mappings DB schema call graphs stack traces code coverage build logs heap dumps system logs … … version control data tickets audit data APIs roles & rights instant messages mailing lists internet forum online code platforms …
  16. Where to get the data? Use your favorite command line

    tool! • mvn dependency:analyze-duplicate • List all the duplicate dependencies in your Java application • jdeps -v . • List all dependencies between classes in your Java code • git shortlog -ns -- *ViewModel.java • List all developers that change view models • cloc ./ --by-file --quiet –csv • List the number of lines for each source code file
  17. Getting more from your tools 32 Format outputs correctly •

    cloc ./ --quiet --csv Place basic preprocessing on tools • git log --no-merges --format=... --since.. Connect to tools directly via APIs • See API docs for Jenkins, SonarQube, jQAssistant/Neo4j, YouTrack, Jira, ...
  18. Example: YouTrack API-Call: https://youtrack.jetbrains.com/api/issues?$top=1000&field s=created,resolved&query=Type:%20Bug%20Project%3A% 20%7BCode%20With%20Me%7D Website: https://youtrack.jetbrains.com/issues/CWM [ {"created":1625212572693,"resolved":null,"$type":"Issue"},

    {"created":1623332830472,"resolved":null,"$type":"Issue"}, {"created":1616154776583,"resolved":1624642098272,"$type":"Issue"}, {"created":1624357764284,"resolved":1624965797482,"$type":"Issue"}, {"created":1624449874896,"resolved":1625092197222,"$type":"Issue"}, {"created":1625138678733,"resolved":null,"$type":"Issue"}, {"created":1616604945032,"resolved":1625040446461,"$type":"Issue"}, {"created":1624609359186,"resolved":null,"$type":"Issue"}... You can produce the results in the UI in most cases by using the API
  19. Example An analysis that uses YouTrack data to track bugfixing

    https://www.innoq.com/de/blog/ defect-analysis-using-pandas/
  20. Example An analysis that uses YouTrack data to track bugfixing

    https://www.innoq.com/de/blog/ defect-analysis-using-pandas/
  21. Great tools 38 • Production-ready data analysis tools are available

    as open source software • Analysis tools can now also handle highly interconnected and huge datasets Why now?
  22. Questions Your programming experience Please put a number into the

    chat box Programming Experience Python Experience No Experience Pandas Experience 1 2 3 4
  23. 40 Interactive notebook system • Document-centered analysis platform • Executable

    code blocks • Directly visible outputs / visualizations https://www.feststelltaste.de/top5-jupyter
  24. 41 Programming language for Data Science • Simply • Efficient

    • Fast https://www.feststelltaste.de/top5-python
  25. 42 Pragmatic data analysis tool • Like a “programmable Excel

    worksheet” • Really fast • Flexible • Expressive • Very good integration with other libraries https://www.feststelltaste.de/top5-pandas
  26. 43 Visualization library • Enables the programmatic creation of graphics

    • Create bar charts, line charts and more • Good integration with pandas & co. • Direct output to Jupyter Notebooks https://www.feststelltaste.de/top5-matplotlib
  27. Application Server Production Coverage Analysis 45 Which code in which

    packages is not used? User Java Web Application Coverage Tool Coverage per Class delivers uses PACKAGE,CLASS,LINE_MISSED,LINE_COVERED org.springframework.samples.petclinic,PetclinicInitializer,0,24 org.springframework.samples.petclinic.model,NamedEntity,1,4 org.springframework.samples.petclinic.model,Specialty,0,1 org.springframework.samples.petclinic.model,PetType,0,1 org.springframework.samples.petclinic.model,Vets,4,0 org.springframework.samples.petclinic.model,Visit,0,12 ...
  28. 46 Data Science Python Distribution • All-inclusive package, free of

    charge! • Bring everything needed for the launch • Included packages are optimized for each other and optimized for the used operating system • Download, install, get started! Python pandas matplotlib Jupyter ...
  29. Python Ecosystem Data Analytics • NumPy • scikit-learn • TensorFlow

    • Dask • Py2neo • Pygments Visualization / Presentation • pygal • Bokeh • python-pptx • RISE Other • Scrapy, Selenium, Flask
  30. Reproducible, open, structured data analysis 50 ✓ Make assumptions explicit

    and simplifications transparent ✓ Explain used data and filtering ✓ Motivate summarizations ✓ Share code, data and results Why now? A B
  31. Sweet Spot for Software Analytics 51 Data Science as foundation:

    Not over- nor under-engineered Lack of methodology Fixed on a specific methodology Enabling flexible analysis, grounded on proven methodology Technological constraints Free technology selection Strong technological foundation for individual analysis Software Quality Dashboards Custom built on strong foundations 100 % custom built analysis Data Science Your analysis
  32. Data Science Venn diagram by Drew Conway 52 Software developers

    are very close to Data Science! Substantive expertise Machine Learning Danger zone! Traditional research Data Science
  33. Data Science and Software Development? 53 "A data scientist is

    someone who is better at statistics than any software engineer and better at software engineering than any statistician ."
  34. Why Data Science? 54 Big community • Free online courses,

    videos and tutorials • (e.g. DataCamp with over 8 million members) • Direct help for very individual questions • (e.g. Stack Overflow or blog articles) • Continuous learning and learning from others through online competitions • (e.g. Kaggle or similar challenges).
  35. Reuse of proven methodologies 55 E.g. Roger E. Peng’s „Stages

    of Data Analysis“ I. Stating Question II. Exploratory Data Analysis III. Formal Modeling IV. Interpretation V. Communication
  36. 57 Which modules are no longer used in production? Which

    modules are no longer used in production? Coverage data during operation in production Coverage data during operation in production Test coverage in staging environment Test coverage in staging environment The measurement of code coverage is representative of the actual usage of the application The measurement of code coverage is representative of the actual usage of the application Modules can be derived from the coverage measurements Modules can be derived from the coverage measurements A list of module and their average coverage of the used code A list of module and their average coverage of the used code How to approach an analysis?
  37. Wie Analysen angehen? 58 Gain coverage data in production using

    JaCoCo Gain coverage data in production using JaCoCo Extract modules from coverage data Extract modules from coverage data Set up coverage ratio of modules used Set up coverage ratio of modules used Degree of utilization of the software in production per module Degree of utilization of the software in production per module Remove modules from code which are no longer used in production Remove modules from code which are no longer used in production
  38. Too complex? Simple models win! 59 Combine multiple data sources

    relevant content Present new findings compactly in other perspectives Joining Aggregate (is enough in ~80% of the cases)
  39. Heuristics 64 Heuristics refers to the art of arriving at

    probable statements or workable solutions with limited knowledge (incomplete information) and little time.” G. Gigerenzer and P. M. Todd with the ABC Research Group: Simple heuristics that make us smart. Oxford University Press, New York 1999. “ A B Why now? ? ? ? ? ? ? ? Without heuristics With heuristics A B ! ! !
  40. Heuristics and Notebooks 67 • Provide analysis notebooks and data

    • Start presentation with visualization and key messages (details if needed) • Group description, code & partial result for each mental step 67
  41. Rules of Tidy Data in Notebooks “Separation of Concerns“ for

    better understandability One column per variable / information type One row for each observation of a variable One table for all related variables A linking column for each table of an analysis From Jeff Leek: The Elements of Data Analytic Style
  42. Integration with Machine Learning • Pandas is using the library

    numpy under the hood • Many machine learning libraries are using numpy under the hood This allows you to use pandas with most of the machine learning libraries out there! MLonCode
  43. Example: Modularization Check Checking the existing modularization of a software

    system and compare it to the change behavior of the development teams Goal: Find out if developers change code more within modules than across modules boundaries (the later would indicate that the existing boundaries need to be refined) → Uses the machine learning library scikit-learn for the job
  44. I. Stating Question 72 "How well do the modules support

    cohesive changes?" Legend A B C A B C change activities A B C components Non-cohesive changes Cohesive changes
  45. I. Stating Question 73 Heuristics "Are changes made within a

    component related ?" • Changes => Commits from version control • Components => Part of a file path
  46. II. Exploratory Data Analysis 74 Commit and file path git

    log --numstat --format=... filepath commit_id .../todo/Get.java #59a26 .../todo/New.java #59a26 .../site/Main.java #34af9 ... ...
  47. III. Modeling 75 Pivot table with commits for each file

    → one vector per file (from now on: pure mathematics) ... #34af9 #35e25 #59a26 ... 0 1 1 .../todo/Get.java ... 0 1 1 .../todo/New.java ... 1 0 0 .../site/Main.java ... ... ... ... ...
  48. III. Modeling 76 Similarity calculation → Cosine similarity between vectors

    / iles ... .../site/Main.java .../todo/New.java .../todo/Get.java ... 0.3 0.8 1 .../todo/Get.java ... 0 1 0.8 .../todo/New.java ... 1 0 0.3 .../site/Main.java ... ... ... ... ...
  49. IV. Interpretation Information reduction • Multidimensional scaling reduces n dimensions

    to two dimensions while maintaining spacing • Component can be extracted from file path • Example: .../todo/Get.java => todo y x filepath 0.67 0.14 .../todo/Get.java 0.70 0.13 .../todo/New.java 0.50 0.31 .../site/Main.java ... ... ... comp todo todo site ...
  50. V. Communication 78 Interactive graphics generation • Files of the

    software system => points • Files that are modified together => proximity of the points to each other • Components of the software system => colors of the points
  51. V. Communication 79 Changes across module boundaries Changes within the

    module boundary Nearby points = related modified source code files 1 point = 1 source code file (color = subject module)
  52. Creating different perspectives Sales data Business subdomains Usage data Technical

    modules Patterns Low-level metrics Examples for perspectives Visibility to businesspeople high low We‘re we are today We‘re we should be
  53. Data → Perspectives 83 Example: Results from static code analyzers

    com.company.ordersystem.partner.api.rest.OrderExecutor.java 540LOC Pattern Language Technical Aspect Layer Subdomain Subsystem Metric LOC: Lines of code
  54. Data → Perspectives 86 Example: Entries from log files 2012-04-12

    13:54:34.512 POST /api/order/1432 132ms "Order processed" Usage scenario Technical Aspect Metric Usage data
  55. Data → Perspectives 87 Example: Semi-manual* assignment of information TRA1321

    <-> External Legacy Transactions EXE5243 <-> Order Execution Usage scenarios Jobs * There is often documentation that provides information that you can use for linking the technology world and the business world
  56. Strategies & Examples for & of actionable insights “Strategies” adopted

    from Miryung Kim, Tom Zimmermann, Rob DeLine, Andrew Begel: The Emerging Role of Data Scientists on Software Development Teams
  57. Be aware! 90 A data analysis is not free! Investigate

    the problem before analyzing the data! • Listen, inquire, repeat • Depth-search with "Why?" • Broad-search with "What else?“ • Investigate only the found hotspot! Ask yourself: “Is data analysis possible and helpful?” 90 Root cause analysis with depth search
  58. Right question 91 “There are many more questions to pursue

    than you have time and resources for. Choose questions that enable the stakeholders to achieve their goals.” Strategies for actionable insights
  59. Examples of actionable insights Right question Bad: „Which developer did

    most of the commits?“ Typical problems with these kinds of questions: • Data quality issues / validity? • Behaviour or performance checking • Actionable? • Metric tuning / cheating “Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” Bill Gates Number of commits per developer for the Intellij IDEA IDE
  60. Examples of actionable insights Right question Better: „Where do developers

    work alone in the code base?“ Colored circles show files changed by just single developers White colors show files that were changed by many different developers Size of circles corresponds to source code size Hierarchical circle packing diagram visualizes directory structure and source code files Next actions: redocumentation, team reorganization, pair programming, …
  61. Examples of actionable insights Right question My favorites: „Which code

    is used by the users of our application?“ Deep blue colored circles shows source code files with code that weren’t executed at all Deep red colored circles shows source code files with code that was heavily used Size of circles corresponds to source code files size Hierarchical circle packing diagram visualizes directory structure and source code files Next actions: insert asserts to entries to dead code parts, delete complete parts of code
  62. Iterate 95 “Iterate with the stakeholders to interpret the data

    and to identify and refine important questions and scenarios.” Strategies for actionable insights
  63. Iterate 96 Examples of actionable insights The question „Where do

    developers work alone in the code base?“ was answered by analyzing the developers’ changes to the code, but • What about code reviews? • What about mob programming? • What about code reading? Refine the analysis when a more detailed answer becomes important! Hierarchical circle packing diagram with an overview of where developers work alone in the code base
  64. Multiple sources 97 “Triangulate multiple data sources to increase the

    confidence in the analysis results.” Strategies for actionable insights
  65. Multiple sources Examples of actionable insights Hot spot analysis: Where

    is complex code that is changed frequently? Source code Version control + Example on the right taken from Adam Tornhill‘s book “Software Design X-Rays”
  66. Translate 100 “Translate analysis results to familiar concepts that are

    important for the stakeholders’ decisions.” Strategies for actionable insights
  67. Translate Displaying business subdomains with production usage („utilization“) and code

    changes („investments“) measures on a 2x2 matrix https://www.feststelltaste.de/swot-analysis-for- spotting-worthless-code/ Examples of actionable insights Strategic Redesign
  68. Translate Identifying active communities around open source software by analyzing

    the frequencies of discussion on the internet (e.g. Stack Overflow) Examples of actionable insights Open Source Software Evaluation
  69. Plan for scale 104 “Many stakeholders want to deploy predictive

    models as part of the product. Embrace your role in the entire end-to-end scenario.” Strategies for actionable insights
  70. Plan for scale Examples of actionable insights Improvement Dashboards Visualizing

    the value of the improvement work Days since last failure 143 Today‘s orders 41 Ø Orders 54 Active User 14 Completed tech debt tickets 4/15 Completion DB migration 78% Version control Log files System monitoring Issues tracker Business monitoring Data warehouse Sources of data Current state of the legacy system
  71. CustomerMapper also reloads all contract data when converting CustomerDBO to

    CustomerBO Example overview 107 descriptive analysis exploratory tracing analysis For every click in the application, there is an average of 250 DB calls and 50 service calls This list of mappers for the following views use already the new (D)BOs Mapping code reloads data via loops and traversing relationships. DBO: DataBase Objects BO: Business Objects → inductive analysis
  72. Descriptive Analysis 108 Metrics allow us to measure different aspects

    in our software, its environment and the people involved in it. Metrics Aspect Measurement result System Developers For every click in the application, there is an average of 250 DB calls and 50 service calls
  73. Descriptive Analysis 109 Make sense of fine-granular data by putting

    it into a perspective where non-technical people can reason about. Metrics Measurement result Translate Developers Decision Makers Every time a customer is fetched, there are 150 DB calls and 30 service calls For every click in the application, there is an average of 250 DB calls and 50 service calls
  74. Exploratory → Inductive Analysis 110 • 1 x perform analysis

    by hand • Reflect and abstract • Re-code (= automate analysis) CustomerMapper also reloads all contract data when converting CustomerDBO to CustomerBO Mapping code reloads data via loops and traversing relationships. A C B E D A 1 finding n findings
  75. Identification of performance hotspots via call tree analysis 111 Examples

    from the trenches Raw data (“call trees”) from a Java performance profiler tool (JProfiler) https://www.feststelltaste.de/performance-hotspots/ From single observations…
  76. Identification of performance hotspots via call tree analysis 112 Examples

    from the trenches API calls to the backend Entry point into the application Hot spot metric (e.g. number of DB calls) Starting point to the bottleneck Which parts of the application load data they don’t need? e.g. /petclinic/owners/ e.g. showOwners() https://www.feststelltaste.de/performance-hotspots/ …to all root causes!
  77. Exploratory → Inductive Analysis 113 Result: A list of root

    causes + to fix them →Very effective in many cases! + =
  78. Addressing architectural problems in larger code bases Examples from the

    trenches Root Causes Recipe *example has been changed because of confidentiality reasons
  79. Tracing Analysis Trace work in progress during cleanup / migrations

    This list of mappers for the following views use already the new (D)BOs
  80. Tracing improvements work over time 116 Framework migrations AngularJS with

    JavaScript ↓ AngularJS with TypeScript ↓ Angular with TypeScript Number of files Examples from the trenches
  81. Tracing improvements work over time 117 Architecture Refactoring Structureless ↓

    Hexagonal Architecture Examples from the trenches https://www.innoq.com/en/blog/visualizing- progress-of-refactoring-into-hexagonal- architecture-using-jqassistant/
  82. Motivation for Graph Analytics 120 Data from software systems… is

    not always available in tabular form (so-called "panel data") is strongly interconnected within a data source (code dependencies, call hierarchies, commits...) can often be related via different data sources (e.g. code with static code analysis and commits)
  83. Graph-based integration 121 • Different data sources can be related

    to each other • New concepts / perspectives can be projected onto fine granular data • Summaries of many details into a more interpretable overall view possible 2016-09-31 78,5% PetDBO Coverage Commits God Class Code Smells Controller Pattern Patterns Pet Code
  84. How does it work? 122 1. Scan software structures 2.

    Save data to a graph database 3. Execute queries • Investigate relationships • Add perspectives (= concepts) and rules • Validate rule violations (= constraints) 4. Generate reports jQAssistant Core ideas for architecture validation Advanced topic
  85. Concepts 123 Elements are assigned to concepts (= an abstract

    idea like a pattern) Examples • Maven Project = Module • Java Package = Layer • Class annotated with @Entity = JPA* entity jQAssistant Core ideas for architecture validation *JPA: Java Persistence API, standard to use databases in Java Advanced topic
  86. Definition of a Concept 124 == JPA Entities [[jpa2:Entity]] Any

    type annotated with @javax.persistence.Entity is a JPA entity. [source,cypher,role=concept] ---- MATCH (t:Type)-[:ANNOTATED_BY]->()-[:OF_TYPE]->(a:Type) WHERE a.fqn ="javax.persistence.Entity" SET t:Jpa:Entity RETURN t AS Entity ----
  87. Constraint 125 Rules are defined based on the concepts •

    Example: "All database entities must be in the persistence layer". Constraints (= deviations from the rules) are checked • Example: "Give me all database entities that are not in the persistence layer." jQAssistant Core ideas for architecture validation Advanced topic
  88. Definition of a Constraint 126 [[model:JpaEntityInModelPackage]] All JPA entities must

    be located in the persistence layer. with the package name "model [source,cypher,role=constraint, requiresConcepts="jpa2:Entity"] ---- MATCH (package:Package)-[:CONTAINS]->(entity:Jpa:Entity) WHERE package.name <> "model RETURN entity AS EntityInWrongPackage ----
  89. Existing scanners 127 ZIP GZ *. class JAR, WAR, EAR

    MANIFEST.MF *. properties XSD YAML XML application. xml web.xml beans.xml JaCoCo FindBugs CheckStyle Maven Test Reports RDBMS schema M2 Repository CSV … JSON Git Arbitrarily expandable thanks to plug-in architecture
  90. Perspectives with Graphs :Class Method :Field 2 findings 5 changes

    :Entity 100% usage birthDate name @Entity @Table(name = "pets") public class Pet { Technical World Subdomain Pattern
  91. Perspectives with Graphs 129 5 types 39 findings 51 changes

    80% usage Business World 16 types 17 findings 15 changes 70% usage
  92. Typical usage examples 130 • Detection of programming errors •

    Common changeable state • Complex logic in UI • Performance HotSpot Mining • Lazy loading / N+1 query problems • Identification of problematic frameworks • Improvement of legacy code • Production Coverage / Code Decommissioning • Effort estimation Rearchitecting / Modularization
  93. Example Graph 133 Spring PetClinic visibility: public name: Pet fqn:

    org.springframework.samples.petclinic.model.Pet … Class:Type:Java:File
  94. Example Graph 134 Spring PetClinic public class Pet { private

    LocalDate birthDate; public LocalDate getBirthDate(){ return this. birthDate; } public void setBirthDate(LocalDate birthDate){ this. birthDate = birthDate; }
  95. Graph in Detail Nodes Examples: • Java class • Measure

    • File • Repository • URL • ...
  96. Graph in Detail Edges - Types of a relationship Examples:

    • DECLARES • EXTENDS • IMPLEMENTS • ANNOTATED_BY • DEPENDS_ON • ...
  97. A Graph – with data*! File File Class Class Type

    Type File File value key “entity“ name “Entity.java” file “jpa.thing.Entity” fqn value key 3 weight value key “Pet” name “Pet.java” file “foo.bar.Pet” fqn Labels Properties *therefore also called "Property Graph Model": Data can be stored at nodes and edges
  98. Graph in Detail Labels – „Tags“ of a node Examples:

    • :Class • :Type • :Java • :File • ... File File
  99. Graph in Detail Properties – Data of a node or

    edge Examples: • fqn (fully qualified name) • name • visibility • abstract • static • final • ... value key “Pet” name “Pet.java” file “foo.bar.Pet” fqn
  100. Structure of Cypher Queries Properties and Values (p:Class {name:"Pet"})-[]->(e:Type {name:"Entity"})

    value key “Entity” name … … value key “Pet” name … … Type Type Class Class … … … …
  101. Structure of Cypher Queries Query and return nodes as results

    MATCH (p:Class {name:"Pet"})-[]->(e:Type {name:"Entity"}) RETURN p, e Type Type Class Class … … … … value key “Entity” name … … value key “Pet” name … …
  102. Structure of Cypher Queries Query and returning of properties as

    results MATCH (p:Class {name:"Pet"})-[]->(e:Type {name:"Entity"}) RETURN p.name, e.name Type Type Class Class … … … … value key “Entity” name … … value key “Pet” name … …
  103. Structure of Cypher Queries Query nodes using properties in WHERE

    clause MATCH (p:Class)-[]->(e:Type) WHERE p.name = "Pet" AND e.name = "Entity" RETURN p, e Less performant, but easier to read and learn at the beginning Type Type Class Class … … … …
  104. Structure of Cypher Queries Specification of the relationship type MATCH

    (p:Class)-[:DEPENDS_ON]->(e:Type) WHERE p.name = "Pet" AND e.name = "Entity" RETURN p, e Type Type Class Class … … … … DEPENDS_ON DEPENDS_ON … …
  105. Examples of Cypher Queries 158 List all types with the

    most methods MATCH (t:Type)-[:DECLARES]->(m:Method) RETURN t.fqn as type, COUNT(m) as methods ORDER BY methods DESC
  106. Examples of Cypher Queries 159 List of all writeable static

    variables that are written to MATCH (c:Class)-[:DECLARES]-> (f:Field)<-[w:WRITES]-(m:Method) WHERE EXISTS(f.static) AND NOT EXISTS(f.final) RETURN c.name as InClass, m.name as theMethod, w.lineNumber as writesInLine, f.name as toStaticField
  107. Examples of Cypher Queries 161 Counting of changes, aggregated over

    domain subareas MATCH (t:Type)-[:BELONGS_TO]->(s:Subdomain), (t)-[:HAS_CHANGE]->(ch:Change) RETURN s.name as ASubdomain, COUNT(DISTINCT t) as Types, COUNT(DISTINCT ch) as Changes ORDER BY Types DESC
  108. Examples of Cypher Queries 162 Aggregation of multiple measurement results

    across functional areas MATCH (t:Type)-[:BELONGS_TO]->(s:Subdomain), (t)-[:HAS_CHANGE]->(ch:Change), (t)-[:HAS_MEASURE]->(co:Coverage) OPTIONAL MATCH (t)-[:HAS_BUG]->(b:BugInstance) RETURN s.name as ASubdomain, COUNT(DISTINCT t) as Types, COUNT(DISTINCT ch) as Changes, AVG(co.ratio) as Coverage, COUNT(DISTINCT b) as Bugs, SUM(DISTINCT t.lastMethodLineNumber) as Lines ORDER BY Coverage ASC, Bugs DESC
  109. Jupyter Notebook + py2neo 165 1. Import libraries 2. Connect

    to the running Neo4j instance 3. Submit query in Cypher 4. Convert result to DataFrame 5. Result from the graph database is displayed
  110. Roundtrip pandas ←→ Neo4j 166 pandas DataFrame as input Neo4j

    output as DataFrame Import code into Neo4j with py2neo
  111. Example: Strategic Redesign 167 Web Application Application Server User Usage

    per Class Coverage Dev Build'n'Run& Source code Version Control System Changes per Class Analysis Improve source code that is actually used
  112. Graph-based analysis 170 • jQAssistant provides fast and flexible views

    for a variety of graph- like software structures • Data can be explored exploratively via the graph database Neo4j • Definition of rules via concepts and constraints in Cypher / AsciiDoc possible for continuous checking Conclusion
  113. Links for jQAssistant 172 • jQAssistant • http://jqassistant.org • Talk

    • https://www.youtube.com/watch?v=kQr2c7yWbEA • Spring PetClinic sample project • Repo: http://github.com/JavaOnAutobahn/spring-petclinic • Example output: https://buschmais.github.io/spring-petclinic • TOP5 Learning jQAssistant • https://www.feststelltaste.de/top5-jqassistant/
  114. matplotlib 175 Visualization library for Python • Very well suited

    for simple (intermediate) visualizations • Direct generation of diagrams from pandas DataFrame using plot() function • However, more elaborate graphics require extensive configuration code Graphics from https://matplotlib.org/3.1.1/gallery/index.html Bar chart for knowledge distribution Scatter plot for code hotspots
  115. D3.js 176 JavaScript-based visualization library • Well suited for highly

    networked data • Output of a JSON file with pandas (and possibly post- processing with Python). • Use existing D3 visualization as basis for template Force-directed graph layout Hierarchical edge bundling Circle Packing
  116. pygal 177 Library for interactive visualizations in Python • Out-of-the-box

    interaction possibilities with the displayed data (e.g. mouseover) • Easy preparation of data from pandas DataFrame necessary Gauges Treemap XY Plot
  117. Tips for the start 178 Use „Picture Search“ to see

    what‘s possible! Galleries and Google image search for visualization ideas helps to move from graphical representation to code for visualizations
  118. Tips for the start 179 Get inspired by example code

    StackOverflow and blogs help very well with details (esp. with matplotlib) with already existing answers thanks to a big community
  119. Effective visualizations 180 • Present little information in an understandable

    way • Also visualize intermediate results • Generate graphics programmatically from results Python code example top10_authors.plot.pie() →
  120. Software Analytics Maturity Model 184 Are you ready for Software

    Analytics? Known awareness of the value of software data is present Used software data already used to show a problem Defined existing data sources are known and analysis tools are standardized (as far as reasonable) Repeatable teams use the same approach to analysis and can use it across teams Integrated the use of data analysis tools in development as well as the creation of new data sources is a matter of course
  121. Ethics 185 • There are restrictions and laws (and workers‘

    councils) • Measurements can be cheated • Beware! Raw measurements have no context • Management might see something in there that‘s not there Tracking individual performance can create a morale issue, which perversely could bring down overall productivity.” Ciera Jaspan, Caitlin Sadowski: No Single Metric Captures Productivity “
  122. 186 "Not everything that can be counted counts, and not

    everything that counts can be counted." William Bruce Cameron “All models are wrong, but some are useful." George Box Final Quotes = food for toughts
  123. Summary • Methods and tools are here • Problems can

    get communicated • Best practices help you get started Data-driven improvement is possible!
  124. More information of the topic Literature • Adam Tornhill: Software

    Design X-Ray • Wes McKinney: Python For Data Analysis • Leek, Jeff: The Elements of Data Analytic Style • Christian Bird, Tim Menzies, Thomas Zimmermann: The Art and Science of Analyzing Software Data • Tim Menzies, Laurie Williams, Thomas Zimmermann: Perspectives on Data Science for Software Engineering Software • Python Data Science Distribution: anaconda.com • GitHub repo: github.com/feststelltaste/software-analytics • Mini-Tutorial: github.com/feststelltaste/software-analytics-workshop-guided
  125. Practicing Software Analytics Self-study Cheatbooks more content for you to

    get started https://github.com/feststelltaste/software-analytics-workshop-guided
  126. Practicing Software Analytics Software Analytics Katas small challenges for you

    to analyze https://github.com/feststelltaste/software-analytics-katas You just got the analysis need and the dataset. You need to find out the rest by yourself!
  127. Practicing Software Analytics Local installation guide for your personal data

    analysis platform https://www.feststelltaste.de/ddiosqig/ Explains what you need to do to work on your own analysis on your own machine
  128. Questions 194 What other questions are there? *Sequential Question and

    Insight Diagram Topic Topic Question Question Q Question Question Q … Answer Answer A Answer Answer A … Answer Answer A Question Question Q Answer Answer A Question Question Q SQUID*!
  129. Software Analytics 5h mobshop* (like this one, but 95% coding-focused)

    Software Analytics 2 days workshop (+ katas & graph analytics) Alternative formats of this workshop *https://www.innoq.com/en/blog/collaborative-learning-with-mobshops/
  130. iSAQB Foundation Level (basic training for software architects) iSAQB IMPROVE

    (software evolution and architecture improvements) Other workshops with me
  131. www.innoq.com Königstorgraben 11 90402 Nürnberg Erftstr. 15-17 50672 Köln Hermannstrasse

    13 20095 Hamburg Kreuzstr. 16 80331 München Ludwigstr. 180E 63067 Offenbach Ohlauer Str. 43 10999 Berlin Krischerstr. 100 40789 Monheim +49 2173 3366-0 innoQ Deutschland GmbH Thank you! Markus Harrer [email protected] markusharrer.de