Software Analytics - Data-Driven Improvement of Software Quality

Software Analytics Data-Driven Improvement of Software Quality Markus Harrer Software
Development Analyst some rights reserved

Note to the readers These are the slides of my
workshop on “Software Analytics - Data-driven improvement of software systems“ in the last version of 2022. The very first analyses were created during my Master’s studies in 2013, especially as part of my Master’s thesis “Possible Applications of Automated Analysis of Artifacts and Metadata from Software Projects to Support the Maintainability Optimization of Long- Lived Software Systems.”1 Thanks to my former employer, INNOQ and many customers with very interesting legacy systems and challenges, I was able to enhance these techniques to analyze large-scale codebases in a data-driven way. With the release of this workshop under a Creative Commons license, I want to help you improve your own software systems in a data-driven way as well. Markus Harrer, October 2025 1 https://speakerdeck.com/feststelltaste/einsatzmoglichkeiten-der-automatisierten- analyse-von-artefakten-und-metadaten-aus-softwareprojekten-zur-unterstutzung- der-wartbarkeitsoptimierung-langlebiger-softwaresysteme

Legal notice 3 Licensed under Creative Commons BY-SA 4.0 You
are free to: • Share — copy and redistribute the material in any medium or format for any purpose, even commercially. • Adapt — remix, transform, and build upon the material for any purpose, even commercially. • The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: • Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. • No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

4 Introduction

“ Markus Harrer Senior Consultant / Nuremberg, Germany Tools only
find, people have to find out!" • Architecture, design and code reviews • Software modernization and evolution • Data analysis in software development 5 Foundation & IMPROVE https://softwareanalytics.de https://feststelltaste.de/ Instructor

Questions Please introduce yourself • Name • Company / business
domain • Job / role • What do you want to analyze?

Schedule • 09:00 am: Start • 12:30 pm - 1:30
pm: Lunch • 4:30 pm: End of new content • Until 5:00 pm: Ask me anything

Online training recommendations Webcams on After 1 hour we take
a 10 minutes break Please mute when not speaking We address each other with our first name You are welcome to ask questions at any time (directly or per hand sign) Please also initiate discussions around the topics

9 About this workshop

Contents 10 • Introduction to Software Analytics • Data sources
for analyses in software development • Challenges while analyzing software data • Introduction to Reproducible Data Science • Data analysis with Jupyter, Python, pandas & Co. • Outlook on graph-based software analysis and machine learning on code Plus: Interactive, hands-on projects and katas

11 Analysis of software systems* with techniques, methods and tools
from (Graph) Data Science to help improving the quality of our software systems. Main idea * and their environments

Motivation

Software (de-)evolution 13 Time Money Cost per Feature Quality Issues
Productivity

Quality issues Just a few examples Systems at their stress
limits Workarounds and “temporary” patches Always nice surprises Lack of understanding of the system

Questions What are your quality issues?

management developers communication gap risk appetite quality issues

management developers communication gap risk appetite adequate quality data analysis
quality goals

Questions Your experience with data analysis Please put a number
into the chat box 1 The topic is new to me 2 I have performed a few analyses by myself 3 The topic is my daily work 4 I wrote a few books about the topic

Examples of analyses

Quantifying knowledge loss in the event of developer turnover 20
https://www.feststelltaste.de/knowledge-islands/

Analysis of community activities around software tools 21 How well
is the tool supported by the community?

Tracking improvements over time 22 Community https://www.innoq.com/en/blog/visualizing- progress-of-refactoring-into-hexagonal- architecture-using-jqassistant/

Identification of performance hotspots via call tree analysis 23 https://www.feststelltaste.de/performance-hotspots/
API calls to the backend Entry point into the application Hot spot metric (e.g. number of DB calls) Starting point to the bottleneck Where are the performance hotspots?

Software Analytics Key ideas and concepts

Definition "Software Analytics“ 25 "Software Analytics is analytics on software
data for managers and software engineers with the aim of empowering software development individuals and teams to gain and share insight from their data to make better decisions." Tim Menzies and Thomas Zimmermann

Insights on important, specific issues 26 Frequency Issues Importance Use
standard tools for general issues Option 2: Use Software Analytics to answer your important issues! Option 1: Just ignore all the other issues (not recommended at all!)

Why now? What makes data-driven improvement of quality possible nowadays?

Why now? Plenty of software data at your fingertips

Question Which data can be used for analyzing quality issues
in software systems?

Data sources for analyses 30 chronological Community Runtime static +
combinations code tests docs mappings DB schema call graphs stack traces code coverage build logs heap dumps system logs … … version control data tickets audit data APIs roles & rights instant messages mailing lists internet forum online code platforms …

Where to get the data? Use your favorite command line
tool! • mvn dependency:analyze-duplicate • List all the duplicate dependencies in your Java application • jdeps -v . • List all dependencies between classes in your Java code • git shortlog -ns -- *ViewModel.java • List all developers that change view models • cloc ./ --by-file --quiet –csv • List the number of lines for each source code file

Getting more from your tools 32 Format outputs correctly •
cloc ./ --quiet --csv Place basic preprocessing on tools • git log --no-merges --format=... --since.. Connect to tools directly via APIs • See API docs for Jenkins, SonarQube, jQAssistant/Neo4j, YouTrack, Jira, ...

Example: YouTrack

Example: YouTrack API-Call: https://youtrack.jetbrains.com/api/issues?$top=1000&field s=created,resolved&query=Type:%20Bug%20Project%3A% 20%7BCode%20With%20Me%7D Website: https://youtrack.jetbrains.com/issues/CWM [ {"created":1625212572693,"resolved":null,"$type":"Issue"},
{"created":1623332830472,"resolved":null,"$type":"Issue"}, {"created":1616154776583,"resolved":1624642098272,"$type":"Issue"}, {"created":1624357764284,"resolved":1624965797482,"$type":"Issue"}, {"created":1624449874896,"resolved":1625092197222,"$type":"Issue"}, {"created":1625138678733,"resolved":null,"$type":"Issue"}, {"created":1616604945032,"resolved":1625040446461,"$type":"Issue"}, {"created":1624609359186,"resolved":null,"$type":"Issue"}... You can produce the results in the UI in most cases by using the API

Example An analysis that uses YouTrack data to track bugfixing
https://www.innoq.com/de/blog/ defect-analysis-using-pandas/

Why now? Great data analysis tools for software developers

Great tools 38 • Production-ready data analysis tools are available
as open source software • Analysis tools can now also handle highly interconnected and huge datasets Why now?

Questions Your programming experience Please put a number into the
chat box Programming Experience Python Experience No Experience Pandas Experience 1 2 3 4

40 Interactive notebook system • Document-centered analysis platform • Executable
code blocks • Directly visible outputs / visualizations https://www.feststelltaste.de/top5-jupyter

41 Programming language for Data Science • Simply • Efficient
• Fast https://www.feststelltaste.de/top5-python

42 Pragmatic data analysis tool • Like a “programmable Excel
worksheet” • Really fast • Flexible • Expressive • Very good integration with other libraries https://www.feststelltaste.de/top5-pandas

43 Visualization library • Enables the programmatic creation of graphics
• Create bar charts, line charts and more • Good integration with pandas & co. • Direct output to Jupyter Notebooks https://www.feststelltaste.de/top5-matplotlib

Production Coverage Analysis 44

Application Server Production Coverage Analysis 45 Which code in which
packages is not used? User Java Web Application Coverage Tool Coverage per Class delivers uses PACKAGE,CLASS,LINE_MISSED,LINE_COVERED org.springframework.samples.petclinic,PetclinicInitializer,0,24 org.springframework.samples.petclinic.model,NamedEntity,1,4 org.springframework.samples.petclinic.model,Specialty,0,1 org.springframework.samples.petclinic.model,PetType,0,1 org.springframework.samples.petclinic.model,Vets,4,0 org.springframework.samples.petclinic.model,Visit,0,12 ...

46 Data Science Python Distribution • All-inclusive package, free of
charge! • Bring everything needed for the launch • Included packages are optimized for each other and optimized for the used operating system • Download, install, get started! Python pandas matplotlib Jupyter ...

Python Ecosystem Data Analytics • NumPy • scikit-learn • TensorFlow
• Dask • Py2neo • Pygments Visualization / Presentation • pygal • Bokeh • python-pptx • RISE Other • Scrapy, Selenium, Flask

Hands-On Part 1 48 tutorial/00 Jupyter Notebook and Python basics.ipynb

Why now? Reproducible, open, structured data analysis

Reproducible, open, structured data analysis 50 ✓ Make assumptions explicit
and simplifications transparent ✓ Explain used data and filtering ✓ Motivate summarizations ✓ Share code, data and results Why now? A B

Sweet Spot for Software Analytics 51 Data Science as foundation:
Not over- nor under-engineered Lack of methodology Fixed on a specific methodology Enabling flexible analysis, grounded on proven methodology Technological constraints Free technology selection Strong technological foundation for individual analysis Software Quality Dashboards Custom built on strong foundations 100 % custom built analysis Data Science Your analysis

Data Science Venn diagram by Drew Conway 52 Software developers
are very close to Data Science! Substantive expertise Machine Learning Danger zone! Traditional research Data Science

Data Science and Software Development? 53 "A data scientist is
someone who is better at statistics than any software engineer and better at software engineering than any statistician ."

Why Data Science? 54 Big community • Free online courses,
videos and tutorials • (e.g. DataCamp with over 8 million members) • Direct help for very individual questions • (e.g. Stack Overflow or blog articles) • Continuous learning and learning from others through online competitions • (e.g. Kaggle or similar challenges).

Reuse of proven methodologies 55 E.g. Roger E. Peng’s „Stages
of Data Analysis“ I. Stating Question II. Exploratory Data Analysis III. Formal Modeling IV. Interpretation V. Communication

Software Analytics Canvas 56

57 Which modules are no longer used in production? Which
modules are no longer used in production? Coverage data during operation in production Coverage data during operation in production Test coverage in staging environment Test coverage in staging environment The measurement of code coverage is representative of the actual usage of the application The measurement of code coverage is representative of the actual usage of the application Modules can be derived from the coverage measurements Modules can be derived from the coverage measurements A list of module and their average coverage of the used code A list of module and their average coverage of the used code How to approach an analysis?

Wie Analysen angehen? 58 Gain coverage data in production using
JaCoCo Gain coverage data in production using JaCoCo Extract modules from coverage data Extract modules from coverage data Set up coverage ratio of modules used Set up coverage ratio of modules used Degree of utilization of the software in production per module Degree of utilization of the software in production per module Remove modules from code which are no longer used in production Remove modules from code which are no longer used in production

Too complex? Simple models win! 59 Combine multiple data sources
relevant content Present new findings compactly in other perspectives Joining Aggregate (is enough in ~80% of the cases)

Exercise 60 Estimation of the knowledge distribution within a modular
system

Why now? Assumptions and shortcomings can be made explicit

Traditional “scripting approach” 62 ...01011010010101 Data Result Analysis Would you
trust this result?

...01011010010101 Data Result Open analysis Notebook Approach 63

Heuristics 64 Heuristics refers to the art of arriving at
probable statements or workable solutions with limited knowledge (incomplete information) and little time.” G. Gigerenzer and P. M. Todd with the ABC Research Group: Simple heuristics that make us smart. Oxford University Press, New York 1999. “ A B Why now? ? ? ? ? ? ? ? Without heuristics With heuristics A B ! ! !

Disclose heuristics with notebooks ~

Disclose heuristics with notebooks Notebook

Heuristics and Notebooks 67 • Provide analysis notebooks and data
• Start presentation with visualization and key messages (details if needed) • Group description, code & partial result for each mental step 67

Rules of Tidy Data in Notebooks “Separation of Concerns“ for
better understandability One column per variable / information type One row for each observation of a variable One table for all related variables A linking column for each table of an analysis From Jeff Leek: The Elements of Data Analytic Style

Example: Modularization Check https://www.feststelltaste.de/checking-the- modularization-of-software-systems-by- analyzing-co-changing-source-code-files/

Integration with Machine Learning • Pandas is using the library
numpy under the hood • Many machine learning libraries are using numpy under the hood This allows you to use pandas with most of the machine learning libraries out there! MLonCode

Example: Modularization Check Checking the existing modularization of a software
system and compare it to the change behavior of the development teams Goal: Find out if developers change code more within modules than across modules boundaries (the later would indicate that the existing boundaries need to be refined) → Uses the machine learning library scikit-learn for the job

I. Stating Question 72 "How well do the modules support
cohesive changes?" Legend A B C A B C change activities A B C components Non-cohesive changes Cohesive changes

I. Stating Question 73 Heuristics "Are changes made within a
component related ?" • Changes => Commits from version control • Components => Part of a file path

II. Exploratory Data Analysis 74 Commit and file path git
log --numstat --format=... filepath commit_id .../todo/Get.java #59a26 .../todo/New.java #59a26 .../site/Main.java #34af9 ... ...

III. Modeling 75 Pivot table with commits for each file
→ one vector per file (from now on: pure mathematics) ... #34af9 #35e25 #59a26 ... 0 1 1 .../todo/Get.java ... 0 1 1 .../todo/New.java ... 1 0 0 .../site/Main.java ... ... ... ... ...

III. Modeling 76 Similarity calculation → Cosine similarity between vectors
/ iles ... .../site/Main.java .../todo/New.java .../todo/Get.java ... 0.3 0.8 1 .../todo/Get.java ... 0 1 0.8 .../todo/New.java ... 1 0 0.3 .../site/Main.java ... ... ... ... ...

IV. Interpretation Information reduction • Multidimensional scaling reduces n dimensions
to two dimensions while maintaining spacing • Component can be extracted from file path • Example: .../todo/Get.java => todo y x filepath 0.67 0.14 .../todo/Get.java 0.70 0.13 .../todo/New.java 0.50 0.31 .../site/Main.java ... ... ... comp todo todo site ...

V. Communication 78 Interactive graphics generation • Files of the
software system => points • Files that are modified together => proximity of the points to each other • Components of the software system => colors of the points

V. Communication 79 Changes across module boundaries Changes within the
module boundary Nearby points = related modified source code files 1 point = 1 source code file (color = subject module)

Modularization Check https://feststelltaste.github.io/software-analytics/notebooks/vis/checking_modularization/dropover.html

Why now? Enrich technical problems with domain knowledge

Creating different perspectives Sales data Business subdomains Usage data Technical
modules Patterns Low-level metrics Examples for perspectives Visibility to businesspeople high low We‘re we are today We‘re we should be

Data → Perspectives 83 Example: Results from static code analyzers
com.company.ordersystem.partner.api.rest.OrderExecutor.java 540LOC Pattern Language Technical Aspect Layer Subdomain Subsystem Metric LOC: Lines of code

Data → Perspectives Low code level

Data → Perspectives 85 Domain level

Data → Perspectives 86 Example: Entries from log files 2012-04-12
13:54:34.512 POST /api/order/1432 132ms "Order processed" Usage scenario Technical Aspect Metric Usage data

Data → Perspectives 87 Example: Semi-manual* assignment of information TRA1321
<-> External Legacy Transactions EXE5243 <-> Order Execution Usage scenarios Jobs * There is often documentation that provides information that you can use for linking the technology world and the business world

Hands-On Part 2 88 tutorial/10 pandas and matplotlib basics.ipynb

Strategies & Examples for & of actionable insights “Strategies” adopted
from Miryung Kim, Tom Zimmermann, Rob DeLine, Andrew Begel: The Emerging Role of Data Scientists on Software Development Teams

Be aware! 90 A data analysis is not free! Investigate
the problem before analyzing the data! • Listen, inquire, repeat • Depth-search with "Why?" • Broad-search with "What else?“ • Investigate only the found hotspot! Ask yourself: “Is data analysis possible and helpful?” 90 Root cause analysis with depth search

Right question 91 “There are many more questions to pursue
than you have time and resources for. Choose questions that enable the stakeholders to achieve their goals.” Strategies for actionable insights

Examples of actionable insights Right question Bad: „Which developer did
most of the commits?“ Typical problems with these kinds of questions: • Data quality issues / validity? • Behaviour or performance checking • Actionable? • Metric tuning / cheating “Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” Bill Gates Number of commits per developer for the Intellij IDEA IDE

Examples of actionable insights Right question Better: „Where do developers
work alone in the code base?“ Colored circles show files changed by just single developers White colors show files that were changed by many different developers Size of circles corresponds to source code size Hierarchical circle packing diagram visualizes directory structure and source code files Next actions: redocumentation, team reorganization, pair programming, …

Examples of actionable insights Right question My favorites: „Which code
is used by the users of our application?“ Deep blue colored circles shows source code files with code that weren’t executed at all Deep red colored circles shows source code files with code that was heavily used Size of circles corresponds to source code files size Hierarchical circle packing diagram visualizes directory structure and source code files Next actions: insert asserts to entries to dead code parts, delete complete parts of code

Iterate 95 “Iterate with the stakeholders to interpret the data
and to identify and refine important questions and scenarios.” Strategies for actionable insights

Iterate 96 Examples of actionable insights The question „Where do
developers work alone in the code base?“ was answered by analyzing the developers’ changes to the code, but • What about code reviews? • What about mob programming? • What about code reading? Refine the analysis when a more detailed answer becomes important! Hierarchical circle packing diagram with an overview of where developers work alone in the code base

Multiple sources 97 “Triangulate multiple data sources to increase the
confidence in the analysis results.” Strategies for actionable insights

Multiple sources Examples of actionable insights Hot spot analysis: Where
is complex code that is changed frequently? Source code Version control + Example on the right taken from Adam Tornhill‘s book “Software Design X-Rays”

IntelliJ IDEA Analysis https://github.com/feststelltaste/software- analytics/blob/master/demos/IntelliJ%20IDEA%20Analysis.ipynb

Translate 100 “Translate analysis results to familiar concepts that are
important for the stakeholders’ decisions.” Strategies for actionable insights

Translate Displaying business subdomains with production usage („utilization“) and code
changes („investments“) measures on a 2x2 matrix https://www.feststelltaste.de/swot-analysis-for- spotting-worthless-code/ Examples of actionable insights Strategic Redesign

Translate Identifying active communities around open source software by analyzing
the frequencies of discussion on the internet (e.g. Stack Overflow) Examples of actionable insights Open Source Software Evaluation

Exercise 103 Analysis of community activities for version control systems

Plan for scale 104 “Many stakeholders want to deploy predictive
models as part of the product. Embrace your role in the entire end-to-end scenario.” Strategies for actionable insights

Plan for scale Examples of actionable insights Improvement Dashboards Visualizing
the value of the improvement work Days since last failure 143 Today‘s orders 41 Ø Orders 54 Active User 14 Completed tech debt tickets 4/15 Completion DB migration 78% Version control Log files System monitoring Issues tracker Business monitoring Data warehouse Sources of data Current state of the legacy system

106 Improving quality in a data-driven way using Software Analytics

CustomerMapper also reloads all contract data when converting CustomerDBO to
CustomerBO Example overview 107 descriptive analysis exploratory tracing analysis For every click in the application, there is an average of 250 DB calls and 50 service calls This list of mappers for the following views use already the new (D)BOs Mapping code reloads data via loops and traversing relationships. DBO: DataBase Objects BO: Business Objects → inductive analysis

Descriptive Analysis 108 Metrics allow us to measure different aspects
in our software, its environment and the people involved in it. Metrics Aspect Measurement result System Developers For every click in the application, there is an average of 250 DB calls and 50 service calls

Descriptive Analysis 109 Make sense of fine-granular data by putting
it into a perspective where non-technical people can reason about. Metrics Measurement result Translate Developers Decision Makers Every time a customer is fetched, there are 150 DB calls and 30 service calls For every click in the application, there is an average of 250 DB calls and 50 service calls

Exploratory → Inductive Analysis 110 • 1 x perform analysis
by hand • Reflect and abstract • Re-code (= automate analysis) CustomerMapper also reloads all contract data when converting CustomerDBO to CustomerBO Mapping code reloads data via loops and traversing relationships. A C B E D A 1 finding n findings

Identification of performance hotspots via call tree analysis 111 Examples
from the trenches Raw data (“call trees”) from a Java performance profiler tool (JProfiler) https://www.feststelltaste.de/performance-hotspots/ From single observations…

Identification of performance hotspots via call tree analysis 112 Examples
from the trenches API calls to the backend Entry point into the application Hot spot metric (e.g. number of DB calls) Starting point to the bottleneck Which parts of the application load data they don’t need? e.g. /petclinic/owners/ e.g. showOwners() https://www.feststelltaste.de/performance-hotspots/ …to all root causes!

Exploratory → Inductive Analysis 113 Result: A list of root
causes + to fix them →Very effective in many cases! + =

Addressing architectural problems in larger code bases Examples from the
trenches Root Causes Recipe *example has been changed because of confidentiality reasons

Tracing Analysis Trace work in progress during cleanup / migrations
This list of mappers for the following views use already the new (D)BOs

Tracing improvements work over time 116 Framework migrations AngularJS with
JavaScript ↓ AngularJS with TypeScript ↓ Angular with TypeScript Number of files Examples from the trenches

Tracing improvements work over time 117 Architecture Refactoring Structureless ↓
Hexagonal Architecture Examples from the trenches https://www.innoq.com/en/blog/visualizing- progress-of-refactoring-into-hexagonal- architecture-using-jqassistant/

Hands-On Part 3 118 tutorial/20 Time Series and Grouping.ipynb

Graph Data Science Using graph-based analysis to get even deeper
insights into quality issues

Motivation for Graph Analytics 120 Data from software systems… is
not always available in tabular form (so-called "panel data") is strongly interconnected within a data source (code dependencies, call hierarchies, commits...) can often be related via different data sources (e.g. code with static code analysis and commits)

Graph-based integration 121 • Different data sources can be related
to each other • New concepts / perspectives can be projected onto fine granular data • Summaries of many details into a more interpretable overall view possible 2016-09-31 78,5% PetDBO Coverage Commits God Class Code Smells Controller Pattern Patterns Pet Code

How does it work? 122 1. Scan software structures 2.
Save data to a graph database 3. Execute queries • Investigate relationships • Add perspectives (= concepts) and rules • Validate rule violations (= constraints) 4. Generate reports jQAssistant Core ideas for architecture validation Advanced topic

Concepts 123 Elements are assigned to concepts (= an abstract
idea like a pattern) Examples • Maven Project = Module • Java Package = Layer • Class annotated with @Entity = JPA* entity jQAssistant Core ideas for architecture validation *JPA: Java Persistence API, standard to use databases in Java Advanced topic

Definition of a Concept 124 == JPA Entities [[jpa2:Entity]] Any
type annotated with @javax.persistence.Entity is a JPA entity. [source,cypher,role=concept] ---- MATCH (t:Type)-[:ANNOTATED_BY]->()-[:OF_TYPE]->(a:Type) WHERE a.fqn ="javax.persistence.Entity" SET t:Jpa:Entity RETURN t AS Entity ----

Constraint 125 Rules are defined based on the concepts •
Example: "All database entities must be in the persistence layer". Constraints (= deviations from the rules) are checked • Example: "Give me all database entities that are not in the persistence layer." jQAssistant Core ideas for architecture validation Advanced topic

Definition of a Constraint 126 [[model:JpaEntityInModelPackage]] All JPA entities must
be located in the persistence layer. with the package name "model [source,cypher,role=constraint, requiresConcepts="jpa2:Entity"] ---- MATCH (package:Package)-[:CONTAINS]->(entity:Jpa:Entity) WHERE package.name <> "model RETURN entity AS EntityInWrongPackage ----

Existing scanners 127 ZIP GZ *. class JAR, WAR, EAR
MANIFEST.MF *. properties XSD YAML XML application. xml web.xml beans.xml JaCoCo FindBugs CheckStyle Maven Test Reports RDBMS schema M2 Repository CSV … JSON Git Arbitrarily expandable thanks to plug-in architecture

Perspectives with Graphs :Class Method :Field 2 findings 5 changes
:Entity 100% usage birthDate name @Entity @Table(name = "pets") public class Pet { Technical World Subdomain Pattern

Perspectives with Graphs 129 5 types 39 findings 51 changes
80% usage Business World 16 types 17 findings 15 changes 70% usage

Typical usage examples 130 • Detection of programming errors •
Common changeable state • Complex logic in UI • Performance HotSpot Mining • Lazy loading / N+1 query problems • Identification of problematic frameworks • Improvement of legacy code • Production Coverage / Code Decommissioning • Effort estimation Rearchitecting / Modularization

131 Example Spring PetClinic Java Application

Example: Spring PetClinic 132 https://github.com/JavaOnAutobahn/spring-petclinic A Java application for managing
a small clinic for pets

Example Graph 133 Spring PetClinic visibility: public name: Pet fqn:
org.springframework.samples.petclinic.model.Pet … Class:Type:Java:File

Example Graph 134 Spring PetClinic public class Pet { private
LocalDate birthDate; public LocalDate getBirthDate(){ return this. birthDate; } public void setBirthDate(LocalDate birthDate){ this. birthDate = birthDate; }

Hands-On Part 4.1 135 https://github.com/feststelltaste/software-graph-analytics-workshop

Example Graph 136 Spring PetClinic We can query this!

137 Cypher Like SQL, but for Graph Databases

A Graph in Detail

A Graph Node Edge

Graph in Detail Nodes Examples: • Java class • Measure
• File • Repository • URL • ...

Graph in Detail Edges - Types of a relationship Examples:
• DECLARES • EXTENDS • IMPLEMENTS • ANNOTATED_BY • DEPENDS_ON • ...

A Graph – with data*! File File Class Class Type
Type File File value key “entity“ name “Entity.java” file “jpa.thing.Entity” fqn value key 3 weight value key “Pet” name “Pet.java” file “foo.bar.Pet” fqn Labels Properties *therefore also called "Property Graph Model": Data can be stored at nodes and edges

Graph in Detail Labels – „Tags“ of a node Examples:
• :Class • :Type • :Java • :File • ... File File

Graph in Detail Properties – Data of a node or
edge Examples: • fqn (fully qualified name) • name • visibility • abstract • static • final • ... value key “Pet” name “Pet.java” file “foo.bar.Pet” fqn

Structure of Cypher Queries

Structure of Cypher Queries Basic Asciicode Syntax

Structure of Cypher Queries Mapping to Graph Visualization

Structure of Cypher Queries Nodes and Edges ()-[]->()

Structure of Cypher Queries Variables (p)-[]->(e)

Structure of Cypher Queries Labels (p:Class)-[]->(e:Type) Type Type Class Class
… … … …

Structure of Cypher Queries Properties and Values (p:Class {name:"Pet"})-[]->(e:Type {name:"Entity"})
value key “Entity” name … … value key “Pet” name … … Type Type Class Class … … … …

Structure of Cypher Queries Query and return nodes as results
MATCH (p:Class {name:"Pet"})-[]->(e:Type {name:"Entity"}) RETURN p, e Type Type Class Class … … … … value key “Entity” name … … value key “Pet” name … …

Structure of Cypher Queries Query and returning of properties as
results MATCH (p:Class {name:"Pet"})-[]->(e:Type {name:"Entity"}) RETURN p.name, e.name Type Type Class Class … … … … value key “Entity” name … … value key “Pet” name … …

Structure of Cypher Queries Query nodes using properties in WHERE
clause MATCH (p:Class)-[]->(e:Type) WHERE p.name = "Pet" AND e.name = "Entity" RETURN p, e Less performant, but easier to read and learn at the beginning Type Type Class Class … … … …

Structure of Cypher Queries Specification of the relationship type MATCH
(p:Class)-[:DEPENDS_ON]->(e:Type) WHERE p.name = "Pet" AND e.name = "Entity" RETURN p, e Type Type Class Class … … … … DEPENDS_ON DEPENDS_ON … …

Examples of Cypher Queries

Examples of Cypher Queries 158 List all types with the
most methods MATCH (t:Type)-[:DECLARES]->(m:Method) RETURN t.fqn as type, COUNT(m) as methods ORDER BY methods DESC

Examples of Cypher Queries 159 List of all writeable static
variables that are written to MATCH (c:Class)-[:DECLARES]-> (f:Field)<-[w:WRITES]-(m:Method) WHERE EXISTS(f.static) AND NOT EXISTS(f.final) RETURN c.name as InClass, m.name as theMethod, w.lineNumber as writesInLine, f.name as toStaticField

Examples of Cypher Queries 161 Counting of changes, aggregated over
domain subareas MATCH (t:Type)-[:BELONGS_TO]->(s:Subdomain), (t)-[:HAS_CHANGE]->(ch:Change) RETURN s.name as ASubdomain, COUNT(DISTINCT t) as Types, COUNT(DISTINCT ch) as Changes ORDER BY Types DESC

Examples of Cypher Queries 162 Aggregation of multiple measurement results
across functional areas MATCH (t:Type)-[:BELONGS_TO]->(s:Subdomain), (t)-[:HAS_CHANGE]->(ch:Change), (t)-[:HAS_MEASURE]->(co:Coverage) OPTIONAL MATCH (t)-[:HAS_BUG]->(b:BugInstance) RETURN s.name as ASubdomain, COUNT(DISTINCT t) as Types, COUNT(DISTINCT ch) as Changes, AVG(co.ratio) as Coverage, COUNT(DISTINCT b) as Bugs, SUM(DISTINCT t.lastMethodLineNumber) as Lines ORDER BY Coverage ASC, Bugs DESC

163 Integration with Jupyter Notebook and pandas

Neo4j, Cypher & Jupyter Notebook 164 Cypher Extension https://github.com/versae/ipython-cypher Alternative:
Cypher kernel https://github.com/HelgeCPH/cypher_kernel

Jupyter Notebook + py2neo 165 1. Import libraries 2. Connect
to the running Neo4j instance 3. Submit query in Cypher 4. Convert result to DataFrame 5. Result from the graph database is displayed

Roundtrip pandas ←→ Neo4j 166 pandas DataFrame as input Neo4j
output as DataFrame Import code into Neo4j with py2neo

Example: Strategic Redesign 167 Web Application Application Server User Usage
per Class Coverage Dev Build'n'Run& Source code Version Control System Changes per Class Analysis Improve source code that is actually used

Strategic Redesign 168 https://www.feststelltaste.de/swot-analysis-for-spotting-worthless-code/

Result of Strategic Redesign Improve source code that is actually
used

Graph-based analysis 170 • jQAssistant provides fast and flexible views
for a variety of graph- like software structures • Data can be explored exploratively via the graph database Neo4j • Definition of rules via concepts and constraints in Cypher / AsciiDoc possible for continuous checking Conclusion

More details about jQAssistant/Neo4j 171 https://easychair.org/publications/preprint/893N

Links for jQAssistant 172 • jQAssistant • http://jqassistant.org • Talk
• https://www.youtube.com/watch?v=kQr2c7yWbEA • Spring PetClinic sample project • Repo: http://github.com/JavaOnAutobahn/spring-petclinic • Example output: https://buschmais.github.io/spring-petclinic • TOP5 Learning jQAssistant • https://www.feststelltaste.de/top5-jqassistant/

173 Visualization Make problems visible for decision-makers

174 Python Visualization Landscape Source: https://pyviz.org/overviews/index.html

matplotlib 175 Visualization library for Python • Very well suited
for simple (intermediate) visualizations • Direct generation of diagrams from pandas DataFrame using plot() function • However, more elaborate graphics require extensive configuration code Graphics from https://matplotlib.org/3.1.1/gallery/index.html Bar chart for knowledge distribution Scatter plot for code hotspots

D3.js 176 JavaScript-based visualization library • Well suited for highly
networked data • Output of a JSON file with pandas (and possibly post- processing with Python). • Use existing D3 visualization as basis for template Force-directed graph layout Hierarchical edge bundling Circle Packing

pygal 177 Library for interactive visualizations in Python • Out-of-the-box
interaction possibilities with the displayed data (e.g. mouseover) • Easy preparation of data from pandas DataFrame necessary Gauges Treemap XY Plot

Tips for the start 178 Use „Picture Search“ to see
what‘s possible! Galleries and Google image search for visualization ideas helps to move from graphical representation to code for visualizations

Tips for the start 179 Get inspired by example code
StackOverflow and blogs help very well with details (esp. with matplotlib) with already existing answers thanks to a big community

Effective visualizations 180 • Present little information in an understandable
way • Also visualize intermediate results • Generate graphics programmatically from results Python code example top10_authors.plot.pie() →

181 Software Analytics Kata

Software Analytics Kata

183 Closing Words

Software Analytics Maturity Model 184 Are you ready for Software
Analytics? Known awareness of the value of software data is present Used software data already used to show a problem Defined existing data sources are known and analysis tools are standardized (as far as reasonable) Repeatable teams use the same approach to analysis and can use it across teams Integrated the use of data analysis tools in development as well as the creation of new data sources is a matter of course

Ethics 185 • There are restrictions and laws (and workers‘
councils) • Measurements can be cheated • Beware! Raw measurements have no context • Management might see something in there that‘s not there Tracking individual performance can create a morale issue, which perversely could bring down overall productivity.” Ciera Jaspan, Caitlin Sadowski: No Single Metric Captures Productivity “

186 "Not everything that can be counted counts, and not
everything that counts can be counted." William Bruce Cameron “All models are wrong, but some are useful." George Box Final Quotes = food for toughts

Summary

Summary • Methods and tools are here • Problems can
get communicated • Best practices help you get started Data-driven improvement is possible!

More information of the topic Literature • Adam Tornhill: Software
Design X-Ray • Wes McKinney: Python For Data Analysis • Leek, Jeff: The Elements of Data Analytic Style • Christian Bird, Tim Menzies, Thomas Zimmermann: The Art and Science of Analyzing Software Data • Tim Menzies, Laurie Williams, Thomas Zimmermann: Perspectives on Data Science for Software Engineering Software • Python Data Science Distribution: anaconda.com • GitHub repo: github.com/feststelltaste/software-analytics • Mini-Tutorial: github.com/feststelltaste/software-analytics-workshop-guided

My microsite about Software Analytics softwareanalytics.de

Practicing Software Analytics Self-study Cheatbooks more content for you to
get started https://github.com/feststelltaste/software-analytics-workshop-guided

Practicing Software Analytics Software Analytics Katas small challenges for you
to analyze https://github.com/feststelltaste/software-analytics-katas You just got the analysis need and the dataset. You need to find out the rest by yourself!

Practicing Software Analytics Local installation guide for your personal data
analysis platform https://www.feststelltaste.de/ddiosqig/ Explains what you need to do to work on your own analysis on your own machine

Questions 194 What other questions are there? *Sequential Question and
Insight Diagram Topic Topic Question Question Q Question Question Q … Answer Answer A Answer Answer A … Answer Answer A Question Question Q Answer Answer A Question Question Q SQUID*!

Software Analytics 5h mobshop* (like this one, but 95% coding-focused)
Software Analytics 2 days workshop (+ katas & graph analytics) Alternative formats of this workshop *https://www.innoq.com/en/blog/collaborative-learning-with-mobshops/

iSAQB Foundation Level (basic training for software architects) iSAQB IMPROVE
(software evolution and architecture improvements) Other workshops with me

www.innoq.com Königstorgraben 11 90402 Nürnberg Erftstr. 15-17 50672 Köln Hermannstrasse
13 20095 Hamburg Kreuzstr. 16 80331 München Ludwigstr. 180E 63067 Offenbach Ohlauer Str. 43 10999 Berlin Krischerstr. 100 40789 Monheim +49 2173 3366-0 innoQ Deutschland GmbH Thank you! Markus Harrer [email protected] markusharrer.de

Software Analytics - Data-Driven Improvement of...

Software Analytics - Data-Driven Improvement of Software Quality

More Decks by Markus Harrer

Other Decks in Technology

Featured

Transcript