Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software Analytics with Jupyter, Pandas, jQAssi...

Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j (Neo4j Online Meetup)

Let’s tackle problems in software development in an automated, data-driven and reproducible way!

As developers, we often feel that there might be something wrong with the way we develop software. Unfortunately, a gut feeling alone isn’t sufficient for the complex, interconnected problems in software systems.

We need solid, understandable arguments to gain budgets for improvement projects or to defend us against political decisions. Though, we can help ourselves: Every step in the development or use of software leaves valuable, digital traces. With clever analysis, these data can show us root causes of problems in our software and deliver new insights – understandable for everybody.

If concrete problems and their impact are known, developers and managers can create solutions and take sustainable actions aligned to existing business goals.

In this meetup, I talk about the analysis of software data by using a digital notebook approach. This allows you to express your gut feelings explicitly with the help of hypotheses, explorations and visualizations step by step.

I show the collaboration of open source analysis tools (Jupyter, Pandas, jQAssistant and, of course, Neo4j) to inspect problems in Java applications and their environment. We have a look at performance hotspots, knowledge loss and worthless code parts – completely automated from raw data up to visualizations for management.

Participants learn how they can translate their unsafe gut feelings into solid evidence for obtaining budgets for dedicated improvement projects with the help of data analysis.

Markus Harrer

November 23, 2017
Tweet

More Decks by Markus Harrer

Other Decks in Programming

Transcript

  1. Software Analytics with Jupyter, Pandas, jQAssistant and Neo4j Identifying Problems

    in Software Development with Data Analysis Markus Harrer @feststelltaste Neo4j Online Meetup 23rd November 2017
  2. Markus Harrer Software Development Analyst Key Activities Java Development, Data

    Analysis in Software Development Areas of Interest Clean Code, Agile, Software Archeology, Software Revival, Epistemology, Cognitive Psychology @feststelltaste feststelltaste.de [email protected] About me
  3. Agenda 1. Motivation 2. Sofware Analytics 3. My impl of

    Software Analytics 4. Examples & Demos 5. Summary 6. Q&A
  4. WALL OF IGNORANCE Janelle Klein: IDEAFLOW - How to Measure

    the PAIN in Software Development. Leanpub
  5. WALL OF IGNORANCE RISK VISIBILITY Janelle Klein: IDEAFLOW - How

    to Measure the PAIN in Software Development. Leanpub
  6. Software Analytics is... “... analytics on software data for managers

    and software engineers with the aim of empowering software development individuals and teams to gain and share insight from their data to make better decisions.” Tim Menzies, Thomas Zimmermann: Software Analytics - So What?. IEEE Software Magazine
  7. Frequency Questions Use standard tools for everyday‘s questions Use Software

    Analytics to tackle high-risk problems Risk / Value Right Insights for better Decisions Adopted from Tim Menzies, Thomas Zimmermann: Software Analytics - So What?. IEEE Software Magazine
  8. Types of Software Data Community chrono- logical Runtime static =>

    Problems are interconnected, so should be the data sources!
  9. Why does it work now? • Domain-Driven Design brings business

    language into code • Data Science enables problem analysis for developers • New Tools can create high-level concepts Code Problems Business Language abstract detailed Problems can be connected to concepts in business terms!
  10. My impl of Software Analytics How can Developers use the

    Power of Data Analysis in their Daily Work?
  11. What can you do today? • Visualize developer contributions over

    time • Identify unused, error-prone or abandoned code • Create a code and problem inventory for legacy systems • Find performance bottlenecks by analyzing call trees • Visualize unwanted dependencies between modules Make specific problems in your software system visible! e. g. Race Conditions, Architecture Smells, Build Breaker, Programming Errors
  12. Choose known tools or tools for plan B* Python Neo4j,

    Pandas, Spark * want to learn / profit from in near future on a suitable platform. Jupyter, Zeppelin => Tools shouldn‘t stand in the way!
  13. Notebook an open dialog with data Context Idea Analysis Conclusion

    Problem Context documented Ideas, assumptions and heuristics communicated Preprocessing justified Calculations understandable Summaries conclusive Everything automated
  14. Python Data Scientist's Best Friend: Easy, effective, fast programming language

    Pandas Pragmatic Data Analysis Framework: Great data structures & integrations with machine learning libraries D3 Visualization Library for Data-Driven Document: Just beautiful, interactive graphics! Jupyter Interactive Notebook: Central hub for data analysis and documentation Basic Tooling
  15. Advanced Tooling: jQAssistant & Neo4j Main Ideas • Scan software

    structures • Store data in Neo4j database • Execute queries • Examine relationships • Add high-level concepts • Validate rules via constraints • Generate reports
  16. jQAssistant – Use Cases Java Class Business‘ Subdomain Living, self-validating

    architecture documentation + Find design & code smells + Add business perspectives
  17. Neo4j Schema for Software Data Node Labels File Class Method

    Commit Relationship Types CONTAINS DEPENDS_ON INVOKES CONTAINS_CHANGE Properties name fqn signature message File Java key value name “Pet” fileName “Pet.java” fqn “foo.bar.Pet” Type File
  18. Cypher Query Example Spring PetClinic “Give me all database objects”

    MATCH (t:Type)-[:ANNOTATED_BY]->()-[:OF_TYPE]->(a:Type) WHERE a.fqn="javax.persistence.Entity" RETURN t AS JpaEntity
  19. Toolchain Python, Jupyter XML/Graph Tables Text Data Pandas jQAssistant Input

    Pandas, Neo4j Analysis matplotlib xlsx E pptx P Output D3
  20. Example JaCoCo  Pandas  D3 Production Coverage 1. Measure

    code coverage in production 2. Calculate ratio of covered lines to all lines 3. Visualize “usage hotspots” with hierarchical bubble chart https://www.feststelltaste.de/visualizing-production-coverage-with-jacoco-pandas-and-d3/
  21. Example Git  Pandas  D3 Knowledge Island* 1. Take

    Git log with numstats 2. Calculate proportional contributions for each source code file per author 3. Visualize “ownership” with hierarchical bubble chart * heavily inspired by Adam Tornhill https://www.feststelltaste.de/knowledge-islands/
  22. Example jQAssistant  Neo4j  Pandas  D3 Dependency Analysis

    between Bounded Contexts https://www.feststelltaste.de/a-graphical-approach-towards-bounded-contexts/
  23. Example jQAssistant  Neo4j  Pandas  D3 Dependency Analysis

    between Bounded Contexts MATCH (s1:Subdomain)<-[:BELONGS_TO]- (type:Type)-[r:DEPENDS_ON*0..1]-> (dependency:Type)-[:BELONGS_TO]->(s2:Subdomain) RETURN s1.name as type, s2.name as dep, COUNT(r) as number https://www.feststelltaste.de/a-graphical-approach-towards-bounded-contexts/ Subdomains => Bounded Contexts that have meaning to business!
  24. Example JProfiler  jQAssistant  Neo4j  Pandas Mining performance

    hotspots 1. Record Call Trees 2. Identify which parts of the application code is responsible for most of the DB operations 3. Trace problems back to the root causes https://www.feststelltaste.de/mining-performance-hotspots-with-jprofiler-jqassistant-neo4j-and-pandas-part-1-the-call-graph/ Requests Incoming Outgoing SQL Calls
  25. Example jQAssistant  Neo4j  Pandas Recursive Method Calls to

    Database MATCH (m:Method)-[:INVOKES*]->(m) -[:INVOKES]->(dbMethod:Method) <-[:DECLARES]-(dbClass:Class) WHERE dbClass.name = "Database" RETURN m, dbMethod, dbClass
  26. Example jQAssistant  Neo4j  Pandas Identify possible Race Conditions

    public class OwnerController { ... private static int ownersIndexes; MATCH (c:Class)-[:DECLARES]->(f:Field)<-[w:WRITES]-(m:Method) WHERE EXISTS(f.static) AND NOT EXISTS(f.final) RETURN c.name, f.name, w.lineNumber, m.name static = same field for all instances of that class
  27. Summary • Tooling for data analysis in software development is

    here! • First analyses are easy to do using tools you already know • Specific in-depth analysis are powerful and worthwhile • Connection between business and developers is possible! • Problems can be attached to code that is business-related • Making the impact of risk-taking visible is a must-have to improve! • Jupyter/Pandas & jQAssistant/Neo4j are my favorites • Provide many ways for identifying problems • Help to figure out solutions as well!
  28. Links Markus Harrer • Blog: https://feststelltaste.de • Twitter: https://twitter.com/feststelltaste •

    SlideShare: https://www.slideshare.net/feststelltaste • Consulting: http://markusharrer.de jQAssistant/Neo4j • Demos: https://jqassistant.org/get-started/ • Guide: http://buschmais.github.io/jqassistant/doc/1.3.0/ • Talk by Dirk Mahler: https://vimeo.com/170797227