Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science on Software Data (WeAreDevelopers World Congress 2021)

Data Science on Software Data (WeAreDevelopers World Congress 2021)

Data Science gains new insights from business data. As software developers, why don't we use Data Science to analyze our data from our software systems, too?

In this session, I will talk about approaches to mine software data based on the many ideas from the Data Science field. We'll also look at the standard tools used in this area to analyze and communicate software development problems easily. With tools such as computational notebooks, data analysis frameworks, visualization, and machine learning libraries, we make hidden issues visible in a data-driven way.

Attendees will learn how to leverage scientific thinking, manage the analysis process, and apply literate statistical programming to analyze software data in an understandable way.

The main part will be hands-on live coding with Open Source tools like Jupyter notebook, Python, pandas, jQAssistant, and Neo4j. I'll show which new insights we can gain from data sources such as Git repositories, performance measurements, or directly from source code.


Markus Harrer

June 30, 2021


  1. Data Science on Software Data WEAREDEVELOPERS WORLD CONFERENCE 2021 Markus

    Harrer Software Development Analyst @ INNOQ Twitter: @feststelltaste Website: softwareanalytics.de Slides: speakerdeck.com/feststelltaste/
  2. The original horror show IMAGE: sethJreid/Pixabay LEGACY systemS

  3. According to https://pkruchten.files.wordpress.com/2013/12/kruchten-colours-yow-sydney.pdf

  4. None
  5. Adapted from Ray Koopa https://commons.wikimedia.org/wiki/File:Lcars_wallpaper.svg (CC BY-SA 4.0) THE ULTIMATE


  7. None
  8. None
  9. None
  10. IV == VII V == VIII VI == IX

  11. A NEW HOPE

  12. Frequency Questions Importance for us Answering important SPECIFIC questions Use

    standard tools for general questions Option 2: Use Software Analytics to answer your very specific questions! Option 1: Just ignore the other questions
  13. Substantial expertise Data Science Data Science Venn diagram by Drew

    Conway (simplified)

    P R O D U C I B L E D A T A S C I E N C E open comprehensible systematic automated

  16. JUDGMENT DAY Typical ISSUES TO TERMINATE  Spotting parts in

    the source code no one knows of anymore  Finding root causes of our performance bottlenecks  Identifying alternative modularizations of software systems  Showing the progress of long-time refactorings  Measuring the community activity around open source software  <your very specific analysis in your very specific situations>
  17. None
  18. Python 3 3 Python 1 ... and matplotlib, numpy, scikit-learn,

    NLTK, Pygments, py2neo, requests, BeautifulSoup, Pygal ...
  19. code and data in love Computational Notebook

  20. Computational Notebook COMPLETELY AUTOMATED • Context documented • Ideas, assumptions

    and simplifications explicit • Calculations presented in an understandable way • Summaries / What’s next? Jupyter Notebook Context Idea Analysis Conclusion Data-driven Software Analysis
  21. Literate Programming with Jupyter Notebook

  22. Attribution: Tobias ToMar Maier, https://commons.wikimedia.org/wiki/File:VHS_tape_with_time_scale.jpg Demo I Miami Cops Police

  23. jQAssistant Neo4j Graph Analytics

  24. :Class :Method :Field https://github.com/buschmais/spring-petclinic public class Pet { private LocalDate

    birthDate; public LocalDate getBirthDate(){ return this.birthDate; } public void setBirthDate(LocalDate birthDate){ this.birthDate = birthDate; }
  25. :Class :Method :Field :Entity https://github.com/buschmais/spring-petclinic @Entity @Table(name = "pets") public

    class Pet {
  26. :Class Business Subdomain :Method :Field findings 2 changes 5 :Entity

    usage 100% name birthDate
  27. types 16 findings 17 changes 15 usage 70% types 5

    findings 39 changes 51 usage 80% A perspective, where also managers can reason about!
  28. Attribution: Tobias ToMar Maier, https://commons.wikimedia.org/wiki/File:VHS_tape_with_time_scale.jpg Demo II Terminator Flash Gordon

  29. The return of reason The Two Tips The fellow feeling

    A u t o m a t i o n Meta Metric Number of solved problems NO Tool To Rule Them All O p e n e s s become the L S ordof the hing T by analyzing software in a data-driven way
  30. ASK ' EM ALL @feststelltaste

  31. Appendix

  32. More on Software Analytics softwareanalytics.de

  33. Jupyter notebook, python, pandas, matplotlib Repo https://github.com/feststelltaste/software-analytics/tree/master/demos/20210630_WeAreDevelopersWorldCongress Interactive online version

    https://mybinder.org/v2/gh/feststelltaste/software-analytics/HEAD?filepath=demos%2F20210630_WeAreDevelopersWorldCongress jQAssistant & Neo4j Repo Spring PetClinic https://github.com/javaonautobahn/spring-petclinic Repo DesignSmells https://github.com/feststelltaste/designsmells Demos Run notebook with this
  34. More on Software Analytics Adam Tornhill: Software X-Ray Tim Menzies,

    Laurie Williams, Thomas Zimmermann: Perspectives on Data Science for Software Engineering Christian Bird, Tim Menzies, Thomas Zimmermann: The Art and Science of Analyzing Software Data
  35. More on Data Science Jeff Leek: The Elements of Data

    Analytic Style Roger D. Peng: Report Writing for Data Science in R Wes McKinney: Python for Data Analysis
  36. More on Graph Analytics Mark Needham & Amy Hodler: Graph

    Algorithms https://neo4j.com/product/graph-data-science-library/
  37. Paper about jQAssistant/Neo4j https://easychair.org/publications/preprint/893N

  38. Thank you very much! innoQ Germany GmbH Krischerstr. 100 40789

    Monheim on the Rhine Germany +49 2173 3366-0 Ohlauer Str. 43 10999 Berlin Germany Ludwigstr. 180E 63067 Offenbach Germany Kreuzstr. 16 80331 Munich Germany Gewerbestr. 11 CH-6330 Cham Switzerland +41 41 743 01 11 Albulastr. 55 8048 Zurich Switzerland innoQ Switzerland GmbH Markus Harrer markus.harrer@innoq.com @feststelltaste