Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science on Software Data (WeAreDevelopers World Congress 2021)

Data Science on Software Data (WeAreDevelopers World Congress 2021)

Data Science gains new insights from business data. As software developers, why don't we use Data Science to analyze our data from our software systems, too?

In this session, I will talk about approaches to mine software data based on the many ideas from the Data Science field. We'll also look at the standard tools used in this area to analyze and communicate software development problems easily. With tools such as computational notebooks, data analysis frameworks, visualization, and machine learning libraries, we make hidden issues visible in a data-driven way.

Attendees will learn how to leverage scientific thinking, manage the analysis process, and apply literate statistical programming to analyze software data in an understandable way.

The main part will be hands-on live coding with Open Source tools like Jupyter notebook, Python, pandas, jQAssistant, and Neo4j. I'll show which new insights we can gain from data sources such as Git repositories, performance measurements, or directly from source code.

Markus Harrer

June 30, 2021
Tweet

More Decks by Markus Harrer

Other Decks in Technology

Transcript

  1. Data Science on Software Data WEAREDEVELOPERS WORLD CONFERENCE 2021 Markus

    Harrer Software Development Analyst @ INNOQ Twitter: @feststelltaste Website: softwareanalytics.de Slides: speakerdeck.com/feststelltaste/
  2. Frequency Questions Importance for us Answering important SPECIFIC questions Use

    standard tools for general questions Option 2: Use Software Analytics to answer your very specific questions! Option 1: Just ignore the other questions
  3. = A WAY TO IMPLEMENT SOLID SOFTWARE ANALYTICS R E

    P R O D U C I B L E D A T A S C I E N C E open comprehensible systematic automated
  4. JUDGMENT DAY Typical ISSUES TO TERMINATE  Spotting parts in

    the source code no one knows of anymore  Finding root causes of our performance bottlenecks  Identifying alternative modularizations of software systems  Showing the progress of long-time refactorings  Measuring the community activity around open source software  <your very specific analysis in your very specific situations>
  5. Python 3 3 Python 1 ... and matplotlib, numpy, scikit-learn,

    NLTK, Pygments, py2neo, requests, BeautifulSoup, Pygal ...
  6. Computational Notebook COMPLETELY AUTOMATED • Context documented • Ideas, assumptions

    and simplifications explicit • Calculations presented in an understandable way • Summaries / What’s next? Jupyter Notebook Context Idea Analysis Conclusion Data-driven Software Analysis
  7. :Class :Method :Field https://github.com/buschmais/spring-petclinic public class Pet { private LocalDate

    birthDate; public LocalDate getBirthDate(){ return this.birthDate; } public void setBirthDate(LocalDate birthDate){ this.birthDate = birthDate; }
  8. types 16 findings 17 changes 15 usage 70% types 5

    findings 39 changes 51 usage 80% A perspective, where also managers can reason about!
  9. The return of reason The Two Tips The fellow feeling

    A u t o m a t i o n Meta Metric Number of solved problems NO Tool To Rule Them All O p e n e s s become the L S ordof the hing T by analyzing software in a data-driven way
  10. Jupyter notebook, python, pandas, matplotlib Repo https://github.com/feststelltaste/software-analytics/tree/master/demos/20210630_WeAreDevelopersWorldCongress Interactive online version

    https://mybinder.org/v2/gh/feststelltaste/software-analytics/HEAD?filepath=demos%2F20210630_WeAreDevelopersWorldCongress jQAssistant & Neo4j Repo Spring PetClinic https://github.com/javaonautobahn/spring-petclinic Repo DesignSmells https://github.com/feststelltaste/designsmells Demos Run notebook with this
  11. More on Software Analytics Adam Tornhill: Software X-Ray Tim Menzies,

    Laurie Williams, Thomas Zimmermann: Perspectives on Data Science for Software Engineering Christian Bird, Tim Menzies, Thomas Zimmermann: The Art and Science of Analyzing Software Data
  12. More on Data Science Jeff Leek: The Elements of Data

    Analytic Style Roger D. Peng: Report Writing for Data Science in R Wes McKinney: Python for Data Analysis
  13. More on Graph Analytics Mark Needham & Amy Hodler: Graph

    Algorithms https://neo4j.com/product/graph-data-science-library/
  14. Thank you very much! innoQ Germany GmbH Krischerstr. 100 40789

    Monheim on the Rhine Germany +49 2173 3366-0 Ohlauer Str. 43 10999 Berlin Germany Ludwigstr. 180E 63067 Offenbach Germany Kreuzstr. 16 80331 Munich Germany Gewerbestr. 11 CH-6330 Cham Switzerland +41 41 743 01 11 Albulastr. 55 8048 Zurich Switzerland innoQ Switzerland GmbH Markus Harrer [email protected] @feststelltaste