How to Do Your Own Analysis of the Kernel Development

7dddc875546948b5b5094167c90dc10d?s=47 Bitergia
October 23, 2013

How to Do Your Own Analysis of the Kernel Development

Talk at LinuxCon Europe 2013.

The development of the Linux kernel is a complex process performed by a large community of contributors. Fortunately, many details about both the process and the development community are captured in the information stored in the git repository. This presentation will show how the MetricsGrimore and vizGrimoire toolsets can be used to perform targeted analysis and visualizations of this information. In addition, the results of some specific studies performed with the tools will also be shown as an illustration of their capabilities and easy of use.

7dddc875546948b5b5094167c90dc10d?s=128

Bitergia

October 23, 2013
Tweet

Transcript

  1. How to Do Your Own Analysis of the Kernel Development

    Jesus M. Gonzalez-Barahona @jgbarah Bitergia GSyC/LibreSoft, Universidad Rey Juan Carlos Edinburgh, UK, October 23rd 2013 (draft, work in progress) Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 1 / 45
  2. c 2012-2013 Bitergia Some rights reserved. This presentation is distributed

    under the “Attribution-ShareAlike 3.0” license, by Creative Commons, available at http://creativecommons.org/licenses/by-sa/3.0/ Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 2 / 45
  3. Bitergia: analytics for your peace of mind Started operations in

    July 2012 Builds on the experience of LibreSoft R&D group Offering professional products and services Focused on: Metrics dashboards about software development (including community metrics) Specific studies and reports (based on metrics and facts collection) http://bitergia.com http://blog.bitergia.com Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 3 / 45
  4. Free software is (in many cases) special Source code available

    Open development model (usually) Many details about the internals of the development process Intense use of tools for coordination Lots of information is tracked, and available Developers & users communities are important sustainability pooling of resources innovation Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 4 / 45
  5. Measuring, measuring, measuring Information about code, community, development can be

    retrieved, organized, analyzed Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 5 / 45
  6. Metrics-tracked processes Quantitative, objective data: facts, not opinions Specific questions

    can be answered Several areas of interest: Developers: Understanding, improving development processes Early detection of potential problems, bad smells Community: General activity, contributions Long-term sustainability, evolution, reaction to issues Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 6 / 45
  7. But data has to be extracted, mined... Data lives in

    repositories not always designed to release all their data easily: tools are needed to retrieve and extract it Data includes many complexities and details tools are needed to filter, organize it Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 7 / 45
  8. But data has to be analyzed, visualized... Casual observation is

    not enough, analysis is needed: tools are needed for statistical and other kinds of analysis Analysis is not enough, visualization may help: tools are needed for interactive visualization Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 8 / 45
  9. The MetricsGrimoire approach Set of tools specialized in retrieving information

    from different kinds of repositories. Among them: CVSAnalY: source code management (CVS, Subversion, git, etc.) Bicho: issue tracking systems (Bugzilla, Jira, SourceForge, Allura, Launchpad, etc.) MLStats: mailing lists (mbox files, Mailman archives, etc.) Store all the information in SQL databases with similar structure http://metricsgrimoire.github.io Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 9 / 45
  10. The vizGrimoire approach Set of tools for analyzing and visualizing

    data produced by MetricsGrimoire: vizGrimoireR: R package for analysis and producing JSON files vizGrimoireJS: JavaScript library for visualizing JSON files Several dashboards: based on vizGrimoireJS From SQL to JSON to visualization http://vizgrimoire.github.io Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 10 / 45
  11. The *Grimoire combination Visualization Parsing Actions Retrieval Process Retrieval Process

    (data mining tools) ... ... DB DB DB SCM BTS X Visualization Parsing Actions Retrieval Process Parsing Actions (R and Python scripts) ... ... DB DB DB JSON JSON JSON Visualization Parsing Actions Retrieval Process Visualization (HTML, CSS and JS) ... ... JSON JSON JSON HTML HTML HTML Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 11 / 45
  12. MetricsGrimoire: CVSAnalY Browses an SCM repository producing a database with:

    All metainformation (commit records, etc.) Metrics for each release of each file Also produces some tables suitable for specific analysis Multiple SCMs: CVS, svn, git (Bazaar, Mercurial through git) Whole history in the database, it’s possible to rebuild the files tree for any revision Tags and branches support Option to save the log to a file while parsing Extensions system, incremental capabilities Multiple database system support (MySQL and SQLite) Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 12 / 45
  13. MetricsGrimoire: CVSAnalY extensions Extension: a “plugin” for CVSAnalY Add information

    to the database, based in the information in the database and maybe the repository Usually: new tables for specific studies Simple example: commits per month per commiter Extensions add one or more tables to the database but they never modify the existing ones Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 13 / 45
  14. MetricsGrimoire: CVSAnalY extensions Some examples: FileTypes: adds a table containing

    information about the type of every file in the database (code, documentation, i18n, etc.) Metrics: analyzes every revision of every file calculating metrics like sloc and complexity metrics (mccabe, halstead). It currently supports metrics for C/C++, Python, Java and ADA. CommitsLOC: adds a new table with information about the total lines added/removed for every commit Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 14 / 45
  15. MetricsGrimoire: Bicho Parsing issue tracking systems Results stored in a

    MySQL database Information about each issue (ticket), and its modifications Currently it supports: SourceForge (HTML parsing) Bugzilla: GNOME, KDE, others Jira, Allura, Launchpad, GitHub, RedMine (API) Incremental Supports Gerrit as well (code review) Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 15 / 45
  16. MetricsGrimoire: MailingListStats Parses mbox information (RFC 822) Deals with Mailman

    archives Stores results (headers, body) in a MySQL database: Sender, CCs, etc. Time / Date Subject ... Incremental Can store multiple projects in a single database Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 16 / 45
  17. vizGrimoireR & vizGrimoireJS Once information is retrieved, and is in

    a format suitable for querying: it can be queried directly in the database it can be analyzed from R it can be filtered, manually inspected, improved it can be combined, cross-analyzed it can be visualized Milking the databases, producing dashboards Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 17 / 45
  18. Now... ...howto for the Linux kernel Jesus Gonzalez-Barahona (Bitergia/URJC) Your

    Own Analysis of the Kernel Development LinuxCon Europe 2013 18 / 45
  19. Data sources used Git snapshot with changes since 1992, by

    Yoann Padioleau reliable about individual changes since 2002) updated (git pull) up to September 2013 http://archive.org/details/git-history-of-linux git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/ linux.git Gmane linux-kernel mailing list http://dir.gmane.org/gmane.linux.kernel Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 19 / 45
  20. Running the tools: getting MetricsGrimoire Get source code from GitHub

    repositories and follow instructions ...or... run metricsgrimoire-setup.py https://github.com/VizGrimoire/VizGrimoireR/blob/master/ misc/metricsgrimoire-setup.py Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 20 / 45
  21. Running the tools: CVSAnalY (git) Get the git repo with

    historic information Run CVSAnalY to populate cvsanalydb database wget http://archive.org/download/ git-history-of-linux/full-history-linux.git.tar tar xvf full-history-linux.git.tar cd full-history-linux git pull [Create databases cvsanalydb] cvsanaly2 -u user -p XXX -d cvsanalydb \ --extensions=FileTypes,CommitsLOC . [From vizGrimoireR/misc, >5,000 dup ids found] unifypeople.py -u user -p XXX -d cvsanalydb Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 21 / 45
  22. Running the tools: MLStats (mailing list) [From vizGrimoireR/examples/linux] get-gmane-archive.sh mlstats

    --db-user=user --db-password=XXX \\ --db-admin-user=adminuser --db-admin-password=XXX \\ --db-name=mlstatsdb \\ mail_dir Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 22 / 45
  23. Running the tools: Producing a basic dashboard Installation of vizGrimoireR

    as package for R https://github.com/VizGrimoire/VizGrimoireR/wiki Produce all the files for the dashboard [ in vizGrimoireR directory, run R scripts to produce JSON ] vizGrimoireJS/run_scripts-linuxkernel.sh [ get HTML (JavaScript, HTML, CSS) files for dashboard ] git clone git@github.com:VizGrimoire/VizGrimoireJS.git \ linux-dashboard cd linux-dashboard git checkout linux [ copy JSON files produced above ] cp ../json/* ./data/json Export the directory via HTTP Access it your favorite web browser Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 23 / 45
  24. Basic dashboard Limited functionality, work in progress http://bitergia.com/public/previews/2013_10_linux/browser/ Jesus Gonzalez-Barahona

    (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 24 / 45
  25. Basic dashboard (git) Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of

    the Kernel Development LinuxCon Europe 2013 25 / 45
  26. Basic dashboard (mailing list) Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis

    of the Kernel Development LinuxCon Europe 2013 26 / 45
  27. A more complete dashboard: MediaWiki http://korma.wmflabs.org/ Jesus Gonzalez-Barahona (Bitergia/URJC) Your

    Own Analysis of the Kernel Development LinuxCon Europe 2013 27 / 45
  28. Case: Activity per directory in git repository Three CVSAnalY tables

    involved: scmlog: metadata for each commit (author, date, ...) actions: metadata for each action on a file (file, commit, ...) file links: metadata for each file (name, path, ...) Group all actions involving files under a certain subdirectory, calculate number of actions and (distinct) authors involved Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 28 / 45
  29. Size (LOC) per subdirectory code arch/alpha arch/arm arch/ia64 arch/m68k arch/mips

    arch/powerpc arch/s390 arch/sh arch/tile arch/x86 drivers/acpi drivers/gpu drivers/infiniband drivers/input drivers/isdn drivers/md drivers/media drivers/mfd drivers/mtd drivers/net drivers/pinctrl drivers/s390 drivers/scsi drivers/staging drivers/tty drivers/usb drivers/video fs/btrfs fs/cifs fs/ext4 fs/jfs fs/nfs fs/nfsd fs/ocfs2 fs/xfs include/linux net/core net/ipv4 net/ipv6 net/netfilter sound/pci sound/soc arch block crypto drivers fs include ipc kernel lib mm net scripts security sound tools cloc –csv –skip-uniqueness –by-file . Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 29 / 45
  30. Changes per directory actions arch block Documentation drivers fs include

    init ipc kernel lib mm net scripts security sound tools virt Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 30 / 45
  31. Changes per directory: more detail actions arch/alpha arch/arm arch/i386 arch/ia64

    arch/mips arch/powerpc arch/ppc arch/s390 arch/sh arch/sparc arch/um arch/x86 drivers/acpi drivers/block drivers/char drivers/gpu drivers/ide drivers/input drivers/media drivers/net drivers/scsi drivers/staging drivers/tty drivers/usb drivers/video fs/btrfs fs/cifs fs/gfs2 fs/nfs fs/nfsd fs/xfs include/linux net/inet net/ipv4 net/ipv6 net/mac80211 arch block Documentation drivers fs include init ipc kernel lib mm net scripts security sound tools Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 31 / 45
  32. Changes per directory: color is number of authors actions authors

    0 500 1000 1500 2000 2500 3000 3500 4000 arch/arm arch/i386 arch/ia64 arch/m68k arch/mips arch/powerpc arch/ppc arch/sh arch/sparc arch/um arch/x86 arch/x86_64 drivers/acpi drivers/char drivers/gpu drivers/i2c drivers/ide drivers/media drivers/mtd drivers/net drivers/scsi drivers/staging drivers/usb drivers/video fs/cifs fs/gfs2 fs/nfs fs/nfsd fs/xfs include/linux include/net net/inet net/ipv4 net/ipv6 net/sctp sound/soc arch drivers fs include kernel net sound tools Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 32 / 45
  33. Changes per directory: color is density of authors actions authors

    per actions 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 arch/arm arch/i386 arch/ia64 arch/m68k arch/mips arch/powerpc arch/ppc arch/sh arch/sparc arch/um arch/x86 arch/x86_64 drivers/acpi drivers/char drivers/gpu drivers/i2c drivers/ide drivers/media drivers/mtd drivers/net drivers/scsi drivers/staging drivers/usb drivers/video fs/cifs fs/gfs2 fs/nfs fs/nfsd fs/xfs include/linux include/net net/inet net/ipv4 net/ipv6 net/sctp sound/soc arch drivers fs include kernel net sound tools Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 33 / 45
  34. Changes per directory: 2002 actions authors per actions 0.0 0.1

    0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 arch/alpha arch/arm arch/i386 arch/ia64 arch/m68k arch/mips arch/powerpc arch/ppc arch/ppc64 arch/sh arch/sparc64 arch/um arch/x86 arch/x86_64 drivers/acpi drivers/block drivers/char drivers/ide drivers/input drivers/isdn drivers/media drivers/net drivers/scsi drivers/usb drivers/video fs/jfs fs/nfsfs/nfsd fs/ntfs fs/xfs include/asm−x86 include/linux net/inet net/ipv4 net/sctp sound/oss sound/pci arch block drivers fs include kernel lib net sound Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 34 / 45
  35. Changes per directory: 2013 actions authors per actions 0.0 0.1

    0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 arch/arc arch/arm arch/i386 arch/mips arch/powerpc arch/s390 arch/tile arch/x86 drivers/acpi drivers/char drivers/dma drivers/gpu drivers/media drivers/net drivers/scsi drivers/spi drivers/staging drivers/tty drivers/usb fs/btrfs fs/ceph fs/cifs fs/ext4 fs/f2fs fs/nfs fs/xfs include/linux net/ipv4 arch Documentation drivers fs include kernel net sound tools virt Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 35 / 45
  36. Case: experience of developers Different ids for same developers have

    to be merged simple heuristics > 5, 000 dup ids for about 10,000 devels https://github.com/VizGrimoire/VizGrimoireR/blob/master/ misc/unifypeople.py Three CVSAnalY tables involved: scmlog: metadata for each commit (author, date, ...) people: metadata for ids corresponding to developers upeople: ‘unified’ ids for developers Calculate ‘age in project’ for active developers at a certain spot in time, grouping them by ‘generations’ Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 36 / 45
  37. Experience: demography prydamid (git) 0 4 8 12 0 100

    200 300 Number of developers Age Generations (active authors of commits), March 1st, 2013 Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 37 / 45
  38. Experience: demography prydamid (git) 0 4 8 12 0 100

    200 300 400 Number of developers Age date 2003−03−01 2005−03−01 2007−03−01 2009−03−01 2011−03−01 2013−03−01 0 4 8 12 0 100 200 300 400 Number of developers Age date 2003−03−01 2005−03−01 2007−03−01 2009−03−01 2011−03−01 2013−03−01 Comparison for pyramids every two years Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 38 / 45
  39. Experience: demography prydamid (git) 2003−03−01 2005−03−01 2007−03−01 2009−03−01 2011−03−01 2013−03−01

    0 4 8 12 0 4 8 12 0 100200300400 0 100200300400 0 100200300400 Number of developers Age (quarters) Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 39 / 45
  40. Experience: demography prydamid (git) 3D version of pyramids every two

    years Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 40 / 45
  41. Case: where do developers work? Study on mailing list: assumption:

    time zones of mail tools are correct time zones correspond roughly to geographical areas http: //en.wikipedia.org/wiki/Time_zone#UTC_offsets_worldwide Americas: GMT-8 to GMT-2 (US/Canada: -8 to -4) Europe/Africa/Middle East: GMT to GMT+5 East Asia/Australia: GMT+8 to GMT+11 All the info in one MLStats table: messages: main headers of each message Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 41 / 45
  42. Timezone origin of messages 0e+00 1e+05 2e+05 −10 −5 0

    5 10 Timezones (relative to GMT) Messages Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 42 / 45
  43. Timezone origin of messages (2002, 2007, 2012) 0 5000 10000

    15000 −10 −5 0 5 10 Messages 0 10000 20000 30000 −10 −5 0 5 10 Messages 0 10000 20000 30000 −10 −5 0 5 10 Messages Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 43 / 45
  44. Takeaways Run MetricsGrimoire on the repositories, get your own databases

    Use our databases, use vizGrimoire to run your own analysis, produce your own visualizations Use our analysis scripts, modify them for your own purposes Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 44 / 45
  45. This is the end Have you learned something useful? [I

    would love to know what interested you the most] [...and the least] Code for the examples on the Linux kernel: https://github.com/ VizGrimoire/VizGrimoireR/tree/master/examples/linux Databases: http://bitergia.com/public/previews/2013_10_linux/ browser/data/db Jesus Gonzalez-Barahona (Bitergia/URJC) Your Own Analysis of the Kernel Development LinuxCon Europe 2013 45 / 45