Slide 1

Slide 1 text

Gareth Rushgrove How people actually write Puppet Analyzing 7.5 million lines of Puppet code

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

@garethr

Slide 4

Slide 4 text

What if we were to analyze all of the code written in the Puppet language?

Slide 5

Slide 5 text

- But why? - GitHub and BigQuery - Libraries.io and the Forge - Caveats and limitations

Slide 6

Slide 6 text

WARNING This presentation contains far too many SQL queries with embedded regex

Slide 7

Slide 7 text

But why? Why analyzing code is useful

Slide 8

Slide 8 text

Making design decisions

Slide 9

Slide 9 text

What features could/should be removed?

Slide 10

Slide 10 text

Identify common problems which can be addressed with additional tooling

Slide 11

Slide 11 text

See how and where new features are adopted by users

Slide 12

Slide 12 text

Identify contributors for more qualitative research

Slide 13

Slide 13 text

GitHub and BigQuery Our main source of data

Slide 14

Slide 14 text

Open source code on GitHub available in BigQuery

Slide 15

Slide 15 text

BigQuery provides a SQL interface to all of the code on GitHub, along with the associated metadata

Slide 16

Slide 16 text

puppetlabs/puppet-bigquery

Slide 17

Slide 17 text

Repository contains all the queries used in this this talk

Slide 18

Slide 18 text

How many bytes of Puppet code?

Slide 19

Slide 19 text

How many lines of Puppet code?

Slide 20

Slide 20 text

In how many files?

Slide 21

Slide 21 text

From how many contributors?

Slide 22

Slide 22 text

So we have lots of code from lots of people. What questions might we ask?

Slide 23

Slide 23 text

How do people license their Puppet code?

Slide 24

Slide 24 text

How do people license their Puppet code? mit(7345) apache-2.0 (6368) gpl-3.0 (1537) gpl-2.0 (1512) bsd-3-clause (805) bsd-2-clause (373) mpl-2.0 (287) agpl-3.0 (160) isc (121) unlicense (105)

Slide 25

Slide 25 text

Mainly permissive licenses, but nearly 15% is GPL

Slide 26

Slide 26 text

What do people name their classes? apache(153) mysql (131) php (129) base (118) main (118) nginx (110) git (84) python (82) mysql::params (69) ssh (64) mysql::server (59) apt (59) puppet::params (56) mysql::config (55) nodejs (53) apt_get_update (52)

Slide 27

Slide 27 text

What do people name defined types? add_dotdeb (75) mysql_db (68) mysql_nginx_default_conf (57) mongodb_db(52) nginx_vhost (49) postgresql_db (40) mariadb_db (39) mariabdb_nginx_default_conf (38) safepackage(31) iptables_port (26)

Slide 28

Slide 28 text

What packages do people manage with Puppet? git (552) curl (386) wget (239) nginx(246) apache2 (244) vim (243) unzip (238) build-essential (197) python-pip (185) mysql (184) ntp (176) openssh-server (156) nodejs (152) httpd (138)

Slide 29

Slide 29 text

What services do people manage with Puppet? apache2 (314) nginx (280) mysql (239) httpd (165) puppet (125) iptables (119) ssh (105) postgresql (93) php5-fpm (88) neutron-server (81) ntp (76) postfix (74) ntpd (71) sshd (70) mysqld (69)

Slide 30

Slide 30 text

These give a good indication of the most common stacks managed with Puppet

Slide 31

Slide 31 text

- Apache and Nginx - PHP, Python and Node.js - MySQL and PostgreSQL

Slide 32

Slide 32 text

It would be interesting to look at this over time too

Slide 33

Slide 33 text

Who committed most of this code?

Slide 34

Slide 34 text

Popular file names containing Puppet code

Slide 35

Slide 35 text

Simple Puppet Module Structure in 2009

Slide 36

Slide 36 text

Evidence that install, config, service pattern became the defacto way of organising code

Slide 37

Slide 37 text

What types are used the most?

Slide 38

Slide 38 text

Which resource types are most used? file (30298) package (22162) exec (16825) service (11112) user (3951) host (2361) group (2181) notify (2151) yumrepo (1229) cron (1122) stage (429) resources (380) mount (373) ssh_authorized_key (271)

Slide 39

Slide 39 text

Which resource types are least used? interface (260) zone (207) selboolean (108) router (103) tidy (76) sshkey (66) schedule (42) filebucket (42) mailalias (40) vlan (31) selmodule (25) zfs (11) mcx (6) scheduled_task (5) zpool (3) k5login (3) computer (2) maillist (1)

Slide 40

Slide 40 text

More than 50% of resources are file and package

Slide 41

Slide 41 text

exec accounts for ~18% of resources

Slide 42

Slide 42 text

A very long tail. maillist was used once in 7.5 million lines

Slide 43

Slide 43 text

How popular are the nagios types? service (160) command (56) host (49) servicedependency (35) contact (18) servicegroup (10) hostextinfo (8) timeperiod (5) hostdependency (3) serviceescalation (2) hostescalation (1)

Slide 44

Slide 44 text

Which new features are used?

Slide 45

Slide 45 text

Which data type are being used? String (603) Boolean (411) Integer (174) Hash (132) Array (82) Default (14) Float (5) Numeric (3) Undef (1)

Slide 46

Slide 46 text

Which abstract data types are in use? Optional(611) Enum (277) Data (162) Type (145) Variant (134) All (127) Patter (73) Tuple (13) Collection (11) Struct (11) Scaler (2)

Slide 47

Slide 47 text

How many repositories container Gemfiles?

Slide 48

Slide 48 text

What gems are popular in Puppet projects?

Slide 49

Slide 49 text

What gems are popular in Puppet projects? puppet (1285) puppetlabs_spec_helper (1268) rake (1215) puppet-lint (837) metadata-json-lint (726) beaker-rspec (694) rspec-puppet (676) puppet-blacksmith (518) beaker (509) serverspec (444)

Slide 50

Slide 50 text

Which puppet-lint plugins are most used? unquoted_string (366) leading_zero (344) absolute_classname (318) trailing_comma (310) version_comparison (248) variable_contains_uppercase (222) beginning_with_digits (191) emtyp_string (156) undef_in_function (147) spaceship_operatorwithout_tag (135)

Slide 51

Slide 51 text

Other sources of data Libraries.io, the Forge API and more

Slide 52

Slide 52 text

Different systems often have different views on the same object

Slide 53

Slide 53 text

Libraries.io

Slide 54

Slide 54 text

Puppet Forge API

Slide 55

Slide 55 text

forgeapi.puppetlabs.com/v3/modules { "uri": "/v3/modules/arioch-keepalived", "slug": "arioch-keepalived", "name": "keepalived", "downloads": 5551609, "created_at": "2013-07-02 09:13:33 -0700", "updated_at": "2016-12-28 20:00:02 -0800", "deprecated_at": null, "deprecated_for": null, "superseded_by": null, "supported": false, "endorsement": "approved", "module_group": "base", "owner": { "uri": "/v3/users/arioch", "slug": "arioch", "username": "arioch",

Slide 56

Slide 56 text

Puppet version dependencies

Slide 57

Slide 57 text

Long tail of specific Puppet version requirements

Slide 58

Slide 58 text

What licenses are popular for Forge modules?

Slide 59

Slide 59 text

Interesting to note that while most Puppet repositories are MIT, most published modules are Apache licensed

Slide 60

Slide 60 text

Caveats Limitations of the data

Slide 61

Slide 61 text

Obviously this is a subset of all Puppet code

Slide 62

Slide 62 text

Software has bugs. Including this software.

Slide 63

Slide 63 text

How does private Puppet code vary from this dataset?

Slide 64

Slide 64 text

Conclusions If all you remember is...

Slide 65

Slide 65 text

The ability to ask questions of a large data set is a useful design tool

Slide 66

Slide 66 text

For example, we could use this approach to see how people adopt the new Puppet Tasks

Slide 67

Slide 67 text

- Analyzing hieradata - Parsing all of the Puppet code - Seeing changes over time So many more questions:

Slide 68

Slide 68 text

- Make it easy to run against your own code - Some way of submitting aggregates - Publish public bigquery tables Lots of ideas too:

Slide 69

Slide 69 text

Any questions? And thanks for listening