Slide 1

Slide 1 text

Parsing a distribution name is sometimes hard Kenichi Ishigaki @charsbar PerlCon 2019 Aug 9, 2019

Slide 2

Slide 2 text

CPAN::Groonga (@LPW2018) https://grep.cpanauthors.org/ You can grep source code of Perl5/6 modules

Slide 3

Slide 3 text

CPAN services need to parse distribution names for grouping

Slide 4

Slide 4 text

Usually it's easy I/IS/ISHIGAKI/Module-CPANTS-Analyse-1.01.tar.gz S/SK/SKAJI/Perl6/App-Mi6-0.0.2.tar.gz • The blue part is the author's directory based on their ID • The purple part is a subdirectory under the author's dir • The red part is the name of the distribution • The orange part is the version of the distribution

Slide 5

Slide 5 text

CPAN::DistnameInfo the defacto standard Perl5 module written by Graham Barr

Slide 6

Slide 6 text

CPAN::DistnameInfo but unmaintained for years

Slide 7

Slide 7 text

CPAN::DistnameInfo unmaintained by the toolchain gang, too

Slide 8

Slide 8 text

CPAN::DistnameInfo I've been using a patched version for CPANTS but I didn't want to repeat that for CPAN::Groonga

Slide 9

Slide 9 text

CPAN::DistnameInfo I was going to ping the gang, but I thought twice: let's test it with BackPAN first

Slide 10

Slide 10 text

I found something

Slide 11

Slide 11 text

CPAN::DistnameInfo says... my $path = "E/ER/ERWANMAS/v0.10.zip"; say encode_json({ CPAN::DistnameInfo->new($path)->properties }); { "cpanid" : "ERWANMAS", "dist" : "v", "distvname" : "v0.10", "extension" : "zip", "filename" : "v0.10.zip", "maturity" : "released", "pathname" : "E/ER/ERWANMAS/v0.10.zip", "version" : "0.10" }

Slide 12

Slide 12 text

Or ... my $path = "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties }); { "cpanid" : "SONNY", "dist" : "DBIx-Class-InflateColumn", "distvname" : "DBIx-Class-InflateColumn-S3", "extension" : "tar.gz", "filename" : "DBIx-Class-InflateColumn-S3.tar.gz", "maturity" : "released", "pathname" : "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz", "version" : "S3" } But really?

Slide 13

Slide 13 text

Of course not https://metacpan.org/release/SONNY/DBIx-Class-InflateColumn-S3

Slide 14

Slide 14 text

Of course not https://metacpan.org/requires/distribution/DBIx-Class- InflateColumn?sort=[[2,1]]

Slide 15

Slide 15 text

More delicate cases my $path = "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties }); { "cpanid" : "HARPREET", "dist" : "XMS-MotifSetv", "distvname" : "XMS-MotifSetv1.0", "extension" : "tar.gz", "filename" : "XMS-MotifSetv1.0.tar.gz", "maturity" : "released", "pathname" : "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz", "version" : "1.0" }

Slide 16

Slide 16 text

More delicate cases

Slide 17

Slide 17 text

More delicate cases my $path = "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties }); { "cpanid" : "MPERRY", "dist" : "Config-INI-Reader", "distvname" : "Config-INI-Reader-Encrypted2", "extension" : "tar.gz", "filename" : "Config-INI-Reader-Encrypted2.tar.gz", "maturity" : "released", "pathname" : "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz", "version" : "Encrypted2" }

Slide 18

Slide 18 text

More delicate cases

Slide 19

Slide 19 text

More delicate cases my $path = "C/CA/CAFFIEND/font_ft2_0.1.0.tgz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties }); { "cpanid" : "CAFFIEND", "dist" : "font_ft", "distvname" : "font_ft2_0.1.0", "extension" : "tgz", "filename" : "font_ft2_0.1.0.tgz", "maturity" : "released", "pathname" : "C/CA/CAFFIEND/font_ft2_0.1.0.tgz", "version" : "2_0.1.0" }

Slide 20

Slide 20 text

More delicate cases

Slide 21

Slide 21 text

Why this happens? • CPAN::DistnameInfo looks for a distribution name and a version at the same time (using regex) • But it might be better to look for a version first, then treat the rest as a name

Slide 22

Slide 22 text

Parse::Distname https://metacpan.org/release/Parse-Distname So I wrote a new module as a PoC, instead of applying a breaking change to the existing code

Slide 23

Slide 23 text

Let's see my $path = "E/ER/ERWANMAS/v0.10.zip"; say encode_json({ Parse::Distname->new($path)->properties }); { "cpanid" : "ERWANMAS", - "dist" : "v", + "dist" : "", "distvname" : "v0.10", "extension" : "zip", "filename" : "v0.10.zip", "maturity" : "released", "pathname" : "E/ER/ERWANMAS/v0.10.zip", - "version" : "0.10" + "version" : "v0.10" }

Slide 24

Slide 24 text

Let' see my $path = "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz"; say encode_json({ Parse::Distname->new($path)->properties }); { "cpanid" : "SONNY", - "dist" : "DBIx-Class-InflateColumn", + "dist" : "DBIx-Class-InflateColumn-S3", "distvname" : "DBIx-Class-InflateColumn-S3", "extension" : "tar.gz", "filename" : "DBIx-Class-InflateColumn-S3.tar.gz", "maturity" : "released", "pathname" : "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz", - "version" : "S3" + "version" : null }

Slide 25

Slide 25 text

Let's see my $path = "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties }); { "cpanid" : "HARPREET", - "dist" : "XMS-MotifSetv", + "dist" : "XMS-MotifSet", "distvname" : "XMS-MotifSetv1.0", "extension" : "tar.gz", "filename" : "XMS-MotifSetv1.0.tar.gz", "maturity" : "released", "pathname" : "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz", - "version" : "1.0" + "version" : "v1.0" }

Slide 26

Slide 26 text

Let's see my $path = "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz"; say encode_json({ Parse::Distname->new($path)->properties }); { "cpanid": "MPERRY", - "dist": "Config-INI-Reader", + "dist": "Config-INI-Reader-Encrypted", "distvname": "Config-INI-Reader-Encrypted2", "extension": "tar.gz", "filename": "Config-INI-Reader-Encrypted2.tar.gz", "maturity": "released", "pathname": "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz", - "version": "Encrypted2" + "version": "2" }

Slide 27

Slide 27 text

Let's see my $path = "C/CA/CAFFIEND/font_ft2_0.1.0.tgz"; say encode_json({ Parse::Distname->new($path)->properties }); { "cpanid": "CAFFIEND", - "dist": "font_ft", + "dist": "font_ft2", "distvname": "font_ft2_0.1.0", "extension": "tgz", "filename": "font_ft2_0.1.0.tgz", "maturity": "released", "pathname": "C/CA/CAFFIEND/font_ft2_0.1.0.tgz", - "version": "2_0.1.0" + "version": "0.1.0" }

Slide 28

Slide 28 text

Fixed 200+ cases • Out of 330000+ BackPAN distributions • Most cases are ancient, or accidental, and often removed already • See https://github.com/charsbar/Parse- Distname/blob/master/xt/walk_through.t for details • Parse::Distname also contains a few patches for CPAN::DistnameInfo

Slide 29

Slide 29 text

May not be perfect yet my $path = "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz"; say encode_json({ Parse::Distname->new($path)->properties }); { "cpanid" : "CDRAKE", - "dist" : "Crypt", + "dist" : "Crypt-MatrixSSL", "distvname" : "Crypt-MatrixSSL3", "extension" : "tar.gz", "filename" : "Crypt-MatrixSSL3.tar.gz", "maturity" : "released", "pathname" : "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz", - "version" : "MatrixSSL3" + "version" : "3" } Looks better, but...

Slide 30

Slide 30 text

May not be perfect yet

Slide 31

Slide 31 text

Fixed this morning (0.04) my $path = "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz"; say encode_json({ Parse::Distname->new($path)->properties }); { "cpanid" : "CDRAKE", - "dist" : "Crypt", + "dist" : "Crypt-MatrixSSL3", "distvname" : "Crypt-MatrixSSL3", "extension" : "tar.gz", "filename" : "Crypt-MatrixSSL3.tar.gz", "maturity" : "released", "pathname" : "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz", - "version" : "MatrixSSL3" + "version" : null } ... by making it an exception

Slide 32

Slide 32 text

Dogfooding • I have started using this for CPANTS and CPAN::Groonga • If everything goes well...?

Slide 33

Slide 33 text

Caveats for migration • Distribution name may become empty (and your database may complain about this) • Internal hash keys are changed

Slide 34

Slide 34 text

Thanks