Save 37% off PRO during our Black Friday Sale! »

Parsing a distribution name is sometimes hard

Parsing a distribution name is sometimes hard

LT at PerlCon 2019

A1035a2db5b72227a14e84d65117b75b?s=128

Kenichi Ishigaki

August 09, 2019
Tweet

Transcript

  1. Parsing a distribution name is sometimes hard Kenichi Ishigaki @charsbar

    PerlCon 2019 Aug 9, 2019
  2. CPAN::Groonga (@LPW2018) https://grep.cpanauthors.org/ You can grep source code of Perl5/6

    modules
  3. CPAN services need to parse distribution names for grouping

  4. Usually it's easy I/IS/ISHIGAKI/Module-CPANTS-Analyse-1.01.tar.gz S/SK/SKAJI/Perl6/App-Mi6-0.0.2.tar.gz • The blue part is

    the author's directory based on their ID • The purple part is a subdirectory under the author's dir • The red part is the name of the distribution • The orange part is the version of the distribution
  5. CPAN::DistnameInfo the defacto standard Perl5 module written by Graham Barr

  6. CPAN::DistnameInfo but unmaintained for years

  7. CPAN::DistnameInfo unmaintained by the toolchain gang, too

  8. CPAN::DistnameInfo I've been using a patched version for CPANTS but

    I didn't want to repeat that for CPAN::Groonga
  9. CPAN::DistnameInfo I was going to ping the gang, but I

    thought twice: let's test it with BackPAN first
  10. I found something

  11. CPAN::DistnameInfo says... my $path = "E/ER/ERWANMAS/v0.10.zip"; say encode_json({ CPAN::DistnameInfo->new($path)->properties });

    { "cpanid" : "ERWANMAS", "dist" : "v", "distvname" : "v0.10", "extension" : "zip", "filename" : "v0.10.zip", "maturity" : "released", "pathname" : "E/ER/ERWANMAS/v0.10.zip", "version" : "0.10" }
  12. Or ... my $path = "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties });

    { "cpanid" : "SONNY", "dist" : "DBIx-Class-InflateColumn", "distvname" : "DBIx-Class-InflateColumn-S3", "extension" : "tar.gz", "filename" : "DBIx-Class-InflateColumn-S3.tar.gz", "maturity" : "released", "pathname" : "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz", "version" : "S3" } But really?
  13. Of course not https://metacpan.org/release/SONNY/DBIx-Class-InflateColumn-S3

  14. Of course not https://metacpan.org/requires/distribution/DBIx-Class- InflateColumn?sort=[[2,1]]

  15. More delicate cases my $path = "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties

    }); { "cpanid" : "HARPREET", "dist" : "XMS-MotifSetv", "distvname" : "XMS-MotifSetv1.0", "extension" : "tar.gz", "filename" : "XMS-MotifSetv1.0.tar.gz", "maturity" : "released", "pathname" : "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz", "version" : "1.0" }
  16. More delicate cases

  17. More delicate cases my $path = "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties

    }); { "cpanid" : "MPERRY", "dist" : "Config-INI-Reader", "distvname" : "Config-INI-Reader-Encrypted2", "extension" : "tar.gz", "filename" : "Config-INI-Reader-Encrypted2.tar.gz", "maturity" : "released", "pathname" : "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz", "version" : "Encrypted2" }
  18. More delicate cases

  19. More delicate cases my $path = "C/CA/CAFFIEND/font_ft2_0.1.0.tgz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties

    }); { "cpanid" : "CAFFIEND", "dist" : "font_ft", "distvname" : "font_ft2_0.1.0", "extension" : "tgz", "filename" : "font_ft2_0.1.0.tgz", "maturity" : "released", "pathname" : "C/CA/CAFFIEND/font_ft2_0.1.0.tgz", "version" : "2_0.1.0" }
  20. More delicate cases

  21. Why this happens? • CPAN::DistnameInfo looks for a distribution name

    and a version at the same time (using regex) • But it might be better to look for a version first, then treat the rest as a name
  22. Parse::Distname https://metacpan.org/release/Parse-Distname So I wrote a new module as a

    PoC, instead of applying a breaking change to the existing code
  23. Let's see my $path = "E/ER/ERWANMAS/v0.10.zip"; say encode_json({ Parse::Distname->new($path)->properties });

    { "cpanid" : "ERWANMAS", - "dist" : "v", + "dist" : "", "distvname" : "v0.10", "extension" : "zip", "filename" : "v0.10.zip", "maturity" : "released", "pathname" : "E/ER/ERWANMAS/v0.10.zip", - "version" : "0.10" + "version" : "v0.10" }
  24. Let' see my $path = "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz"; say encode_json({ Parse::Distname->new($path)->properties });

    { "cpanid" : "SONNY", - "dist" : "DBIx-Class-InflateColumn", + "dist" : "DBIx-Class-InflateColumn-S3", "distvname" : "DBIx-Class-InflateColumn-S3", "extension" : "tar.gz", "filename" : "DBIx-Class-InflateColumn-S3.tar.gz", "maturity" : "released", "pathname" : "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz", - "version" : "S3" + "version" : null }
  25. Let's see my $path = "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz"; say encode_json({ CPAN::DistnameInfo->new($path)->properties });

    { "cpanid" : "HARPREET", - "dist" : "XMS-MotifSetv", + "dist" : "XMS-MotifSet", "distvname" : "XMS-MotifSetv1.0", "extension" : "tar.gz", "filename" : "XMS-MotifSetv1.0.tar.gz", "maturity" : "released", "pathname" : "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz", - "version" : "1.0" + "version" : "v1.0" }
  26. Let's see my $path = "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz"; say encode_json({ Parse::Distname->new($path)->properties });

    { "cpanid": "MPERRY", - "dist": "Config-INI-Reader", + "dist": "Config-INI-Reader-Encrypted", "distvname": "Config-INI-Reader-Encrypted2", "extension": "tar.gz", "filename": "Config-INI-Reader-Encrypted2.tar.gz", "maturity": "released", "pathname": "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz", - "version": "Encrypted2" + "version": "2" }
  27. Let's see my $path = "C/CA/CAFFIEND/font_ft2_0.1.0.tgz"; say encode_json({ Parse::Distname->new($path)->properties });

    { "cpanid": "CAFFIEND", - "dist": "font_ft", + "dist": "font_ft2", "distvname": "font_ft2_0.1.0", "extension": "tgz", "filename": "font_ft2_0.1.0.tgz", "maturity": "released", "pathname": "C/CA/CAFFIEND/font_ft2_0.1.0.tgz", - "version": "2_0.1.0" + "version": "0.1.0" }
  28. Fixed 200+ cases • Out of 330000+ BackPAN distributions •

    Most cases are ancient, or accidental, and often removed already • See https://github.com/charsbar/Parse- Distname/blob/master/xt/walk_through.t for details • Parse::Distname also contains a few patches for CPAN::DistnameInfo
  29. May not be perfect yet my $path = "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz"; say

    encode_json({ Parse::Distname->new($path)->properties }); { "cpanid" : "CDRAKE", - "dist" : "Crypt", + "dist" : "Crypt-MatrixSSL", "distvname" : "Crypt-MatrixSSL3", "extension" : "tar.gz", "filename" : "Crypt-MatrixSSL3.tar.gz", "maturity" : "released", "pathname" : "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz", - "version" : "MatrixSSL3" + "version" : "3" } Looks better, but...
  30. May not be perfect yet

  31. Fixed this morning (0.04) my $path = "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz"; say encode_json({

    Parse::Distname->new($path)->properties }); { "cpanid" : "CDRAKE", - "dist" : "Crypt", + "dist" : "Crypt-MatrixSSL3", "distvname" : "Crypt-MatrixSSL3", "extension" : "tar.gz", "filename" : "Crypt-MatrixSSL3.tar.gz", "maturity" : "released", "pathname" : "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz", - "version" : "MatrixSSL3" + "version" : null } ... by making it an exception
  32. Dogfooding • I have started using this for CPANTS and

    CPAN::Groonga • If everything goes well...?
  33. Caveats for migration • Distribution name may become empty (and

    your database may complain about this) • Internal hash keys are changed
  34. Thanks