Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parsing a distribution name is sometimes hard

Parsing a distribution name is sometimes hard

LT at PerlCon 2019

Kenichi Ishigaki

August 09, 2019
Tweet

More Decks by Kenichi Ishigaki

Other Decks in Technology

Transcript

  1. Parsing a distribution name
    is sometimes hard
    Kenichi Ishigaki
    @charsbar
    PerlCon 2019
    Aug 9, 2019

    View full-size slide

  2. CPAN::Groonga (@LPW2018)
    https://grep.cpanauthors.org/
    You can grep source code of Perl5/6 modules

    View full-size slide

  3. CPAN services need to
    parse distribution names
    for grouping

    View full-size slide

  4. Usually it's easy
    I/IS/ISHIGAKI/Module-CPANTS-Analyse-1.01.tar.gz
    S/SK/SKAJI/Perl6/App-Mi6-0.0.2.tar.gz
    • The blue part is the author's directory based on their ID
    • The purple part is a subdirectory under the author's dir
    • The red part is the name of the distribution
    • The orange part is the version of the distribution

    View full-size slide

  5. CPAN::DistnameInfo
    the defacto standard Perl5 module
    written by Graham Barr

    View full-size slide

  6. CPAN::DistnameInfo
    but unmaintained for years

    View full-size slide

  7. CPAN::DistnameInfo
    unmaintained by the toolchain gang, too

    View full-size slide

  8. CPAN::DistnameInfo
    I've been using a patched version
    for CPANTS but I didn't want to
    repeat that for CPAN::Groonga

    View full-size slide

  9. CPAN::DistnameInfo
    I was going to ping the gang,
    but I thought twice:
    let's test it with BackPAN first

    View full-size slide

  10. I found something

    View full-size slide

  11. CPAN::DistnameInfo says...
    my $path = "E/ER/ERWANMAS/v0.10.zip";
    say encode_json({ CPAN::DistnameInfo->new($path)->properties });
    {
    "cpanid" : "ERWANMAS",
    "dist" : "v",
    "distvname" : "v0.10",
    "extension" : "zip",
    "filename" : "v0.10.zip",
    "maturity" : "released",
    "pathname" : "E/ER/ERWANMAS/v0.10.zip",
    "version" : "0.10"
    }

    View full-size slide

  12. Or ...
    my $path = "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz";
    say encode_json({ CPAN::DistnameInfo->new($path)->properties });
    {
    "cpanid" : "SONNY",
    "dist" : "DBIx-Class-InflateColumn",
    "distvname" : "DBIx-Class-InflateColumn-S3",
    "extension" : "tar.gz",
    "filename" : "DBIx-Class-InflateColumn-S3.tar.gz",
    "maturity" : "released",
    "pathname" : "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz",
    "version" : "S3"
    }
    But really?

    View full-size slide

  13. Of course not
    https://metacpan.org/release/SONNY/DBIx-Class-InflateColumn-S3

    View full-size slide

  14. Of course not
    https://metacpan.org/requires/distribution/DBIx-Class-
    InflateColumn?sort=[[2,1]]

    View full-size slide

  15. More delicate cases
    my $path = "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz";
    say encode_json({ CPAN::DistnameInfo->new($path)->properties });
    {
    "cpanid" : "HARPREET",
    "dist" : "XMS-MotifSetv",
    "distvname" : "XMS-MotifSetv1.0",
    "extension" : "tar.gz",
    "filename" : "XMS-MotifSetv1.0.tar.gz",
    "maturity" : "released",
    "pathname" : "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz",
    "version" : "1.0"
    }

    View full-size slide

  16. More delicate cases

    View full-size slide

  17. More delicate cases
    my $path = "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz";
    say encode_json({ CPAN::DistnameInfo->new($path)->properties });
    {
    "cpanid" : "MPERRY",
    "dist" : "Config-INI-Reader",
    "distvname" : "Config-INI-Reader-Encrypted2",
    "extension" : "tar.gz",
    "filename" : "Config-INI-Reader-Encrypted2.tar.gz",
    "maturity" : "released",
    "pathname" : "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz",
    "version" : "Encrypted2"
    }

    View full-size slide

  18. More delicate cases

    View full-size slide

  19. More delicate cases
    my $path = "C/CA/CAFFIEND/font_ft2_0.1.0.tgz";
    say encode_json({ CPAN::DistnameInfo->new($path)->properties });
    {
    "cpanid" : "CAFFIEND",
    "dist" : "font_ft",
    "distvname" : "font_ft2_0.1.0",
    "extension" : "tgz",
    "filename" : "font_ft2_0.1.0.tgz",
    "maturity" : "released",
    "pathname" : "C/CA/CAFFIEND/font_ft2_0.1.0.tgz",
    "version" : "2_0.1.0"
    }

    View full-size slide

  20. More delicate cases

    View full-size slide

  21. Why this happens?
    • CPAN::DistnameInfo looks for a
    distribution name and a version at
    the same time (using regex)
    • But it might be better to look for a
    version first, then treat the rest as
    a name

    View full-size slide

  22. Parse::Distname
    https://metacpan.org/release/Parse-Distname
    So I wrote a new module as a PoC,
    instead of applying a breaking
    change to the existing code

    View full-size slide

  23. Let's see
    my $path = "E/ER/ERWANMAS/v0.10.zip";
    say encode_json({ Parse::Distname->new($path)->properties });
    {
    "cpanid" : "ERWANMAS",
    - "dist" : "v",
    + "dist" : "",
    "distvname" : "v0.10",
    "extension" : "zip",
    "filename" : "v0.10.zip",
    "maturity" : "released",
    "pathname" : "E/ER/ERWANMAS/v0.10.zip",
    - "version" : "0.10"
    + "version" : "v0.10"
    }

    View full-size slide

  24. Let' see
    my $path = "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz";
    say encode_json({ Parse::Distname->new($path)->properties });
    {
    "cpanid" : "SONNY",
    - "dist" : "DBIx-Class-InflateColumn",
    + "dist" : "DBIx-Class-InflateColumn-S3",
    "distvname" : "DBIx-Class-InflateColumn-S3",
    "extension" : "tar.gz",
    "filename" : "DBIx-Class-InflateColumn-S3.tar.gz",
    "maturity" : "released",
    "pathname" : "S/SO/SONNY/DBIx-Class-InflateColumn-S3.tar.gz",
    - "version" : "S3"
    + "version" : null
    }

    View full-size slide

  25. Let's see
    my $path = "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz";
    say encode_json({ CPAN::DistnameInfo->new($path)->properties });
    {
    "cpanid" : "HARPREET",
    - "dist" : "XMS-MotifSetv",
    + "dist" : "XMS-MotifSet",
    "distvname" : "XMS-MotifSetv1.0",
    "extension" : "tar.gz",
    "filename" : "XMS-MotifSetv1.0.tar.gz",
    "maturity" : "released",
    "pathname" : "H/HA/HARPREET/XMS-MotifSetv1.0.tar.gz",
    - "version" : "1.0"
    + "version" : "v1.0"
    }

    View full-size slide

  26. Let's see
    my $path = "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz";
    say encode_json({ Parse::Distname->new($path)->properties });
    {
    "cpanid": "MPERRY",
    - "dist": "Config-INI-Reader",
    + "dist": "Config-INI-Reader-Encrypted",
    "distvname": "Config-INI-Reader-Encrypted2",
    "extension": "tar.gz",
    "filename": "Config-INI-Reader-Encrypted2.tar.gz",
    "maturity": "released",
    "pathname": "M/MP/MPERRY/Config-INI-Reader-Encrypted2.tar.gz",
    - "version": "Encrypted2"
    + "version": "2"
    }

    View full-size slide

  27. Let's see
    my $path = "C/CA/CAFFIEND/font_ft2_0.1.0.tgz";
    say encode_json({ Parse::Distname->new($path)->properties });
    {
    "cpanid": "CAFFIEND",
    - "dist": "font_ft",
    + "dist": "font_ft2",
    "distvname": "font_ft2_0.1.0",
    "extension": "tgz",
    "filename": "font_ft2_0.1.0.tgz",
    "maturity": "released",
    "pathname": "C/CA/CAFFIEND/font_ft2_0.1.0.tgz",
    - "version": "2_0.1.0"
    + "version": "0.1.0"
    }

    View full-size slide

  28. Fixed 200+ cases
    • Out of 330000+ BackPAN distributions
    • Most cases are ancient, or accidental, and
    often removed already
    • See https://github.com/charsbar/Parse-
    Distname/blob/master/xt/walk_through.t for
    details
    • Parse::Distname also contains a few patches
    for CPAN::DistnameInfo

    View full-size slide

  29. May not be perfect yet
    my $path = "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz";
    say encode_json({ Parse::Distname->new($path)->properties });
    {
    "cpanid" : "CDRAKE",
    - "dist" : "Crypt",
    + "dist" : "Crypt-MatrixSSL",
    "distvname" : "Crypt-MatrixSSL3",
    "extension" : "tar.gz",
    "filename" : "Crypt-MatrixSSL3.tar.gz",
    "maturity" : "released",
    "pathname" : "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz",
    - "version" : "MatrixSSL3"
    + "version" : "3"
    }
    Looks better, but...

    View full-size slide

  30. May not be perfect yet

    View full-size slide

  31. Fixed this morning (0.04)
    my $path = "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz";
    say encode_json({ Parse::Distname->new($path)->properties });
    {
    "cpanid" : "CDRAKE",
    - "dist" : "Crypt",
    + "dist" : "Crypt-MatrixSSL3",
    "distvname" : "Crypt-MatrixSSL3",
    "extension" : "tar.gz",
    "filename" : "Crypt-MatrixSSL3.tar.gz",
    "maturity" : "released",
    "pathname" : "C/CD/CDRAKE/Crypt-MatrixSSL3.tar.gz",
    - "version" : "MatrixSSL3"
    + "version" : null
    }
    ... by making it an exception

    View full-size slide

  32. Dogfooding
    • I have started using this for CPANTS
    and CPAN::Groonga
    • If everything goes well...?

    View full-size slide

  33. Caveats for migration
    • Distribution name may become
    empty (and your database may
    complain about this)
    • Internal hash keys are changed

    View full-size slide