Multilingual Stemming on CPAN

Multilingual Stemming on CPAN

The past year has seen multiple advances in stemming algorithms and tools on CPAN, including new stemmers that previously didn’t exist in Perl for multiple natural languages. Let’s take a whirlwind tour of new stemming development in 2013–2014 and demonstrate how Shutterstock uses them for multilingual information retrieval.

Presented at:
◦ 2014-05-03: DC–Baltimore Perl Workshop (DCBPW) 2014, Silver Spring, MD
◦  2014-05-29: New York Perl Mongers (NY.pm), New York, NY
◦  2014-06-23: YAPC::NA 2014, Orlando, FL
◦  2014-06-27: Open Source Bridge 2014, Portland, OR

Video: https://youtu.be/fpxJHuopS5Y?t=4m12s

05bab33cfd102c84f045838aa4e05bec?s=128

Nova Patch

June 27, 2014
Tweet

Transcript

  1. Advances in Multilingual Stemming on CPAN Nick Patch @nickpatch Shutterstock

  2. hacking → hack hacker → hack hacked → hack hack

    → hack
  3. hacking → hack hacker → hack hacked → hack hack

    → hack
  4. hacking → hack hacker → hack hacked → hack hack

    → hack
  5. hacking → hack hacker → hack hacked → hack hack

    → hack
  6. hacking → hack hacker → hack hacked → hack hack

    → hack
  7. hacking → hack hacker → hack hacked → hack hack

    → hack
  8. hacking → hack hacker → hack hacked → hack hack

    → hack
  9. hacking → hack hacker → hack hacked → hack hack

    → hack
  10. gurgled → gurgl

  11. gurgled → gurgl

  12. gurgled → gurgl gurgling → gurgl

  13. gurgled → gurgl gurgling → gurgl

  14. gurgled → gurgl gurgling → gurgl gurgle → gurgl

  15. gurgled → gurgl gurgling → gurgl gurgle → gurgl

  16. stem("hacker")

  17. stem("hacker") eq stem("hacking")

  18. indexer(stem("hacker"))

  19. indexer(stem("hacker")) lookup(stem("hacking"))

  20. Lingua::Stem::Any

  21. Lingua::Stem::Any bg cs da de en eo es fa f

    fr gl hu io it nl no pt ro ru sv tr
  22. use Lingua::Stem::Any; $stemmer = Lingua::Stem::Any->new( language => $language ); $stem

    = $stemmer->stem($word);
  23. Attributes language source cache exceptions casefold normalize

  24. Methods stem($word) stem(@words) stem_in_place(\@words)

  25. Methods languages languages($source) sources sources($lang) clear_cache

  26. Lingua::Stem::UniNE::CS Czech Image by NuclearVacuum on Wikimedia Commons / CC

    BY-SA 3.0
  27. Lingua::Stem::UniNE::CS Czech Bulgarian Lingua::Stem::UniNE::BG Image by NuclearVacuum on Wikimedia Commons

    / CC BY-SA 3.0
  28. Lingua::Stem::UniNE::FA Persian Image by Mani1 on Wikimedia Commons / public

    domain
  29. Lingua::Stem::Patch::EO Esperanto Image by Ionut Cojocaru on Wikimedia Commons /

    CC BY 3.0
  30. Lingua::Stem::Patch::IO Ido Image by Ionut Cojocaru on Wikimedia Commons /

    CC BY 3.0
  31. Lingua::Stem::TLH ?! Klingon?! Image by NASA and ESA / public

    domain
  32. TODO pl Polish ar Arabic bn Bengali hi Hindi mr

    Marathi
  33. Nick Patch @nickpatch Shutterstock