Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Enabling Exploration through Text Analytics

Enabling Exploration through Text Analytics

The 2009 Text Analytics Summit presentation discusses the importance of text analytics in enhancing exploratory search capabilities for various information-seeking needs, such as health and work-related queries. It highlights real-world examples and outlines key takeaways regarding text analytics' role in categorization, entity detection, and user engagement. It concludes by asserting that text analytics technology is currently available and essential for effective exploration, advocating for improvements in the domain.

Avatar for Daniel Tunkelang

Daniel Tunkelang

May 22, 2026

More Decks by Daniel Tunkelang

Other Decks in Technology

Transcript

  1. © 2009 Endeca Technologies, Inc. All rights reserved. Enabling Exploration

    through Text Analytics Daniel Tunkelang Chief Scientist, Endeca
  2. © 2009 Endeca Technologies, Inc. All rights reserved. 2 overview

    information seeking tools need to support exploration text analytics can help you can do this here and now
  3. © 2009 Endeca Technologies, Inc. All rights reserved. 3 real-world

    information seeking examples • looking for health information • looking for work-related information reminder search and text analytics are a means, not an end
  4. © 2009 Endeca Technologies, Inc. All rights reserved. 4 example

    1: looking for health information six months into my wife’s pregnancy, we discovered that she had gestational diabetes how to learn more?
  5. © 2009 Endeca Technologies, Inc. All rights reserved. 7 maybe

    the private sector knows best: webmd powered by
  6. © 2009 Endeca Technologies, Inc. All rights reserved. 9 example

    2: looking for work-related information need to ramp up summer interns on text mining how to find a good book?
  7. © 2009 Endeca Technologies, Inc. All rights reserved. 13 triangle

    research libraries: next-gen catalog powered by
  8. © 2009 Endeca Technologies, Inc. All rights reserved. 14 faceted

    search enables query refinement powered by
  9. © 2009 Endeca Technologies, Inc. All rights reserved. 15 take-away

    #1 exploratory search support: a must-have for many information needs
  10. © 2009 Endeca Technologies, Inc. All rights reserved. 16 text

    analytics • categorization • named entity detection • term extraction • sentiment analysis vague term, lots of see-alsos text mining information extraction content enrichment
  11. © 2009 Endeca Technologies, Inc. All rights reserved. 17 newssift:

    text analytics enabling exploration powered by categorization named entity detection term extraction sentiment analysis
  12. © 2009 Endeca Technologies, Inc. All rights reserved. 19 facebook:

    the good powered by Social Utility Iphone Application
  13. © 2009 Endeca Technologies, Inc. All rights reserved. 20 facebook:

    the bad powered by Criminal Behavior Litigation And Settlement
  14. © 2009 Endeca Technologies, Inc. All rights reserved. 21 take-away

    #2 text analytics enable exploratory search
  15. © 2009 Endeca Technologies, Inc. All rights reserved. 24 caveats

    • rule-based techniques are domain-specific • statistical techniques rely on trained models • plan for errors, inconsistency • document vs. corpus analysis
  16. © 2009 Endeca Technologies, Inc. All rights reserved. 25 Person

    Location Organization ABDUL-KARIM KHALAF (1) ALTOONA, PA (1) ABC News Inc. (1) ABDULRAHMAN ABDULLAH (1) Afghanistan (7) Air Force (1) AL GORE (1) Africa (5) Amazon.com Inc. (1) ALEX TREBEK (1) Akihabara (1) American Airlines Inc. (1) ALI HASSAN AL (1) Alaska (3) Apple (1) AMANDA MARCOTTE (1) Allegheny (1) Arctic National Wildlife Refuge (1) AMY WINEHOUSE (1) Americas (17) Arianna Huffington (1) ANDERS ERICSSON (1) Appalachia (1) Australian Liberal Party (1) ANDREW LLOYD WEBBER (1) Argentina (1) Bad News Bears (1) ANTHONY MWANGI (1) Arizona (11) Bear Stearns (2) ANTONIN SCALIA (1) Arkansas (7) Big Apple Companies (1) ARYE BARAK (1) Arlington, Va. (2) BioDiversity Research Institute (1) Aaron Sorkin (1) Arrest (1) Bloomberg LP (3) Abbie Hoffman (1) Asia (1) Bob Dole (1) Abe Lincoln (1) Atlanta (2) Bocuse d’Or World Cuisine Contest (1) Abe Weiss (1) Austin (1) Boston Globe (1) Abraham Lincoln (1) Austin, Texas (1) Boston Tea Party (1) Adlai Stephenson (1) Australia (1) Budweiser (1) problems with entity extraction • moderate precision, but low recall • not just noisy, but inconsistent • corpus analysis can help! Arrest (1) Asia (1) ALTOONA, PA (1) Abe Lincoln (1) Bob Dole (1) Boston Tea Party (1) Abraham Lincoln (1)
  17. © 2009 Endeca Technologies, Inc. All rights reserved. 27 division

    of labor people supply vocabulary machine annotates documents http://www.precolumbianwomen.com/images/inca-labor.10.gif
  18. © 2009 Endeca Technologies, Inc. All rights reserved. 28 example:

    ACM digital library • opportunity – repository of (sometimes) author-tagged documents – high-precision tags: very few false positives • challenge – poor reuse of vocabulary: most tags unique – low-recall tags: 90% false negatives as is, tags were not useful for exploration
  19. © 2009 Endeca Technologies, Inc. All rights reserved. 29 solution

    • bootstrap on author-supplied tags • prune 600K+ tags to 10K by – imposing frequency threshold – normalizing by case and singular/plural – eliminating infrequent subphrases • mine documents using resulting vocabulary • manually validate most frequently assigned tags
  20. © 2009 Endeca Technologies, Inc. All rights reserved. 32 if

    you prefer sports to computer science • no author-supplied tags • use search logs instead • supplement with authority files – team names – player names • mine documents using resulting vocabulary
  21. © 2009 Endeca Technologies, Inc. All rights reserved. 35 take-away

    #3 this is not vapor ware; text analytics to enable exploration is available here and now
  22. © 2009 Endeca Technologies, Inc. All rights reserved. 36 looking

    forward • better tags are the beginning, not the end • improve with manual and automatic processing • give users control over precision / recall trade-off • help users and content creators help you
  23. © 2009 Endeca Technologies, Inc. All rights reserved. 37 in

    closing exploratory search = must-have, not nice-to-have text analytics are a key enabler the technology is real, here, and now
  24. © 2009 Endeca Technologies, Inc. All rights reserved. 38 thank

    you…and come to SIGIR! communication 1.0 email: [email protected] communication 2.0 blog: http://thenoisychannel.com twitter: http://twitter.com/dtunkelang SIGIR: July 19-23 in Boston Industry Track on July 22nd!