Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Information Retrieval and Text Analyti...

Building Information Retrieval and Text Analytics Systems for Tech Support

Radialpoint (www.radialpoint.com) is a Montreal-based software company with deep experience in providing products and services for consumer technology support. We’re engaged in research and development of a new semantic IR system (a vertical search engine) designed to provide accurate knowledge to help resolve consumer technology issues. To this end, we’re researching possibilities of combining Information Retrieval, Natural Language Processing and Machine Learning to discover and organize knowledge relevant to tech support.

We’re also collecting search queries and click-though data from live tech support sessions performed by several call centres. This data helps us to determine the real-life information needs as well as to focus the web crawler to collect the most relevant content for the search index.

We will present some early findings on search query clustering and discuss challenges and research topics that may help in creation of such a system. We would also discuss our current work on a semantic pipeline processing of knowledge bases, and some preliminary results on adding entity annotations to ElasticSearch index.

Besides the technical aspects of this work, Radialpoint is interested in discussing opportunities for collaboration with UdeM and their students.

Alexis Smirnov

January 22, 2014
Tweet

More Decks by Alexis Smirnov

Other Decks in Technology

Transcript

  1. Content acquisition challenge •  How to identify a subset of

    the web content relevant to tech support?
  2. Reveal Makes Google search more effective in cases where KBs

    don’t have the knowledge required    
  3. Query  Filtering   Session  Iden2fica2on   Clustering   Recommendation system

    powered by a professional network Query  strings   Google  results   Clicks  
  4. •  4  months,  145  par/cipants   •  Hand-­‐classified  as  “personal”

     or  “business”   – Sample  of  100  queries  annotated  by  2nd   annotator,  Cohen's  kappa  of  0.87   •  5,218  unique  search  queries:  31/69  %   personal/business  split     Collection Type   Query  string   Personal   Amoxicillin  alcohol     Business   audio  device  on  high  defini/on  audio  bus   windows  xp  driver  dell     Business   Avaya  Voip     Personal   Best  pou/ne  store  in  Montreal    
  5. A query cluster example •  can not open links in

    windows live mail\ •  cannot open hyperlinks windows live mail file cannot be found •  live mail prohibited file type •  live mail prohibited file type how to allow •  windows live mail cannot open links •  windows live mail cannot open links xp •  windows live mail cannot open pdf files •  windows live mail cannot open pdf files command failed to execute •  windows live mail file cannot be found links
  6. Tried Approaches for Query Classification •  TF/IDF – documents are

    too small (-) •  Singular Value Decomposition for Latent Semantic Analysis (LSA) – not enough data (-) •  Data normalization - remove extra spaces, lowercase, remove stop words, remove short/long queries (+/-) •  Maximum Entropy Classification (+) Query Classification
  7. Results: 78.9% •  100  tech  support  agents   for  training

      •  45  for  tes/ng   •  3895  unique  queries  for   training,  36%  of  them   personal   •  1417  unique  queries  for   tes/ng,  18%  of  them   personal.   •  27%  of  the  corpus   •  Trained  MaxEnt  model   using  words  as  features   •  Keeping  capitaliza/on   •  Ignoring  numbers   •  Correctly  iden/fied  210   out  of  267  personal   queries:  recall  78.7%   •  Total  of  317  queries   marked  as  personal:   precision  66.2%  
  8. Qualitative error analysis Type   Erroneously   Predicted  as  

    Count   Percentage   Queries  with  stop-­‐words   "how  much  fiber  is  in  oatmeal"   Business   31   18.9   Personal  queries  with  business  terms   "best  games  for  gamecube"     Business   20   12.1   Queries  with  1  term   "igfxtray"     Personal   54   32.9   Only  numbers   "7035  7036”     Personal   7   4.3   Error  in  gold  standard   "hines  ward"   Personal   6   3.6   Other   Personal   46   28  
  9. •  the same browser session •  the difference between two

    queries lower than 15 minutes •  queries in the same cluster •  anything else? Search Session Identification
  10. •  Clicked URLs •  Top X Google results •  Levenshtein

    distance on query terms –  Are “XP install” and “HP install” close? •  Cosine similarity on query terms •  Freebase for entity resolution •  Latent Semantic Analysis Hierarchical Clustering Dendrogram: Search Session Clustering
  11. •  What crawling strategy is most appropriate? •  How to

    select content for processing? •  What is the ontology of tech support knowledge? •  How to resolve entities in such ontology? •  What entities can be extracted by rule-based system? •  How to create an effective training set? •  How to rank connections within the Knowledge Graph? •  How to expand the query? •  How to encode context of the support session in the query?
  12. Working together •  Internship program •  Research partnership •  PhD

    sponsorship, MITACS •  NSERC grant •  www.radialpoint.com/about- radialpoint/career •  Etc.