Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Find and be Found: Information Retrieval at Lin...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Find and be Found: Information Retrieval at LinkedIn

This ACM SIGIR 2013 Industry Track presentation discusses how LinkedIn's search functionality works. It explains how LinkedIn search is personalized based on a user's profile and network. Query understanding involves tagging queries to determine entity types like people, companies, or skills. Ranking is also personalized using machine learning models trained on search logs to determine relevance for a specific user's query. The system aims to provide both globally and personally relevant results, as about two-thirds of clicks come from out of a user's network.

Avatar for Daniel Tunkelang

Daniel Tunkelang

May 23, 2026

More Decks by Daniel Tunkelang

Other Decks in Technology

Transcript

  1. Recruiting Solutions Recruiting Solutions Recruiting Solutions formation Retrieval at LinkedIn

    Shakti Sinha Daniel Tunkelang Head, Search Relevance Head, Query Understanding Shakti Daniel Find and be Found:
  2. Let’s talk a bit about how it all works. § 

    Query Understanding §  Ranking More at http://data.linkedin.com/search. 11
  3. Query tagging: key to query understanding. §  Using human judgments

    to evaluate tag precision. –  Extremely accurate (> 99%) for identifying person names. –  Harder to distinguish company vs. title vs. skill (e.g., oracle dba). §  Comparing CTR for tag matches vs. non-matches. –  Difference can be large enough to suggest filtering vs. ranking: 15
  4. Detecting navigational vs. exploratory queries. Pre-retrieval §  Sequence of query

    tags. Post-retrieval §  Distribution of scores / features. 16 Click behavior §  Title searches >50x more likely to get 2+ clicks than name searches.
  5. Query expansion for exploratory queries. 17 software patent lawyer Query

    expansions derived from reformulations. e.g., lawyer -> attorney
  6. Understanding misspelled queries. 18 daniel tankalong infomation retrieval marisa meyer

    ingenero eletrico jonathan podemsky desenista industrail Did you mean daniel tunkelang? Did you mean marissa mayer? Did you mean johnathan podemsky? Did you mean information retrieval? Did you mean ingeniero electrico? Did you mean desenhista industrial?
  7. Spelling out the details. entity data people, companies successful queries

    tunkelang => reformulations marisa => marissa n-grams dublin => du ub bl li in metaphones mark/marc => MRK word pairs johnathan podemsky INDEX } { marisa meyer yoohoo marissa marisa meyer mayer yahoo yoohoo 19
  8. Relevant results can be in or out of network. 23

    §  Searcher’s network matters for relevance. –  Within network results have higher CTR. §  But the network is not enough. –  About two thirds of search clicks come from out of network results.
  9. Personalized machine-learned ranking. 24 §  Data point is a triple

    (searcher, query, document). –  Searcher features are important! §  Labels: Is this document relevant to the query and the user? –  Depends on the user’s network, location, etc. –  Too much to ask random person to judge. §  Training data has to be collected from search logs.
  10. Search log data has biases. 25 §  Presentation bias – 

    Results shown higher tend to get clicked more often. –  Use FairPairs [Radlinski and Joachims, AAAI’06]. not flipped flipped flipped Clicked! ✗ ✔ ✔ ✗ ✗ ✗ training data
  11. Search log data has biases. 26 §  Sample bias – 

    User clicks or skips only what is shown. –  What about low scoring results from existing model? –  Add low-scoring results as ‘easy negatives’ so model learns bad results not presented to user. … label 0 label 0 label 0 label 0 … page 1 page 2 page 3 page n
  12. How to train your model. 28 §  Train simple models

    to resemble complex ones. –  Build Additive Groves model [Sorokina et al, ECML ’07], which is good at detecting interactions. §  Build tree with logistic regression leaves. §  By restricting tree to user and query features, only regression model evaluated for each document. β0 +β1 T(x 1 )+...+βn x n α0 +α1 P(x 1 )+...+αn Q(x n ) X2 =? X10 < 0.1234 ? γ0 +γ1 R(x 1 )+...+γn Q(x n )
  13. Take-Aways §  LinkedIn’s search problem is unique because of deep

    role of personalization – users are integral part of the corpus. §  Query understanding allows us to optimize for entity- oriented search against semi-structured content. §  Ranking requires us to contextually apply global and personalized user, query, and document features. 29
  14. Want to learn more? §  Check out http://data.linkedin.com/search. §  Contact

    us: –  Shakti: [email protected] http://linkedin.com/in/sdsinha –  Daniel: [email protected] http://linkedin.com/in/dtunkelang –  Asif: [email protected] http://linkedin.com/in/asifmakhani §  Did we mention that we’re hiring? 31