Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Socializing Search. Professionally.

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Socializing Search. Professionally.

This 2014 O'Reilly Strata presentation discusses LinkedIn's approach to search.

LinkedIn has a unique data collection: the 277M+ members who use LinkedIn are also the most valuable entities in its corpus, which consists of people, companies, jobs, and a rich content ecosystem. Its members use LinkedIn to satisfy a diverse set of navigational and exploratory information needs, which it addresses by leveraging semi-structured and social content to understanding their query intent and deliver a personalized search experience.

As a result, it has built a system quite different from those used for web or enterprise search. This talk discusses how it has addressed the unique scalability, performance, and search quality challenges in order to deliver billions of deeply personalized searches to our members. Although many of the challenges are unique to LinkedIn, the ideas should prove useful to other folks thinking about entity-oriented search or working with large-scale social network data.

Avatar for Daniel Tunkelang

Daniel Tunkelang

May 26, 2026

More Decks by Daniel Tunkelang

Other Decks in Technology

Transcript

  1. Recruiting Solutions Recruiting Solutions Recruiting Solutions Sriram Sankar Daniel Tunkelang

    Principal Staff Engineer Head, Query Understanding Sriram Daniel Socializing Search. Professionally.
  2. 8 Machine-learned ranking, socially.  Relevance models incorporate user features:

    score = P (Document | Query, User)  Our model: tree with logistic regression leaves. 8 X 2 =0 X 2 =? X 2 =1 X 10 < 0.1234 ? Yes N o
  3. 10 Query understanding can act as a relevance filter. 10

    for i in [1..n] s  w 1 w 2 … w i if P c (s) > 0 a  new Segment() a.segs  {s} a.prob  P c (s) B[i]  {a} for j in [1..i-1] for b in B[j] s  w j w j+1 … w i if P c (s) > 0 a  new Segment() a.segs  b.segs U {s} a.prob  b.prob * P c (s) B[i]  B[i] U {a} sort B[i] by prob truncate B[i] to size k
  4. Jobs at LinkedIn Searc h link People currently working at

    LinkedIn People who used to work at LinkedIn Coming soon: entity-driven search assist.
  5. 13 Infrastructure Lucene  Map of terms to documents –

    the index  Provides an API to add and remove documents to the index  Provides an API to query the index
  6. 14 BLAH BLAH BLAH Daniel BLAH BLAH LinkedIn BLAH BLAH

    BLAH BLAH BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH 2. 1. Daniel Sriram LinkedIn 2 1 Inverted Index Forward Index
  7. 16  Extremely easy to build a search engine 

    But difficult to get sophisticated
  8. 17 The LinkedIn Search Stack Query Rewriter Index Retrieval Scorer

    Sorter/Blender Request Response Offline Data Building Updates Live Updates Data
  9. 18 Search Index Served by Lucene  Inverted index 

    Forward index  Static rank based document ordering
  10. 19 Offline Data Builds on Hadoop  Multi-stage map-reduce pipeline

    allows complex data processing  Produces sharded single segment Lucene index with documents sorted by static rank  Produces data models for use in query rewriting
  11. 20 Live Data Updates  Feed based framework to support

    updates to offline data builds  Lucene enhanced with a partial index update capability
  12. 21 Query Rewriting (and Planning)  Accepts raw query and

    user metadata  Produces Lucene retrieval query and metadata for scoring  May use data models built offline
  13. 22 Index Retrieval  Lucene query built by query rewriter

    is used to retrieve documents from the Lucene index  Documents are retrieved in static rank order (best document first)  Retrieval may be early-terminated – given that retrieval is in static rank order  No scoring is performed during retrieval
  14. 23 Scoring  Scoring is performed after retrieval  Its

    input is the retrieved document (i.e., includes the forward index), a description of how the retrieval query matched the document, and the scoring metadata produced by the rewriter  Costly features can be computed offline during the index building process in Hadoop – e.g., tf/idf calculations
  15. 24 Summary Quality  LinkedIn Search leverages the economic graph.

     Social means that relevance is highly personalized.  Less is more: query understanding is a relevance filter.  Moving in the direction of suggesting structured queries. System  Powered by Lucene, but with additional components.  Offline data builds on Hadoop, partial index updates.  Index uses static ranking and early termination.  Scoring performed outside of Lucene.