s ← w1 w2 … wi ! if Pc (s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc (s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi ! if Pc (s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc (s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!
query interpretation. § Bucket search log analysis by query classes. § Query rewriting specific to query classes. § … Query understanding focuses on set-level metrics. Not just about best answer, but getting to best question. 12
queries targeted by spammers. – 10,000 most common non-name queries. § Look at top results for a generic user. – i.e., show unpersonalized search results. § Remove private profiles. – Members first! Can’t sacrifice privacy to fight spammers. § Label data by crowdsourcing. – Relevance is subjective, but spam is relatively objective. 17
probability between 0 and 1. § Use spam score as piecewise linear factor: if score < spam min : # not a spammer relevance *= 1.0 elif score > spam max : # spammer relevance *= 0.0 else: # linear function of spamminess relevance *= (spam max - score) / (spam max - spam min ) 19
which features we use for spam detection, or spammers will work around them. § Spammers will try to reverse-engineer us anyway. § Personalization benefits us and our legitimate users – it’s hard to spam your way to high personalized ranking. § Fighting spam is all about making the investment less profitable for the spammer. 20
heavy-tailed distribution, even the most popular queries account for a small fraction of distribution. § We don’t want to suggest generic queries that would produce useless results. – e.g., c -> company, j -> jobs § Goal is to not only to infer user’s intent but also suggest a search that yields relevant results across content types. 25
by likelihood of a successful search. – Consider click-through behavior as well as downstream actions. § Bootstrap using what we know from pre-unified search behavior. – Tricky part is compensating for findability bias. § Continuously evaluate and collect feedback through user behavior. – E.g., members using the left rail to select a particular vertical. 27
two-part computation: P(Content Type | User, Query) x P(Document | User, Query, Content Type) § Intent detection comes first: inefficient to send all queries to all verticals. § Secondary components introduce diversity. 29
queries as early as possible. § Fight the spammers that be. § Unify and simplify the search experience. Goal: help LinkedIn’s 200M+ members find and be found. 30