Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Searching over streams with Luwak and Samza

Searching over streams with Luwak and Samza

Talk co-presented with Alan Woodward at FOSDEM, Brussels, Belgium, on 31 January 2015. http://martin.kleppmann.com/2015/01/31/searching-over-streams-at-fosdem.html

Abstract:

Real-time searching over streams is useful in a number of contexts. For example, companies may want to detect whenever they are mentioned in a news feed; or a Twitter user might want to see a continuous stream of tweets for a particular hashtag.

Luwak (https://github.com/flaxsearch/luwak) provides a mechanism for running many thousands of queries over a single document in a highly efficient manner, by filtering out queries that it can detect will not match. Luwak is designed to run on a single node, holding all registered queries in RAM. Scaling to higher document throughput, or to more queries, requires parallelization across multiple machines.

Samza (http://samza.apache.org/) provides a framework for such parallelization, by partitioning and recombining both the document streams and the query set (which can be treated as just another stream), and also provides fault-tolerance mechanisms that allows swift recovery from machine failure, without losing documents or queries.

Martin Kleppmann

January 31, 2015
Tweet

More Decks by Martin Kleppmann

Other Decks in Programming

Transcript

  1. Ka#a  at  LinkedIn   •  350+  commodity  machines   • 

    8,000+  topics   •  140,000+  par==ons   •  278  Billion  messages/day   •  49  TB/day  in   •  176  TB/day  out   •  Peak  Load   –  4.4  Million  messages  per  second   –  6  Gigabits/sec  Inbound   –  21  Gigabits/sec  Outbound  
  2. •  Flax  Luwak:   hRps://github.com/flaxsearch/luwak   •  Apache  Ka#a:  hRp://ka#a.apache.org/

      •  Apache  Samza:  hRp://samza.apache.org/   •  Mar=n  Kleppmann  -­‐  @mar=nkl   •  Alan  Woodward  -­‐  @romseygeek