Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Searching over streams with Luwak and Samza

Searching over streams with Luwak and Samza

Talk co-presented with Alan Woodward at FOSDEM, Brussels, Belgium, on 31 January 2015. http://martin.kleppmann.com/2015/01/31/searching-over-streams-at-fosdem.html


Real-time searching over streams is useful in a number of contexts. For example, companies may want to detect whenever they are mentioned in a news feed; or a Twitter user might want to see a continuous stream of tweets for a particular hashtag.

Luwak (https://github.com/flaxsearch/luwak) provides a mechanism for running many thousands of queries over a single document in a highly efficient manner, by filtering out queries that it can detect will not match. Luwak is designed to run on a single node, holding all registered queries in RAM. Scaling to higher document throughput, or to more queries, requires parallelization across multiple machines.

Samza (http://samza.apache.org/) provides a framework for such parallelization, by partitioning and recombining both the document streams and the query set (which can be treated as just another stream), and also provides fault-tolerance mechanisms that allows swift recovery from machine failure, without losing documents or queries.

Martin Kleppmann

January 31, 2015

More Decks by Martin Kleppmann

Other Decks in Programming


  1. Ka#a  at  LinkedIn   •  350+  commodity  machines   • 

    8,000+  topics   •  140,000+  par==ons   •  278  Billion  messages/day   •  49  TB/day  in   •  176  TB/day  out   •  Peak  Load   –  4.4  Million  messages  per  second   –  6  Gigabits/sec  Inbound   –  21  Gigabits/sec  Outbound  
  2. •  Flax  Luwak:   hRps://github.com/flaxsearch/luwak   •  Apache  Ka#a:  hRp://ka#a.apache.org/

      •  Apache  Samza:  hRp://samza.apache.org/   •  Mar=n  Kleppmann  -­‐  @mar=nkl   •  Alan  Woodward  -­‐  @romseygeek