Searching over streams with Luwak and Samza

Searching over streams with Luwak and Samza

Talk co-presented with Alan Woodward at FOSDEM, Brussels, Belgium, on 31 January 2015.


Real-time searching over streams is useful in a number of contexts. For example, companies may want to detect whenever they are mentioned in a news feed; or a Twitter user might want to see a continuous stream of tweets for a particular hashtag.

Luwak ( provides a mechanism for running many thousands of queries over a single document in a highly efficient manner, by filtering out queries that it can detect will not match. Luwak is designed to run on a single node, holding all registered queries in RAM. Scaling to higher document throughput, or to more queries, requires parallelization across multiple machines.

Samza ( provides a framework for such parallelization, by partitioning and recombining both the document streams and the query set (which can be treated as just another stream), and also provides fault-tolerance mechanisms that allows swift recovery from machine failure, without losing documents or queries.


Martin Kleppmann

January 31, 2015