Save 37% off PRO during our Black Friday Sale! »

Searching over streams with Luwak and Samza

Searching over streams with Luwak and Samza

Talk co-presented with Alan Woodward at FOSDEM, Brussels, Belgium, on 31 January 2015. http://martin.kleppmann.com/2015/01/31/searching-over-streams-at-fosdem.html

Abstract:

Real-time searching over streams is useful in a number of contexts. For example, companies may want to detect whenever they are mentioned in a news feed; or a Twitter user might want to see a continuous stream of tweets for a particular hashtag.

Luwak (https://github.com/flaxsearch/luwak) provides a mechanism for running many thousands of queries over a single document in a highly efficient manner, by filtering out queries that it can detect will not match. Luwak is designed to run on a single node, holding all registered queries in RAM. Scaling to higher document throughput, or to more queries, requires parallelization across multiple machines.

Samza (http://samza.apache.org/) provides a framework for such parallelization, by partitioning and recombining both the document streams and the query set (which can be treated as just another stream), and also provides fault-tolerance mechanisms that allows swift recovery from machine failure, without losing documents or queries.

0d4ef9af8e4f0cf5c162b48ba24faea6?s=128

Martin Kleppmann

January 31, 2015
Tweet

Transcript

  1. None
  2. None
  3. None
  4. None
  5. None
  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. None
  14. None
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. Ka#a  at  LinkedIn   •  350+  commodity  machines   • 

    8,000+  topics   •  140,000+  par==ons   •  278  Billion  messages/day   •  49  TB/day  in   •  176  TB/day  out   •  Peak  Load   –  4.4  Million  messages  per  second   –  6  Gigabits/sec  Inbound   –  21  Gigabits/sec  Outbound  
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. None
  59. •  Flax  Luwak:   hRps://github.com/flaxsearch/luwak   •  Apache  Ka#a:  hRp://ka#a.apache.org/

      •  Apache  Samza:  hRp://samza.apache.org/   •  Mar=n  Kleppmann  -­‐  @mar=nkl   •  Alan  Woodward  -­‐  @romseygeek