Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Streams with Elasticsearch

Data Streams with Elasticsearch

Datalayer: 2017 talk - datalayer.com

Why do we need data streams,
Current approaches,
Proposing an alternative: Let's build a streaming DB,
How to build one yourself,
Previewing Streams: A realtime layer for Elasticsearch,
Benchmarks and upcoming features.

Siddharth Kothari

May 17, 2017
Tweet

More Decks by Siddharth Kothari

Other Decks in Technology

Transcript

  1. 1. Streams and Firehoses from #IoT 2. Monitoring Systems 3.

    Analytics 4. E-commerce: Search, Price Monitoring 5. Fraud Detection (Cyber Security, Payments) Use Cases
  2. Typical Approach: Limitations • Middleware logic that connects DB with

    
 realtime protocol like WS. Edge Cases
 • Cannot handle complex realtime scenarios.
 • Scaling of WS connections.
  3. Streaming Framework DB Layer Middleware + Realtime Protocol Scaling RP

    Edge Cases Monitoring A complex System Approach to Complex Realtime Scenarios
  4. • Take the best parts of a DB system
 •

    Bake in realtime protocol
 • Make the middleware layer optional Alternative: A Streaming Database System
  5. Distributed Full-text Search based on Lucene Can scale to many

    nodes and highly available Analytics, Document Oriented, Open Source Elasticsearch
  6. aka Search in Reverse 1. Indexing a Query
 2. Matches

    when new documents are added
 3. Distributed design since v1.0.0 Percolation in Elasticsearch
  7. • Queries as subscriptions (HTTP Streaming / Websockets)
 • Publish

    matches to subscribers • Works as is with the ES API Baking Realtime
  8. • Beyond Percolation, keep the doc store model of ES.

    • Every document is a topic (channel). • Every query to the DB also maps to a topic (channel). Data Streams: Topology
  9. • Performant, because Nginx! • Can work anywhere, because Docker!

    • All the good parts of existing data layers. Streams: Benefits
  10. • A nifty Time To Live (ttl) feature. • Optionally,

    stream without storing. • Interval and frequency based queries. Streams: Features
  11. Index > Store > Stream > Act AWS Lambda Another

    REST API Send an E-mail Design Pattern
  12. Median Throughput / sec 0 32500 65000 97500 130000 m3.2xlarge

    nodes 3-nodes 6-nodes 9-nodes Elasticsearch Elasticsearch with Streams Benchmarks: Indexing Overhead <1% indexing overhead Dataset: 8 Million Location Docs
  13. Median Throughput / sec 0 100000 200000 300000 400000 m3.2xlarge

    nodes 3-nodes 6-nodes 9-nodes Elasticsearch Elasticsearch with Streams <1% indexing overhead Dataset: 8 Million Location Points Benchmarks: Indexing Overhead
  14. Benchmarks: More Broadcasts at <1s latency for 10,000 clients. ~20%

    CPU overhead at indexing time. Number of streams nodes can scale.