Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Streams with Elasticsearch

Data Streams with Elasticsearch

Datalayer: 2017 talk - datalayer.com

Why do we need data streams,
Current approaches,
Proposing an alternative: Let's build a streaming DB,
How to build one yourself,
Previewing Streams: A realtime layer for Elasticsearch,
Benchmarks and upcoming features.

Avatar for Siddharth Kothari

Siddharth Kothari

May 17, 2017
Tweet

More Decks by Siddharth Kothari

Other Decks in Technology

Transcript

  1. 1. Streams and Firehoses from #IoT 2. Monitoring Systems 3.

    Analytics 4. E-commerce: Search, Price Monitoring 5. Fraud Detection (Cyber Security, Payments) Use Cases
  2. Typical Approach: Limitations • Middleware logic that connects DB with

    
 realtime protocol like WS. Edge Cases
 • Cannot handle complex realtime scenarios.
 • Scaling of WS connections.
  3. Streaming Framework DB Layer Middleware + Realtime Protocol Scaling RP

    Edge Cases Monitoring A complex System Approach to Complex Realtime Scenarios
  4. • Take the best parts of a DB system
 •

    Bake in realtime protocol
 • Make the middleware layer optional Alternative: A Streaming Database System
  5. Distributed Full-text Search based on Lucene Can scale to many

    nodes and highly available Analytics, Document Oriented, Open Source Elasticsearch
  6. aka Search in Reverse 1. Indexing a Query
 2. Matches

    when new documents are added
 3. Distributed design since v1.0.0 Percolation in Elasticsearch
  7. • Queries as subscriptions (HTTP Streaming / Websockets)
 • Publish

    matches to subscribers • Works as is with the ES API Baking Realtime
  8. • Beyond Percolation, keep the doc store model of ES.

    • Every document is a topic (channel). • Every query to the DB also maps to a topic (channel). Data Streams: Topology
  9. • Performant, because Nginx! • Can work anywhere, because Docker!

    • All the good parts of existing data layers. Streams: Benefits
  10. • A nifty Time To Live (ttl) feature. • Optionally,

    stream without storing. • Interval and frequency based queries. Streams: Features
  11. Index > Store > Stream > Act AWS Lambda Another

    REST API Send an E-mail Design Pattern
  12. Median Throughput / sec 0 32500 65000 97500 130000 m3.2xlarge

    nodes 3-nodes 6-nodes 9-nodes Elasticsearch Elasticsearch with Streams Benchmarks: Indexing Overhead <1% indexing overhead Dataset: 8 Million Location Docs
  13. Median Throughput / sec 0 100000 200000 300000 400000 m3.2xlarge

    nodes 3-nodes 6-nodes 9-nodes Elasticsearch Elasticsearch with Streams <1% indexing overhead Dataset: 8 Million Location Points Benchmarks: Indexing Overhead
  14. Benchmarks: More Broadcasts at <1s latency for 10,000 clients. ~20%

    CPU overhead at indexing time. Number of streams nodes can scale.