Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Druid

Apache Druid

This presentation gives an overview of the Apache Druid project. It covers areas like use cases, features, architecture and users.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Music by

"Little Planet", composed and performed by Bensound from http://www.bensound.com/

Mike Frampton

May 14, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Druid ? • Real Time Analytics Database •

    Distributed Architecture • Open Source • Highly Performant • Time Series Database • Apache 2 License • Written in Java
  2. Druid Use Cases • User activity and behaviour • Network

    flows • Digital marketing • Application performance management • IoT and device metrics • OLAP and business intelligence For real time data ingestion, fast query and high uptime.
  3. Druid Features • Column-oriented storage • Native search indexes •

    Streaming and batch ingest • Flexible schemas • Time-optimized partitioning • SQL support • Horizontal scalability • Easy operation
  4. Druid Users • Airbnb • Alibaba • Booking.com • Cisco

    • Ebay • Hulu • Lyft • Outbrain • Paypal • Pinterest • Slack • Twitter • Walmart • Yahoo Some of the more famous users among many others
  5. Druid MetaStore • Stores Metadata about system and data stored

    • Can use the following databases – Derby, MySQL, Postgresql • Stores Meta data information like – Segments, Rules, Config – Tasks, Audit
  6. Druid Deep Storage • Deep storage persists Druid segment data

    • Uses storage like – Local Mounts, AWS S3, HDFS • Core extensions available from Druid committers • Extension examples include – Azure, Cassandra, Cloudfiles
  7. Druid Processes • Historical – store and query historic data

    • MiddleManager – ingest new data • Broker – process client queries • Coordinator – watch over Historical processes • Overlord - watch over MiddleManager processes • Router – optional – provide a unified API gateway
  8. Druid Query • Druid supports JSON and SQL based queries

    • The SQL syntax is as follows • GROUPING SETS improves efficiency, reduces scanning • ROLLUP provides grouped data for each level of data • CUBE provides grouped data for each combination of data
  9. Druid High Availability (HA) • Use 3 or 5 ZooKeeper

    nodes on own hardware • MetaStore use MySQL or Postgresql – With replication and failover • Use multiple Coordinators and Overlords – Using same metaStore and ZooKeeper • Scale Brokers horizontally • Use a load balancer
  10. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” – • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  11. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration