Interactive Querying of Streams Using Apache Pulsar

Ca8e52de8e1a27e7c18fff0a08fcf2e2?s=47 Streamlio
September 12, 2019

Interactive Querying of Streams Using Apache Pulsar

As applications become more reliant on real-time data, streaming/messaging platforms have become more and more popular and crucial to any data pipeline. Currently, many streaming/messaging platforms are only used to access the most recent events from streams of data, however, there is tremendous value that can be unlocked if the full history of streams can be queried in an interactive fashion. Pulsar SQL is a query layer built on top of Apache Pulsar (a next-gen messaging platform), that enables users to dynamically query all streams, old and new, stored inside of Pulsar. Thus, users can unlock insights from querying both new and historical streams of data in a single system. Pulsar SQL leverages Presto and Apache Pulsar’s unique architecture to execute queries in a highly scalable fashion regardless of the number of partitions of topics that make up the streams. In this talk, we will examine the use cases and advantages of being able to interactively query events within an streaming messaging platform and how Pulsar enables users to do that in the most user-friendly and efficient manner.



September 12, 2019



  2. AGENDA 1. Talk about the use cases 2. Existing Architectures

    3. Apache Pulsar Overview 4. Pulsar SQL 5. Demo! 2
  3. WHAT ARE STREAMS? Continuous flows of data… Almost all data

    originate in this form
  4. INTERACTIVE QUERYING OF STREAMS? Querying both latest and historical data

  5. HOW IS IT USEFUL? • Speed (i.e. data-driven processing) -

    Act faster • Accuracy - In many contexts the wrong decision may be made if you do not have visibility that includes the most current data - For example, historical data is useful to predict a user is interested in buying a particular item, but if my analytics don’t also know that the user just purchased that item two minutes ago they’re going to make the wrong recommendation • Simplification: - Single place to go to access current and historical data 5
  6. DEBUGGING • Errors and Exception • Troubleshooting systems and networks

    • Have we seen these errors before?
  7. MONITORING (AUDIT LOGS) • Answering the “What, When, Who, Why”

    • Suspicious access patterns • Example • Auditing CDC logs in financial institutions
  8. EXPLORING • Raw or enriched data • Really simplifies access

    if data is all in one location
  9. LOTS OF USE CASES • Data analytics • Business Intelligence

    • Real-time dashboards • etc… 9
  10. STREAM PROCESSING PATTERN Compute Messaging Storage Data Ingestion Data Processing

    / Querying Results Storage Data Storage Data Serving
  11. EXISTING SOLUTIONS HDFS Messaging Real-time compute Storage Data Stream Querying

  12. PROBLEMS WITH EXISTING SOLUTIONS • Multiple Systems • Duplication of

    data - Data consistency. Where is the source of truth? • Latency between data ingestion and when data is queryable 12

  14. HISTORY • Project started at Yahoo around 2012 and went

    through various iterations • Open-Sourced in September 2016 • Entered Apache Incubator in June 2017 • Graduated as TLP on September 2018 • Over 160 contributors and 4100 stars

  16. APACHE PULSAR Flexible Messaging + Streaming System backed by a

    durable log storage
  17. EVENT STORE 17


  19. CORE CONCEPTS Topic Producers Consumers Time Consumers Consumers Producers

  20. CORE CONCEPTS Apache Pulsar Cluster Tenants Namespaces Topics Marketing Sales

    Security Analytics Campaigns Data Transformation Data Integration Microservices Visits Conversions Responses Conversions Transactions Interactions Log events Signatures Accesses
  21. ARCHITECTURE Multi-layer, scalable architecture • Independent layers for processing, serving

    and storage • Messaging and processing built on Apache Pulsar • Storage built on Apache BookKeeper Consumer Producer Producer Producer Consumer Consumer Consumer Messaging Broker Broker Broker Bookie Bookie Bookie Bookie Bookie Event storage Function Processing Worker Worker
  22. SEGMENT CENTRIC STORAGE • In addition to partitioning, messages are

    stored in segments (based on time and size) • Segments are independent from each others and spread across all storage nodes
  23. SEGMENT CENTRIC • Unbounded log storage • Instant scaling without

    data rebalancing • High write and read availability via maximized data placement options • Fast replica repair — many-to-many read 23
  24. WRITES • Every segment/ledger has an ensemble • Each entry

    in ledger has a ✦ Write quorum - Nodes of the ensemble to which it is written (usually all) ★Ack quorum - Nodes of the write quorum that must respond for that entry to be acknowledged (usually a majority)
  25. BOOKKEEPER INTERNAL • Separate IO path for reads and writes

    • Optimized for writing, tailing reads, catch-up reads

  27. TIERED STORAGE Unlimited topic storage capacity Achieves the true “stream-storage”:

    keep the raw data forever in stream form
  28. TIERED STORAGE • Leverage cloud storage services to offload cold

    data — Completely transparent to clients • Extremely cost effective — Backends (S3) (Coming GCS, HDFS) • Example: Retain all data for 1 month — Offload all messages older than 1 day to S3 28
  29. SCHEMA REGISTRY • Store information on the data structure —

    Stored in BookKeeper • Enforce data types on topic • Allow for compatible schema evolutions
  30. APACHE PULSAR Multi-tenancy A single cluster can support many tenants

    and use cases Seamless Cluster Expansion Expand the cluster without any down time High throughput & Low Latency Can reach 1.8 M messages/s in a single partition and publish latency of 5ms at 99pct Durability Data replicated and synced to disk Geo-replication Out of box support for geographically distributed applications Unified messaging model Support both Topic & Queue semantic in a single model Tiered Storage Hot/warm data for real time access and cold event data in cheaper storage Pulsar Functions Flexible light weight compute Highly scalable Can support millions of topics, makes data modeling easier
  31. PULSAR SQL • Interactive SQL queries over data stored in

    Pulsar • Query old and real-time data 31
  32. PULSAR SQL / 2 • Based on Presto by Facebook

    — • Presto is a distributed query execution engine • Fetches the data from multiple sources (HDFS, S3, MySQL, …) • Full SQL compatibility 32
  33. PULSAR SQL / 3 • Pulsar connector for Presto •

    Read data directly from BookKeeper — bypass Pulsar Broker • Can also read data offloaded to Tiered Storage (S3, GCS, etc.) • Many-to-many data reads • Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition • Time based indexing — Use “publishTime” in predicates to reduce data being read from disk 33

  35. BENEFITS • Do not need to move data into another

    system for querying • Read data in parallel - Performance not impacted by partitioning - Increase throughput by increasing write quorum - Newly arrived data able to be queried immediately
  36. PERFORMANCE • SETUP • 3 Nodes • 12 CPU cores

    • 128 GB RAM • 2 X 1.2 TB NVMe disks • Results • JSON (Compressed) • ~60 Millions Rows / Second • Avro (Compressed) • ~50 Million Rows / Second
  37. DEMO

  38. APACHE PULSAR IN PRODUCTION @SCALE 4+ years Serves 2.3 million

    topics 700 billion messages/day 500+ bookie nodes 200+ broker nodes Average latency < 5 ms 99.9% 15 ms (strong durability guarantees) Zero data loss 150+ applications Self served provisioning Full-mesh cross-datacenter replication - 8+ data centers

    job search website (Linkedin, Indeed, etc) • Using Pulsar to track job searches by users • Search params • Search results • Analysis on user behavior
  40. FUTURE WORK • Performance tuning • Store data in columnar

    format - Improve compression ratio - Materialize relevant columns • Support different indices 40
  41. QUESTIONS? • Try Apache Pulsar yourself! - Sign-up on Streamlio

    sandbox: 41