Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Interactive Querying of Streams Using Apache Pu...

Streamlio
September 12, 2019

Interactive Querying of Streams Using Apache Pulsar

As applications become more reliant on real-time data, streaming/messaging platforms have become more and more popular and crucial to any data pipeline. Currently, many streaming/messaging platforms are only used to access the most recent events from streams of data, however, there is tremendous value that can be unlocked if the full history of streams can be queried in an interactive fashion. Pulsar SQL is a query layer built on top of Apache Pulsar (a next-gen messaging platform), that enables users to dynamically query all streams, old and new, stored inside of Pulsar. Thus, users can unlock insights from querying both new and historical streams of data in a single system. Pulsar SQL leverages Presto and Apache Pulsar’s unique architecture to execute queries in a highly scalable fashion regardless of the number of partitions of topics that make up the streams. In this talk, we will examine the use cases and advantages of being able to interactively query events within an streaming messaging platform and how Pulsar enables users to do that in the most user-friendly and efficient manner.

Streamlio

September 12, 2019
Tweet

More Decks by Streamlio

Other Decks in Technology

Transcript

  1. AGENDA 1. Talk about the use cases 2. Existing Architectures

    3. Apache Pulsar Overview 4. Pulsar SQL 5. Demo! 2
  2. HOW IS IT USEFUL? • Speed (i.e. data-driven processing) -

    Act faster • Accuracy - In many contexts the wrong decision may be made if you do not have visibility that includes the most current data - For example, historical data is useful to predict a user is interested in buying a particular item, but if my analytics don’t also know that the user just purchased that item two minutes ago they’re going to make the wrong recommendation • Simplification: - Single place to go to access current and historical data 5
  3. MONITORING (AUDIT LOGS) • Answering the “What, When, Who, Why”

    • Suspicious access patterns • Example • Auditing CDC logs in financial institutions
  4. STREAM PROCESSING PATTERN Compute Messaging Storage Data Ingestion Data Processing

    / Querying Results Storage Data Storage Data Serving
  5. PROBLEMS WITH EXISTING SOLUTIONS • Multiple Systems • Duplication of

    data - Data consistency. Where is the source of truth? • Latency between data ingestion and when data is queryable 12
  6. HISTORY • Project started at Yahoo around 2012 and went

    through various iterations • Open-Sourced in September 2016 • Entered Apache Incubator in June 2017 • Graduated as TLP on September 2018 • Over 160 contributors and 4100 stars
  7. CORE CONCEPTS Apache Pulsar Cluster Tenants Namespaces Topics Marketing Sales

    Security Analytics Campaigns Data Transformation Data Integration Microservices Visits Conversions Responses Conversions Transactions Interactions Log events Signatures Accesses
  8. ARCHITECTURE Multi-layer, scalable architecture • Independent layers for processing, serving

    and storage • Messaging and processing built on Apache Pulsar • Storage built on Apache BookKeeper Consumer Producer Producer Producer Consumer Consumer Consumer Messaging Broker Broker Broker Bookie Bookie Bookie Bookie Bookie Event storage Function Processing Worker Worker
  9. SEGMENT CENTRIC STORAGE • In addition to partitioning, messages are

    stored in segments (based on time and size) • Segments are independent from each others and spread across all storage nodes
  10. SEGMENT CENTRIC • Unbounded log storage • Instant scaling without

    data rebalancing • High write and read availability via maximized data placement options • Fast replica repair — many-to-many read 23
  11. WRITES • Every segment/ledger has an ensemble • Each entry

    in ledger has a ✦ Write quorum - Nodes of the ensemble to which it is written (usually all) ★Ack quorum - Nodes of the write quorum that must respond for that entry to be acknowledged (usually a majority)
  12. BOOKKEEPER INTERNAL • Separate IO path for reads and writes

    • Optimized for writing, tailing reads, catch-up reads
  13. TIERED STORAGE • Leverage cloud storage services to offload cold

    data — Completely transparent to clients • Extremely cost effective — Backends (S3) (Coming GCS, HDFS) • Example: Retain all data for 1 month — Offload all messages older than 1 day to S3 28
  14. SCHEMA REGISTRY • Store information on the data structure —

    Stored in BookKeeper • Enforce data types on topic • Allow for compatible schema evolutions
  15. APACHE PULSAR Multi-tenancy A single cluster can support many tenants

    and use cases Seamless Cluster Expansion Expand the cluster without any down time High throughput & Low Latency Can reach 1.8 M messages/s in a single partition and publish latency of 5ms at 99pct Durability Data replicated and synced to disk Geo-replication Out of box support for geographically distributed applications Unified messaging model Support both Topic & Queue semantic in a single model Tiered Storage Hot/warm data for real time access and cold event data in cheaper storage Pulsar Functions Flexible light weight compute Highly scalable Can support millions of topics, makes data modeling easier
  16. PULSAR SQL • Interactive SQL queries over data stored in

    Pulsar • Query old and real-time data 31
  17. PULSAR SQL / 2 • Based on Presto by Facebook

    — https://prestodb.io/ • Presto is a distributed query execution engine • Fetches the data from multiple sources (HDFS, S3, MySQL, …) • Full SQL compatibility 32
  18. PULSAR SQL / 3 • Pulsar connector for Presto •

    Read data directly from BookKeeper — bypass Pulsar Broker • Can also read data offloaded to Tiered Storage (S3, GCS, etc.) • Many-to-many data reads • Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition • Time based indexing — Use “publishTime” in predicates to reduce data being read from disk 33
  19. BENEFITS • Do not need to move data into another

    system for querying • Read data in parallel - Performance not impacted by partitioning - Increase throughput by increasing write quorum - Newly arrived data able to be queried immediately
  20. PERFORMANCE • SETUP • 3 Nodes • 12 CPU cores

    • 128 GB RAM • 2 X 1.2 TB NVMe disks • Results • JSON (Compressed) • ~60 Millions Rows / Second • Avro (Compressed) • ~50 Million Rows / Second
  21. APACHE PULSAR IN PRODUCTION @SCALE 4+ years Serves 2.3 million

    topics 700 billion messages/day 500+ bookie nodes 200+ broker nodes Average latency < 5 ms 99.9% 15 ms (strong durability guarantees) Zero data loss 150+ applications Self served provisioning Full-mesh cross-datacenter replication - 8+ data centers
  22. USE CASE: JOB SEARCH ANALYTICS ZHAOPIN • ZhaoPin - Chinese

    job search website (Linkedin, Indeed, etc) • Using Pulsar to track job searches by users • Search params • Search results • Analysis on user behavior
  23. FUTURE WORK • Performance tuning • Store data in columnar

    format - Improve compression ratio - Materialize relevant columns • Support different indices 40