Interactive Querying of Streams Using Apache Pulsar

INTERACTIVE QUERYING OF STREAMS USING APACHE PULSAR http://pulsar.apache.org Jerry Peng

AGENDA 1. Talk about the use cases 2. Existing Architectures
3. Apache Pulsar Overview 4. Pulsar SQL 5. Demo! 2

WHAT ARE STREAMS? Continuous flows of data… Almost all data
originate in this form

INTERACTIVE QUERYING OF STREAMS? Querying both latest and historical data

HOW IS IT USEFUL? • Speed (i.e. data-driven processing) -
Act faster • Accuracy - In many contexts the wrong decision may be made if you do not have visibility that includes the most current data - For example, historical data is useful to predict a user is interested in buying a particular item, but if my analytics don’t also know that the user just purchased that item two minutes ago they’re going to make the wrong recommendation • Simpliﬁcation: - Single place to go to access current and historical data 5

DEBUGGING • Errors and Exception • Troubleshooting systems and networks
• Have we seen these errors before?

MONITORING (AUDIT LOGS) • Answering the “What, When, Who, Why”
• Suspicious access patterns • Example • Auditing CDC logs in ﬁnancial institutions

EXPLORING • Raw or enriched data • Really simpliﬁes access
if data is all in one location

LOTS OF USE CASES • Data analytics • Business Intelligence
• Real-time dashboards • etc… 9

STREAM PROCESSING PATTERN Compute Messaging Storage Data Ingestion Data Processing
/ Querying Results Storage Data Storage Data Serving

EXISTING SOLUTIONS HDFS Messaging Real-time compute Storage Data Stream Querying

PROBLEMS WITH EXISTING SOLUTIONS • Multiple Systems • Duplication of
data - Data consistency. Where is the source of truth? • Latency between data ingestion and when data is queryable 12

THIS IS WHERE APACHE PULSAR AND PULSAR SQL COMES IN…

HISTORY • Project started at Yahoo around 2012 and went
through various iterations • Open-Sourced in September 2016 • Entered Apache Incubator in June 2017 • Graduated as TLP on September 2018 • Over 160 contributors and 4100 stars

EXAMPLES OF PULSAR USERS AND CONTRIBUTORS

APACHE PULSAR Flexible Messaging + Streaming System backed by a
durable log storage

EVENT STORE 17

PULSAR OVERVIEW

CORE CONCEPTS Topic Producers Consumers Time Consumers Consumers Producers

CORE CONCEPTS Apache Pulsar Cluster Tenants Namespaces Topics Marketing Sales
Security Analytics Campaigns Data Transformation Data Integration Microservices Visits Conversions Responses Conversions Transactions Interactions Log events Signatures Accesses

ARCHITECTURE Multi-layer, scalable architecture • Independent layers for processing, serving
and storage • Messaging and processing built on Apache Pulsar • Storage built on Apache BookKeeper Consumer Producer Producer Producer Consumer Consumer Consumer Messaging Broker Broker Broker Bookie Bookie Bookie Bookie Bookie Event storage Function Processing Worker Worker

SEGMENT CENTRIC STORAGE • In addition to partitioning, messages are
stored in segments (based on time and size) • Segments are independent from each others and spread across all storage nodes

SEGMENT CENTRIC • Unbounded log storage • Instant scaling without
data rebalancing • High write and read availability via maximized data placement options • Fast replica repair — many-to-many read 23

WRITES • Every segment/ledger has an ensemble • Each entry
in ledger has a ✦ Write quorum - Nodes of the ensemble to which it is written (usually all) ★Ack quorum - Nodes of the write quorum that must respond for that entry to be acknowledged (usually a majority)

BOOKKEEPER INTERNAL • Separate IO path for reads and writes
• Optimized for writing, tailing reads, catch-up reads

SEGMENTS VS PARTITIONS

TIERED STORAGE Unlimited topic storage capacity Achieves the true “stream-storage”:
keep the raw data forever in stream form

TIERED STORAGE • Leverage cloud storage services to ofﬂoad cold
data — Completely transparent to clients • Extremely cost effective — Backends (S3) (Coming GCS, HDFS) • Example: Retain all data for 1 month — Ofﬂoad all messages older than 1 day to S3 28

SCHEMA REGISTRY • Store information on the data structure —
Stored in BookKeeper • Enforce data types on topic • Allow for compatible schema evolutions

APACHE PULSAR Multi-tenancy A single cluster can support many tenants
and use cases Seamless Cluster Expansion Expand the cluster without any down time High throughput & Low Latency Can reach 1.8 M messages/s in a single partition and publish latency of 5ms at 99pct Durability Data replicated and synced to disk Geo-replication Out of box support for geographically distributed applications Unified messaging model Support both Topic & Queue semantic in a single model Tiered Storage Hot/warm data for real time access and cold event data in cheaper storage Pulsar Functions Flexible light weight compute Highly scalable Can support millions of topics, makes data modeling easier

PULSAR SQL • Interactive SQL queries over data stored in
Pulsar • Query old and real-time data 31

PULSAR SQL / 2 • Based on Presto by Facebook
— https://prestodb.io/ • Presto is a distributed query execution engine • Fetches the data from multiple sources (HDFS, S3, MySQL, …) • Full SQL compatibility 32

PULSAR SQL / 3 • Pulsar connector for Presto •
Read data directly from BookKeeper — bypass Pulsar Broker • Can also read data ofﬂoaded to Tiered Storage (S3, GCS, etc.) • Many-to-many data reads • Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition • Time based indexing — Use “publishTime” in predicates to reduce data being read from disk 33

PULSAR SQL ARCHITECTURE

BENEFITS • Do not need to move data into another
system for querying • Read data in parallel - Performance not impacted by partitioning - Increase throughput by increasing write quorum - Newly arrived data able to be queried immediately

PERFORMANCE • SETUP • 3 Nodes • 12 CPU cores
• 128 GB RAM • 2 X 1.2 TB NVMe disks • Results • JSON (Compressed) • ~60 Millions Rows / Second • Avro (Compressed) • ~50 Million Rows / Second

APACHE PULSAR IN PRODUCTION @SCALE 4+ years Serves 2.3 million
topics 700 billion messages/day 500+ bookie nodes 200+ broker nodes Average latency < 5 ms 99.9% 15 ms (strong durability guarantees) Zero data loss 150+ applications Self served provisioning Full-mesh cross-datacenter replication - 8+ data centers

USE CASE: JOB SEARCH ANALYTICS ZHAOPIN • ZhaoPin - Chinese
job search website (Linkedin, Indeed, etc) • Using Pulsar to track job searches by users • Search params • Search results • Analysis on user behavior

FUTURE WORK • Performance tuning • Store data in columnar
format - Improve compression ratio - Materialize relevant columns • Support different indices 40

QUESTIONS? • Try Apache Pulsar yourself! - Sign-up on Streamlio
sandbox: cloud.streamlio.com 41

Interactive Querying of Streams Using Apache Pu...

Interactive Querying of Streams Using Apache Pulsar

More Decks by Streamlio

Other Decks in Technology

Featured

Transcript