INTERACTIVE QUERYING OF STREAMS
USING APACHE PULSAR
http://pulsar.apache.org
Jerry Peng
Slide 2
Slide 2 text
AGENDA
1. Talk about the use cases
2. Existing Architectures
3. Apache Pulsar Overview
4. Pulsar SQL
5. Demo!
2
Slide 3
Slide 3 text
WHAT ARE STREAMS?
Continuous flows of data…
Almost all data originate in this form
Slide 4
Slide 4 text
INTERACTIVE QUERYING OF STREAMS?
Querying both latest and historical data
Slide 5
Slide 5 text
HOW IS IT USEFUL?
• Speed (i.e. data-driven processing)
- Act faster
• Accuracy
- In many contexts the wrong decision may be made if you do not have visibility that includes the most current data
- For example, historical data is useful to predict a user is interested in buying a particular item, but if my analytics
don’t also know that the user just purchased that item two minutes ago they’re going to make the wrong
recommendation
• Simplification:
- Single place to go to access current and historical data
5
Slide 6
Slide 6 text
DEBUGGING
• Errors and Exception
• Troubleshooting systems and networks
• Have we seen these errors before?
Slide 7
Slide 7 text
MONITORING (AUDIT LOGS)
• Answering the “What, When, Who,
Why”
• Suspicious access patterns
• Example
• Auditing CDC logs in financial
institutions
Slide 8
Slide 8 text
EXPLORING
• Raw or enriched data
• Really simplifies access if data is all in
one location
Slide 9
Slide 9 text
LOTS OF USE CASES
• Data analytics
• Business Intelligence
• Real-time dashboards
• etc…
9
Slide 10
Slide 10 text
STREAM PROCESSING PATTERN
Compute
Messaging
Storage
Data Ingestion Data Processing / Querying
Results Storage
Data Storage
Data
Serving
Slide 11
Slide 11 text
EXISTING SOLUTIONS
HDFS
Messaging Real-time compute
Storage
Data Stream
Querying
Slide 12
Slide 12 text
PROBLEMS WITH EXISTING SOLUTIONS
• Multiple Systems
• Duplication of data
- Data consistency. Where is the source of truth?
• Latency between data ingestion and when data is queryable
12
Slide 13
Slide 13 text
THIS IS WHERE APACHE PULSAR AND PULSAR SQL COMES
IN…
Slide 14
Slide 14 text
HISTORY
• Project started at Yahoo around 2012
and went through various iterations
• Open-Sourced in September 2016
• Entered Apache Incubator in June 2017
• Graduated as TLP on September 2018
• Over 160 contributors and 4100 stars
Slide 15
Slide 15 text
EXAMPLES OF PULSAR USERS AND
CONTRIBUTORS
Slide 16
Slide 16 text
APACHE PULSAR
Flexible Messaging + Streaming System
backed by a durable log storage
Slide 17
Slide 17 text
EVENT STORE
17
Slide 18
Slide 18 text
PULSAR OVERVIEW
Slide 19
Slide 19 text
CORE CONCEPTS
Topic
Producers
Consumers
Time
Consumers
Consumers
Producers
ARCHITECTURE
Multi-layer, scalable architecture
• Independent layers for processing, serving and storage
• Messaging and processing built on Apache Pulsar
• Storage built on Apache BookKeeper
Consumer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Messaging
Broker Broker Broker
Bookie Bookie Bookie Bookie Bookie
Event storage
Function Processing
Worker
Worker
Slide 22
Slide 22 text
SEGMENT
CENTRIC
STORAGE
• In addition to partitioning,
messages are stored in segments
(based on time and size)
• Segments are independent from
each others and spread across
all storage nodes
Slide 23
Slide 23 text
SEGMENT CENTRIC
• Unbounded log storage
• Instant scaling without data rebalancing
• High write and read availability via maximized data placement options
• Fast replica repair — many-to-many read
23
Slide 24
Slide 24 text
WRITES
• Every segment/ledger has an ensemble
• Each entry in ledger has a
✦ Write quorum
- Nodes of the ensemble to which it is
written (usually all)
★Ack quorum
- Nodes of the write quorum that must
respond for that entry to be
acknowledged (usually a majority)
Slide 25
Slide 25 text
BOOKKEEPER INTERNAL
• Separate IO path for reads and writes
• Optimized for writing, tailing reads,
catch-up reads
Slide 26
Slide 26 text
SEGMENTS VS PARTITIONS
Slide 27
Slide 27 text
TIERED STORAGE
Unlimited topic storage capacity
Achieves the true “stream-storage”: keep
the raw data forever in stream form
Slide 28
Slide 28 text
TIERED STORAGE
• Leverage cloud storage services to offload cold data — Completely
transparent to clients
• Extremely cost effective — Backends (S3) (Coming GCS, HDFS)
• Example: Retain all data for 1 month — Offload all messages older
than 1 day to S3
28
Slide 29
Slide 29 text
SCHEMA REGISTRY
• Store information on the data structure
— Stored in BookKeeper
• Enforce data types on topic
• Allow for compatible schema
evolutions
Slide 30
Slide 30 text
APACHE PULSAR
Multi-tenancy
A single cluster can support
many tenants and use cases
Seamless Cluster Expansion
Expand the cluster without any
down time
High throughput & Low
Latency
Can reach 1.8 M messages/s in
a single partition and publish
latency of 5ms at 99pct
Durability
Data replicated and synced to
disk
Geo-replication
Out of box support for
geographically distributed
applications
Unified messaging model
Support both Topic & Queue
semantic in a single model
Tiered Storage
Hot/warm data for real time access
and cold event data in cheaper
storage
Pulsar Functions
Flexible light weight compute
Highly scalable
Can support millions of topics,
makes data modeling easier
Slide 31
Slide 31 text
PULSAR SQL
• Interactive SQL queries over data stored in Pulsar
• Query old and real-time data
31
Slide 32
Slide 32 text
PULSAR SQL / 2
• Based on Presto by Facebook — https://prestodb.io/
• Presto is a distributed query execution engine
• Fetches the data from multiple sources (HDFS, S3, MySQL, …)
• Full SQL compatibility
32
Slide 33
Slide 33 text
PULSAR SQL / 3
• Pulsar connector for Presto
• Read data directly from BookKeeper — bypass Pulsar Broker
• Can also read data offloaded to Tiered Storage (S3, GCS, etc.)
• Many-to-many data reads
• Data is split even on a single partition — multiple workers can read data in parallel from
single Pulsar partition
• Time based indexing — Use “publishTime” in predicates to reduce data being read from disk
33
Slide 34
Slide 34 text
PULSAR SQL ARCHITECTURE
Slide 35
Slide 35 text
BENEFITS
• Do not need to move data into another
system for querying
• Read data in parallel
- Performance not impacted by
partitioning
- Increase throughput by increasing write
quorum
- Newly arrived data able to be queried
immediately
Slide 36
Slide 36 text
PERFORMANCE
• SETUP
• 3 Nodes
• 12 CPU cores
• 128 GB RAM
• 2 X 1.2 TB NVMe disks
• Results
• JSON (Compressed)
• ~60 Millions Rows / Second
• Avro (Compressed)
• ~50 Million Rows / Second
Slide 37
Slide 37 text
DEMO
Slide 38
Slide 38 text
APACHE PULSAR IN PRODUCTION @SCALE
4+ years
Serves 2.3 million topics
700 billion messages/day
500+ bookie nodes
200+ broker nodes
Average latency < 5 ms
99.9% 15 ms (strong durability guarantees)
Zero data loss
150+ applications
Self served provisioning
Full-mesh cross-datacenter replication - 8+ data centers
Slide 39
Slide 39 text
USE CASE: JOB SEARCH ANALYTICS
ZHAOPIN
• ZhaoPin - Chinese job search website
(Linkedin, Indeed, etc)
• Using Pulsar to track job searches by
users
• Search params
• Search results
• Analysis on user behavior
Slide 40
Slide 40 text
FUTURE WORK
• Performance tuning
• Store data in columnar format
- Improve compression ratio
- Materialize relevant columns
• Support different indices
40