Slide 1

Slide 1 text

INTERACTIVE QUERYING OF STREAMS USING APACHE PULSAR http://pulsar.apache.org Jerry Peng

Slide 2

Slide 2 text

AGENDA 1. Talk about the use cases 2. Existing Architectures 3. Apache Pulsar Overview 4. Pulsar SQL 5. Demo! 2

Slide 3

Slide 3 text

WHAT ARE STREAMS? Continuous flows of data… Almost all data originate in this form

Slide 4

Slide 4 text

INTERACTIVE QUERYING OF STREAMS? Querying both latest and historical data

Slide 5

Slide 5 text

HOW IS IT USEFUL? • Speed (i.e. data-driven processing) - Act faster • Accuracy - In many contexts the wrong decision may be made if you do not have visibility that includes the most current data - For example, historical data is useful to predict a user is interested in buying a particular item, but if my analytics don’t also know that the user just purchased that item two minutes ago they’re going to make the wrong recommendation • Simplification: - Single place to go to access current and historical data 5

Slide 6

Slide 6 text

DEBUGGING • Errors and Exception • Troubleshooting systems and networks • Have we seen these errors before?

Slide 7

Slide 7 text

MONITORING (AUDIT LOGS) • Answering the “What, When, Who, Why” • Suspicious access patterns • Example • Auditing CDC logs in financial institutions

Slide 8

Slide 8 text

EXPLORING • Raw or enriched data • Really simplifies access if data is all in one location

Slide 9

Slide 9 text

LOTS OF USE CASES • Data analytics • Business Intelligence • Real-time dashboards • etc… 9

Slide 10

Slide 10 text

STREAM PROCESSING PATTERN Compute Messaging Storage Data Ingestion Data Processing / Querying Results Storage Data Storage Data Serving

Slide 11

Slide 11 text

EXISTING SOLUTIONS HDFS Messaging Real-time compute Storage Data Stream Querying

Slide 12

Slide 12 text

PROBLEMS WITH EXISTING SOLUTIONS • Multiple Systems • Duplication of data - Data consistency. Where is the source of truth? • Latency between data ingestion and when data is queryable 12

Slide 13

Slide 13 text

THIS IS WHERE APACHE PULSAR AND PULSAR SQL COMES IN…

Slide 14

Slide 14 text

HISTORY • Project started at Yahoo around 2012 and went through various iterations • Open-Sourced in September 2016 • Entered Apache Incubator in June 2017 • Graduated as TLP on September 2018 • Over 160 contributors and 4100 stars

Slide 15

Slide 15 text

EXAMPLES OF PULSAR USERS AND CONTRIBUTORS

Slide 16

Slide 16 text

APACHE PULSAR Flexible Messaging + Streaming System backed by a durable log storage

Slide 17

Slide 17 text

EVENT STORE 17

Slide 18

Slide 18 text

PULSAR OVERVIEW

Slide 19

Slide 19 text

CORE CONCEPTS Topic Producers Consumers Time Consumers Consumers Producers

Slide 20

Slide 20 text

CORE CONCEPTS Apache Pulsar Cluster Tenants Namespaces Topics Marketing Sales Security Analytics Campaigns Data Transformation Data Integration Microservices Visits Conversions Responses Conversions Transactions Interactions Log events Signatures Accesses

Slide 21

Slide 21 text

ARCHITECTURE Multi-layer, scalable architecture • Independent layers for processing, serving and storage • Messaging and processing built on Apache Pulsar • Storage built on Apache BookKeeper Consumer Producer Producer Producer Consumer Consumer Consumer Messaging Broker Broker Broker Bookie Bookie Bookie Bookie Bookie Event storage Function Processing Worker Worker

Slide 22

Slide 22 text

SEGMENT CENTRIC STORAGE • In addition to partitioning, messages are stored in segments (based on time and size) • Segments are independent from each others and spread across all storage nodes

Slide 23

Slide 23 text

SEGMENT CENTRIC • Unbounded log storage • Instant scaling without data rebalancing • High write and read availability via maximized data placement options • Fast replica repair — many-to-many read 23

Slide 24

Slide 24 text

WRITES • Every segment/ledger has an ensemble • Each entry in ledger has a ✦ Write quorum - Nodes of the ensemble to which it is written (usually all) ★Ack quorum - Nodes of the write quorum that must respond for that entry to be acknowledged (usually a majority)

Slide 25

Slide 25 text

BOOKKEEPER INTERNAL • Separate IO path for reads and writes • Optimized for writing, tailing reads, catch-up reads

Slide 26

Slide 26 text

SEGMENTS VS PARTITIONS

Slide 27

Slide 27 text

TIERED STORAGE Unlimited topic storage capacity Achieves the true “stream-storage”: keep the raw data forever in stream form

Slide 28

Slide 28 text

TIERED STORAGE • Leverage cloud storage services to offload cold data — Completely transparent to clients • Extremely cost effective — Backends (S3) (Coming GCS, HDFS) • Example: Retain all data for 1 month — Offload all messages older than 1 day to S3 28

Slide 29

Slide 29 text

SCHEMA REGISTRY • Store information on the data structure — Stored in BookKeeper • Enforce data types on topic • Allow for compatible schema evolutions

Slide 30

Slide 30 text

APACHE PULSAR Multi-tenancy A single cluster can support many tenants and use cases Seamless Cluster Expansion Expand the cluster without any down time High throughput & Low Latency Can reach 1.8 M messages/s in a single partition and publish latency of 5ms at 99pct Durability Data replicated and synced to disk Geo-replication Out of box support for geographically distributed applications Unified messaging model Support both Topic & Queue semantic in a single model Tiered Storage Hot/warm data for real time access and cold event data in cheaper storage Pulsar Functions Flexible light weight compute Highly scalable Can support millions of topics, makes data modeling easier

Slide 31

Slide 31 text

PULSAR SQL • Interactive SQL queries over data stored in Pulsar • Query old and real-time data 31

Slide 32

Slide 32 text

PULSAR SQL / 2 • Based on Presto by Facebook — https://prestodb.io/ • Presto is a distributed query execution engine • Fetches the data from multiple sources (HDFS, S3, MySQL, …) • Full SQL compatibility 32

Slide 33

Slide 33 text

PULSAR SQL / 3 • Pulsar connector for Presto • Read data directly from BookKeeper — bypass Pulsar Broker • Can also read data offloaded to Tiered Storage (S3, GCS, etc.) • Many-to-many data reads • Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition • Time based indexing — Use “publishTime” in predicates to reduce data being read from disk 33

Slide 34

Slide 34 text

PULSAR SQL ARCHITECTURE

Slide 35

Slide 35 text

BENEFITS • Do not need to move data into another system for querying • Read data in parallel - Performance not impacted by partitioning - Increase throughput by increasing write quorum - Newly arrived data able to be queried immediately

Slide 36

Slide 36 text

PERFORMANCE • SETUP • 3 Nodes • 12 CPU cores • 128 GB RAM • 2 X 1.2 TB NVMe disks • Results • JSON (Compressed) • ~60 Millions Rows / Second • Avro (Compressed) • ~50 Million Rows / Second

Slide 37

Slide 37 text

DEMO

Slide 38

Slide 38 text

APACHE PULSAR IN PRODUCTION @SCALE 4+ years Serves 2.3 million topics 700 billion messages/day 500+ bookie nodes 200+ broker nodes Average latency < 5 ms 99.9% 15 ms (strong durability guarantees) Zero data loss 150+ applications Self served provisioning Full-mesh cross-datacenter replication - 8+ data centers

Slide 39

Slide 39 text

USE CASE: JOB SEARCH ANALYTICS ZHAOPIN • ZhaoPin - Chinese job search website (Linkedin, Indeed, etc) • Using Pulsar to track job searches by users • Search params • Search results • Analysis on user behavior

Slide 40

Slide 40 text

FUTURE WORK • Performance tuning • Store data in columnar format - Improve compression ratio - Materialize relevant columns • Support different indices 40

Slide 41

Slide 41 text

QUESTIONS? • Try Apache Pulsar yourself! - Sign-up on Streamlio sandbox: cloud.streamlio.com 41