CrateDB: Distributed SQL database built on top of Lucene

December 5, 2022 Marija Selakovic Marios Trivyzas CrateDB: Distributed SQL
database built on top of Lucene

2 Speakers Marija Selakovic Developer Advocate @Crate.io Marios Trivyzas Senior
Software Engineer @Crate.io

What is CrateDB Architecture overview Logical replication Distributed query execution
Storage layer 1 2 3 4 5 6 Deployments and ecosystem

4 What is CrateDB A distributed, open-source SQL database for
real-time analytics

5 The best of SQL and NoSQL Search Engine SQL
NoSQL Performance § Run ad-hoc queries across 100s of billions of records in < 1 second § Use columnar storage for hyper-fast aggregations § With fully distributed query engine on top of Apache Lucene® Ease of Use and Flexibility § Standard SQL and PostgreSQL Wire Protocol for easy onboarding and great tool compatibility § Support for dynamic objects and schemas § Analyze relational, object, time-series, geo, blob and full-text data all in one system Scalability § Scale from a single node to analyzing petabytes of data across hundreds of nodes in a cluster § Fully-managed Cloud and On-Premises (OSS)

6 Document store in CrateDB CREATE TABLE IF NOT EXISTS
“doc”.“readings” ( “ts” TIMESTAMP WITH TIME ZONE, “device_id” TEXT, “data” OBJECT AS ( “name” TEXT, “value” Number, ) ) • CrateDB stores data in relational format • Objects as JSON-like documents in columns • Arbitrary number of attributes and nesting levels • Also stores original document in JSON format // obj JSON { “name” : “battery”, “value” : “90.2” }

7 Dynamic schemas Flexibility to handle evolving data structures CREATE
TABLE IF NOT EXISTS “doc”.“readings” ( “ts” TIMESTAMP WITH TIME ZONE, ”device_id” TEXT, ”data” OBJECT(DYNAMIC) AS ( ”sensor” OBJECT(STRICT), ”event” OBJECT(IGNORED) ) ) DYNAMIC • Highest flexibility • New columns allowed STRICT • Must follow column definition • Lowest flexibility IGNORED • Indexes columns that follow the definition • Ignores the ones that don’t

8 Database administration • Information_schema and pg_* tables • Information
about tables, partitions, columns, constraints • System tables • sys.cluster • sys.nodes • sys.shards • sys.jobs • sys.operations • KILL job_id stops single job • KILL ALL stops every single job on each node

9 When to use CrateDB Time-series data Internet of things
Fast queries and aggregations Full-text search SQL as a query language Geospatial data Logical replication Hybrid deployments High-availability and scalability Use cases not supported by CrateDB: Transactional workloads (OLTP) Highly hierarchical/normalized data models Foreign Keys

CrateDB Architecture

11 Shared-nothing architecture Node (crate) Node (crate) Node (crate) Node
(crate) Node (crate) Application Scaling by adding more nodes CrateDB is a distributed system built in a shared-nothing architecture. Every node is “equal” and can take over every task within the cluster. Scaling becomes easier No single point of failure Shards are automatically balanced across nodes

12 Overview Client Parser Analyzer Planner & Optimizer Lucene Storage
Blob Storage Distributed Execution Engine PostgreSQL Wire Protocol HTTP Endpoint Client Client Client Client Client Parses SQL statement and creates AST Semantic processing of the statement Execution plan and optimizations Executes the plan in the cluster CrateDB node

13 Node-to-node communication • Encryption can be enabled for each
of the supported protocols (PostgreSQL, HTTP, Transport) • The encryption protocol is TLSv1.2 • Configuration via host-based authentication • For the transport protocol, trust and cert are valid authentication methods • If enabled, transport protocol encryption applies to the whole cluster

Storage Layer

15 Everything is the index ▪ CrateDB utilizes Lucene as
storage engine ▪ Indexing strategy based on Lucene: o Inverted index for text values o Block KD-Trees for numerical values (including timestamps, geo-points, geo- shapes, etc.) ▪ Column-based data storage is built for more effective sorting and aggregations. Known as doc values. ▪ Doc values are enabled per default and currently can be disabled only for text fields

16 Fulltext index with analyzer ▪ Split the content into
tokens, which are then searched ▪ Consists of: o Tokenizer o Zero or more token filters o Zero or more char filters The quick brown fox jumps Over the lAzY DOG. quick, brown, fox, jumps, lazy, dog Standard analyzer: o standard tokenizer o lowercase filter o stop token filter

17 Inverted index ▪ Mapping from content to its location
in the database ▪ Plain (default) and fulltext ▪ Fast search in billions of records ID model quantity 1 Almond Milk 60 2 Almond Flour 50 3 Milk 20 name IDs Almond 1, 2 Milk 1, 3 Flour 2 name IDs Almond Milk 1 Almond Flour 2 Milk 3 Fulltext index Plain index

18 Doc values ▪ Inverted indexes is good solution for
searching documents by terms ▪ Not for column-oriented retrieval of data ▪ Doc values: column-based data storage built at index time ▪ Fast aggregation, sorting and grouping ID model quantity 1 Milk 60 2 Almond 50 3 Lemon 20 ID quantity 1 60 2 50 3 20

19 Compression CrateDB has two compression algorithms for tables: •
LZ4 (default): Good tradeoff between storage efficiency and query performance • Deflate: More aggressive compression at the cost of query performance • Doc values are delta-encoded, bit-packing and GCD compressed

20 Compression: test CREATE TABLE readings ( time TIMESTAMP, device_id
TEXT, battery_level DOUBLE PRECISION, battery_status TEXT, battery_temperature DOUBLE PRECISION, bssid TEXT, cpu_avg_1min DOUBLE PRECISION, cpu_avg_5min DOUBLE PRECISION, cpu_avg_15min DOUBLE PRECISION, mem_free DOUBLE PRECISION, mem_used DOUBLE PRECISION, rssi DOUBLE PRECISION, ssid TEXT ); 120 million rows, index on time and device_id 19,7 GB 14,4 GB 39,0 GB CrateDB (LZ4) CrateDB (Deflate) PostgreSQL - 63 % - 49 %

Distributed query execution

22 Logical plan An abstraction of all transformation steps needed
to execute a query. Explain SELECT name FROM users; -> Collect[doc.users | [name] | true] EXPLAIN SELECT name FROM users WHERE name = 'name'; -> Collect[doc.users | [name] | (name = 'name')] Collect operator: • table name (doc.users) • attribute (name) • query expression(true) termQuery replaces true expression

23 Filter push down EXPLAIN SELECT name FROM (SELECT name
FROM users WHERE age = 30) AS n WHERE n.name = 'foo'; Rename[name] AS n └ Collect[doc.users | [name] | ((name = 'foo') AND (age = 30)) Two predicates got merged to reduce the number of rows to be collected

24 Physical execution plan • Logical plan does not reason
about data distribution • Every table is split into multiple shards by default • Routing table: information where the shards of the table are located CREATE TABLE t1 ( name STRING ) CLUSTERED INTO 3 SHARDS; Node 3 shard t1.s3 Node 2 shard t1.s2 Node 1 shard t1.s1

25 Physical execution plan (1) • Consists of one or
more execution phases (e.g., Collect -> CollectPhase) • Three types of nodes: • Collect node: from which data is gathered • Handler node: communication with a client • Merge nodes: merge the results from collect nodes Collect node Collect node Merge node Handler node Client same node

26 Distributed query execution: SELECT Explain SELECT name FROM users;
-> Collect[doc.users | [name] | true] RoutedCollectPhase: toCollect: [Ref{name}] routing: node-0: users: [0, 1] node-1: users: [2, 3] MergePhase: executionNodes: [node-0] numUpstreams: 2 Logical plan Physical execution plan attribute name nodes that should be queried shard ids of the tables handler node: node that merges data

27 Distributed query execution: SELECT with LIMIT SELECT name, age
FROM users WHERE name LIKE 'A%' ORDER BY id LIMIT 50 Eval[name, age] └ Fetch[name, age, id] └ Limit[50::bigint;0] └ OrderBy[id ASC] └ Collect[doc.users | [_fetchid, id] | (name LIKE 'A%')] Logical plan • _fetchid: system column used to fetch values of other attributes • Fetch operator: returns one or more _fetchid columns • Limit operator: limits the output to at most 50 records • OrderBy operator: returns the _fetchids ordered by id column

28 Execution strategy N1 Collect _fetchid Limit 50 N2 Collect
_fetchid Limit 50 N3 Collect _fetchid Limit 50 Merge Limit 50 Fetch _fetchid -> [name, details] • Loading values is expensive • CrateDB collects _fetchid values • Fetch operation retrieves the name and details values • Records are merged and reduced to 50 values 16 17 17

29 Distributed joins in CrateDB for each tuple l ∈
L do for each tuple r ∈ R do if l.a Θ r.b put tuple(l, r) in Q • Default algorithm: nested loop • Each tuple in inner relation is compared with each tuple in outer relations • For some tables, nested loop can be executed on the handler node

30 Distributed nested loop 1. Find the location of the
required shards 2. Broadcast data to the nodes holding the shards they are joining with 3. Each node starts nested loop 4. Intermediate results are pushed to handler node Complexity: O(N2) Shard L1 Node 1 Shard L2 Node 2 Shard R1 Node 1 Shard R2 Node 2 Handler Node broadcast push results merge nested loop

31 Block hash join algorithm • Algorithm for processing equi-joins
• Build phase: scan the smaller table and store hash values of join attributes in the hash table • Probe phase: each row in the other table is probed against the hash table • Dataset is divided into blocks and the algorithm repeats for each block …. rowi , rowi+1 ,… …. Rows from left store Hash table Rows from right validate join condition emit combined row Complexity: O(M+N)

32 Distributed block hash join algorithm N1 N2 N3 Block
hash join Block hash join Block hash join hash % 3 hash % 3 hash % 3 Merge • After partitioning, apply the block hash join algorithm • Rows with matching hashes are on the same node 2 node = hash value % total nodes • Hash for every row in each shard • Assign row to a node using the modulo operator2

33 Direct response vs paging PostgreSQL wire protocol supports streaming
of individual data rows • Two modes for SELECT queries: • Direct response: collect node sends the full result to the handler • Paging: collect node sends results in batches • Which mode is used depends on the query (LIMIT), number of nodes and shards • Idea: if not so many results are expected direct response performs better

34 Direct response vs paging (1) Client Node 0 Node
1 1. SELECT query 2. job request 3. empty ack. 6. merged result 4. subset of data 5. request for more data Client Node 0 Node 1 1. SELECT query 2. job request 2/3. job request/result 3. result 4. merged result Direct response Paging

Logical replication

36 Logical replication: publish-subscribe model Logical replication allows a CrateDB
cluster to replicate table(s) to another CrateDB cluster Publishers Subscribers § Publisher: cluster that makes data available to other clusters through publications § Subscriber: cluster that receives replicated data from one or more publishers § Besides data, all operation types such as INSERT, DELETE, UPDATE and schema changes are also replicated

37 Creating publications and subscriptions § Publication can contain all
tables, subset of tables or no tables § No aggregations during replication § Tables on the subscriber are read-only § Connection string contains parameters in the key- value format § WITH enabled parameter: it specifies whether the subscription should be actively replicating CREATE PUBLICATION name {FOR TABLE table_name [,…] | FOR ALL TABLES} CREATE SUBSCRIPTION name CONNECTION ‘path_to_publisher’ PUBLICATION publication_name [, …] [WITH (parameter_name [= value], [,…], …]

38 Use case: improving data locality Benefits: • Collect data
where it is created • Replicate only specific data to local clusters • Access data with lower latency • No need to duplicate all data Central reporting cluster Region 1 Region 2 Region 3

39 Use case: central reporting Benefits: • Collect data close
to where it is needed and replicate only specific data to a central cluster • Access data from multiple sources in one single location • No need to duplicate all data Central reporting cluster Region 1 Region 2 Region 3

Deployment and ecosystem

41 Deployment Apache License v2.0 DBaaS * https://github.com/crate/crate https://github.com/crate/crate-operator https://hub.docker.com/_/crate

42 Ecosystem ▪ PostgreSQL wire protocol v3[1] => compatibility with
many tools ▪ SQL queries must be supported by CrateDB’s SQL dialect ▪ Some limitations: o No support for TIME without a time zone o No support for multidimensional arrays o No support for INTERVAL input units (e.g., MILLISECONDS, MICROSECONDS) [1] https://crate.io/docs/crate/reference/en/5.0/interfaces/postgres.html

43 What is next ▪ Improving PostgreSQL ecosystem ▪ Facilitate
CrateDB integrations with more systems & tools ▪ Improve user experience and functionality of COPY FROM/TO ▪ Optimize source parsing to improve insert/update performance ▪ Further optimizations on distributed query execution

Join our community: community.crate.io [email protected] [email protected] Thank you!

Appendix

46 Securing high availability Node leaves Hardware failure Rolling Update
Self-healing Data fully replicated again Node joins Automatic Synchronization Automatic Rebalancing Automatic Failover All data available Performance reduced 2 1 3 4

Queryable Data New Record Refresh In-Memory Segment Translog In-Memory Buffer
1000 ms Segment 1 Segment 0 New Segment 4 Segment 2 Segment 5 Merge RAM Disk while Recovery Replay Flush (Lucene commit) ~ 5000 ms Transaction Log for persistence Lucene Refresh / Flush

48 Shard replication CREATE TABLE t1 ( name STRING, month
INTEGER ) CLUSTERED INTO 3 SHARDS WITH (number_of_replicas = ‘1’) • Built-in replication enables high availability • Data is replicated on a shard level • User decides on the number of replicas

49 Automatic failover Replica shards can become primary shards when
the primary shard becomes unavailable.

50 User-defined functions • For various analytics tasks, CrateDB supports
user-defined functions • Example: filter all elements in an array within a given range CREATE OR REPLACE FUNCTION filter(ARRAY(INTEGER), INTEGER, INTEGER) RETURNS ARRAY(INTEGER) LANGUAGE JAVASCRIPT AS 'function array_filter(array_integer, min, max) { return array_integer.filter(el => el >= min && el <= max); }';

CrateDB: Distributed SQL database built on top...

CrateDB: Distributed SQL database built on top of Lucene

More Decks by cratedb

Other Decks in Technology

Featured

Transcript