Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CrateDB: Distributed SQL database built on top...

CrateDB: Distributed SQL database built on top of Lucene

Marija Selakovic, Developer Advocate at Crate.io, and Marios Trivyzas, Senior Software Engineer at Crate.io, explain CrateDB in detail as part of the Carnegie Mellon University database seminar series 2022.

cratedb

June 23, 2023
Tweet

More Decks by cratedb

Other Decks in Technology

Transcript

  1. 5 The best of SQL and NoSQL Search Engine SQL

    NoSQL Performance § Run ad-hoc queries across 100s of billions of records in < 1 second § Use columnar storage for hyper-fast aggregations § With fully distributed query engine on top of Apache Lucene® Ease of Use and Flexibility § Standard SQL and PostgreSQL Wire Protocol for easy onboarding and great tool compatibility § Support for dynamic objects and schemas § Analyze relational, object, time-series, geo, blob and full-text data all in one system Scalability § Scale from a single node to analyzing petabytes of data across hundreds of nodes in a cluster § Fully-managed Cloud and On-Premises (OSS)
  2. 6 Document store in CrateDB CREATE TABLE IF NOT EXISTS

    “doc”.“readings” ( “ts” TIMESTAMP WITH TIME ZONE, “device_id” TEXT, “data” OBJECT AS ( “name” TEXT, “value” Number, ) ) • CrateDB stores data in relational format • Objects as JSON-like documents in columns • Arbitrary number of attributes and nesting levels • Also stores original document in JSON format // obj JSON { “name” : “battery”, “value” : “90.2” }
  3. 7 Dynamic schemas Flexibility to handle evolving data structures CREATE

    TABLE IF NOT EXISTS “doc”.“readings” ( “ts” TIMESTAMP WITH TIME ZONE, ”device_id” TEXT, ”data” OBJECT(DYNAMIC) AS ( ”sensor” OBJECT(STRICT), ”event” OBJECT(IGNORED) ) ) DYNAMIC • Highest flexibility • New columns allowed STRICT • Must follow column definition • Lowest flexibility IGNORED • Indexes columns that follow the definition • Ignores the ones that don’t
  4. 8 Database administration • Information_schema and pg_* tables • Information

    about tables, partitions, columns, constraints • System tables • sys.cluster • sys.nodes • sys.shards • sys.jobs • sys.operations • KILL job_id stops single job • KILL ALL stops every single job on each node
  5. 9 When to use CrateDB Time-series data Internet of things

    Fast queries and aggregations Full-text search SQL as a query language Geospatial data Logical replication Hybrid deployments High-availability and scalability Use cases not supported by CrateDB: Transactional workloads (OLTP) Highly hierarchical/normalized data models Foreign Keys
  6. 11 Shared-nothing architecture Node (crate) Node (crate) Node (crate) Node

    (crate) Node (crate) Application Scaling by adding more nodes CrateDB is a distributed system built in a shared-nothing architecture. Every node is “equal” and can take over every task within the cluster. Scaling becomes easier No single point of failure Shards are automatically balanced across nodes
  7. 12 Overview Client Parser Analyzer Planner & Optimizer Lucene Storage

    Blob Storage Distributed Execution Engine PostgreSQL Wire Protocol HTTP Endpoint Client Client Client Client Client Parses SQL statement and creates AST Semantic processing of the statement Execution plan and optimizations Executes the plan in the cluster CrateDB node
  8. 13 Node-to-node communication • Encryption can be enabled for each

    of the supported protocols (PostgreSQL, HTTP, Transport) • The encryption protocol is TLSv1.2 • Configuration via host-based authentication • For the transport protocol, trust and cert are valid authentication methods • If enabled, transport protocol encryption applies to the whole cluster
  9. 15 Everything is the index ▪ CrateDB utilizes Lucene as

    storage engine ▪ Indexing strategy based on Lucene: o Inverted index for text values o Block KD-Trees for numerical values (including timestamps, geo-points, geo- shapes, etc.) ▪ Column-based data storage is built for more effective sorting and aggregations. Known as doc values. ▪ Doc values are enabled per default and currently can be disabled only for text fields
  10. 16 Fulltext index with analyzer ▪ Split the content into

    tokens, which are then searched ▪ Consists of: o Tokenizer o Zero or more token filters o Zero or more char filters The quick brown fox jumps Over the lAzY DOG. quick, brown, fox, jumps, lazy, dog Standard analyzer: o standard tokenizer o lowercase filter o stop token filter
  11. 17 Inverted index ▪ Mapping from content to its location

    in the database ▪ Plain (default) and fulltext ▪ Fast search in billions of records ID model quantity 1 Almond Milk 60 2 Almond Flour 50 3 Milk 20 name IDs Almond 1, 2 Milk 1, 3 Flour 2 name IDs Almond Milk 1 Almond Flour 2 Milk 3 Fulltext index Plain index
  12. 18 Doc values ▪ Inverted indexes is good solution for

    searching documents by terms ▪ Not for column-oriented retrieval of data ▪ Doc values: column-based data storage built at index time ▪ Fast aggregation, sorting and grouping ID model quantity 1 Milk 60 2 Almond 50 3 Lemon 20 ID quantity 1 60 2 50 3 20
  13. 19 Compression CrateDB has two compression algorithms for tables: •

    LZ4 (default): Good tradeoff between storage efficiency and query performance • Deflate: More aggressive compression at the cost of query performance • Doc values are delta-encoded, bit-packing and GCD compressed
  14. 20 Compression: test CREATE TABLE readings ( time TIMESTAMP, device_id

    TEXT, battery_level DOUBLE PRECISION, battery_status TEXT, battery_temperature DOUBLE PRECISION, bssid TEXT, cpu_avg_1min DOUBLE PRECISION, cpu_avg_5min DOUBLE PRECISION, cpu_avg_15min DOUBLE PRECISION, mem_free DOUBLE PRECISION, mem_used DOUBLE PRECISION, rssi DOUBLE PRECISION, ssid TEXT ); 120 million rows, index on time and device_id 19,7 GB 14,4 GB 39,0 GB CrateDB (LZ4) CrateDB (Deflate) PostgreSQL - 63 % - 49 %
  15. 22 Logical plan An abstraction of all transformation steps needed

    to execute a query. Explain SELECT name FROM users; -> Collect[doc.users | [name] | true] EXPLAIN SELECT name FROM users WHERE name = 'name'; -> Collect[doc.users | [name] | (name = 'name')] Collect operator: • table name (doc.users) • attribute (name) • query expression(true) termQuery replaces true expression
  16. 23 Filter push down EXPLAIN SELECT name FROM (SELECT name

    FROM users WHERE age = 30) AS n WHERE n.name = 'foo'; Rename[name] AS n └ Collect[doc.users | [name] | ((name = 'foo') AND (age = 30)) Two predicates got merged to reduce the number of rows to be collected
  17. 24 Physical execution plan • Logical plan does not reason

    about data distribution • Every table is split into multiple shards by default • Routing table: information where the shards of the table are located CREATE TABLE t1 ( name STRING ) CLUSTERED INTO 3 SHARDS; Node 3 shard t1.s3 Node 2 shard t1.s2 Node 1 shard t1.s1
  18. 25 Physical execution plan (1) • Consists of one or

    more execution phases (e.g., Collect -> CollectPhase) • Three types of nodes: • Collect node: from which data is gathered • Handler node: communication with a client • Merge nodes: merge the results from collect nodes Collect node Collect node Merge node Handler node Client same node
  19. 26 Distributed query execution: SELECT Explain SELECT name FROM users;

    -> Collect[doc.users | [name] | true] RoutedCollectPhase: toCollect: [Ref{name}] routing: node-0: users: [0, 1] node-1: users: [2, 3] MergePhase: executionNodes: [node-0] numUpstreams: 2 Logical plan Physical execution plan attribute name nodes that should be queried shard ids of the tables handler node: node that merges data
  20. 27 Distributed query execution: SELECT with LIMIT SELECT name, age

    FROM users WHERE name LIKE 'A%' ORDER BY id LIMIT 50 Eval[name, age] └ Fetch[name, age, id] └ Limit[50::bigint;0] └ OrderBy[id ASC] └ Collect[doc.users | [_fetchid, id] | (name LIKE 'A%')] Logical plan • _fetchid: system column used to fetch values of other attributes • Fetch operator: returns one or more _fetchid columns • Limit operator: limits the output to at most 50 records • OrderBy operator: returns the _fetchids ordered by id column
  21. 28 Execution strategy N1 Collect _fetchid Limit 50 N2 Collect

    _fetchid Limit 50 N3 Collect _fetchid Limit 50 Merge Limit 50 Fetch _fetchid -> [name, details] • Loading values is expensive • CrateDB collects _fetchid values • Fetch operation retrieves the name and details values • Records are merged and reduced to 50 values 16 17 17
  22. 29 Distributed joins in CrateDB for each tuple l ∈

    L do for each tuple r ∈ R do if l.a Θ r.b put tuple(l, r) in Q • Default algorithm: nested loop • Each tuple in inner relation is compared with each tuple in outer relations • For some tables, nested loop can be executed on the handler node
  23. 30 Distributed nested loop 1. Find the location of the

    required shards 2. Broadcast data to the nodes holding the shards they are joining with 3. Each node starts nested loop 4. Intermediate results are pushed to handler node Complexity: O(N2) Shard L1 Node 1 Shard L2 Node 2 Shard R1 Node 1 Shard R2 Node 2 Handler Node broadcast push results merge nested loop
  24. 31 Block hash join algorithm • Algorithm for processing equi-joins

    • Build phase: scan the smaller table and store hash values of join attributes in the hash table • Probe phase: each row in the other table is probed against the hash table • Dataset is divided into blocks and the algorithm repeats for each block …. rowi , rowi+1 ,… …. Rows from left store Hash table Rows from right validate join condition emit combined row Complexity: O(M+N)
  25. 32 Distributed block hash join algorithm N1 N2 N3 Block

    hash join Block hash join Block hash join hash % 3 hash % 3 hash % 3 Merge • After partitioning, apply the block hash join algorithm • Rows with matching hashes are on the same node 2 node = hash value % total nodes • Hash for every row in each shard • Assign row to a node using the modulo operator2
  26. 33 Direct response vs paging PostgreSQL wire protocol supports streaming

    of individual data rows • Two modes for SELECT queries: • Direct response: collect node sends the full result to the handler • Paging: collect node sends results in batches • Which mode is used depends on the query (LIMIT), number of nodes and shards • Idea: if not so many results are expected direct response performs better
  27. 34 Direct response vs paging (1) Client Node 0 Node

    1 1. SELECT query 2. job request 3. empty ack. 6. merged result 4. subset of data 5. request for more data Client Node 0 Node 1 1. SELECT query 2. job request 2/3. job request/result 3. result 4. merged result Direct response Paging
  28. 36 Logical replication: publish-subscribe model Logical replication allows a CrateDB

    cluster to replicate table(s) to another CrateDB cluster Publishers Subscribers § Publisher: cluster that makes data available to other clusters through publications § Subscriber: cluster that receives replicated data from one or more publishers § Besides data, all operation types such as INSERT, DELETE, UPDATE and schema changes are also replicated
  29. 37 Creating publications and subscriptions § Publication can contain all

    tables, subset of tables or no tables § No aggregations during replication § Tables on the subscriber are read-only § Connection string contains parameters in the key- value format § WITH enabled parameter: it specifies whether the subscription should be actively replicating CREATE PUBLICATION name {FOR TABLE table_name [,…] | FOR ALL TABLES} CREATE SUBSCRIPTION name CONNECTION ‘path_to_publisher’ PUBLICATION publication_name [, …] [WITH (parameter_name [= value], [,…], …]
  30. 38 Use case: improving data locality Benefits: • Collect data

    where it is created • Replicate only specific data to local clusters • Access data with lower latency • No need to duplicate all data Central reporting cluster Region 1 Region 2 Region 3
  31. 39 Use case: central reporting Benefits: • Collect data close

    to where it is needed and replicate only specific data to a central cluster • Access data from multiple sources in one single location • No need to duplicate all data Central reporting cluster Region 1 Region 2 Region 3
  32. 42 Ecosystem ▪ PostgreSQL wire protocol v3[1] => compatibility with

    many tools ▪ SQL queries must be supported by CrateDB’s SQL dialect ▪ Some limitations: o No support for TIME without a time zone o No support for multidimensional arrays o No support for INTERVAL input units (e.g., MILLISECONDS, MICROSECONDS) [1] https://crate.io/docs/crate/reference/en/5.0/interfaces/postgres.html
  33. 43 What is next ▪ Improving PostgreSQL ecosystem ▪ Facilitate

    CrateDB integrations with more systems & tools ▪ Improve user experience and functionality of COPY FROM/TO ▪ Optimize source parsing to improve insert/update performance ▪ Further optimizations on distributed query execution
  34. 46 Securing high availability Node leaves Hardware failure Rolling Update

    Self-healing Data fully replicated again Node joins Automatic Synchronization Automatic Rebalancing Automatic Failover All data available Performance reduced 2 1 3 4
  35. Queryable Data New Record Refresh In-Memory Segment Translog In-Memory Buffer

    1000 ms Segment 1 Segment 0 New Segment 4 Segment 2 Segment 5 Merge RAM Disk while Recovery Replay Flush (Lucene commit) ~ 5000 ms Transaction Log for persistence Lucene Refresh / Flush
  36. 48 Shard replication CREATE TABLE t1 ( name STRING, month

    INTEGER ) CLUSTERED INTO 3 SHARDS WITH (number_of_replicas = ‘1’) • Built-in replication enables high availability • Data is replicated on a shard level • User decides on the number of replicas
  37. 50 User-defined functions • For various analytics tasks, CrateDB supports

    user-defined functions • Example: filter all elements in an array within a given range CREATE OR REPLACE FUNCTION filter(ARRAY(INTEGER), INTEGER, INTEGER) RETURNS ARRAY(INTEGER) LANGUAGE JAVASCRIPT AS 'function array_filter(array_integer, min, max) { return array_integer.filter(el => el >= min && el <= max); }';