Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CrateDB: A Database or a Search Engine? Both!

CrateDB: A Database or a Search Engine? Both!

CrateDB is an open-source, distributed SQL database that runs daily queries on millions of data records. It is built for data-intensive analytics with a PostgreSQL interface that runs anywhere, from on-premises to the cloud. In this talk, I will give you an introduction to CrateDB, its architecture, and how to make ingestion and analysis of a large amount of data extremely simple and efficient. You will learn how to get all the benefits of traditional databases while utilizing the scalability and efficiency of a distributed search engine.

cratedb

June 23, 2023
Tweet

More Decks by cratedb

Other Decks in Technology

Transcript

  1. What is CrateDB Architecture Logical Replication Try out CrateDB When

    to Use CrateDB? Summary 01 02 03 04 05 06 Outline 2 Crate.io
  2. About CrateDB 4 Crate.io • Since 2014: https://github.com/crate/crate • A

    distributed, horizontally scaling database for data intensive analytics solutions • Open Source under Apache License 2.0 • Perfect choice for: ◦ Event data (high-volume, non-transactional) ◦ Mixed structured/unstructured data ◦ Fast analytical queries ◦ Highly scalable deployments
  3. SQL • PostgreSQL-compatible interface for easy data consumption • Standard

    SQL as query language • Table schemas are represented in a relational format NoSQL • Dynamic objects and schemas • Based on Apache Lucene • All columns get indexed by default with low overhead • Eventual consistency to achieve high ingest performance CrateDB Combines the Best of SQL and NoSQL 5 Crate.io
  4. PostgreSQL Compatibility 6 Crate.io PostgreSQL Wire Protocol and Dialect Compatible

    with many tools and utilities through included system catalog
  5. ▪ SQL Engine for parsing and executing SQL statements ▪

    Distributed, shared-nothing architecture ▪ Every node is “equal” and can perform any task ▪ Benefits: o Scaling becomes easier o Automatic data rebalancing o No single point of failure CrateDB Architecture 8 Crate.io Client Parser Analyzer Planner & Optimizer Lucene Storage Blob Storage Distributed Execution Engine PostgreSQL Wire Protocol HTTP Endpoint Client Client Client Client Client CrateDB node
  6. Lucene Index 9 Crate.io ID name quantity 1 Almond Milk

    100 2 Almond Flour 200 3 Milk 300 ID name 1 Almond Milk 2 Almond Flour 3 Milk name IDs Almond 1, 2 Milk 2, 3 Flour 2 Fulltext Index / BKD-Trees Column store Benefits: • Fast Search • Very fast aggregations CrateDB by defaults indexes all data similar to Search Engines
  7. Sharding ▪ Every table is split into multiple shards by

    default ▪ Shards are distributed evenly across all nodes in the cluster CREATE TABLE metrics ( ”ts" TIMESTAMP, "name" TEXT, "tags" OBJECT(DYNAMIC), "fields" OBJECT(DYNAMIC) ) CLUSTERED INTO 3 SHARDS;
  8. Partitioning ▪ Partitioning splits up a large table into smaller

    chunks ▪ Minimize the number of records that need to be queried CREATE TABLE metrics ( ”ts" TIMESTAMP, ”ts_month" GENERATED ALWAYS AS DATE_TRUNC('month’, “ts"), "name" TEXT, "tags" OBJECT(DYNAMIC), "fields" OBJECT(DYNAMIC) ) CLUSTERED INTO 3 SHARDS PARTITIONED BY (ts_month);
  9. Logical Replication: Publish-Subscribe Model 13 Crate.io Publishers Subscribers § Publisher:

    database instance that makes data available to other locations through publications § Subscriber: database instance that receives replicated data from one or more publishers § Cluster can have both publications and subscriptions
  10. Creating Publications and Subscriptions 14 Crate.io § Publication can contain

    all tables, subset of tables or no tables § All operations are replicated § No aggregations during replication § Tables on the subscriber are read-only § Connection string contains parameters in the key-value format § WITH enabled parameter: it specifies whether the subscription should be actively replicating CREATE PUBLICATION name {FOR TABLE table_name [,…] | FOR ALL TABLES} CREATE SUBSCRIPTION name CONNECTION ‘path_to_publisher’ PUBLICATION publication_name [, …] [WITH (parameter_name [= value], [,…], …]
  11. Use Case: Central Reporting 15 Crate.io Local storage and centralized

    reporting Replicate data from a large number of smaller, localized clusters back to a centralized reporting cluster. Benefits § Collect data close to where it’s needed, and replicate only specific data to a central cluster § Access data from multiple sources in one single location § No need to duplicate all data
  12. Use Case: Low Latency 16 Crate.io Central storage and local

    replicas Replicate data from a large cluster to a large number of smaller localized clusters for low latency access. Benefits of logical replication § Collect data where it’s created, and replicate only specific data to local clusters § Access data from close to the location with lower latency § Replicate selected data to select locations
  13. • Download CrateDB and run it from a tarball bash

    -c "$(curl -L https://try.crate.io/)" • Access the Admin UI via following link: https://localhost:4200 Getting Started 18 Crate.io Admin UI Web interface Overview (state, version) Query console Sharding overview Node information Cluster monitoring
  14. More Features 20 Crate.io • Geo search • User Defined

    Functions • Snapshots • User management • Schema/Table privileges • SSL encryption • Dynamic objects • COPY FROM and COPY TO • … docs.crate.io
  15. CrateDB Use Cases 22 Crate.io Want or have to use

    SQL Time Series Data Real-time Analytics Large amounts of data Internet of Things Geospatial data Structured/Semi-structured data Thousands of queries per seconds Transactional workloads (OLTP) Foreign Keys Highly hierarchical/normalized data
  16. • Highly scalable data infrastructure is a key • CrateDB

    makes it simple to store and analyze massive amounts of data • Unlimited scalability • Efficient, dynamic architecture • Dynamic schemas • Easy integration due to PostgreSQL compatibility • Fast in aggregation and search • Open source Summary 23 community.crate.io