Slide 1

Slide 1 text

November 23, 2022 Marija Selakovic Not all time-series are equal Challenges of storing and analyzing industrial data

Slide 2

Slide 2 text

2 About the speaker • Developer advocate @ Crate.io • Ph.D. in computer science at TU Darmstadt • Background in software engineering and data science • Interests in big data, community engagement, and helping developers to be successful with CrateDB

Slide 3

Slide 3 text

What is CrateDB Industrial time-series Data frequency Semi structured data Availability and reliability 1 2 3 4 5 6 Q&A

Slide 4

Slide 4 text

4 What is CrateDB A distributed, open-source SQL database for real-time analytics

Slide 5

Slide 5 text

5 The best of SQL and NoSQL Search Engine SQL NoSQL Performance § Run ad-hoc queries across 100s of billions of records in < 1 second § Use columnar storage for hyper-fast aggregations § With fully distributed query engine on top of Apache Lucene® Ease of Use and Flexibility § Standard SQL and PostgreSQL Wire Protocol for easy onboarding and great tool compatibility § Support for dynamic objects and schemas § Analyze relational, object, time-series, geo and full-text data all in one system Scalability § Scale from a single node to analyzing petabytes of data across hundreds of nodes in a cluster § Fully-managed Cloud, Hybrid and On-Premises (OSS)

Slide 6

Slide 6 text

6 Products and services on Kubernetes Private region Run your own cloud region on your Kubernetes infrastructure Quick start Deploy new cluster within minutes Scale with ease Add or remove nodes with zero down time Backup and recovery Fully automated (full/incremental) DBaaS Worry-free operations Hosted and managed by Crate.io Engineers on your behalf Apache License v2.0 Run on your own On any hardware

Slide 7

Slide 7 text

Industrial time-series

Slide 8

Slide 8 text

8 What are time-series • Data describing how a metric changes over time • Data points come with a timestamp: the specific moment in time in which the value was collected • Applications: manufacturing, financial markets, energy, application monitoring, streaming, etc.

Slide 9

Slide 9 text

9 Example Timestamp Device_id Battery CPU Memory 1480538130000 demo002696 temperature: 90.2 level: 73 status: discharging avg_1min: 26.73 avg_5min: 13.79 avg_15min: 9.32 free: 607689289 used 392310711 1480325490000 demo000153 temperature: 89.7 level: 27 status: discharging avg_1min: 6.62 avg_5min: 7.02 avg_15min: 7.63 free: 582462374 used 417537626 1480161150000 demo000403 temperature: 90.6 level: 54 status: discharging avg_1min: 4.5 avg_5min: 6.64 avg_15min: 9.46 free: 480858681 used 519141319 1480778190000 demo000138 temperature: 87.3 level: 73 status: discharging avg_1min: 8.67 avg_5min: 7.62 avg_15min: 9.16 free: 617321184 used 382678816 1480585170000 demo000905 temperature: 89 level: 15 status: discharging avg_1min: 6.14 avg_5min: 10.68 avg_15min: 15.22 free: 643883624 used 356116376 1480089360000 demo001177 temperature: 90.6 level: 75 status: discharging avg_1min: 6.93 avg_5min: 6.06 avg_15min: 7.69 free: 747529797 used 252470203

Slide 10

Slide 10 text

10 Industrial time-series data • Time-series became increasingly important for the manufacturing industry • Characterized by huge amounts of data • Various data formats: the number of types of signals and sensors is very high • High-frequency and low-frequency data need to be combined for effective analysis • The potential for valuable insights with right technology

Slide 11

Slide 11 text

11 Relational databases Pros ü SQL accessibility ü High data availability ü Easy integration Cons ✖ Difficult to scale ✖ Performance degrades with high data volumes NoSQL databases Pros ü Scalability ü Flexibility ü Distributions Cons ✖ Difficult integration ✖ Expensive ✖ High complexity Time-series databases Pros ü Built for time-series Cons ✖ Performance degrades for complex aggregations ✖ Need for another database Technology for time-series data

Slide 12

Slide 12 text

Not all time-series are equal 12 Not all time-series use cases are the same! Not all time-series solutions are made equal for industrial time-series!

Slide 13

Slide 13 text

13 Challenges Availability Semi structured data Data frequency

Slide 14

Slide 14 text

Availability

Slide 15

Slide 15 text

15 Why the availability is important • Availablity: proportion of time the system can be used • Industries produce data 24/7 • Crucial for: • Keep production running • Real-time data analytics • Ensuring high-performance and quality

Slide 16

Slide 16 text

16 Shared-nothing architecture Node (crate) Node (crate) Node (crate) Node (crate) Node (crate) Application Scaling by adding more nodes CrateDB is a distributed system built in a shared-nothing architecture. Every node is “equal” and can take over every task within the cluster. Scaling becomes easier No single point of failure Shards are automatically balanced across nodes

Slide 17

Slide 17 text

17 Securing high availability Node leaves Hardware failure Rolling Update Self-healing Data fully replicated again Node joins Automatic Synchronization Automatic Rebalancing Automatic Failover All data available Performance reduced 2 1 3 4

Slide 18

Slide 18 text

18 Shard replication CREATE TABLE t1 ( name STRING, month INTEGER ) CLUSTERED INTO 3 SHARDS WITH (number_of_replicas = ‘1’) • Built-in replication enables high availability • Data is replicated on a shard level • User decides on the number of replicas

Slide 19

Slide 19 text

19 Automatic failover Replica shards can become primary shards when the primary shard becomes unavailable.

Slide 20

Slide 20 text

20 Logical replication Logical replication allows a CrateDB cluster to replicate table(s) to another CrateDB cluster Publishers Subscribers § Publisher: cluster that makes data available to other clusters through publications § Subscriber: cluster that receives replicated data from one or more publishers § A cluster can have both publications and subscriptions § Data on subscriber are read only

Slide 21

Slide 21 text

Improving data locality Benefits: • Collect data where it is created • Replicate only specific data to local clusters • Access data with lower latency • No need to duplicate all data Central reporting cluster Region 1 Region 2 Region 3

Slide 22

Slide 22 text

Semi structured data

Slide 23

Slide 23 text

23 Complex measurements • Single machine might consist of submodules • Naturally creating structures in data • Multiple time-series Keep the structure Fast and easy search Aggregated values

Slide 24

Slide 24 text

24 Example: hierarchical structures readings ts device_id data sensors name value event type ▪ Readings: contains data from different sensors ▪ Solution 1: flatten all data ▪ Solution 2: keep the structure

Slide 25

Slide 25 text

25 Document store in CrateDB CREATE TABLE IF NOT EXISTS “doc”.“readings” ( “ts” TIMESTAMP WITH TIME ZONE, “device_id” TEXT, “data” OBJECT AS ( “name” TEXT, “value” Number, ) ) • CrateDB stores data in relational format • Store data as JSON-like documents in columns • Arbitrary number of attributes and nesting levels !// obj JSON { “name” : “battery”, “value” : “90.2” }

Slide 26

Slide 26 text

26 Dynamic schemas Flexibility to handle evolving data structures CREATE TABLE IF NOT EXISTS “doc”.“readings” ( “ts” TIMESTAMP WITH TIME ZONE, ”device_id” TEXT, ”data” OBJECT(DYNAMIC) AS ( ”sensor” OBJECT(STRICT), ”event” OBJECT(IGNORED) ) ) DYNAMIC • Highest flexibility • New columns allowed STRICT • Must follow column definition • Lowest flexibility IGNORED • Indexes columns that follow the definition • Ignores the ones that don’t

Slide 27

Slide 27 text

27 Querying semi-structured document Extract a top-level columns SELECT ts, data FROM readings Extract nested fields SELECT data[‘sensor’], data[‘event’] FROM readings { ”ts”:5678973, “device_id”: “drill001”, “data”: { “sensor” : {“name” : “temp1”, “value”: 21}, “event” : {“type” : “info”} } }

Slide 28

Slide 28 text

28 Indexing in CrateDB ▪ CrateDB utilizes Lucene as storage engine ▪ Indexing strategy based on Lucene: o Inverted index for text values o Block KD-Trees for numerical values ▪ Column-based data storage is built for more effective sorting and aggregations. Known as doc values. ▪ Doc values are enabled per default and can be disabled only for text fields

Slide 29

Slide 29 text

29 Inverted index ▪ Mapping from content to its location in the database ▪ Plain (default) and fulltext ▪ Fast search in billions of records ID model quantity 1 Temperature Sensor 60 2 Vibration Sensor 50 3 Pressure Sensor 20 name IDs Temperature 1 Vibration 2 Pressure 3 Sensor 1, 2, 3 name IDs Temperature Sensor 1 Vibration Sensor 2 Pressure Sensor 3 Fulltext index Plain index

Slide 30

Slide 30 text

30 Fast aggregation ▪ Inverted indexes is good solution for searching documents by terms ▪ Not for column-oriented retrieval of data ▪ Doc values: column-based data storage built at index time ▪ Fast aggregation, sorting and grouping ID model quantity 1 Temperature Sensor 60 2 Vibration Sensor 50 3 Pressure Sensor 20 ID quantity 1 60 2 50 3 20

Slide 31

Slide 31 text

Data frequency

Slide 32

Slide 32 text

32 Frequencies in time-series data • Refers to the number of times data is sent in given interval • High frequency data + business process data • Different sensors send data in different frequencies • One record per measurement huge overhead • Is one database enough? 100 000 sig/sec 100 sig/sec business data

Slide 33

Slide 33 text

33 CREATE TABLE IF NOT EXISTS “doc”.“readings” ( “ts” TIMESTAMP WITH TIME ZONE, “name” TEXT, “device_id” TEXT, “data” ARRAY(INTEGER) ) • CrateDB supports arrays, including arrays of objects • Several measurements can be stored in a single row “data” : [21, 20, 23, 21, 19, 18, 24] Arrays Extract values from arrays: SELECT data[1] FROM readings SELECT data[1:100] FROM readings

Slide 34

Slide 34 text

34 Joins in CrateDB Keep all data in a single database! Schema info: • device_id • manufacturer • model • version Schema readings: • ts • name • device_id • data 3,000 device information thousands of records per sec SELECT readings.data FROM info JOIN readings ON info.device_id= readings.device_id WHERE info.manufacturer = 'iobeam' AND info.model = 'S7' Example: find sensor data for iobeam S7 machine

Slide 35

Slide 35 text

35 Types of joins • CrateDB supports cross, inner and outer joins • Allows efficient query over large dataset • Algorithms: nested loop and distributed hash join algorithm • Optimizations for distributed execution1 1https://archive.fosdem.org/2022/schedule/event/distributed_join_cratedb/

Slide 36

Slide 36 text

36 User-defined functions • For various analytics tasks, CrateDB supports user-defined functions • Example: filter all elements in an array within a given range CREATE OR REPLACE FUNCTION filter(ARRAY(INTEGER), INTEGER, INTEGER) RETURNS ARRAY(INTEGER) LANGUAGE JAVASCRIPT AS 'function array_filter(array_integer, min, max) { return array_integer.filter(el => el >= min && el <= max); }';

Slide 37

Slide 37 text

37 When to use CrateDB Time-series data Internet of things Fast queries and aggregations Full-text search SQL as a query language Geospatial data Logical replication Hybrid deployments High-availability and scalability Use cases not supported by CrateDB: Transactional workloads (OLTP) Highly hierarchical/normalized data models Foreign Keys

Slide 38

Slide 38 text

38 Takeaways • Huge interest on time-series data • High volumes on valuable data • Availability, semi-structured data and different data frequencies • CrateDB offers features for handling time-series data Start today with $200 free credits

Slide 39

Slide 39 text

Join our community: community.crate.io Contact: [email protected] Thank you! Contribute: https://github.com/crate/crate