Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Not all time-series are equal ​ Challenges of...

cratedb
June 23, 2023

Not all time-series are equal ​ Challenges of storing and analyzing industrial data

In this talk, we will discuss how different data properties affect solutions for data storage, data retention, and data integration. Furthermore, we will illustrate how the requirements for storing industrial data can be satisfied with CrateDB: an open-source, distributed database that makes storage and analysis of massive amounts of data simple and efficient.

cratedb

June 23, 2023
Tweet

More Decks by cratedb

Other Decks in Technology

Transcript

  1. November 23, 2022 Marija Selakovic Not all time-series are equal

    Challenges of storing and analyzing industrial data
  2. 2 About the speaker • Developer advocate @ Crate.io •

    Ph.D. in computer science at TU Darmstadt • Background in software engineering and data science • Interests in big data, community engagement, and helping developers to be successful with CrateDB
  3. 5 The best of SQL and NoSQL Search Engine SQL

    NoSQL Performance § Run ad-hoc queries across 100s of billions of records in < 1 second § Use columnar storage for hyper-fast aggregations § With fully distributed query engine on top of Apache Lucene® Ease of Use and Flexibility § Standard SQL and PostgreSQL Wire Protocol for easy onboarding and great tool compatibility § Support for dynamic objects and schemas § Analyze relational, object, time-series, geo and full-text data all in one system Scalability § Scale from a single node to analyzing petabytes of data across hundreds of nodes in a cluster § Fully-managed Cloud, Hybrid and On-Premises (OSS)
  4. 6 Products and services on Kubernetes Private region Run your

    own cloud region on your Kubernetes infrastructure Quick start Deploy new cluster within minutes Scale with ease Add or remove nodes with zero down time Backup and recovery Fully automated (full/incremental) DBaaS Worry-free operations Hosted and managed by Crate.io Engineers on your behalf Apache License v2.0 Run on your own On any hardware
  5. 8 What are time-series • Data describing how a metric

    changes over time • Data points come with a timestamp: the specific moment in time in which the value was collected • Applications: manufacturing, financial markets, energy, application monitoring, streaming, etc.
  6. 9 Example Timestamp Device_id Battery CPU Memory 1480538130000 demo002696 temperature:

    90.2 level: 73 status: discharging avg_1min: 26.73 avg_5min: 13.79 avg_15min: 9.32 free: 607689289 used 392310711 1480325490000 demo000153 temperature: 89.7 level: 27 status: discharging avg_1min: 6.62 avg_5min: 7.02 avg_15min: 7.63 free: 582462374 used 417537626 1480161150000 demo000403 temperature: 90.6 level: 54 status: discharging avg_1min: 4.5 avg_5min: 6.64 avg_15min: 9.46 free: 480858681 used 519141319 1480778190000 demo000138 temperature: 87.3 level: 73 status: discharging avg_1min: 8.67 avg_5min: 7.62 avg_15min: 9.16 free: 617321184 used 382678816 1480585170000 demo000905 temperature: 89 level: 15 status: discharging avg_1min: 6.14 avg_5min: 10.68 avg_15min: 15.22 free: 643883624 used 356116376 1480089360000 demo001177 temperature: 90.6 level: 75 status: discharging avg_1min: 6.93 avg_5min: 6.06 avg_15min: 7.69 free: 747529797 used 252470203
  7. 10 Industrial time-series data • Time-series became increasingly important for

    the manufacturing industry • Characterized by huge amounts of data • Various data formats: the number of types of signals and sensors is very high • High-frequency and low-frequency data need to be combined for effective analysis • The potential for valuable insights with right technology
  8. 11 Relational databases Pros ü SQL accessibility ü High data

    availability ü Easy integration Cons ✖ Difficult to scale ✖ Performance degrades with high data volumes NoSQL databases Pros ü Scalability ü Flexibility ü Distributions Cons ✖ Difficult integration ✖ Expensive ✖ High complexity Time-series databases Pros ü Built for time-series Cons ✖ Performance degrades for complex aggregations ✖ Need for another database Technology for time-series data
  9. Not all time-series are equal 12 Not all time-series use

    cases are the same! Not all time-series solutions are made equal for industrial time-series!
  10. 15 Why the availability is important • Availablity: proportion of

    time the system can be used • Industries produce data 24/7 • Crucial for: • Keep production running • Real-time data analytics • Ensuring high-performance and quality
  11. 16 Shared-nothing architecture Node (crate) Node (crate) Node (crate) Node

    (crate) Node (crate) Application Scaling by adding more nodes CrateDB is a distributed system built in a shared-nothing architecture. Every node is “equal” and can take over every task within the cluster. Scaling becomes easier No single point of failure Shards are automatically balanced across nodes
  12. 17 Securing high availability Node leaves Hardware failure Rolling Update

    Self-healing Data fully replicated again Node joins Automatic Synchronization Automatic Rebalancing Automatic Failover All data available Performance reduced 2 1 3 4
  13. 18 Shard replication CREATE TABLE t1 ( name STRING, month

    INTEGER ) CLUSTERED INTO 3 SHARDS WITH (number_of_replicas = ‘1’) • Built-in replication enables high availability • Data is replicated on a shard level • User decides on the number of replicas
  14. 20 Logical replication Logical replication allows a CrateDB cluster to

    replicate table(s) to another CrateDB cluster Publishers Subscribers § Publisher: cluster that makes data available to other clusters through publications § Subscriber: cluster that receives replicated data from one or more publishers § A cluster can have both publications and subscriptions § Data on subscriber are read only
  15. Improving data locality Benefits: • Collect data where it is

    created • Replicate only specific data to local clusters • Access data with lower latency • No need to duplicate all data Central reporting cluster Region 1 Region 2 Region 3
  16. 23 Complex measurements • Single machine might consist of submodules

    • Naturally creating structures in data • Multiple time-series Keep the structure Fast and easy search Aggregated values
  17. 24 Example: hierarchical structures readings ts device_id data sensors name

    value event type ▪ Readings: contains data from different sensors ▪ Solution 1: flatten all data ▪ Solution 2: keep the structure
  18. 25 Document store in CrateDB CREATE TABLE IF NOT EXISTS

    “doc”.“readings” ( “ts” TIMESTAMP WITH TIME ZONE, “device_id” TEXT, “data” OBJECT AS ( “name” TEXT, “value” Number, ) ) • CrateDB stores data in relational format • Store data as JSON-like documents in columns • Arbitrary number of attributes and nesting levels !// obj JSON { “name” : “battery”, “value” : “90.2” }
  19. 26 Dynamic schemas Flexibility to handle evolving data structures CREATE

    TABLE IF NOT EXISTS “doc”.“readings” ( “ts” TIMESTAMP WITH TIME ZONE, ”device_id” TEXT, ”data” OBJECT(DYNAMIC) AS ( ”sensor” OBJECT(STRICT), ”event” OBJECT(IGNORED) ) ) DYNAMIC • Highest flexibility • New columns allowed STRICT • Must follow column definition • Lowest flexibility IGNORED • Indexes columns that follow the definition • Ignores the ones that don’t
  20. 27 Querying semi-structured document Extract a top-level columns SELECT ts,

    data FROM readings Extract nested fields SELECT data[‘sensor’], data[‘event’] FROM readings { ”ts”:5678973, “device_id”: “drill001”, “data”: { “sensor” : {“name” : “temp1”, “value”: 21}, “event” : {“type” : “info”} } }
  21. 28 Indexing in CrateDB ▪ CrateDB utilizes Lucene as storage

    engine ▪ Indexing strategy based on Lucene: o Inverted index for text values o Block KD-Trees for numerical values ▪ Column-based data storage is built for more effective sorting and aggregations. Known as doc values. ▪ Doc values are enabled per default and can be disabled only for text fields
  22. 29 Inverted index ▪ Mapping from content to its location

    in the database ▪ Plain (default) and fulltext ▪ Fast search in billions of records ID model quantity 1 Temperature Sensor 60 2 Vibration Sensor 50 3 Pressure Sensor 20 name IDs Temperature 1 Vibration 2 Pressure 3 Sensor 1, 2, 3 name IDs Temperature Sensor 1 Vibration Sensor 2 Pressure Sensor 3 Fulltext index Plain index
  23. 30 Fast aggregation ▪ Inverted indexes is good solution for

    searching documents by terms ▪ Not for column-oriented retrieval of data ▪ Doc values: column-based data storage built at index time ▪ Fast aggregation, sorting and grouping ID model quantity 1 Temperature Sensor 60 2 Vibration Sensor 50 3 Pressure Sensor 20 ID quantity 1 60 2 50 3 20
  24. 32 Frequencies in time-series data • Refers to the number

    of times data is sent in given interval • High frequency data + business process data • Different sensors send data in different frequencies • One record per measurement huge overhead • Is one database enough? 100 000 sig/sec 100 sig/sec business data
  25. 33 CREATE TABLE IF NOT EXISTS “doc”.“readings” ( “ts” TIMESTAMP

    WITH TIME ZONE, “name” TEXT, “device_id” TEXT, “data” ARRAY(INTEGER) ) • CrateDB supports arrays, including arrays of objects • Several measurements can be stored in a single row “data” : [21, 20, 23, 21, 19, 18, 24] Arrays Extract values from arrays: SELECT data[1] FROM readings SELECT data[1:100] FROM readings
  26. 34 Joins in CrateDB Keep all data in a single

    database! Schema info: • device_id • manufacturer • model • version Schema readings: • ts • name • device_id • data 3,000 device information thousands of records per sec SELECT readings.data FROM info JOIN readings ON info.device_id= readings.device_id WHERE info.manufacturer = 'iobeam' AND info.model = 'S7' Example: find sensor data for iobeam S7 machine
  27. 35 Types of joins • CrateDB supports cross, inner and

    outer joins • Allows efficient query over large dataset • Algorithms: nested loop and distributed hash join algorithm • Optimizations for distributed execution1 1https://archive.fosdem.org/2022/schedule/event/distributed_join_cratedb/
  28. 36 User-defined functions • For various analytics tasks, CrateDB supports

    user-defined functions • Example: filter all elements in an array within a given range CREATE OR REPLACE FUNCTION filter(ARRAY(INTEGER), INTEGER, INTEGER) RETURNS ARRAY(INTEGER) LANGUAGE JAVASCRIPT AS 'function array_filter(array_integer, min, max) { return array_integer.filter(el => el >= min && el <= max); }';
  29. 37 When to use CrateDB Time-series data Internet of things

    Fast queries and aggregations Full-text search SQL as a query language Geospatial data Logical replication Hybrid deployments High-availability and scalability Use cases not supported by CrateDB: Transactional workloads (OLTP) Highly hierarchical/normalized data models Foreign Keys
  30. 38 Takeaways • Huge interest on time-series data • High

    volumes on valuable data • Availability, semi-structured data and different data frequencies • CrateDB offers features for handling time-series data Start today with $200 free credits