Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data modelling, management and storage with Apa...

Data modelling, management and storage with Apache Cassandra and CFML

We all are more or less used to relational data base systems like MySQL or SQL Server. Over the last 10 years or so, NoSQL DBs like MongoDB have become more popular and have seen traction in certain tech stacks and for certain applications.

An underutilised approach to storing and managing information are wide-column stores like Apache Cassandra. This talk will introduce the concept of wide-column stores and the more general idea of column-oriented data bases and what they’re useful for in the first place.

From there we’re going to build a simple data model for a practical use case and have a look at how data modelling for column stores is very different from what we almost intuitively do for traditional relational database systems.

Cassandra is also distributed and highly fault-tolerant on commodity hardware and offers linear scalability and its own query language CQL, so we’re looking into how this works with a ring-based cluster setup and quorum-based availability. Finally - we’re putting all of this together and look at how to use Cassandra with CFML.

At the end of the talk, people will have a better understanding of when, where and how it makes sense to use column stores in their application architectures.

This version of the talk was given at CFCamp Germany 2024 on June 14, 2024.

Kai Koenig

June 16, 2024
Tweet

More Decks by Kai Koenig

Other Decks in Programming

Transcript

  1. Hello, my name is Kai • Software Architect in the

    back end and mobile app space • Work interests: ◦ CFML, JVM tech, Python, Ruby, AWS and Google Cloud and managing *nix-based infrastructure platforms ◦ Android, Kotlin, Flutter and building SDKs
  2. Relational and NoSQL Databases • Follow a structured schema for

    data organization • Use SQL for querying and managing data • Ensure ACID properties for data integrity • Modelling relationships via normalisation Relational Databases • Offer more flexibility in schema design • Use varied data models like document, key-value, wide-column, or graph • Optimize for distributed scale and speed NoSQL Databases
  3. Introduction to Wide-Column Stores • Column-oriented database - one of

    many types of NoSQL databases • Key differences in data modeling and storage architecture compared to relational databases. • 2-dimensional key-value store • Usually has tables, rows and columns - but: ◦ Name and format of columns varies from row to row ◦ Some databases stores data internally by columns, using specific compression techniques
  4. Historical Background • Apache Cassandra was developed at Facebook to

    address the limitations of existing database systems. • It combines techniques from Amazon’s Dynamo for distributed storage and Google’s Bigtable for data storage and management. • Goal: system capable of handling large-scale data with high scalability and availability. Origin of Apache Cassandra
  5. Scaling out on commodity hardware allows for cost-effective expansion by

    adding more nodes, thus increasing throughput and storage capacity linearly. Global availability at low latency is achieved through replication and data distribution across multiple data centers. Design Objectives Full multi-master database replication ensures no single point of failure and enables data to be written to any node in the cluster. Flexible schema design supports dynamic data models, allowing new columns to be added without downtime and improving adaptability to changing requirements.
  6. Apache Cassandra has a distributed architecture that allows it to

    be highly fault-tolerant on commodity hardware. Unique levels of linear scalability of Apache Cassandra make it handle large amounts of data and traffic with ease. CQL is Cassandra's query language and offers a lot of similarity to SQL, making it easier for developers familiar with SQL to transition. Distributed Architecture Linear Scalability Cassandra Query Language (CQL) Apache Cassandra ensures data availability through its quorum-based system, even in the face of network partitions and individual oder multi-node failures. Apache Cassandra: Key Features Quorum-based Availability
  7. Partition: The first part of a primary key, determining data

    placement in the cluster. Efficient queries include the partition key. Identifies where data is stored. Table: Represents a collection of partitions with a defined schema. Tables store rows and columns, and can add columns without downtime. Keyspace: Defines data replication and organizational structure. Each keyspace contains multiple tables and specifies replication strategies. Key Components and Terminology Row and Column: Rows are collections of columns identified by unique primary keys. Columns store individual pieces of data within rows.
  8. Cassandra Query Language (CQL) CQL resembles SQL in syntax, making

    it accessible for users familiar with relational databases. It simplifies data manipulation and querying in Cassandra. With CQL, users can define and modify keyspaces, tables, and columns. It supports creating, altering, and dropping schemas with simple commands. SQL-like nature Schema Creation and Updates Data Access CQL allows efficient data retrieval and updates. Users can insert, update, delete, and query data using familiar SQL-like commands.
  9. User-defined functions and aggregates enable custom computations within queries. User-defined

    types (UDTs) allow creation of custom data types to better model application data. Single partition lightweight transactions ensure atomicity with compare and set semantics. Advanced CQL Features Local secondary indices enhance query performance on non-primary key columns. Collection types such as sets, maps, and lists provide flexibility in data storage. 1 2 3 4 5
  10. Nodes in Cassandra use a consistent hashing ring for dataset

    partitioning, allowing for efficient distribution and scaling of data across the cluster. Cassandra employs a local persistence engine based on a Log-Structured Merge Tree (LSM) to manage storage on individual nodes. Cassandra handles request coordination over a partitioned dataset, ensuring that read and write requests are efficiently managed across multiple nodes. Local Persistence Request Coordination Ring Membership Dynamo Techniques in Cassandra
  11. Consistent Hashing Cassandra uses consistent hashing with a token ring

    to partition data across nodes. Each node is assigned one or more tokens, defining ownership of data. Unlike naive hashing, consistent hashing minimizes data redistribution when nodes are added or removed, ensuring balanced data distribution and efficient scaling. Token Ring Advantages Virtual Nodes (vnodes) Vnodes allow a single physical node to own multiple tokens, enhancing load balancing and fault tolerance by distributing the data more evenly across the cluster.
  12. Replication Strategies NetworkTopologyStrategy allows for replication factor specification per datacenter.

    It ensures replicas are placed on different racks to avoid single points of failure, enhancing fault tolerance. SimpleStrategy is used for single datacenter clusters. It replicates data uniformly across nodes without considering rack or datacenter topology, mainly for testing or small-scale deployments. NetworkTopologyStrategy SimpleStrategy Transient Replication Transient replication is an experimental feature that allows some replicas to store only unrepaired data, reducing storage overhead while maintaining availability. It is not yet recommended for production use.
  13. Anti-Entropy Repair Read repair occurs during read operations when discrepancies

    are found between replicas. The coordinator node sends the latest data to out-of-date replicas to ensure eventual consistency. Read Repair Hinted Handoff Hinted handoff temporarily stores write operations for unreachable nodes. Once the node becomes available, the stored hints are forwarded to it, ensuring no data loss. Anti-entropy repair uses Merkle trees to compare data across replicas and identify differences. This process ensures all replicas eventually converge to the same state. Replica Synchronization and Repair
  14. THREE: Three replicas must respond for even stronger consistency. TWO:

    Two replicas must respond, providing higher consistency than ONE. ONE: Only a single replica must respond to the read or write request. Tunable Consistency Levels (selection) LOCAL_QUORUM: A majority of replicas in the local datacenter must respond, reducing cross-datacenter latency. ALL: All replicas must respond, ensuring the highest level of consistency. QUORUM: A majority of replicas (n/2 + 1) must respond, balancing consistency and availability. 1 2 3 4 5 6
  15. Gossip Protocol for cluster membership & bootstrap Cassandra uses a

    gossip protocol to manage cluster membership. Nodes exchange information about their own state and the state of other nodes they are aware of, ensuring all nodes have a consistent view of the cluster. Each node periodically selects a random node to exchange state information with. This includes details about node status, token ownership, and schema versions, using versioned vector clocks to track updates. Cluster Membership State Information Exchange Seed Nodes Seed nodes are used to bootstrap the cluster by providing initial contact points for new nodes. They become hotspots for gossip communication, ensuring new nodes can join the cluster by connecting to them.
  16. Failure Detection Mechanism • Every node makes independent decisions about

    peer availability. • Uses heartbeat state to monitor node status in the cluster. • Convicts nodes if no heartbeat detected within a threshold time. Phi Accrual Failure Detector • UP state: Node is considered available. • DOWN state: Node is considered unavailable. • State decisions are local and based on successful network communication. UP and DOWN State
  17. Data distribution using consistent hashing ensures balanced load and storage

    across all nodes, maintaining high performance. The system is designed to run efficiently on commodity hardware, avoiding the need for expensive specialized hardware. Cassandra allows linear scaling by simply adding more nodes to the cluster, enhancing both storage and computational capacity. Scaling Out on Commodity Hardware
  18. CAP Theorem and Cassandra • The CAP theorem states that

    a distributed database can only provide two out of the three guarantees: Consistency, Availability, and Partition Tolerance. • Cassandra prioritizes Availability and Partition Tolerance, ensuring that the system continues to operate even if some network messages are lost or delayed. • By compromising on immediate consistency, Cassandra achieves high availability and fault tolerance, making it suitable for applications that require continuous operation despite network issues. Cassandra’s Approach
  19. • Process of identifying entities and their relationships • Data

    stored in normalized tables with foreign keys • Queries driven by table structure and joins Data Modeling in Relational Databases • Data modeling based on application queries • Data structured to optimize query performance • Denormalization used to group related data in single tables Cassandra's Query-Driven Modeling What is Data Modeling?
  20. Query-Driven Modeling In Cassandra, the structure and organization of data

    are determined by the application queries. Data modeling starts with defining the queries first, and then designing tables to support those queries efficiently. Cassandra uses denormalization to optimize read performance. Data is duplicated across multiple tables, eliminating the need for joins and ensuring that queries can be executed swiftly by accessing a single table. Query-Driven Approach Denormalization No Joins Unlike relational databases, Cassandra does not support joins. All necessary data for a query must reside in a single table, which reduces complexity and enhances read efficiency.
  21. Partition keys help in distributing data evenly among nodes, improving

    load balancing and fault tolerance. Efficient partitioning reduces the number of nodes involved in a query, minimizing latency and improving read performance. Choosing the right primary key ensures even data distribution across the cluster, preventing hotspots. Goals of Data Modeling in Cassandra 1 2 3 Optimal selection of keys enhances write throughput by avoiding contention and ensuring swift data access. 4
  22. Understanding Partitions • Partitions enable distributed data storage across multiple

    nodes, enhancing scalability and fault tolerance. • A partition key is used to determine the partition in which a row of data will be stored, ensuring even data distribution. • Consistent hashing is employed to map partition keys to nodes, allowing for quick data retrieval and balanced load distribution. • Each partition is stored on multiple nodes (replicas) to ensure data availability and redundancy. Role of Partitions in Cassandra
  23. Partition Key Examples In table t with primary key id,

    the partition key is generated from id for data distribution. In table t with composite primary key (id, c), the partition key is generated from id, and c is the clustering key. Simple Key Example Composite Key Examples
  24. Magazine Data Set Example (1) List all magazine names with

    their publication frequency. Query 1 Limited to columns needed and partition key is the id. Model for Query 1 We have a magazine dataset: Magazine ID Publication Frequency Publication Date Publisher
  25. Magazine Data Set Example (2) List all magazine names by

    publisher. Query 2 We have a magazine dataset: Magazine ID Publication Frequency Publication Date Publisher Additional column Publisher for partition key. Id becomes the clustering key for sorting in a partition. Model for Query 2
  26. Designing Schema Designing a schema in Cassandra starts with understanding

    the queries your application will perform. Identify the primary entities and their attributes, then design tables to efficiently support these queries. CREATE TABLE magazine_name ( id int PRIMARY KEY, name text, publicationFrequency text ) Schema Design Process Example Query Q1 Example Query Q2 CREATE TABLE magazine_publisher ( publisher text, id int, name text, publicationFrequency text, PRIMARY KEY (publisher, id) ) WITH CLUSTERING ORDER BY (id DESC)
  27. Disk Space Monitor the number of values and disk space

    per partition. Aim for fewer than 100,000 values and 100MB per partition for optimal performance. Partition Size Data Redundancy Data Model Analysis To be expected, but identify and minimize duplicate data across tables. While necessary for denormalization, excessive redundancy can waste disk space and affect performance. Calculate and monitor disk space usage for each table. Efficiently managed disk space ensures better performance and allows for future scalability. Use Lightweight Transactions (LWT) sparingly. They can impact performance, so it's best to reserve them for critical consistency requirements. Lightweight Transactions
  28. Defining Application Queries (2) • Q1. Find hotels near a

    given point of interest. • Q2. Find information about a given hotel, such as its name and location. • Q3. Find points of interest near a given hotel. • Q4. Find an available room in a given date range. • Q5. Find the rate and amenities for a room. • Q6. Lookup a reservation by confirmation number. • Q7. Lookup a reservation by hotel, date, and guest name. • Q8. Lookup all reservations by guest name. • Q9. View guest details.
  29. Logical Data Modeling Identify the primary entity type and its

    attributes. Create tables for each query, capturing entities and relationships from the conceptual model. Design the primary key to ensure data is uniquely identifiable. Partition key columns group related data, and clustering columns support desired sort ordering. Creating Logical Models Primary Key Design Query Support Align table design with application queries. Ensure tables are structured to efficiently support the required queries, minimizing partitions read.
  30. Physical Data Modeling Transition from logical to physical data models

    involves defining specific data types for each column and considering how data will be stored on disk. Use CQL data types like text, int, and uuid. Collections and user-defined types (UDTs) can help manage complex data structures and reduce redundancy. From Logical to Physical Data Types User-Defined Types UDTs allow encapsulation of multiple fields into a single column. For example, an address type might include street, city, and zip code fields.
  31. • JDBC driver for Cassandra ◦ Usable in tooling like

    DataGrip ◦ Usable in any JVM-based platform, incl. ACF/Lucee and BoxLang ◦ Custom driver setup via admin tools or cfconfig etc • Caveats ◦ All communication has to be managed from within the query string ◦ Tags like <CFTRANSACTION> or <CFQUERPARAM> don’t work as expected in this context ◦ Optional: Use Java PreparedStatements • Connection design ◦ Remember that every node can be connected to - how would that look like from an application server cluster? Application Use