Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Cassandra: Introduction and Use Cases

Apache Cassandra: Introduction and Use Cases

This presentation on Apache Cassandra was given by Alex Volochev (Developer Advocate, Datastax) at NoSQL Day 2021.

Postgres Professional

May 20, 2021
Tweet

Transcript

  1. Intro to Cassandra 1. What is Cassandra and when you

    need it 2. Downsides :( 3. Use-Cases
  2. NODE NODE NODE NODE NODE NODE NODE Apache Cassandra™ 1

    Installation = 1 NODE ✔ Capacity = ~ 2-4TB ✔ Throughput = LOTS Tx/sec/core Communication: ✔ Gossiping DataCenter | Ring = NoSQL Distributed Database
  3. - Big Data Ready (Elastic scaling) - Highest Availability (Masterless,

    replication) - Geographical Distribution (Native multi-datacenter deployments) - Read/Write Performance (Very fast, linear scaling) - Vendor Independent (Apache Software Foundation OSS Project) Apache Cassandra™ = NoSQL Distributed Database
  4. Country City Population USA New York 8.000.000 USA Los Angeles

    4.000.000 FR Paris 2.230.000 DE Berlin 3.350.000 UK London 9.200.000 AU Sydney 4.900.000 FR Toulouse 1.100.000 JP Tokyo 37.430.000 IN Mumbai 20.200.000 DE Nuremberg 500.000 CA Montreal 4.200.000 CA Toronto 6.200.000 Partition Key Data is Distributed
  5. Country City Population USA New York 8.000.000 USA Los Angeles

    4.000.000 UK London 9.200.000 AU Sydney 4.900.000 JP Tokyo 37.430.000 IN Mumbai 20.200.000 DE Berlin 3.350.000 DE Nuremberg 500.000 CA Montreal 4.200.000 CA Toronto 6.200.000 FR Toulouse 1.100.000 FR Paris 2.230.000 Data is Distributed
  6. Partitioning Partition Key CO City Population AU Sydney 4.900.000 CA

    Toronto 6.200.000 CA Montreal 4.200.000 DE Berlin 3.350.000 DE Nuremberg 500.000 CO City Population 59 Sydney 4.900.000 12 Toronto 6.200.000 12 Montreal 4.200.000 45 Berlin 3.350.000 45 Nuremberg 500.000 Partitioner Murmur3 Hashing Tokens 1-25 26-50 51-75 76-100 Cassandra Nodes
  7. RF = 3 Replication Factor 3 means that every row

    is stored on 3 different nodes 0 50 33 17 83 67 Data is Replicated
  8. 59 (data) RF = 3 0 50 33 17 83

    67 Replication within the Ring
  9. RF = 3 0 50 33 17 83 67 59

    (data) Replication within the Ring
  10. RF = 3 0 50 33 17 83 67 59

    (data) 59 (data) 59 (data) Replication within the Ring
  11. RF = 3 0 50 33 17 83 67 59

    (data) 59 (data) 59 (data) Hint Node Failure
  12. RF = 3 0 50 33 17 83 67 59

    (data) 59 (data) 59 (data) Hint Node Failure Recovered
  13. CAP Theorem Consistency Availability Partition Tolerance In the distributed environment,

    you can have only two guaranteed qualities out of three :(
  14. Is Cassandra AP or CP? Cassandra is configurably consistent. In

    any moment, of the time, for any particular query you can set your the Consistency Level you require to have. Cassandra Consistency Levels: • ANY • ONE • TWO, THREE • QUORUM • ALL
  15. Is Cassandra AP or CP? Cassandra is configurably consistent. In

    any moment, of the time, for any particular query you can set your the Consistency Level you require to have. Cassandra Multi-DC Consistency Levels: • LOCAL_ONE • LOCAL_QUORUM • EACH_QUORUM
  16. Client Write CL = QUORUM Read CL = QUORUM Client

    Immediate Consistency – The Right Way
  17. CL Write + CL Read > RF ← Immediate Consistency

    Immediate Consistency – The Right Way
  18. Normalization “Database normalization is the process of structuring a relational

    database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as part of his relational model.” PROS: Simple write, Data Integrity CONS: Slow read, Complex Queries userId firstName lastName 1 Edgar Codd 2 Raymond Boyce departmentId department 1 Engineering 2 Math Employees Departments
  19. Denormalization “Denormalization is a strategy used on a database to

    increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data” PROS: Quick Read, Simple Queries CONS: Multiple Writes, Manual Integrity userId firstName lastName department 1 Edgar Codd Engineering 2 Raymond Boyce Math 3 Sage Lahja Math 4 Juniper Jones Botany Employees
  20. Keyspaces CREATE KEYSPACE users WITH REPLICATION = { 'class' :

    'NetworkTopologyStrategy', 'us-west-1' : 3, 'eu-central-1' : 5 }; keyspace Replication factor by data center replication strategy
  21. Creating a Table in CQL CREATE TABLE killrvideo.users_by_city ( city

    text, last_name text, first_name text, address text, email text, PRIMARY KEY ((city), last_name, first_name, email)); keyspace column definitions Partition key Primary key table Clustering columns
  22. Primary Key CREATE TABLE killrvideo.users_by_city ( city text, last_name text,

    first_name text, address text, email text, PRIMARY KEY ((city), last_name, first_name, email)); An identifier for a row. Consists of at least one Partition Key and zero or more Clustering Columns. MUST ENSURE UNIQUENESS. MAY DEFINE SORTING. Partition key Clustering columns PRIMARY KEY ((city), last_name, first_name, email); PRIMARY KEY (user_id); Bad Example: PRIMARY KEY ((city), last_name, first_name); Good Examples:
  23. Partition Key CREATE TABLE killrvideo.users_by_city ( city text, last_name text,

    first_name text, address text, email text, PRIMARY KEY ((city), last_name, first_name, email)); An identifier for a partition. Consists of at least one column, may have more if needed PARTITIONS ROWS. Partition key Clustering columns PRIMARY KEY (user_id); PRIMARY KEY ((video_id), comment_id); Bad Example: PRIMARY KEY ((sensor_id), logged_at); Good Examples:
  24. Clustering Column(s) CREATE TABLE killrvideo.users_by_city ( city text, last_name text,

    first_name text, address text, email text, PRIMARY KEY ((city), last_name, first_name, email)); Used to ensure uniqueness and sorting order. Optional. Partition key Clustering columns PRIMARY KEY ((video_id), comment_id); PRIMARY KEY ((city), last_name, first_name); PRIMARY KEY ((city), last_name, first_name, email); PRIMARY KEY ((video_id), created_at, comment_id); Not Unique Not Sorted
  25. Rules of a Good Partition • Store together what you

    retrieve together • Avoid big partitions • Avoid hot partitions PRIMARY KEY ((video_id), created_at, comment_id); PRIMARY KEY ((comment_id), created_at); The Slide of the Year Award! Example: open a video? Get the comments in a single query!
  26. Rules of a Good Partition PRIMARY KEY ((video_id), created_at, comment_id);

    The Slide of the Year Award! PRIMARY KEY ((country), user_id); • Up to 2 billion cells per partition • Up to ~100k rows in a partition • Up to ~100MB in a Partition • Store together what you retrieve together • Avoid big partitions • Avoid hot partitions
  27. Rules of a Good Partition PRIMARY KEY ((sensor_id), reported_at); The

    Slide of the Year Award! • Store together what you retrieve together • Avoid big and constantly growing partitions! • Avoid hot partitions Example: a huge IoT infrastructure, hardware all over the world, different sensors reporting their state every 10 seconds. Every sensor reports its UUID, timestamp of the report, sensor’s value. • Sensor ID: UUID • Timestamp: Timestamp • Value: float
  28. Rules of a Good Partition PRIMARY KEY ((sensor_id), reported_at); The

    Slide of the Year Award! BUCKETING PRIMARY KEY ((sensor_id, month_year), reported_at); • Sensor ID: UUID • MonthYear: Integer or String • Timestamp: Timestamp • Value: float • Store together what you retrieve together • Avoid big and constantly growing partitions! • Avoid hot partitions Example: a huge IoT infrastructure, hardware all over the world, different sensors reporting their state every 10 seconds. Every sensor reports its UUID, timestamp of the report, sensor’s value.
  29. Rules of a Good Partition PRIMARY KEY ((video_id), created_at, comment_id);

    PRIMARY KEY (user_id); The Slide of the Year Award! PRIMARY KEY ((country), user_id); • Store together what you retrieve together • Avoid big partitions • Avoid hot partitions
  30. Intro to Cassandra 1. What is Cassandra and when you

    need it 2. Downsides :( 3. Use-Cases
  31. OLTP / OLAP • OnLine Transaction Processing OLTP OLAP -

    Need answers NOW - Simple Queries - Queries don’t change often - Answers can wait - Complex Queries - Queries tend to change (Adhoc) RDBMS • OnLine Analytical Processing
  32. 1. Search Capabilities :( Partition key is required, no search

    on Data Columns, limited aggregations 2. Complex Management It's easy to shoot yourself in the foot 3. Lack of Cassandra-enabled developer Companies like Apple or Netflix are vacuuming the job market Downsides
  33. Solutions 1. Teamwork (partner up with Apache Spark, Elasticsearch etc.)

    2. Serverless DataStax Astra (astra.datastax.com) 3. DataStax Certification Program (academy.datastax.com)
  34. Intro to Cassandra 1. What is Cassandra and when you

    need it 2. Downsides :( 3. Use-Cases
  35. High Throughput High Volume Heavy Writes Heavy Reads Event Streaming

    Log Analytics Internet of Things Other Time Series Mission-Critical No Data Loss Always-on Scalability Availability Distributed Cloud-native Banking Pricing Market Data Inventory Global Retail Tracking / Logistics Customer Experience API Layer Hybrid-cloud Enterprise Data Layer Multi-cloud Modern Cloud Applications Global Presence Workload Mobility Compliance / GDPR Understanding Use Cases