Slide 1

Slide 1 text

Intro to Apache Cassandra: Decentralised NoSQL Database for Big Data

Slide 2

Slide 2 text

Intro to Cassandra 1. What is Cassandra and when you need it 2. Downsides :( 3. Use-Cases

Slide 3

Slide 3 text

NODE NODE NODE NODE NODE NODE NODE Apache Cassandra™ 1 Installation = 1 NODE ✔ Capacity = ~ 2-4TB ✔ Throughput = LOTS Tx/sec/core Communication: ✔ Gossiping DataCenter | Ring = NoSQL Distributed Database

Slide 4

Slide 4 text

- Big Data Ready (Elastic scaling) - Highest Availability (Masterless, replication) - Geographical Distribution (Native multi-datacenter deployments) - Read/Write Performance (Very fast, linear scaling) - Vendor Independent (Apache Software Foundation OSS Project) Apache Cassandra™ = NoSQL Distributed Database

Slide 5

Slide 5 text

Vertical Scalability? Vertical scaling requires one large expensive machine + Easier to get - Expensive - Still SPoF

Slide 6

Slide 6 text

Horizontal Scalability! Horizontal scaling requires multiple less-expensive commodity hardware + Cheaper + Removes SPoF - Logistically harder

Slide 7

Slide 7 text

Cassandra Scales Horizontally 100,000 transactions/second 200,000 transactions/second 400,000 transactions/second

Slide 8

Slide 8 text

Cassandra Scales Linearly Recommended reading: netflixtechblog.com/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e

Slide 9

Slide 9 text

Country City Population USA New York 8.000.000 USA Los Angeles 4.000.000 FR Paris 2.230.000 DE Berlin 3.350.000 UK London 9.200.000 AU Sydney 4.900.000 FR Toulouse 1.100.000 JP Tokyo 37.430.000 IN Mumbai 20.200.000 DE Nuremberg 500.000 CA Montreal 4.200.000 CA Toronto 6.200.000 Partition Key Data is Distributed

Slide 10

Slide 10 text

Country City Population USA New York 8.000.000 USA Los Angeles 4.000.000 UK London 9.200.000 AU Sydney 4.900.000 JP Tokyo 37.430.000 IN Mumbai 20.200.000 DE Berlin 3.350.000 DE Nuremberg 500.000 CA Montreal 4.200.000 CA Toronto 6.200.000 FR Toulouse 1.100.000 FR Paris 2.230.000 Data is Distributed

Slide 11

Slide 11 text

Partitioning Partition Key CO City Population AU Sydney 4.900.000 CA Toronto 6.200.000 CA Montreal 4.200.000 DE Berlin 3.350.000 DE Nuremberg 500.000 CO City Population 59 Sydney 4.900.000 12 Toronto 6.200.000 12 Montreal 4.200.000 45 Berlin 3.350.000 45 Nuremberg 500.000 Partitioner Murmur3 Hashing Tokens 1-25 26-50 51-75 76-100 Cassandra Nodes

Slide 12

Slide 12 text

RF = 3 Replication Factor 3 means that every row is stored on 3 different nodes 0 50 33 17 83 67 Data is Replicated

Slide 13

Slide 13 text

59 (data) RF = 3 0 50 33 17 83 67 Replication within the Ring

Slide 14

Slide 14 text

RF = 3 0 50 33 17 83 67 59 (data) Replication within the Ring

Slide 15

Slide 15 text

RF = 3 0 50 33 17 83 67 59 (data) 59 (data) 59 (data) Replication within the Ring

Slide 16

Slide 16 text

RF = 3 0 50 33 17 83 67 59 (data) 59 (data) 59 (data) Hint Node Failure

Slide 17

Slide 17 text

RF = 3 0 50 33 17 83 67 59 (data) 59 (data) 59 (data) Hint Node Failure Recovered

Slide 18

Slide 18 text

CAP Theorem Consistency Availability Partition Tolerance In the distributed environment, you can have only two guaranteed qualities out of three :(

Slide 19

Slide 19 text

Is Cassandra AP or CP? Cassandra is configurably consistent. In any moment, of the time, for any particular query you can set your the Consistency Level you require to have. Cassandra Consistency Levels: ● ANY ● ONE ● TWO, THREE ● QUORUM ● ALL

Slide 20

Slide 20 text

Is Cassandra AP or CP? Cassandra is configurably consistent. In any moment, of the time, for any particular query you can set your the Consistency Level you require to have. Cassandra Multi-DC Consistency Levels: ● LOCAL_ONE ● LOCAL_QUORUM ● EACH_QUORUM

Slide 21

Slide 21 text

Client Write CL = ONE RF = 3 Consistency Level One

Slide 22

Slide 22 text

Client Write CL = Quorum RF = 3 Consistency Level Quorum

Slide 23

Slide 23 text

Client Write CL = ALL RF = 3 Consistency Level ALL

Slide 24

Slide 24 text

Client Write CL = ALL RF = 3 Consistency Level ALL

Slide 25

Slide 25 text

Client Write CL = QUORUM Read CL = QUORUM Client Immediate Consistency – The Right Way

Slide 26

Slide 26 text

CL Write + CL Read > RF ← Immediate Consistency Immediate Consistency – The Right Way

Slide 27

Slide 27 text

• Geographic Distribution • Hybrid-Cloud and Multi-Cloud Data Distributed Everywhere On-premise

Slide 28

Slide 28 text

Normalization “Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as part of his relational model.” PROS: Simple write, Data Integrity CONS: Slow read, Complex Queries userId firstName lastName 1 Edgar Codd 2 Raymond Boyce departmentId department 1 Engineering 2 Math Employees Departments

Slide 29

Slide 29 text

Denormalization “Denormalization is a strategy used on a database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data” PROS: Quick Read, Simple Queries CONS: Multiple Writes, Manual Integrity userId firstName lastName department 1 Edgar Codd Engineering 2 Raymond Boyce Math 3 Sage Lahja Math 4 Juniper Jones Botany Employees

Slide 30

Slide 30 text

Keyspaces CREATE KEYSPACE users WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'us-west-1' : 3, 'eu-central-1' : 5 }; keyspace Replication factor by data center replication strategy

Slide 31

Slide 31 text

Creating a Table in CQL CREATE TABLE killrvideo.users_by_city ( city text, last_name text, first_name text, address text, email text, PRIMARY KEY ((city), last_name, first_name, email)); keyspace column definitions Partition key Primary key table Clustering columns

Slide 32

Slide 32 text

Primary Key CREATE TABLE killrvideo.users_by_city ( city text, last_name text, first_name text, address text, email text, PRIMARY KEY ((city), last_name, first_name, email)); An identifier for a row. Consists of at least one Partition Key and zero or more Clustering Columns. MUST ENSURE UNIQUENESS. MAY DEFINE SORTING. Partition key Clustering columns PRIMARY KEY ((city), last_name, first_name, email); PRIMARY KEY (user_id); Bad Example: PRIMARY KEY ((city), last_name, first_name); Good Examples:

Slide 33

Slide 33 text

Partition Key CREATE TABLE killrvideo.users_by_city ( city text, last_name text, first_name text, address text, email text, PRIMARY KEY ((city), last_name, first_name, email)); An identifier for a partition. Consists of at least one column, may have more if needed PARTITIONS ROWS. Partition key Clustering columns PRIMARY KEY (user_id); PRIMARY KEY ((video_id), comment_id); Bad Example: PRIMARY KEY ((sensor_id), logged_at); Good Examples:

Slide 34

Slide 34 text

Clustering Column(s) CREATE TABLE killrvideo.users_by_city ( city text, last_name text, first_name text, address text, email text, PRIMARY KEY ((city), last_name, first_name, email)); Used to ensure uniqueness and sorting order. Optional. Partition key Clustering columns PRIMARY KEY ((video_id), comment_id); PRIMARY KEY ((city), last_name, first_name); PRIMARY KEY ((city), last_name, first_name, email); PRIMARY KEY ((video_id), created_at, comment_id); Not Unique Not Sorted

Slide 35

Slide 35 text

Rules of a Good Partition ● Store together what you retrieve together ● Avoid big partitions ● Avoid hot partitions PRIMARY KEY ((video_id), created_at, comment_id); PRIMARY KEY ((comment_id), created_at); The Slide of the Year Award! Example: open a video? Get the comments in a single query!

Slide 36

Slide 36 text

Rules of a Good Partition PRIMARY KEY ((video_id), created_at, comment_id); The Slide of the Year Award! PRIMARY KEY ((country), user_id); ● Up to 2 billion cells per partition ● Up to ~100k rows in a partition ● Up to ~100MB in a Partition ● Store together what you retrieve together ● Avoid big partitions ● Avoid hot partitions

Slide 37

Slide 37 text

Rules of a Good Partition PRIMARY KEY ((sensor_id), reported_at); The Slide of the Year Award! ● Store together what you retrieve together ● Avoid big and constantly growing partitions! ● Avoid hot partitions Example: a huge IoT infrastructure, hardware all over the world, different sensors reporting their state every 10 seconds. Every sensor reports its UUID, timestamp of the report, sensor’s value. ● Sensor ID: UUID ● Timestamp: Timestamp ● Value: float

Slide 38

Slide 38 text

Rules of a Good Partition PRIMARY KEY ((sensor_id), reported_at); The Slide of the Year Award! BUCKETING PRIMARY KEY ((sensor_id, month_year), reported_at); ● Sensor ID: UUID ● MonthYear: Integer or String ● Timestamp: Timestamp ● Value: float ● Store together what you retrieve together ● Avoid big and constantly growing partitions! ● Avoid hot partitions Example: a huge IoT infrastructure, hardware all over the world, different sensors reporting their state every 10 seconds. Every sensor reports its UUID, timestamp of the report, sensor’s value.

Slide 39

Slide 39 text

Rules of a Good Partition PRIMARY KEY ((video_id), created_at, comment_id); PRIMARY KEY (user_id); The Slide of the Year Award! PRIMARY KEY ((country), user_id); ● Store together what you retrieve together ● Avoid big partitions ● Avoid hot partitions

Slide 40

Slide 40 text

Intro to Cassandra 1. What is Cassandra and when you need it 2. Downsides :( 3. Use-Cases

Slide 41

Slide 41 text

OLTP / OLAP ● OnLine Transaction Processing OLTP OLAP - Need answers NOW - Simple Queries - Queries don’t change often - Answers can wait - Complex Queries - Queries tend to change (Adhoc) RDBMS ● OnLine Analytical Processing

Slide 42

Slide 42 text

1. Search Capabilities :( Partition key is required, no search on Data Columns, limited aggregations 2. Complex Management It's easy to shoot yourself in the foot 3. Lack of Cassandra-enabled developer Companies like Apple or Netflix are vacuuming the job market Downsides

Slide 43

Slide 43 text

Solutions 1. Teamwork (partner up with Apache Spark, Elasticsearch etc.) 2. Serverless DataStax Astra (astra.datastax.com) 3. DataStax Certification Program (academy.datastax.com)

Slide 44

Slide 44 text

Intro to Cassandra 1. What is Cassandra and when you need it 2. Downsides :( 3. Use-Cases

Slide 45

Slide 45 text

High Throughput High Volume Heavy Writes Heavy Reads Event Streaming Log Analytics Internet of Things Other Time Series Mission-Critical No Data Loss Always-on Scalability Availability Distributed Cloud-native Banking Pricing Market Data Inventory Global Retail Tracking / Logistics Customer Experience API Layer Hybrid-cloud Enterprise Data Layer Multi-cloud Modern Cloud Applications Global Presence Workload Mobility Compliance / GDPR Understanding Use Cases

Slide 46

Slide 46 text

Thank You