Introduction to Big data and Cassandra

Slide 1

Slide 1 text

Introduction to big data and Group 02: • Nguyen Dang Anh Thi • Dang Phuong Nam • Le Minh Nghia 1 Presenter:Nguyen Dang Anh Thi

Slide 2

Slide 2 text

2 Social network IOT devices Stock market 35 zettabytes in 2020 According to Digital Universe Overview about Big Data

Slide 3

Slide 3 text

• Relational databases are not good for storing big data. • Fixed schema. • Not distributed, hard to scale. • Poor write performance for a high-throughput. 3 Overview about storage: RDBMS

Slide 4

Slide 4 text

• NoSQL is better for big data application. • Flexibility: dynamic schema, can store unstructured and semi-structured data Not distributed, hard to scale. • Scalable, most NoSQL databases are distributed. • High performance. 4 Overview about storage: NoSQL

Slide 5

Slide 5 text

Introduce Cassandra: Cassandra History • Cassandra was developed in 2007 at Facebook by one of the authors of Amazon's Dynamo Avinash Lakshman and Prashant Malik to power inbox search feature. • Facebook released Cassandra as an open- source project on Google code in July 2008 • In March 2009 became an Apache Incubator project. • On February 17, 2010 it graduated to a top- level project. • Was adopeted by many big companies until now. 5

Slide 6

Slide 6 text

Inbox search problem - Requirements: a scalability, distributed data across multiple data centers. - Million of simultaneous users. - Billion of writes perday. A user want to seach inbox his or her inbox for messages using one of two strategies: • Term search – keyword. • Interaction search – search by username. 6

Slide 7

Slide 7 text

Big table(2006) 7 Dynamo paper(2007) A data model that is: • Reliable. • High-performant. • Always available. • Richer Data model. • 1 keys lot of values. • Fast sequential access.

Slide 8

Slide 8 text

Cassandara(2009) 8 - Distributed features of DynamoDB. - Data model and storage from big table.

Slide 9

Slide 9 text

• Node − It is the place where data is stored. • Data center − It is a collection of related nodes. • Cluster − A cluster is a component that contains one or more data centers. • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. • Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables. • SSTable − It is a disk file to which the data is flushed from the mem- table when its contents reach a threshold value. 9 Detail about Cassandra : Architecture overview

Slide 10

Slide 10 text

• Peer to peer, masterless, ring architecture. • Every nodes is the same, no master, no slave. • Data is partitioned among all nodes in the cluster. • Data replication to ensure fault tolerance . 10

Slide 11

Slide 11 text

• Keyspace is the outermost container for data in Cassandra. • Columns are grouped into Column Families. • Each Column has ▪ Name ▪ Value 11 Detail about Cassandra: Data modeling

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Source: https://www.scnsoft.com/blog/cassandra-performance 13 Detail about Cassandra: Partitioning

Slide 14

Slide 14 text

• Random partitioning – this is the default and recommended strategy. Partitioning data as evenly as possible across all nodes using an MD5 hash of every column family row key. • Ordered partitioning – stores column family row keys in sorted order across the nodes in a database cluster. 14 There are two basic data partitioning strategies:

Slide 15

Slide 15 text

• Replication Factor: Replication means the no. of copies maintained on different nodes. Replication Factor of 3 means, 3 copies of data maintained on 3 different nodes. So if 2 of the nodes go down we still have one copy of data safe. • Replication Strategy: There is two replication strategy. 15 Detail about Cassandra: Replication

Slide 16

Slide 16 text

Simple strategy: This strategy is used when there is only one data center, data is copied in a clockwise manner on all the nodes. 16 Source: https://data-flair.training/blogs/cassandra-architecture/

Slide 17

Slide 17 text

Network topology strategy: This strategy is highly recommended as there is a possibility to expand according to the future use. 17 Source: https://data-flair.training/blogs/cassandra-architecture/

Slide 18

Slide 18 text

18 Source: https://data-flair.training/blogs/cassandra-architecture/ Detail about Cassandra: Key features

Slide 19

Slide 19 text

19 Open Source • Cassandra, though it is very powerful and reliable, is FREE!. • Cassandra can be integrated with other Apache Open-source projects like Hadoop, Apache Pig, Apache Hive,.. Peer-to-Peer architecture • No single point of failure. • Any number of servers/nodes can be added to any Cassandra cluster in any of the datacenters. • Any server can entertain request from any client.

Slide 20

Slide 20 text

20 Elastic Scalability - Cassandra cluster can be easily scaled-up or scaled-down. • Any number of nodes can be added or deleted in Cassandra cluster without much disturbance. • You don’t have to restart the cluster or change queries related Cassandra application while scaling up or down. High Availability and Fault Tolerance • Data replication which makes Cassandra highly available and fault-tolerant (each data is stored at more than one location). • Data replication can also happen across multiple data centres.

Slide 21

Slide 21 text

21 High Performance • The basic idea behind developing Cassandra was to harness the hidden capabilities of several multicore machines.. • Cassandra has proven itself to be excellently reliable when it comes to a large set of data. Column Oriented • Unlike traditional databases, where column names only consist of metadata, in Cassandra column names can also consist of the actual data. • Cassandra rows can consist of masses of columns. ⇒ Cassandra is endowed with a rich data model.

Slide 22

Slide 22 text

22 Tunable Consistency • Eventual consistency makes sure that the client is approved as soon as the cluster accepts the write. • Strong consistency means that any update is broadcasted to all machines or all the nodes where the particular data is situated. • The mixture of the two consistency is also a possibility. Schema Free • In Cassandra, columns can be created at your will within the rows.

Slide 23

Slide 23 text

Source: https://www.scnsoft.com/blog/cassandra-performance 23 Detail about Cassandra: Read Operation

Slide 24

Slide 24 text

When write request comes to the node: • Firstly, it logs in the Commit Log. Data will be captured and stored in the Mem-Table. • When mem-table is full, data is flushed to the SSTable data file. Source: https://www.scnsoft.com/blog/cassandra-performance 24 Detail about Cassandra: Write Operation

Slide 25

Slide 25 text

25 Cassandra performance benchmark Load process: Cassandra Vs. MongoDB Vs. HBase Vs. Couchbase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks

Slide 26

Slide 26 text

26 Load process: Cassandra Vs. MongoDB Vs. HBase Vs. Couchbase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks

Slide 27

Slide 27 text

27 Mixed Operational And Analytical Workload Cassandra Vs. MongoDB Vs. HBase Vs. Couchbase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks

Slide 28

Slide 28 text

28 Mixed Operational And Analytical Workload Cassandra Vs. MongoDB Vs. HBase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks

Slide 29

Slide 29 text

29 • MySQL > 50 GB Data Writes Average : ~300 ms Reads Average : ~350 ms • Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms Comparison with MySQL Source: Stats provided by Authors using facebook data. Comparison with MySQL

Slide 30

Slide 30 text

30 Summary Advantages: Disadvanteges

Slide 31

Slide 31 text

31 When not to use Cassandra: • Tables have multiple access paths. Example: lots of secondary indexes. • The application depends on identifying rows with sequential values. MySQL autoincrement or Oracle sequences. • Cassandra does not do ACID. LSD, Sulphuric or any other kind. If you think you need it go elsewhere. Many times people think they do need it when they don’t. • Aggregates: Cassandra does not support aggregates, if you need to do a lot of them, think another database. • Joins: You many be able to data model yourself out of this one, but take care. • Locks: Honestly, Cassandra does not support locking. There is a good reason for this. Don’t try to implement them yourself. I have seen the end result of people trying to do locks using Cassandra and the results were not pretty. • Updates: Cassandra is very good at writes, okay with reads. Updates and deletes are implemented as special cases of writes and that has consequences that are not immediately obvious. • Transactions: CQL has no begin/commit transaction syntax. If you think you need it then Cassandra is a poor choice for you. Don’t try to simulate it. The results won’t be pretty.

Slide 32

Slide 32 text

32 When to use Cassandra: • Distributed: Runs on more than one server node. • Scale linearly: By adding nodes, not more hardware on existing nodes. • Work globally: A cluster may be geographically distributed. • Favor writes over reads: Writes are an order of magnitude faster than reads. • Democratic peer to peer architecture: No master/slave. • Favor partition tolerance and availability over consistency: Eventually consistent (see the CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem.) • Support fast targeted reads by primary key: Focus on primary key reads alternative paths are very sub-optimal. • Support data with a defined lifetime: All data in a Cassandra database has a defined lifetime no need to delete it after the lifetime expires the data goes away.

Slide 33

Slide 33 text

• Twitter is using Cassandra for analytics: real-time analytics, geolocation and places of interest data, and data mining over the entire user store. • Mahalo uses it for its primary near-time data store. • Facebook still uses it for inbox search, though they are using a proprietary fork. • Digg uses it for its primary near-time data store. • Rackspace uses it for its cloud service, monitoring, and logging. • Reddit uses it as a persistent cache. • Cloudkick uses it for monitoring statistics and analytics. • Ooyala uses it to store and serve near real-time video analytics data. • SimpleGeo uses it as the main data store for its real-time location infrastructure. • Onespot uses it for a subset of its main data store 33 Companies that use Cassandra

Slide 34

Slide 34 text

References • https://www.slideshare.net/asismohanty/cassandra-basics-20 • https://data-flair.training/blogs/cassandra-architecture/ • https://www.slideshare.net/quangntta/introduction-to-cassandra- 59962524?from_action=save • https://data-flair.training/blogs/cassandra-data-model/ • https://technospirituality.com/2016/07/apache-cassandra-a-quick- look/ • https://www.slideshare.net/DataStax/understanding-data- partitioning-and-replication-in-apache-cassandra 34

Slide 35

Slide 35 text

References • https://www.scnsoft.com/blog/cassandra-performance • https://www.edureka.co/blog/interview-questions/cassandra- interview-questions/ • https://www.gocit.vn/files/Cassandra.The.Definitive.Guide- www.gocit.vn.pdf • https://www.edureka.co/blog/apache-cassandra-advantages/ 35