Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Big data and Cassandra

Anh Thi Nguyen
December 07, 2020
2.3k

Introduction to Big data and Cassandra

Anh Thi Nguyen

December 07, 2020
Tweet

Transcript

  1. Introduction to big data and Group 02: • Nguyen Dang

    Anh Thi • Dang Phuong Nam • Le Minh Nghia 1 Presenter:Nguyen Dang Anh Thi
  2. 2 Social network IOT devices Stock market 35 zettabytes in

    2020 According to Digital Universe Overview about Big Data
  3. • Relational databases are not good for storing big data.

    • Fixed schema. • Not distributed, hard to scale. • Poor write performance for a high-throughput. 3 Overview about storage: RDBMS
  4. • NoSQL is better for big data application. • Flexibility:

    dynamic schema, can store unstructured and semi-structured data Not distributed, hard to scale. • Scalable, most NoSQL databases are distributed. • High performance. 4 Overview about storage: NoSQL
  5. Introduce Cassandra: Cassandra History • Cassandra was developed in 2007

    at Facebook by one of the authors of Amazon's Dynamo Avinash Lakshman and Prashant Malik to power inbox search feature. • Facebook released Cassandra as an open- source project on Google code in July 2008 • In March 2009 became an Apache Incubator project. • On February 17, 2010 it graduated to a top- level project. • Was adopeted by many big companies until now. 5
  6. Inbox search problem - Requirements: a scalability, distributed data across

    multiple data centers. - Million of simultaneous users. - Billion of writes perday. A user want to seach inbox his or her inbox for messages using one of two strategies: • Term search – keyword. • Interaction search – search by username. 6
  7. Big table(2006) 7 Dynamo paper(2007) A data model that is:

    • Reliable. • High-performant. • Always available. • Richer Data model. • 1 keys lot of values. • Fast sequential access.
  8. • Node − It is the place where data is

    stored. • Data center − It is a collection of related nodes. • Cluster − A cluster is a component that contains one or more data centers. • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. • Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables. • SSTable − It is a disk file to which the data is flushed from the mem- table when its contents reach a threshold value. 9 Detail about Cassandra : Architecture overview
  9. • Peer to peer, masterless, ring architecture. • Every nodes

    is the same, no master, no slave. • Data is partitioned among all nodes in the cluster. • Data replication to ensure fault tolerance . 10
  10. • Keyspace is the outermost container for data in Cassandra.

    • Columns are grouped into Column Families. • Each Column has ▪ Name ▪ Value 11 Detail about Cassandra: Data modeling
  11. 12

  12. • Random partitioning – this is the default and recommended

    strategy. Partitioning data as evenly as possible across all nodes using an MD5 hash of every column family row key. • Ordered partitioning – stores column family row keys in sorted order across the nodes in a database cluster. 14 There are two basic data partitioning strategies:
  13. • Replication Factor: Replication means the no. of copies maintained

    on different nodes. Replication Factor of 3 means, 3 copies of data maintained on 3 different nodes. So if 2 of the nodes go down we still have one copy of data safe. • Replication Strategy: There is two replication strategy. 15 Detail about Cassandra: Replication
  14. Simple strategy: This strategy is used when there is only

    one data center, data is copied in a clockwise manner on all the nodes. 16 Source: https://data-flair.training/blogs/cassandra-architecture/
  15. Network topology strategy: This strategy is highly recommended as there

    is a possibility to expand according to the future use. 17 Source: https://data-flair.training/blogs/cassandra-architecture/
  16. 19 Open Source • Cassandra, though it is very powerful

    and reliable, is FREE!. • Cassandra can be integrated with other Apache Open-source projects like Hadoop, Apache Pig, Apache Hive,.. Peer-to-Peer architecture • No single point of failure. • Any number of servers/nodes can be added to any Cassandra cluster in any of the datacenters. • Any server can entertain request from any client.
  17. 20 Elastic Scalability - Cassandra cluster can be easily scaled-up

    or scaled-down. • Any number of nodes can be added or deleted in Cassandra cluster without much disturbance. • You don’t have to restart the cluster or change queries related Cassandra application while scaling up or down. High Availability and Fault Tolerance • Data replication which makes Cassandra highly available and fault-tolerant (each data is stored at more than one location). • Data replication can also happen across multiple data centres.
  18. 21 High Performance • The basic idea behind developing Cassandra

    was to harness the hidden capabilities of several multicore machines.. • Cassandra has proven itself to be excellently reliable when it comes to a large set of data. Column Oriented • Unlike traditional databases, where column names only consist of metadata, in Cassandra column names can also consist of the actual data. • Cassandra rows can consist of masses of columns. ⇒ Cassandra is endowed with a rich data model.
  19. 22 Tunable Consistency • Eventual consistency makes sure that the

    client is approved as soon as the cluster accepts the write. • Strong consistency means that any update is broadcasted to all machines or all the nodes where the particular data is situated. • The mixture of the two consistency is also a possibility. Schema Free • In Cassandra, columns can be created at your will within the rows.
  20. When write request comes to the node: • Firstly, it

    logs in the Commit Log. Data will be captured and stored in the Mem-Table. • When mem-table is full, data is flushed to the SSTable data file. Source: https://www.scnsoft.com/blog/cassandra-performance 24 Detail about Cassandra: Write Operation
  21. 25 Cassandra performance benchmark Load process: Cassandra Vs. MongoDB Vs.

    HBase Vs. Couchbase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks
  22. 26 Load process: Cassandra Vs. MongoDB Vs. HBase Vs. Couchbase

    Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks
  23. 27 Mixed Operational And Analytical Workload Cassandra Vs. MongoDB Vs.

    HBase Vs. Couchbase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks
  24. 28 Mixed Operational And Analytical Workload Cassandra Vs. MongoDB Vs.

    HBase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks
  25. 29 • MySQL > 50 GB Data Writes Average :

    ~300 ms Reads Average : ~350 ms • Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms Comparison with MySQL Source: Stats provided by Authors using facebook data. Comparison with MySQL
  26. 31 When not to use Cassandra: • Tables have multiple

    access paths. Example: lots of secondary indexes. • The application depends on identifying rows with sequential values. MySQL autoincrement or Oracle sequences. • Cassandra does not do ACID. LSD, Sulphuric or any other kind. If you think you need it go elsewhere. Many times people think they do need it when they don’t. • Aggregates: Cassandra does not support aggregates, if you need to do a lot of them, think another database. • Joins: You many be able to data model yourself out of this one, but take care. • Locks: Honestly, Cassandra does not support locking. There is a good reason for this. Don’t try to implement them yourself. I have seen the end result of people trying to do locks using Cassandra and the results were not pretty. • Updates: Cassandra is very good at writes, okay with reads. Updates and deletes are implemented as special cases of writes and that has consequences that are not immediately obvious. • Transactions: CQL has no begin/commit transaction syntax. If you think you need it then Cassandra is a poor choice for you. Don’t try to simulate it. The results won’t be pretty.
  27. 32 When to use Cassandra: • Distributed: Runs on more

    than one server node. • Scale linearly: By adding nodes, not more hardware on existing nodes. • Work globally: A cluster may be geographically distributed. • Favor writes over reads: Writes are an order of magnitude faster than reads. • Democratic peer to peer architecture: No master/slave. • Favor partition tolerance and availability over consistency: Eventually consistent (see the CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem.) • Support fast targeted reads by primary key: Focus on primary key reads alternative paths are very sub-optimal. • Support data with a defined lifetime: All data in a Cassandra database has a defined lifetime no need to delete it after the lifetime expires the data goes away.
  28. • Twitter is using Cassandra for analytics: real-time analytics, geolocation

    and places of interest data, and data mining over the entire user store. • Mahalo uses it for its primary near-time data store. • Facebook still uses it for inbox search, though they are using a proprietary fork. • Digg uses it for its primary near-time data store. • Rackspace uses it for its cloud service, monitoring, and logging. • Reddit uses it as a persistent cache. • Cloudkick uses it for monitoring statistics and analytics. • Ooyala uses it to store and serve near real-time video analytics data. • SimpleGeo uses it as the main data store for its real-time location infrastructure. • Onespot uses it for a subset of its main data store 33 Companies that use Cassandra