Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Big data and Cassandra

9c5d7453fc8afbecfbc9e9ef882656be?s=47 Anh Thi Nguyen
December 07, 2020
770

Introduction to Big data and Cassandra

9c5d7453fc8afbecfbc9e9ef882656be?s=128

Anh Thi Nguyen

December 07, 2020
Tweet

Transcript

  1. Introduction to big data and Group 02: • Nguyen Dang

    Anh Thi • Dang Phuong Nam • Le Minh Nghia 1 Presenter:Nguyen Dang Anh Thi
  2. 2 Social network IOT devices Stock market 35 zettabytes in

    2020 According to Digital Universe Overview about Big Data
  3. • Relational databases are not good for storing big data.

    • Fixed schema. • Not distributed, hard to scale. • Poor write performance for a high-throughput. 3 Overview about storage: RDBMS
  4. • NoSQL is better for big data application. • Flexibility:

    dynamic schema, can store unstructured and semi-structured data Not distributed, hard to scale. • Scalable, most NoSQL databases are distributed. • High performance. 4 Overview about storage: NoSQL
  5. Introduce Cassandra: Cassandra History • Cassandra was developed in 2007

    at Facebook by one of the authors of Amazon's Dynamo Avinash Lakshman and Prashant Malik to power inbox search feature. • Facebook released Cassandra as an open- source project on Google code in July 2008 • In March 2009 became an Apache Incubator project. • On February 17, 2010 it graduated to a top- level project. • Was adopeted by many big companies until now. 5
  6. Inbox search problem - Requirements: a scalability, distributed data across

    multiple data centers. - Million of simultaneous users. - Billion of writes perday. A user want to seach inbox his or her inbox for messages using one of two strategies: • Term search – keyword. • Interaction search – search by username. 6
  7. Big table(2006) 7 Dynamo paper(2007) A data model that is:

    • Reliable. • High-performant. • Always available. • Richer Data model. • 1 keys lot of values. • Fast sequential access.
  8. Cassandara(2009) 8 - Distributed features of DynamoDB. - Data model

    and storage from big table.
  9. • Node − It is the place where data is

    stored. • Data center − It is a collection of related nodes. • Cluster − A cluster is a component that contains one or more data centers. • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. • Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables. • SSTable − It is a disk file to which the data is flushed from the mem- table when its contents reach a threshold value. 9 Detail about Cassandra : Architecture overview
  10. • Peer to peer, masterless, ring architecture. • Every nodes

    is the same, no master, no slave. • Data is partitioned among all nodes in the cluster. • Data replication to ensure fault tolerance . 10
  11. • Keyspace is the outermost container for data in Cassandra.

    • Columns are grouped into Column Families. • Each Column has ▪ Name ▪ Value 11 Detail about Cassandra: Data modeling
  12. 12

  13. Source: https://www.scnsoft.com/blog/cassandra-performance 13 Detail about Cassandra: Partitioning

  14. • Random partitioning – this is the default and recommended

    strategy. Partitioning data as evenly as possible across all nodes using an MD5 hash of every column family row key. • Ordered partitioning – stores column family row keys in sorted order across the nodes in a database cluster. 14 There are two basic data partitioning strategies:
  15. • Replication Factor: Replication means the no. of copies maintained

    on different nodes. Replication Factor of 3 means, 3 copies of data maintained on 3 different nodes. So if 2 of the nodes go down we still have one copy of data safe. • Replication Strategy: There is two replication strategy. 15 Detail about Cassandra: Replication
  16. Simple strategy: This strategy is used when there is only

    one data center, data is copied in a clockwise manner on all the nodes. 16 Source: https://data-flair.training/blogs/cassandra-architecture/
  17. Network topology strategy: This strategy is highly recommended as there

    is a possibility to expand according to the future use. 17 Source: https://data-flair.training/blogs/cassandra-architecture/
  18. 18 Source: https://data-flair.training/blogs/cassandra-architecture/ Detail about Cassandra: Key features

  19. 19 Open Source • Cassandra, though it is very powerful

    and reliable, is FREE!. • Cassandra can be integrated with other Apache Open-source projects like Hadoop, Apache Pig, Apache Hive,.. Peer-to-Peer architecture • No single point of failure. • Any number of servers/nodes can be added to any Cassandra cluster in any of the datacenters. • Any server can entertain request from any client.
  20. 20 Elastic Scalability - Cassandra cluster can be easily scaled-up

    or scaled-down. • Any number of nodes can be added or deleted in Cassandra cluster without much disturbance. • You don’t have to restart the cluster or change queries related Cassandra application while scaling up or down. High Availability and Fault Tolerance • Data replication which makes Cassandra highly available and fault-tolerant (each data is stored at more than one location). • Data replication can also happen across multiple data centres.
  21. 21 High Performance • The basic idea behind developing Cassandra

    was to harness the hidden capabilities of several multicore machines.. • Cassandra has proven itself to be excellently reliable when it comes to a large set of data. Column Oriented • Unlike traditional databases, where column names only consist of metadata, in Cassandra column names can also consist of the actual data. • Cassandra rows can consist of masses of columns. ⇒ Cassandra is endowed with a rich data model.
  22. 22 Tunable Consistency • Eventual consistency makes sure that the

    client is approved as soon as the cluster accepts the write. • Strong consistency means that any update is broadcasted to all machines or all the nodes where the particular data is situated. • The mixture of the two consistency is also a possibility. Schema Free • In Cassandra, columns can be created at your will within the rows.
  23. Source: https://www.scnsoft.com/blog/cassandra-performance 23 Detail about Cassandra: Read Operation

  24. When write request comes to the node: • Firstly, it

    logs in the Commit Log. Data will be captured and stored in the Mem-Table. • When mem-table is full, data is flushed to the SSTable data file. Source: https://www.scnsoft.com/blog/cassandra-performance 24 Detail about Cassandra: Write Operation
  25. 25 Cassandra performance benchmark Load process: Cassandra Vs. MongoDB Vs.

    HBase Vs. Couchbase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks
  26. 26 Load process: Cassandra Vs. MongoDB Vs. HBase Vs. Couchbase

    Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks
  27. 27 Mixed Operational And Analytical Workload Cassandra Vs. MongoDB Vs.

    HBase Vs. Couchbase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks
  28. 28 Mixed Operational And Analytical Workload Cassandra Vs. MongoDB Vs.

    HBase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks
  29. 29 • MySQL > 50 GB Data Writes Average :

    ~300 ms Reads Average : ~350 ms • Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms Comparison with MySQL Source: Stats provided by Authors using facebook data. Comparison with MySQL
  30. 30 Summary Advantages: Disadvanteges

  31. 31 When not to use Cassandra: • Tables have multiple

    access paths. Example: lots of secondary indexes. • The application depends on identifying rows with sequential values. MySQL autoincrement or Oracle sequences. • Cassandra does not do ACID. LSD, Sulphuric or any other kind. If you think you need it go elsewhere. Many times people think they do need it when they don’t. • Aggregates: Cassandra does not support aggregates, if you need to do a lot of them, think another database. • Joins: You many be able to data model yourself out of this one, but take care. • Locks: Honestly, Cassandra does not support locking. There is a good reason for this. Don’t try to implement them yourself. I have seen the end result of people trying to do locks using Cassandra and the results were not pretty. • Updates: Cassandra is very good at writes, okay with reads. Updates and deletes are implemented as special cases of writes and that has consequences that are not immediately obvious. • Transactions: CQL has no begin/commit transaction syntax. If you think you need it then Cassandra is a poor choice for you. Don’t try to simulate it. The results won’t be pretty.
  32. 32 When to use Cassandra: • Distributed: Runs on more

    than one server node. • Scale linearly: By adding nodes, not more hardware on existing nodes. • Work globally: A cluster may be geographically distributed. • Favor writes over reads: Writes are an order of magnitude faster than reads. • Democratic peer to peer architecture: No master/slave. • Favor partition tolerance and availability over consistency: Eventually consistent (see the CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem.) • Support fast targeted reads by primary key: Focus on primary key reads alternative paths are very sub-optimal. • Support data with a defined lifetime: All data in a Cassandra database has a defined lifetime no need to delete it after the lifetime expires the data goes away.
  33. • Twitter is using Cassandra for analytics: real-time analytics, geolocation

    and places of interest data, and data mining over the entire user store. • Mahalo uses it for its primary near-time data store. • Facebook still uses it for inbox search, though they are using a proprietary fork. • Digg uses it for its primary near-time data store. • Rackspace uses it for its cloud service, monitoring, and logging. • Reddit uses it as a persistent cache. • Cloudkick uses it for monitoring statistics and analytics. • Ooyala uses it to store and serve near real-time video analytics data. • SimpleGeo uses it as the main data store for its real-time location infrastructure. • Onespot uses it for a subset of its main data store 33 Companies that use Cassandra
  34. References • https://www.slideshare.net/asismohanty/cassandra-basics-20 • https://data-flair.training/blogs/cassandra-architecture/ • https://www.slideshare.net/quangntta/introduction-to-cassandra- 59962524?from_action=save • https://data-flair.training/blogs/cassandra-data-model/

    • https://technospirituality.com/2016/07/apache-cassandra-a-quick- look/ • https://www.slideshare.net/DataStax/understanding-data- partitioning-and-replication-in-apache-cassandra 34
  35. References • https://www.scnsoft.com/blog/cassandra-performance • https://www.edureka.co/blog/interview-questions/cassandra- interview-questions/ • https://www.gocit.vn/files/Cassandra.The.Definitive.Guide- www.gocit.vn.pdf •

    https://www.edureka.co/blog/apache-cassandra-advantages/ 35
  36. Demo Cassandra on Docker 36

  37. Thank you 37