Upgrade to Pro — share decks privately, control downloads, hide ads and more …

StratioDeep: An Integration Layer Between Spark and Cassandra

Stratio
December 04, 2013

StratioDeep: An Integration Layer Between Spark and Cassandra

We present StratioDeep, an integration layer between the Spark distributed computing framework and Cassandra, a NoSQL distributed database.

Cassandra brings together the distributed system technologies from Dynamo and the data model from Google’s BigTable. Like Dynamo, Cassandra is eventually consistent and based on a P2P model without a single point of failure. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. For these reasons, C* is one of the most popular NoSQL databases, but one of its handicaps is that it’s necessary to model the schema on the executed queries. This is because C* is oriented to search by key.

Integrating C* and Spark gives us a system that combines the best of both worlds.

Existing integrations between the two systems are not satisfactory: they basically provide an HDFS abstraction layer over C*. We believe this solution is not efficient because introduces an important overhead between the two systems.

The purpose of our work has been to provide an much lower-level integration that not only performs better, it also opens to Cassandra the possibility to solve a wide range of new use cases thanks to the powerfulness of the Spark distributed computing framework.

We’ve already deployed this solution in real applications with diverse clients: pattern detection, log mining, fraud detection, sentiment analysis and financial transaction analysis.

In addition this integration is the building block for our challenging and novel Lambda architecture completely based on Cassandra.

In order to complete the integration, we provide a seamless extension to the Cassandra Query Language: CQL is oriented to key-based search. As such, it is not a good choice to perform queries that move an huge amount of data. We’ve extended CQL in order to provide a user-friendly interface. This is a new approach for batch processing over C*. It consists in an abstraction layer that translates custom CQL queries to Spark jobs and delegates the complexity of distributing the query itself over the underlying cluster of commodity machines to Spark

Stratio

December 04, 2013
Tweet

More Decks by Stratio

Other Decks in Technology

Transcript

  1. StratioDeep An efficient data mining solution “Two and two are

    four? Sometimes… Sometimes they are five.” G. Orwell #StratioB
  2. Why we also need Spark • In Cassandra, you need

    to design the schema with the query in mind • Every other type of query is either very inefficient or impossible to resolve #StratioB
  3. • Supports CQL3 features • Use of secondary Indexes •

    Small codebase (less bugs) StratioDeep features (I) #StratioB
  4. StratioDeep features (II) Provides a Java friendly API: • Developers

    map Column Families to custom serializable POJOs • StratioDeep wraps the complexity of performing Spark calculations directly over the user provided POJOs. • SQL-Like Domain Specific Language #StratioB
  5. SQL-Like domain specific language: • Built on-top of Spark’s API.

    • SQL + Linq abstractions. • Unique interface to all Stratio platform modules Stratio DSL (I) #StratioB
  6. Stratio RT extension • Built on-top of Spark Streaming API.

    Stratio BUS extension • Registration of new channels/consumer/producers Cross-module integration with StratioMeta • Lets us create flows of data between StratioDeep  StratioRT • Materialized views, live queries, alerts, etc… Stratio DSL (II) #StratioB