StratioDeep: An Integration Layer Between Spark...

December 04, 2013

170

StratioDeep: An Integration Layer Between Spark and Cassandra

We present StratioDeep, an integration layer between the Spark distributed computing framework and Cassandra, a NoSQL distributed database.

Cassandra brings together the distributed system technologies from Dynamo and the data model from Google’s BigTable. Like Dynamo, Cassandra is eventually consistent and based on a P2P model without a single point of failure. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. For these reasons, C* is one of the most popular NoSQL databases, but one of its handicaps is that it’s necessary to model the schema on the executed queries. This is because C* is oriented to search by key.

Integrating C* and Spark gives us a system that combines the best of both worlds.

Existing integrations between the two systems are not satisfactory: they basically provide an HDFS abstraction layer over C*. We believe this solution is not efficient because introduces an important overhead between the two systems.

The purpose of our work has been to provide an much lower-level integration that not only performs better, it also opens to Cassandra the possibility to solve a wide range of new use cases thanks to the powerfulness of the Spark distributed computing framework.

We’ve already deployed this solution in real applications with diverse clients: pattern detection, log mining, fraud detection, sentiment analysis and financial transaction analysis.

In addition this integration is the building block for our challenging and novel Lambda architecture completely based on Cassandra.

In order to complete the integration, we provide a seamless extension to the Cassandra Query Language: CQL is oriented to key-based search. As such, it is not a good choice to perform queries that move an huge amount of data. We’ve extended CQL in order to provide a user-friendly interface. This is a new approach for batch processing over C*. It consists in an abstraction layer that translates custom CQL queries to Spark jobs and delegates the complexity of distributing the query itself over the underlying cluster of commodity machines to Spark

Stratio

December 04, 2013

Tweet

More Decks by Stratio

See All by Stratio

Node.js and Cassandra

1

140

In-memory search with Cassandra persistence

1

180

Other Decks in Technology

See All in Technology

隙間時間で爆速開発！ Claude Code × Vibe Coding で作るマニュアル自動生成サービス

2

200

P2P ではじめる WebRTC のつまづきどころ

1

270

Tiptapで実現する堅牢で柔軟なエディター開発

1

150

激動の時代、新卒エンジニアはAIツールにどう向き合うか。　[LayerX Bet AI Day Countdown LT Day1 ツールの選択]

0

610

DatabricksのOLTPデータベース『Lakebase』に詳しくなろう！

0

160

Recoil脱却の現状と挑戦

3

460

AI工学特論: MLOps・継続的評価

10

2k

ビジネス文書に特化した基盤モデル開発 / SaaSxML_Session_2

0

130

【CEDEC2025】現場を理解して実現！ゲーム開発を効率化するWebサービスの開発と、利用促進のための継続的な改善

PRO

0

370

AIを使っていい感じにE2Eテストを書けるようになるまで / Trying to Write Good E2E Tests with AI

3

1.9k

Jitera Company Deck / JP

0

250

Amazon CloudWatchのメトリクスインターバルについて / Metrics interval matters

3

280

Featured

See All Featured

Cheating the UX When There Is Nothing More to Optimize - PixelPioneers

stephaniewalter

283

13k

Reflections from 52 weeks, 52 projects

351

21k

ピンチをチャンスに：未来をつくるプロダクトロードマップ #pmconf2020

126

53k

Why You Should Never Use an ORM

PRO

58

9.5k

Thoughts on Productivity

69

4.8k

461

140k

Gamification - CAS2011

81

5.4k

Easily Structure & Communicate Ideas using Wireframe

194

16k

Embracing the Ebb and Flow

86

4.8k

Practical Orchestrator

189

11k

The Art of Programming - Codeland 2020

54

13k

146

16k

Transcript

StratioDeep: an integration layer between Spark and Cassandra
None
Our customers #StratioB
StratioDeep An efficient data mining solution “Two and two are
four? Sometimes… Sometimes they are five.” G. Orwell #StratioB
Why we use Cassandra
One User – Lots of data Case A #StratioB
Many Users – Few data Case B #StratioB
Many users – Lots of data Case C #StratioB
Why we also need Spark • In Cassandra, you need
to design the schema with the query in mind • Every other type of query is either very inefficient or impossible to resolve #StratioB
Challenge Accepted
None
• Supports CQL3 features • Use of secondary Indexes •
Small codebase (less bugs) StratioDeep features (I) #StratioB
StratioDeep features (II) Provides a Java friendly API: • Developers
map Column Families to custom serializable POJOs • StratioDeep wraps the complexity of performing Spark calculations directly over the user provided POJOs. • SQL-Like Domain Specific Language #StratioB
SQL-Like domain specific language: • Built on-top of Spark’s API.
• SQL + Linq abstractions. • Unique interface to all Stratio platform modules Stratio DSL (I) #StratioB
Stratio RT extension • Built on-top of Spark Streaming API.
Stratio BUS extension • Registration of new channels/consumer/producers Cross-module integration with StratioMeta • Lets us create flows of data between StratioDeep  StratioRT • Materialized views, live queries, alerts, etc… Stratio DSL (II) #StratioB
Use case A Use case C #StratioB Conclusion
THANKS Luca Rosellini @luca_roselli ni Alvaro Agea @alvaroagea