Efficient solution integrating Spark & Cassandra by ALVARO AGEA & LUCA ROSELLINI Big Data Spain 2013

An efficient data mining solution by integrating Spark and Cassandra
Alvaro Agea Herradón – Luca Rosellini

AN EFFICIENT DATA MINING SOLUTION

Hadoop?

Cassandra?

Spark?

Stratio Deep An efficient data mining solution “Two and two
are four? Sometimes… Sometimes they are five.” G. Orwell #StratioB

Goals • Why do you need Cassandra? • What is
the problem? • Why do you need Spark? • How do they work together? #StratioB

Cassandra #StratioB • Based on DynamoDB… • Replication, Key/Value, P2P
• And based on Big Table… • Column oriented

ROBUST FAST EFFICENT

NO BOTTLENECK REPLICATE D DECENTRALIZED

Another Databas e?

One User – Lot of data Case A #StratioB

Many User – Few data Case B #StratioB

Many user – Lot of data Case C #StratioB

Crawler app #StratioB Cassandra, I choose you 100 M Indexed
pages 3k reads Query time < 1s

But…

Marketing walks in

New query “I need to find all the reference to
the domain ACME. I need the answer by Friday.” #StratioB

Problem Cassandra is not well suited to resolved this type
of queries You need to design the schema with the query in mind #StratioB

Challenge Accepted

What options do we have? • Run Hive Query on
top of C* • Write an ETL script and load data into another DB • Clone the cluster #StratioB

What options do we have? Run Hive Query on top
of C* Write ETL scripts and load into another DB Clone the cluster #StratioB

And now… what can we do? “We can't solve problems
by using the same kind of thinking we used when we created them” #StratioB Albert Einstein

• Alternative to MapReduce • A low latency cluster computing
system • For very large datasets • Create by UC Berkeley AMP Lab in 2010. • May be 100 times faster than MapReduce for:  Interactive algorithms.  Interactive data mining Spark #StratioB

Logistic regression in Spark vs Hadoop SOURCE | http://spark.incubator.apache.org/ #StratioB

WHO USES SPARK?

Spark and Cassandra Integration points #StratioB

Cassandra’s HDFS abstraction layer Advantantages: • Easily integrates with legacy
systems. Drawbacks: • Very high-level: no access to low level Cassandra’s features. • Questionable performance. INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioB

Cassandra’s Hadoop Interface • Thrift protocol • CQL3 (our implementation)
 Uses the novel Cassandra’s CqlPagingInputFormat INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioB

• Supports CQL3 features • Respects data locality • Good
compromise between performance / implementation complexity CQL3 Integration INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioB

CQL3 Integration (II) Provides a Java friendly API: • Developers
map Column Families to custom serializable POJOs • StratioDeep wraps the complexity of performing Spark calculations directly over the user provided POJOs. INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioB

Drawbacks: • Still not preforming as well as we’d like
 Uses Cassandra’s Hadoop Interface • No analyst-friendly interface:  No SQL-like query features CQL3 Integration (III) INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioB

Bring the integration to another level: • Dump Cassandra’s Hadoop
Interface • Direct access to Cassandra’s SSTable(s) files. • Extend Cassandra’s CQL3 to make use of Spark’s distributed data processing power Future extensions What are we currently working on? #StratioB

#StratioB Conclusion

THANKS

Efficient solution integrating Spark & Cassandr...

Efficient solution integrating Spark & Cassandra by ALVARO AGEA & LUCA ROSELLINI Big Data Spain 2013

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript