Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Efficient solution integrating Spark & Cassandra by ALVARO AGEA & LUCA ROSELLINI Big Data Spain 2013

Efficient solution integrating Spark & Cassandra by ALVARO AGEA & LUCA ROSELLINI Big Data Spain 2013

Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is.

Session presented at Big Data Spain 2013 Conference
7th Nov 2013
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2013/conference/an-efficient-data-mining-solution-by-integrating-spark-and-cassandra

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

November 15, 2013
Tweet

Transcript

  1. An efficient data mining solution by integrating Spark and Cassandra

    Alvaro Agea Herradón – Luca Rosellini
  2. AN EFFICIENT DATA MINING SOLUTION

  3. Hadoop?

  4. Cassandra?

  5. Spark?

  6. Stratio Deep An efficient data mining solution “Two and two

    are four? Sometimes… Sometimes they are five.” G. Orwell #StratioB
  7. None
  8. Goals • Why do you need Cassandra? • What is

    the problem? • Why do you need Spark? • How do they work together? #StratioB
  9. Cassandra #StratioB • Based on DynamoDB… • Replication, Key/Value, P2P

    • And based on Big Table… • Column oriented
  10. ROBUST FAST EFFICENT

  11. NO BOTTLENECK REPLICATE D DECENTRALIZED

  12. Another Databas e?

  13. Why?

  14. One User – Lot of data Case A #StratioB

  15. Many User – Few data Case B #StratioB

  16. Many user – Lot of data Case C #StratioB

  17. Crawler app #StratioB Cassandra, I choose you 100 M Indexed

    pages 3k reads Query time < 1s
  18. But…

  19. Marketing walks in

  20. New query “I need to find all the reference to

    the domain ACME. I need the answer by Friday.” #StratioB
  21. Problem Cassandra is not well suited to resolved this type

    of queries You need to design the schema with the query in mind #StratioB
  22. Challenge Accepted

  23. What options do we have? • Run Hive Query on

    top of C* • Write an ETL script and load data into another DB • Clone the cluster #StratioB
  24. What options do we have? Run Hive Query on top

    of C* Write ETL scripts and load into another DB Clone the cluster #StratioB
  25. And now… what can we do? “We can't solve problems

    by using the same kind of thinking we used when we created them” #StratioB Albert Einstein
  26. • Alternative to MapReduce • A low latency cluster computing

    system • For very large datasets • Create by UC Berkeley AMP Lab in 2010. • May be 100 times faster than MapReduce for:  Interactive algorithms.  Interactive data mining Spark #StratioB
  27. Logistic regression in Spark vs Hadoop SOURCE | http://spark.incubator.apache.org/ #StratioB

  28. WHO USES SPARK?

  29. Spark and Cassandra Integration points #StratioB

  30. Cassandra’s HDFS abstraction layer Advantantages: • Easily integrates with legacy

    systems. Drawbacks: • Very high-level: no access to low level Cassandra’s features. • Questionable performance. INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioB
  31. Cassandra’s Hadoop Interface • Thrift protocol • CQL3 (our implementation)

     Uses the novel Cassandra’s CqlPagingInputFormat INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioB
  32. • Supports CQL3 features • Respects data locality • Good

    compromise between performance / implementation complexity CQL3 Integration INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioB
  33. CQL3 Integration (II) Provides a Java friendly API: • Developers

    map Column Families to custom serializable POJOs • StratioDeep wraps the complexity of performing Spark calculations directly over the user provided POJOs. INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioB
  34. Demo

  35. Drawbacks: • Still not preforming as well as we’d like

     Uses Cassandra’s Hadoop Interface • No analyst-friendly interface:  No SQL-like query features CQL3 Integration (III) INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioB
  36. Bring the integration to another level: • Dump Cassandra’s Hadoop

    Interface • Direct access to Cassandra’s SSTable(s) files. • Extend Cassandra’s CQL3 to make use of Spark’s distributed data processing power Future extensions What are we currently working on? #StratioB
  37. #StratioB Conclusion

  38. THANKS

  39. None