Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark & Riak - Introduction to the spark-riak-connector

Spark & Riak - Introduction to the spark-riak-connector

Short presentation of Riak & Spark relations and demo of how to use the spark-riak-connector through several examples available on Github : https://github.com/ogirardot/spark-riak-example

Olivier Girardot

January 14, 2016
Tweet

More Decks by Olivier Girardot

Other Decks in Programming

Transcript

  1. Me, Myself & I Associate at LateralThoughts.com Scala, Java, Python

    Developer Data Engineer @ Axa & Carrefour Apache Spark Trainer with Databricks LATERAL THOUGHTS
  2. And the Other One … Director Sales @ Basho Technologies

    (Basho make Riak) Ex of MySQL France Co-Founder MariaDB Funny Accent
  3. Quick Introduction … 2011 Creators of Riak Riak KV: NoSQL

    key value database Riak S2: Large Object Storage 2015 New Products Basho Data Platform: Integrated NoSQL databases, caching, in-memory analytics, and search Riak TS: NoSQL Time Series database 120+ employees Global Offices Seattle (HQ), Washington DC, London, Paris, Tokyo 300+ Enterprise customers, 1/3 of the Fortune 50
  4. PRIORITIZED NEEDS High Availability - Critical Data High Scale –

    Heavy Reads & Writes Geo Locality – Multiple Data Centers Operational Simplicity – Resources Don’t Scale as Clusters Data Accuracy – Write Conflict Options ∂ RIAK S2 USE CASES Large Object Store Content Distribution Web & Cloud Services Active Archives ∂ RIAK KV USE CASES User Data Session Data Profile Data Real-time Data Log Data ∂ RIAK TS USE CASES IoT/Devices Financial/Economic Scientific Observations Log Data
  5. Pre-Requisites To use the Spark Riak Connector, as of now,

    you need to build it yourself : Clone https://github.com/basho/spark-riak-connector `git checkout v1.1.0` `mvn clean install`
  6. Reading from Connect to a Riak KV Cluster from Spark

    Query it : Full Scan Using Keys Using secondary indexes (2i)
  7. Loading data from riakBucket[V](bucketName: String): RiakRDD[V] riakBucket[V](bucketName: String, bucketType: String):

    RiakRDD[V] riakBucket[K, V](bucketName: String, convert: (Location, RiakObject) => (K, V)): RiakRDD[(K, V)] … On your Spark Context, you can use :
  8. Spark Riak Connector - Roadmap Better Integration with Riak TS

    Enhanced DataFrames - based on Riak TS Schema APIs Server-side aggregations and grouping - using TS SQL commands Speed Data Locality (partition RDDs according to replication in the cluster) - launch Spark executors on the same nodes where the data resides. Better mapping from vnodes to Spark workers using coverage plan Better support for Riak data types (CRDT) and Search queries Today requires using Java Riak client APIs Spark Streaming Provide example and sample integration with Apache Kafka Improve reliability using Riak for checkpoints and WAL Add examples and documentation for Python support DRAFT