Spark & Riak - Introduction to the spark-riak-connector

SPARK & RIAK INTRODUCTION TO THE SPARK-RIAK-CONNECTOR LATERAL THOUGHTS

Me, Myself & I Associate at LateralThoughts.com Scala, Java, Python
Developer Data Engineer @ Axa & Carrefour Apache Spark Trainer with Databricks LATERAL THOUGHTS

And the Other One … Director Sales @ Basho Technologies
(Basho make Riak) Ex of MySQL France Co-Founder MariaDB Funny Accent

Quick Introduction … 2011 Creators of Riak Riak KV: NoSQL
key value database Riak S2: Large Object Storage 2015 New Products Basho Data Platform: Integrated NoSQL databases, caching, in-memory analytics, and search Riak TS: NoSQL Time Series database 120+ employees Global Oﬃces Seattle (HQ), Washington DC, London, Paris, Tokyo 300+ Enterprise customers, 1/3 of the Fortune 50

PRIORITIZED NEEDS High Availability - Critical Data High Scale –
Heavy Reads & Writes Geo Locality – Multiple Data Centers Operational Simplicity – Resources Don’t Scale as Clusters Data Accuracy – Write Conflict Options ∂ RIAK S2 USE CASES Large Object Store Content Distribution Web & Cloud Services Active Archives ∂ RIAK KV USE CASES User Data Session Data Profile Data Real-time Data Log Data ∂ RIAK TS USE CASES IoT/Devices Financial/Economic Scientific Observations Log Data

The Evolution of NoSQL Unstructured Data Platforms Multi-Model Solutions Point
Solutions

Basho Data Platform …

ABOUT SPARK & RIAK

Spark & Riak Disclaimer, the following presentation uses : Spark
v1.5.2 Spark-Riak-Connector v1.1.0

Pre-Requisites To use the Spark Riak Connector, as of now,
you need to build it yourself : Clone https://github.com/basho/spark-riak-connector `git checkout v1.1.0` `mvn clean install`

Bootstrapped project

Reading from Connect to a Riak KV Cluster from Spark
Query it : Full Scan Using Keys Using secondary indexes (2i)

Connecting to

Loading data from riakBucket[V](bucketName: String): RiakRDD[V] riakBucket[V](bucketName: String, bucketType: String):
RiakRDD[V] riakBucket[K, V](bucketName: String, convert: (Location, RiakObject) => (K, V)): RiakRDD[(K, V)] … On your Spark Context, you can use :

add a query, otherwise…

Find all : Find by key(s) :

Implicits that will give you the riak* methods

Reading from Using case classes Using Secondary Indexes

Basic I/O

Mapping Objects - Buckets

Adding ﬁelds during save

Spark Riak Connector - Roadmap Better Integration with Riak TS
Enhanced DataFrames - based on Riak TS Schema APIs Server-side aggregations and grouping - using TS SQL commands Speed Data Locality (partition RDDs according to replication in the cluster) - launch Spark executors on the same nodes where the data resides. Better mapping from vnodes to Spark workers using coverage plan Better support for Riak data types (CRDT) and Search queries Today requires using Java Riak client APIs Spark Streaming Provide example and sample integration with Apache Kafka Improve reliability using Riak for checkpoints and WAL Add examples and documentation for Python support DRAFT

Thank you @ogirardot o.girardot@lateral-thoughts.com https://github.com/ogirardot/spark-riak-example https://speakerdeck.com/ogirardot/spark-and-riak-introduction-to- the-spark-riak-connector @mcarney23 michael.carney@basho.com fr.basho.com

Spark & Riak - Introduction to the spark-riak-c...

Spark & Riak - Introduction to the spark-riak-connector

Olivier Girardot

More Decks by Olivier Girardot

Other Decks in Programming

Featured

Transcript