Spark & Riak - Introduction to the spark-riak-connector

Slide 1

Slide 1 text

SPARK & RIAK INTRODUCTION TO THE SPARK-RIAK-CONNECTOR LATERAL THOUGHTS

Slide 2

Slide 2 text

Me, Myself & I Associate at LateralThoughts.com Scala, Java, Python Developer Data Engineer @ Axa & Carrefour Apache Spark Trainer with Databricks LATERAL THOUGHTS

Slide 3

Slide 3 text

And the Other One … Director Sales @ Basho Technologies (Basho make Riak) Ex of MySQL France Co-Founder MariaDB Funny Accent

Slide 4

Slide 4 text

Quick Introduction … 2011 Creators of Riak Riak KV: NoSQL key value database Riak S2: Large Object Storage 2015 New Products Basho Data Platform: Integrated NoSQL databases, caching, in-memory analytics, and search Riak TS: NoSQL Time Series database 120+ employees Global Oﬃces Seattle (HQ), Washington DC, London, Paris, Tokyo 300+ Enterprise customers, 1/3 of the Fortune 50

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

PRIORITIZED NEEDS High Availability - Critical Data High Scale – Heavy Reads & Writes Geo Locality – Multiple Data Centers Operational Simplicity – Resources Don’t Scale as Clusters Data Accuracy – Write Conflict Options ∂ RIAK S2 USE CASES Large Object Store Content Distribution Web & Cloud Services Active Archives ∂ RIAK KV USE CASES User Data Session Data Profile Data Real-time Data Log Data ∂ RIAK TS USE CASES IoT/Devices Financial/Economic Scientific Observations Log Data

Slide 7

Slide 7 text

The Evolution of NoSQL Unstructured Data Platforms Multi-Model Solutions Point Solutions

Slide 8

Slide 8 text

Basho Data Platform …

Slide 9

Slide 9 text

ABOUT SPARK & RIAK

Slide 10

Slide 10 text

Spark & Riak Disclaimer, the following presentation uses : Spark v1.5.2 Spark-Riak-Connector v1.1.0

Slide 11

Slide 11 text

Pre-Requisites To use the Spark Riak Connector, as of now, you need to build it yourself : Clone https://github.com/basho/spark-riak-connector `git checkout v1.1.0` `mvn clean install`

Slide 12

Slide 12 text

Bootstrapped project

Slide 13

Slide 13 text

Reading from Connect to a Riak KV Cluster from Spark Query it : Full Scan Using Keys Using secondary indexes (2i)

Slide 14

Slide 14 text

Connecting to

Slide 15

Slide 15 text

Loading data from riakBucket[V](bucketName: String): RiakRDD[V] riakBucket[V](bucketName: String, bucketType: String): RiakRDD[V] riakBucket[K, V](bucketName: String, convert: (Location, RiakObject) => (K, V)): RiakRDD[(K, V)] … On your Spark Context, you can use :

Slide 16

Slide 16 text

add a query, otherwise…

Slide 17

Slide 17 text

Find all : Find by key(s) :

Slide 18

Slide 18 text

Implicits that will give you the riak* methods

Slide 19

Slide 19 text

Reading from Using case classes Using Secondary Indexes

Slide 20

Slide 20 text

Basic I/O

Slide 21

Slide 21 text

Mapping Objects - Buckets

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Adding ﬁelds during save

Slide 24

Slide 24 text

Spark Riak Connector - Roadmap Better Integration with Riak TS Enhanced DataFrames - based on Riak TS Schema APIs Server-side aggregations and grouping - using TS SQL commands Speed Data Locality (partition RDDs according to replication in the cluster) - launch Spark executors on the same nodes where the data resides. Better mapping from vnodes to Spark workers using coverage plan Better support for Riak data types (CRDT) and Search queries Today requires using Java Riak client APIs Spark Streaming Provide example and sample integration with Apache Kafka Improve reliability using Riak for checkpoints and WAL Add examples and documentation for Python support DRAFT

Slide 25

Slide 25 text

Thank you @ogirardot [email protected] https://github.com/ogirardot/spark-riak-example https://speakerdeck.com/ogirardot/spark-and-riak-introduction-to- the-spark-riak-connector @mcarney23 [email protected] fr.basho.com