Spark MongoDB Connector Introduction - London MUG

{ name: "Ross Lawley", role: "Senior Software Engineer", twitter: "@RossC0"
}

Agenda Spark The new connector Use cases Internals Demo

3 What is ? Fast and general computing engine for
clusters •  Makes it easy and fast to process large datasets •  APIs in Scala, Python, Java, R •  Libraries for SQL, streaming, machine learning, … •  It’s fundamentally different to what’s come before

4 Why not just use Hadoop? •  Spark is FAST
– Faster to write. – Faster to run. • Up to 100x faster than Hadoop in memory • 10x faster on disk.

A visual comparison Hadoop Spark

Spark Programming Model Resilient Distributed Datasets •  An RDD is
a collection of elements that is immutable, distributed and fault-tolerant. •  Transformations can be applied to a RDD, resulting in new RDD. •  Actions can be applied to a RDD to obtain a value. •  RDD is lazy.

RDD Operations Transformations Actions map reduce filter collect flatMap count
mapPartitions save sample lookupKey union take join foreach groupByKey reduceByKey

Built in fault tolerance RDDs maintain lineage information that can
be used to reconstruct lost partitions val searches = spark.textFile("hdfs://...") .filter(_.contains("Search")) .map(_.split("\t")(2)).cache() .filter(_.contains("MongoDB")) .count() Mapped RDD Filtered RDD HDFS RDD Cached RDD Filtered RDD Count

. . . Spark Driver Worker 1 Worker n Worker
2 Cluster Manager Data source Spark topology

Spark high level view

Spark high level view Spark Spark SQL Spark Streaming MLIB
GraphX

Spark future high level view? Spark Core – Unstructured Data
Spark SQL – Structured Data Spark Streaming MLIB GraphX

Connecting MongoDB and Spark Big Data Storage Big Data Compute

Connecting MongoDB and Spark OLTP Applications Fine grained operations Offline
Processing Analytics Data Warehousing

Prior to the Spark Connector HDFS HDFS HDFS MongoDB Hadoop
Connector

The MongoDB Spark Connector MongoDB Spark Connector

Fare Calculation Engine One of World’s Largest Airlines Migrates from
Oracle to MongoDB and Apache Spark to Support 100x performance improvement Problem Why MongoDB Results Problem Solution Results China Eastern targeting 130,000 seats sold every day across its web and mobile channels New fare calculation engine needed to support 20,000 search queries per second, but current Oracle platform supported only 200 per second Apache Spark used for fare calculations, using business rules stored in MongoDB Fare calculations written to MongoDB for access by the search application MongoDB Connector for Apache Spark allows seamless integration with data locality awareness across the cluster Cluster of less than 20 API, Spark & MongoDB nodes supports 180m fare calculations & 1.6 billion searches per day Each node delivers 15x higher performance and 10x lower latency than existing Oracle servers MongoDB Enterprise Advanced provided Ops Manager for operational automation and access to expert technical support

Internals

Reads under the hood MongoSpark.load(sparkSession).count() 1.  Create a MongoRDD[Document] 2. 
Partition the data 3.  Calculate the Partitions . 4.  Get the preferred locations and allocate workers 5.  For each partition: i.  Queries and returns the cursor ii.  Iterates the cursor and sums up the data 6.  Finally, the Spark application returns the sum of the sums.

Writes under the hood MongoSpark.save(rdd) 1.  Create a MongoDB Connector
2.  For each partition: 1.  Group the data in batches 2.  Insert into the collection * DataFrames will upsert if there is an `_id`

Performance MongoDB Usual Suspects •  Document design •  Indexes • 
Read Concern / Write Concern Spark Specifics •  Partitioning Strategy •  Data Locality

Configuration The connector is designed to be highly configurable and
extendable. Configure via Spark Conf, Options or Read/WriteConfig instances •  Read Configuration •  URI, database name and collection name •  Read Preference, Read Concern •  Partitioner •  Sample Size (for inferring the schema) •  Local threshold (for choosing the MongoS) •  Write Configuration •  URI, database name and collection name •  Write concern •  Local threshold (for choosing the MongoS)

. . . Data locality

Data Locality on sharded clusters MongoS MongoS MongoS MongoS MongoS
. . .

Data Locality on sharded clusters Configure: LocalThreshold, MongoShardedPartitioner MongoS MongoS
MongoS MongoS MongoS . . .

Data Locality on sharded clusters MongoD MongoD MongoD MongoD MongoD
MongoS MongoS MongoS MongoS MongoS . . . Configure: ReadPreference, LocalThreshold, MongoShardedPartitioner

Demo Time

Spark and MongoDB •  An extremely powerful combination •  Many
possible use cases •  Warning some operations maybe faster if performed using Aggregation Framework •  Evolving all the time

Free, Self-Paced Training

Spark MongoDB Connector Introduction - London MUG

Spark MongoDB Connector Introduction - London MUG

rozza

More Decks by rozza

Other Decks in Technology

Featured

Transcript

{ name: "Ross Lawley", role: "Senior Software Engineer", twitter: "@RossC0"

Agenda Spark The new connector Use cases Internals Demo

3 What is ? Fast and general computing engine for

4 Why not just use Hadoop? •  Spark is FAST

A visual comparison Hadoop Spark

Spark Programming Model Resilient Distributed Datasets •  An RDD is

RDD Operations Transformations Actions map reduce filter collect flatMap count

Built in fault tolerance RDDs maintain lineage information that can

. . . Spark Driver Worker 1 Worker n Worker

Spark high level view

Spark high level view Spark Spark SQL Spark Streaming MLIB

Spark future high level view? Spark Core – Unstructured Data

&

Connecting MongoDB and Spark Big Data Storage Big Data Compute

Connecting MongoDB and Spark OLTP Applications Fine grained operations Offline

Prior to the Spark Connector HDFS HDFS HDFS MongoDB Hadoop

The MongoDB Spark Connector MongoDB Spark Connector

Fare Calculation Engine One of World’s Largest Airlines Migrates from

Internals

Reads under the hood MongoSpark.load(sparkSession).count() 1.  Create a MongoRDD[Document] 2.

Writes under the hood MongoSpark.save(rdd) 1.  Create a MongoDB Connector

Performance MongoDB Usual Suspects •  Document design •  Indexes •

Configuration The connector is designed to be highly configurable and

. . . Data locality

Data Locality on sharded clusters MongoS MongoS MongoS MongoS MongoS

Data Locality on sharded clusters Configure: LocalThreshold, MongoShardedPartitioner MongoS MongoS

Data Locality on sharded clusters MongoD MongoD MongoD MongoD MongoD

Demo Time

Spark and MongoDB •  An extremely powerful combination •  Many

Free, Self-Paced Training