Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark MongoDB Connector Introduction - London MUG

rozza
September 08, 2016

Spark MongoDB Connector Introduction - London MUG

Slides from the London MongoDB User Group Sept 2016

rozza

September 08, 2016
Tweet

More Decks by rozza

Other Decks in Technology

Transcript

  1. 3 What is ? Fast and general computing engine for

    clusters •  Makes it easy and fast to process large datasets •  APIs in Scala, Python, Java, R •  Libraries for SQL, streaming, machine learning, … •  It’s fundamentally different to what’s come before
  2. 4 Why not just use Hadoop? •  Spark is FAST

    – Faster to write. – Faster to run. • Up to 100x faster than Hadoop in memory • 10x faster on disk.
  3. Spark Programming Model Resilient Distributed Datasets •  An RDD is

    a collection of elements that is immutable, distributed and fault-tolerant. •  Transformations can be applied to a RDD, resulting in new RDD. •  Actions can be applied to a RDD to obtain a value. •  RDD is lazy.
  4. RDD Operations Transformations Actions map reduce filter collect flatMap count

    mapPartitions save sample lookupKey union take join foreach groupByKey reduceByKey
  5. Built in fault tolerance RDDs maintain lineage information that can

    be used to reconstruct lost partitions val searches = spark.textFile("hdfs://...") .filter(_.contains("Search")) .map(_.split("\t")(2)).cache() .filter(_.contains("MongoDB")) .count() Mapped RDD Filtered RDD HDFS RDD Cached RDD Filtered RDD Count
  6. . . . Spark Driver Worker 1 Worker n Worker

    2 Cluster Manager Data source Spark topology
  7. Spark future high level view? Spark Core – Unstructured Data

    Spark SQL – Structured Data Spark Streaming MLIB GraphX
  8. &

  9. Fare Calculation Engine One of World’s Largest Airlines Migrates from

    Oracle to MongoDB and Apache Spark to Support 100x performance improvement Problem Why MongoDB Results Problem Solution Results China Eastern targeting 130,000 seats sold every day across its web and mobile channels New fare calculation engine needed to support 20,000 search queries per second, but current Oracle platform supported only 200 per second Apache Spark used for fare calculations, using business rules stored in MongoDB Fare calculations written to MongoDB for access by the search application MongoDB Connector for Apache Spark allows seamless integration with data locality awareness across the cluster Cluster of less than 20 API, Spark & MongoDB nodes supports 180m fare calculations & 1.6 billion searches per day Each node delivers 15x higher performance and 10x lower latency than existing Oracle servers MongoDB Enterprise Advanced provided Ops Manager for operational automation and access to expert technical support
  10. Reads under the hood MongoSpark.load(sparkSession).count() 1.  Create a MongoRDD[Document] 2. 

    Partition the data 3.  Calculate the Partitions . 4.  Get the preferred locations and allocate workers 5.  For each partition: i.  Queries and returns the cursor ii.  Iterates the cursor and sums up the data 6.  Finally, the Spark application returns the sum of the sums.
  11. Writes under the hood MongoSpark.save(rdd) 1.  Create a MongoDB Connector

    2.  For each partition: 1.  Group the data in batches 2.  Insert into the collection * DataFrames will upsert if there is an `_id`
  12. Performance MongoDB Usual Suspects •  Document design •  Indexes • 

    Read Concern / Write Concern Spark Specifics •  Partitioning Strategy •  Data Locality
  13. Configuration The connector is designed to be highly configurable and

    extendable. Configure via Spark Conf, Options or Read/WriteConfig instances •  Read Configuration •  URI, database name and collection name •  Read Preference, Read Concern •  Partitioner •  Sample Size (for inferring the schema) •  Local threshold (for choosing the MongoS) •  Write Configuration •  URI, database name and collection name •  Write concern •  Local threshold (for choosing the MongoS)
  14. Data Locality on sharded clusters MongoD MongoD MongoD MongoD MongoD

    MongoS MongoS MongoS MongoS MongoS . . . Configure: ReadPreference, LocalThreshold, MongoShardedPartitioner
  15. Spark and MongoDB •  An extremely powerful combination •  Many

    possible use cases •  Warning some operations maybe faster if performed using Aggregation Framework •  Evolving all the time