Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB DC 2012: Why We Chose MongoDB to Put Bi...

mongodb
June 26, 2012
570

MongoDB DC 2012: Why We Chose MongoDB to Put Big Data "On The Map"

Nicholas Knize, Thermopylae Sciences and Technology
Traditional Geospatial Information Systems (GIS) have heavily depended on the use of Relational Databases (RDBMS) for indexing. Relational Databases place a priority on long running transactions, pre-defined fixed normalized schemas, and large joins which make them ill equipped to support Big Data scenarios of data volume, variability, and velocity. Relational database theory is not optimal for geospatial applications and decent relational geospatial databases are expensive and typically difficult to maintain. Learn how Thermopylae Sciences and Technology applied the non-relational (NoSQL) implementation of MongoDB for tackling dynamic geospatial data at massive scale. During the presentation we will cover: Why NoSQL technology over Relational Databases for scaling geospatial date, why MongoDB (what are the enhancements to the spatial indexer?) and where is it being used.

mongodb

June 26, 2012
Tweet

Transcript

  1. •  Expose$enterprise$data$in$a$geo2temporal$user$defined$ environment$ •  Provide$a$flexible$and$scalable$spaUal$indexing$framework$ for$heterogeneous$data$$ •  Visualize$spaUally$referenced$data$on$3D$globe$&$2D$maps$ •  Manage$real2Ume$data$feeds$and$mobile$messaging$$

    •  View$data$over$geo2recUfied$imagery$with$3D$terrain$ •  Support$mission$planning$and$simulaUon$ •  Provide$real2Ume$collaboraUon$and$sharing$ ISPATIAL OVERVIEW ACCOMPLISHING$THE$IMPOSSIBLE$
  2. •  Cassandra$ –  Nice$Bring$Your$Own$Index$(BYOI)$design$ –  …$but$Java,$Java,$Java…$Memory$management$can$be$an$issue$ –  Adding$new$nodes$can$be$a$pain$(Token$Changes,$nodetool)$ –  Key2Value$store…good$for$simple$data$models$

    •  Hbase$ –  Nice$BigTable$model$ –  Theory$grounded$heavily$in$C.A.P,$inflexible$trade2offs$ –  Complicated$setup$and$maintenance$$$ •  CouchDB$ –  Provides$some$GeoSpaUal$funcUonality$ –  HEAVILY$dependent$on$Map2Reduce$model$(complicated$design)$ –  Erlang$based$–$poor$mulU2threaded$heap$management$ $ NOSQL OPTIONS ACCOMPLISHING$THE$IMPOSSIBLE$ Subset$of$Evaluated$NoSQL$OpUons$
  3. $$$$Why$MongoDB$for$Thermopylae?$ •  Documents$based$on$Javascript$Object$NotaUon$(JSON)$–$A$GEOJSON$ match$made$in$heaven!$ $ •  C++$2$No$Garbage$CollecUon$Overhead!$$Efficient$memory$management$ design$reduces$disk$swapping$and$paging$ •  Disk$storage$is$memory$mapped,$enabling$fast$swapping$when$necessary$$

    $ •  Built$in$auto2failover$with$replica$sets$and$fast$recovery$with$journaling$ •  Tunable$Consistency$–$Consistency$defined$at$applicaUon$layer$ •  Schema$Flexible$–$friendly$properUes$of$SQL$enable$easy$port$ •  Provided$iniUal$spaUal$indexing$support$–$Point$based$limited!$ $ WHY TST LIKES MONGODB ACCOMPLISHING$THE$IMPOSSIBLE$
  4. GEOHASH BTREE ISSUES ACCOMPLISHING$THE$IMPOSSIBLE$ •  Neighbors$aren’t$so$ close!$ –  Neighboring$points$on$the$ Geoid$may$end$up$on$

    opposite$ends$of$the$ plane$ –  Impacts$search$efficiency$ •  What$about$Geometry?$ –  Doesn’t$support$>$2D$ –  Mongo$uses$MulU2 LocaUon$documents$ which$really$just$indexes$ mulUple$points$that$link$ back$to$a$single$document$ $$$$Issues$with$the$Geohash$b2Tree$approach$
  5. Case 3: Case 4: Multi-Location Document (aka. Polygon) Search Polygon

    Case 1: Case 2: Success! Success! Fail! Fail! Mongo$MulU2locaUon$Document$Clipping$Issues$ ($within$search$doesn’t$always$work$w/$mulU2locaUon)$ MULTI-LOCATION CLIPPING ACCOMPLISHING$THE$IMPOSSIBLE$
  6. CUSTOM TUNED SPATIAL INDEXER ACCOMPLISHING$THE$IMPOSSIBLE$ Thermopylae$Custom$Tuned$MongoDB$$$$$$for$Geo$ TST$Leverage’s$Gunman’s$1984$Research$in$R/R*$Trees$ •  R2Trees$organize$any2dimensional$data$by$represenUng$ the$data$as$a$minimum$bounding$box.$$

    •  Each$node$bounds$its$children.$$A$node$can$have$many$ objects$in$it$(max:$m$$$min:$$ceil(m/2)%)$ •  Splits$and$merges$opUmized$by$minimizing$overlaps$ •  The$leaves$point$to$the$actual$objects$(stored$on$disk$ probably)$ •  Height$balanced$–$search$is$always$O(log$n)$$
  7. R*-Tree Spatial Index Example •  Sample insertion result for 4th

    order tree •  Objectives: 1.  Minimize area 2.  Minimize overlaps 3.  Minimize margins 4.  Maximize inner node utilization a b c d e f g h i j k l m n o p R*-TREE INDEX OBJECTIVES ACCOMPLISHING$THE$IMPOSSIBLE$
  8. Insert •  Similar to insertion into B+-tree but may insert

    into any leaf; leaf splits in case capacity exceeded. –  Which leaf to insert into? –  How to split a node? R*-TREE INSERT EXAMPLE ACCOMPLISHING$THE$IMPOSSIBLE$
  9. Insert—Leaf Selection •  Follow a path from root to leaf.

    •  At each node move into subtree whose MBR area increases least with addition of new rectangle. m n o p
  10. m n o p a! a! a! x a b

    c d e f g h i j k l m n o p Query •  Start at root •  Find all overlapping MBRs •  Search subtrees recursively
  11. Query •  Search m. m n o p a! a!

    x x a b c d e f g h i j k l m n o p a! a! a b c d e g
  12. Geo2Sharding$–$(in%work)$ $ $Scalable$Distributed$R*$Tree$(SD2r*Tree)$ Balanced$binary$tree,$ distributed$on$a$set$of$ servers:$ $ •  Each$internal$node$has$ exactly$two$children$

    $ •  Each$leaf$node$stores$a$ subset$of$the$indexed$ dataset$ $ •  At$each$node,$the$height$ of$the$subtrees$differ$by$ at$most$one$ $ •  Each$server$stores$one$ data$node$and$one$ “rouUng”$node$ GEO-SHARDING ACCOMPLISHING$THE$IMPOSSIBLE$
  13. d0! d1! r1! d0! Data!Node! Spa.al!! Coverage! a! a! b!

    c! c! b! d0! r1! a! b! c! c! b! d2! d1! e! d! d! r2! e! SD2r*Tree$Data$Structure$IllustraUon$$ •  di$ =$Data$Node$(Chunk)$ •  ri$ =$Coverage$Node$ $ Leveraged$work$from$Litwin,$Mouza,$Rigaux$2007$ SD-r*Tree DATA STRUCTURE ACCOMPLISHING$THE$IMPOSSIBLE$
  14. SD2r*Tree$Structure$DistribuUon$ d0! r1! a! b! c! c! b! d2! d1!

    e! d! d! r2! e! r2! d1! d2! d0! r1! GeoShard!2! GeoShard!3! GeoShard!1! mongos! SD-r*TREE STRUCTURE DISTRIBUTION ACCOMPLISHING$THE$IMPOSSIBLE$
  15. Next$Steps:$Beyond$42Dimensions$2$X2Tree$ (Berchtold,$Keim,$Kriegel$–$1996)$$ Normal Internal Nodes Supernodes Data Nodes •  Avoid$MBR$overlaps$

    $ •  Avoid$node$splits$(main$cause$for$high$overlap)$ $ •  Introduce$new$node$structure:$Supernodes!–$Large$Directory$nodes$of$variable$size$ BEYOND 4-DIMENSIONS ACCOMPLISHING$THE$IMPOSSIBLE$
  16. T2Sciences$Custom$Tuned$SpaUal$Indexer$ •  OpUmized$SpaUal$Search$–$Finds$intersecUng$MBR$and$recurses$into$ those$nodes$ $ •  OpUmized$SpaUal$Inserts$–$Uses$the$Hilbert$Value$of$MBR$centroid$to$ guide$search$$ –  28%$reducUon$in$number$of$nodes$touched$

    •  OpUmize$Deletes$–$Leverages$R*$split/merge$approach$for$rebalancing$ tree$when$nodes$become$over/under2full$ •  Low$maintenance$–$Leverages$MongoDB’s$automaUc$data$compacUon$ and$parUUoning$ CONCLUSION ACCOMPLISHING$THE$IMPOSSIBLE$
  17. Example$Use$Case$–$OSINT$(Foursquare$Data)$ •  Sample Foursquare data set mashed with Government Intel

    Data •  1 million Geo Document test (points and polys) •  4 server replica set •  ~350ms query response •  ~300% improvement over PostGIS EXAMPLE ACCOMPLISHING$THE$IMPOSSIBLE$
  18. Key$Customers$2$Government $$ •  US$Dept$of$State$Bureau$of$DiplomaUc$Security$ –  Build$and$support$30$TB$Google$Earth$Globe$with$mulU2 terabytes$of$individual$globes$sent$to$embassies$throughout$ the$world.$$Integrated$Google$Earth$and$iSpaUal$framework.$ •  US$Army$Intelligence$Security$Command$

    –  Provide$experUse$in$managing$technology$integraUon$–$ prime$contractor$providing$operaUons,$intelligence,$and$IT$ support$worldwide.$$Partners$include$IBM,$Lockheed$MarUn,$ Google,$MIT,$Carnegie$Mellon.$$Integrated$Google$Earth$and$ iSpaUal$framework.$ •  US$Southern$Command$ –  Coordinate$Intelligence$management$systems$spaUal$data$ collecUon,$indexing,$and$distribuUon.$$Integrated$Google$ Earth,$iSpaUal,$and$iHarvest.$ –  Index$large$volume$imagery$and$expose$it$for$different$ services$(Air$Force,$Navy,$Army,$Marines,$Coast$Guard)$ $ GOVERNMENT CUSTOMERS ACCOMPLISHING$THE$IMPOSSIBLE$
  19. $ •  Writes$are$accomplished$using$in2place$update$on$disk$(crazy$disk$ swapping$rate)$ $ •  Table$joins,$updates,$and$large$queries$quickly$outgrow$disk$cache$ requiring$many$random$disk$seeks$(performance$bonleneck!!)$ •  Strict$consistency$requirements$impacts$scalability$(e.g.$Postgres$

    uses$MulUversion$Consistency,$commonly$resulUng$in$stale$data)$ •  As$data$centers$grow,$the$probability$of$node$failure$(due$to$Disk$ Writes,$Consistency,$and$Atomic$operaUons)$increases$ $ RDBMS WEAKNESSES ACCOMPLISHING$THE$IMPOSSIBLE$ RDBMS$Weaknesses$