Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB NYC 2012: 80 Million Voters and Counting

mongodb
May 29, 2012
300

MongoDB NYC 2012: 80 Million Voters and Counting

MongoDB NYC 2012: 80 Million Voters and Counting, Daniel Weitzenfeld, 5ivePoints. At 5ivePoints, we have a MongoDB database with 85+million US voters, every one of them geocoded. Our application allows supporters of political campaigns across the country to pull up nearby voters on their mobile phone to canvas. Moreover, campaigns can create custom filters using any combination of voter attributes, and we provide data-driven filters that update in near real-time based on campaign results. But with a collection of this size, each index is on the order of GBs, and un-indexed queries aren't feasible. I'll discuss how we are able to support flexible user queries - integrating both geo-spatial and other attributes - with only 4 indexes, minimizing page faults caused by pulling rarely-used indexes into memory.

mongodb

May 29, 2012
Tweet

Transcript

  1. •  On demand door-to-door canvassing (hence MongoDB) •  Voter lookup

    by name •  Call lists •  Social web integration
  2. The Voter-Retrieval Challenge •  Retrieve nearby voters matching a complex

    query. •  Query can be generated by user, or by app $$ Voter Restaurant $$
  3. Constraint •  Minimize indexes •  with 130+ m records, each

    index is on order of 3~5 GB •  MongoDB needs indexes in memory for best results
  4. Our Two-Part Solution 1.  Client-side filtering with decision trees 2. 

    Store all those districts as polygon objects Result: –  Only 4 indexes on voter collection –  Indexes fit in 17 GB RAM
  5. Demonstration 100 95 75 65 50 25 50 65 95

    100 75 75 85 1. db.V.find({g:{$near:[x,y]}}).limit(50) 2. Run voters through appropriate decision tree 3. Return only high performers, with “Query-Matching” score Regardless of tree/filter complexity: Index: geospatial 1
  6. Why Decision Trees? •  Easy to implement, using recursion • 

    Easy fuzzy matching •  Easy handling of missing values, categorical and continuous data •  Non-linearity •  Easily stored in Mongo, and translated to/ from PHP Array •  Wide usage in machine learning means ML output can plug right in
  7. Why Decision Trees? •  Easy to run an object through

    a tree using recursion 100 95 75 65 50 25 subtree subtree
  8. Why Decision Trees? •  Easy handling of missing values, categorical

    and continuous data 100 95 65 50 25 , missing >=45 <45
  9. Why Decision Trees? •  Wide usage in machine learning means

    ML output can plug right in – Single Tree – Multiple Trees: •  Random Forests •  Gradient Boosted Trees
  10. Index on all districts? I’m running for Dog Catcher in

    ____________. What voters do you have in my district? Many levels of district: •  Town •  School District •  County •  State Legislature, lower house •  State Legislature, upper house •  Congressional
  11. Or index on none? 1. db.DISTRICTS.findOne({‘name’:’Hicksville’}) Hicksville polygon from Census

    shapefiles 2. db.VOTERS.find({‘g’:{$within:<Hicksville polygon>}}) Index: geospatial 1