MongoDB NYC 2012: 80 Million Voters and Counting

130 Million Voters and Counting @

Daniel Weitzenfeld, Director of Analytics @DanielWDanielW @5ivepoints

What 5ivepoints does

•  On demand door-to-door canvassing (hence MongoDB) •  Voter lookup
by name •  Call lists •  Social web integration

MongoDB stack

Use clever application/schema design to minimize necessary indexes.

The Voter-Retrieval Challenge •  Retrieve nearby voters matching a complex
query. •  Query can be generated by user, or by app $$ Voter Restaurant $$

Constraint •  Minimize indexes •  with 130+ m records, each
index is on order of 3~5 GB •  MongoDB needs indexes in memory for best results

Without constraint… $$ party background gender religion income issues +
geospatial, name, ID numbers… districts

Our Two-Part Solution 1.  Client-side ﬁltering with decision trees 2. 
Store all those districts as polygon objects Result: –  Only 4 indexes on voter collection –  Indexes ﬁt in 17 GB RAM

Part 1: Client side ﬁltering with decision trees

What’s a decision tree? 100 95 75 65 50 25

Demonstration 100 95 75 65 50 25 50 65 95
100 75 75 85 1. db.V.ﬁnd({g:{$near:[x,y]}}).limit(50) 2. Run voters through appropriate decision tree 3. Return only high performers, with “Query-Matching” score Regardless of tree/ﬁlter complexity: Index: geospatial 1

Why Decision Trees? •  Easy to implement, using recursion • 
Easy fuzzy matching •  Easy handling of missing values, categorical and continuous data •  Non-linearity •  Easily stored in Mongo, and translated to/ from PHP Array •  Wide usage in machine learning means ML output can plug right in

Why Decision Trees? •  Easy to run an object through
a tree using recursion 100 95 75 65 50 25 subtree subtree

Why Decision Trees? •  Easy fuzzy matching 100 95 25
$$$ $$ $$$

Why Decision Trees? •  Easy handling of missing values, categorical
and continuous data 100 95 65 50 25 , missing >=45 <45

Why Decision Trees? •  Non-linearity Voter Income Score 100 0

Why Decision Trees? •  Easily stored in Mongo, and translated
to/from PHP Array

Why Decision Trees? •  Wide usage in machine learning means
ML output can plug right in – Single Tree – Multiple Trees: •  Random Forests •  Gradient Boosted Trees

Part 2: Store all those districts as polygon objects

Index on all districts? I’m running for Dog Catcher in
____________. What voters do you have in my district? Many levels of district: •  Town •  School District •  County •  State Legislature, lower house •  State Legislature, upper house •  Congressional

Or index on none? 1. db.DISTRICTS.findOne({‘name’:’Hicksville’}) Hicksville polygon from Census
shapefiles 2. db.VOTERS.find({‘g’:{$within:<Hicksville polygon>}}) Index: geospatial 1

Bonus: Random Sampling Hicksville Bounding Box Much easier/faster/randomer than: db.VOTERS.ﬁnd({‘Town’:’Hicksville’}).skip(random
int)

Use clever application/schema design to minimize necessary indexes.

MongoDB NYC 2012: 80 Million Voters and Counting

MongoDB NYC 2012: 80 Million Voters and Counting

mongodb

More Decks by mongodb

Featured

Transcript

130 Million Voters and Counting @

Daniel Weitzenfeld, Director of Analytics @DanielWDanielW @5ivepoints

What 5ivepoints does

•  On demand door-to-door canvassing (hence MongoDB) •  Voter lookup

MongoDB stack

Use clever application/schema design to minimize necessary indexes.

The Voter-Retrieval Challenge •  Retrieve nearby voters matching a complex

Constraint •  Minimize indexes •  with 130+ m records, each

Without constraint… $$ party background gender religion income issues +

Our Two-Part Solution 1.  Client-side ﬁltering with decision trees 2.

Part 1: Client side ﬁltering with decision trees

What’s a decision tree? 100 95 75 65 50 25

Demonstration 100 95 75 65 50 25 50 65 95

Why Decision Trees? •  Easy to implement, using recursion •

Why Decision Trees? •  Easy to run an object through

Why Decision Trees? •  Easy fuzzy matching 100 95 25

Why Decision Trees? •  Easy handling of missing values, categorical

Why Decision Trees? •  Non-linearity Voter Income Score 100 0

Why Decision Trees? •  Easily stored in Mongo, and translated

Why Decision Trees? •  Wide usage in machine learning means

Part 2: Store all those districts as polygon objects

Index on all districts? I’m running for Dog Catcher in

Or index on none? 1. db.DISTRICTS.ﬁndOne({‘name’:’Hicksville’}) Hicksville polygon from Census

Bonus: Random Sampling Hicksville Bounding Box Much easier/faster/randomer than: db.VOTERS.ﬁnd({‘Town’:’Hicksville’}).skip(random

Use clever application/schema design to minimize necessary indexes.