Slide 1

Slide 1 text

NAVIGATING THE CAMPAIGN CONTRIBUTION NETWORK Bobby Norton 1 Dev Day - Detroit, MI November 17, 2012 image: http://skyeome.net/wordpress/?p=102

Slide 2

Slide 2 text

Learning Systems Institute Lockheed Martin Simulation & Training ThoughtWorks DRW Trading Group Aurelius Education Startup (Coming soon...)

Slide 3

Slide 3 text

If string theory fails to provide a testable prediction, then nobody should believe it. S. James Gates, Jr. - University of Maryland

Slide 4

Slide 4 text

In string theory, the Planck length is the order of magnitude of the oscillating strings that form elementary particles, and shorter lengths do not make physical sense. http://en.wikipedia.org/wiki/Planck_length

Slide 5

Slide 5 text

Charge radius of a proton: 0.8768 x 10-15m

Slide 6

Slide 6 text

A difference of 1020m

Slide 7

Slide 7 text

1020m: Approximate width of the Milky Way galaxy

Slide 8

Slide 8 text

http://ut-images.s3.amazonaws.com/wp-content/uploads/2010/11/milkyway.jpg

Slide 9

Slide 9 text

It’s difficult to make scientific claims about something that can’t be measured.

Slide 10

Slide 10 text

Testable predictions about complex systems GOAL

Slide 11

Slide 11 text

Graph databases provide us with an efficient means to create predictive models of complex systems. CLAIM

Slide 12

Slide 12 text

COMPLEX SYSTEMS • Cascading failures • Unclear boundaries • May be capable of adaptation • Nonlinear (exhibit a “Butterfly Effect”) • May be nested, a system of systems • Often exhibit small world and scale-free topologies http://en.wikipedia.org/wiki/Complex_system#Features_of_complex_systems

Slide 13

Slide 13 text

Complexity can emerge from simple components

Slide 14

Slide 14 text

http://en.wikipedia.org/wiki/File:KochFlake.svg

Slide 15

Slide 15 text

Text http://flowingdata.com/2008/03/12/17-ways-to-visualize-the-twitter-universe/

Slide 16

Slide 16 text

Skitter data depicting a macroscopic snapshot of Internet connectivity, with selected backbone ISPs By K. C. Claffy Email: [email protected] http://www.caida.org/publications/papers/bydate/index.xml

Slide 17

Slide 17 text

http://upload.wikimedia.org/wikipedia/commons/0/03/C.elegans-brain-network.jpg

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Exploring the network of campaign contributions WORK IN PROGRESS

Slide 20

Slide 20 text

http://www.fec.gov/finance/disclosure/ftpdet.shtml#2011_2012

Slide 21

Slide 21 text

C00514893|N|Q1|P|12951391265|15E|IND|KATZ, DAVID|SAN FRANCISCO|CA|94110|GROUPON, INC.|VP/ GM|03202012|1000|C00401224|C3694043A|776253||* EARMARKED CONTRIBUTION: SEE BELOW| 4051020121155797607 C00496778|A|Q1|P|12951584507|15|IND|KALATHIL, VINOD|CHICAGO|IL|60607|GROUPON/DIRECTOR OF INTERNAL AUDIT|DIRECTOR OF INTERNAL AUDIT|03152012|1000||C6914012|781193||| 4051520121155972880 C00420760|N|Q2|P|12020554070|15|IND|KIMET, CAROLYN|EVANSTON|IL|80201|GROUPON|E-MARKETING DIRECTOR|06052012|1000||SA0802094012139|802475|||2080220121160111657 C00494740|N|M8|P|12952683408|15|IND|GERSTER, DAVID|BURLINGAME|CA|94010|GROUPON, INC.| DIRECTOR OF ANALYTICS|07112012|2000||C17532152|805999|||4082820121161649967 C00431445|N|M9|P|12972391292|15|IND|KATZ, DAVID|SAN FRANCISCO|CA|94110|GROUPON, INC.|VP/ GM MOBILE|08312012|1000||C20387409|811365|||4100320121165864857 C00494740|N|M10|P|12960022928|15|IND|KLATT, KYLE|CHICAGO|IL|60657|GROUPON|SENIOR CAMPAIGN ORGANIZER|09092012|250||C21392360|821033|||4102520121168451880 C00501692|A|Q3|P|12950510503|15|IND|BAVDA, MRUGESH|CHICAGO|IL|60622|GROUPON|MARKET PLANNER|09272011|500||C7251377|765709|||4030120121152599889 C00401224|N|M4||12971009695|24T|IND|KATZ, DAVID|SAN FRANCISCO|CA|94110|GROUPON, INC.|VP/ GM|03202012|1000|C00514893|SA11AI_5025357|778552||EARMARKED FOR PEOPLE FOR DEREK KILMER (C00514893)|4060920121156872726 C00494930|N|Q2|G|12020551850|15|IND|KLAUMINZER, JAY|ROCKY RIVER|OH|44116|GROUPON| REGIONAL VICE PRESIDENT|06292012|1000||SA0803123312246|802781|||2080620121160328596 C00494740|A|Q2|P|11971580062|15|IND|LEFKOFSKY, ERIC|GLENCOE|IL|60022|GROUPON|OWNER| 04152011|35800||C11008231|748092|||4101820111143685060 C00494740|N|M9|P|12972228143|15|IND|BAKER, CAROLINE|CHICAGO|IL|60616|GROUPON|OPERATIONS| 08312012|250||C20245573|810872|||4092720121165000105 C00431445|N|M9|P|12972342658|15|IND|HUTMACHER, AMY|GRAND JUNCTION|CO|81506|GROUPON| PRODUCT MANAGER|08082012|250||C18876006|811365|||4100320121165718953 C00494740|N|M9|P|12972226974|15|IND|MASON, ANDREW DIVVENS|CHICAGO|IL|60612|GROUPON| CEO|08102012|35800||C19080692|810872|||4092720121164996598 C00431445|N|M9|P|12972331365|15|IND|RASMUSSEN, ERIC|MENLO PARK|CA|94025|GROUPON.COM| MARKETING|08312012|250||C20380155|811365|||4100320121165685074

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

FEC NODES AND EDGES • candidates --contribute_to--> committees • committees --contribute_to--> committees • donors --contribute_to--> committees • donors --employed_by--> companies • companies --contribute_to--> committees • committees --contribute_to--> candidates

Slide 24

Slide 24 text

Property graph: “key/value-based, directed, multi-relational” http://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph

Slide 25

Slide 25 text

http://philcrissman.com/images/posts/3618515225_6446a9876b.jpg

Slide 26

Slide 26 text

ETL Extract Transform Load

Slide 27

Slide 27 text

CREATING NODES

Slide 28

Slide 28 text

public int saveEntries(BatchInserter inserter, BatchIndex index) { int count = 0; try { for (String line : lines) { String[] fields = line.split("\\|"); // Candidate address data is the most inconsistent in this file, so we skip that entirely. // There are also incomplete records, e.g. CAND_ID H2NJ02177, that we can skip since they // aren't referenced anywhere else. if (dirty(fields, 10)) continue; Map candidate = transform(fieldEnum, fields); long candidateId = inserter.createNode(candidate); index.add(candidateId, candidate); // TODO: Convert the COMMITTEE_ID property to a relationship with a committee count++; } } catch (Exception e) { throw new RuntimeException("Failed to write candidates:", e); } return count; }

Slide 29

Slide 29 text

CREATING EDGES

Slide 30

Slide 30 text

for (String line : lines) { String[] fields = line.split("\\|"); if (dirty(fields, 10)) { dirty++; continue; }

Slide 31

Slide 31 text

Map props = transactionProperties(fields); String sourceData = fields[Fields.SOURCE_COMMITTEE_ID.ordinal()]; String targetData = fields[Fields.TARGET_COMMITTEE_ID.ordinal()]; Long sourceId = committeeIndex.find("committee_id", sourceData); Long targetId = committeeIndex.find("committee_id", targetData); if (sourceId == null || targetId == null) { props.put("source", sourceData); props.put("target", targetData); int amount = Integer.parseInt(fields[Fields.AMOUNT.ordinal()].trim()); dirtyMoney += amount; dirty++; } else { inserter.createRelationship(sourceId, targetId, Relationships.CONTRIBUTED, props); count++; }

Slide 32

Slide 32 text

INDEXING

Slide 33

Slide 33 text

public BatchIndex(BatchInserter inserter, String indexName, String indexProperty) { this.indexProperty = indexProperty; indexProvider = new LuceneBatchInserterIndexProvider(inserter); index = indexProvider.nodeIndex(indexName, MapUtil.stringMap("type", "exact")); index.setCacheCapacity(indexProperty, 100000); } http://lucene.apache.org

Slide 34

Slide 34 text

public void add(long nodeId, Map properties) { if (!properties.containsKey(indexProperty)) { throw new RuntimeException( String.format("Node %d is missing property %s", nodeId, indexProperty)); } index.add(nodeId, properties); }

Slide 35

Slide 35 text

public Long find(String key, String value) { IndexHits ids = index.get(key, value); if (!ids.hasNext()) { ids.close(); return null; } Long nodeId = ids.next(); ids.close(); return nodeId; }

Slide 36

Slide 36 text

NETWORK ANALYSIS Queries Algorithms

Slide 37

Slide 37 text

How much has the Obama campaign spent? obama = g.V.filter {it.name == "OBAMA, BARACK"}.next() x = [] obama.outE.amount.store(x) { it.toInteger() } x.inject(0) { acc, val -> acc + val } ==>187447424* *given the data I’ve loaded so far...

Slide 38

Slide 38 text

To which committee did the Obama campaign make the most number of contributions? obama = g.V.filter {it.name == "OBAMA, BARACK"}.next() x = [:] obama.out.groupCount(x).iterate() top = x.sort {a,b -> b.value <=> a.value}[0..9] gremlin> top.keySet().toArray().first().name ==>WORKING AMERICA

Slide 39

Slide 39 text

Who is going to win the election? (according to eigenvector centrality) m = [:]; c = 0; g.V.out.groupCount(m).loop(2){c++ < 1000} top = m.sort{-it.value}[0..9] top.keySet().toArray().first().name ==>OBAMA FOR AMERICA

Slide 40

Slide 40 text

This approach relies on mining a network hidden in relational data

Slide 41

Slide 41 text

Chapter 13: Crouching Table, Hidden Network

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

NEXT STEPS • Open source release for hackers and data scientists - ANN coming soon via Twitter • Finish loading FEC data and continue network analysis • Visualize the network • Compare results to other implementations to validate the original claim

Slide 44

Slide 44 text

@bobbynorton linkedin.com/in/bobbynorton