PyData - June 6th - Introducing SixFifty

BREAK INTO DATA SCIENCE Introducing SixFifty John Sandall 6th June
2017 @john_sandall @SixFiftyData

BREAK INTO DATA SCIENCE 18th April 2017

WHAT IS DATA SCIENCE? Call to arms

WHAT IS DATA SCIENCE? Inspiration

WHAT IS DATA SCIENCE? Sidenote: Why am I addicted to
poll trackers? Habit Forming 101 1. Cue

WHAT IS DATA SCIENCE? Sidenote: Why am I addicted to
poll trackers? Habit Forming 101 1. Cue 2. Routine

Sidenote: Why am I addicted to poll trackers? Habit Forming
101 1. Cue 2. Routine 3. Reward "Predictable feedback loops don’t create desire" – Nir Eyal

The UK has 650 parliamentary constituencies. Why SixFifty?

BREAK INTO DATA SCIENCE General Election Forecasts

Uniform National Swing: A Case Study Welcome to Shefﬁeld Hallam
2010 results for Shefﬁeld Hallam • CON: 24% • LAB: 16% • LD: 53% MP: Nick Clegg.

Step 1. Compare national results with latest polling 2010 national
results • CON: 36% • LAB: 29% • LD: 23% 2015 polling • CON: 33% • LAB: 33% • LD: 9% Uniform National Swing: A Case Study

Step 2. Calculate "uniform national swing" (i.e. uplift) 2010 ->
2015 UNS • CON: -8% • LAB: +13% • LD: -62% Uniform National Swing: A Case Study

Step 3. Apply UNS to each constituency 2015 forecast for
Shefﬁeld Hallam • CON: 24% less 8% = 22% • LAB: 16% add 13% = 18% • LD: 53% less 62% = 20% Uniform National Swing: A Case Study

Step 4. Forecast winner 2015 forecast for Shefﬁeld Hallam •
CON: 22% <- Conservative victory! • LAB: 18% • LD: 20% Uniform National Swing: A Case Study

Step 5. So who won? 2015 forecast for Shefﬁeld Hallam
• CON: 22% <- Conservative victory! • LAB: 18% • LD: 20% 2015 result for Shefﬁeld Hallam • CON: 14% • LAB: 36% • LD: 40% <- Lib Dem victory! Uniform National Swing: A Case Study

Is there a better way? • Use rigorous & modern
modelling techniques. • Cross-validate, backtest, evaluate for predictive accuracy. • Open source our code, data, methodology. • Blend in multiple data sources, not just polling. • Greater understanding of what drives election outcomes. Uniform National Swing: A Case Study

BREAK INTO DATA SCIENCE Polling

Average forecast (left) vs actual results (right) for the 2015
general election What went wrong in 2015?

• "Shy Tories"? • "Lazy Labour"? • Biased sampling? •
Herding? What went wrong in 2015?

• "Shy Tories"? Mostly a convenient myth. • "Lazy Labour"?
• Biased sampling? • Herding? What went wrong in 2015?

Partially true. • Biased sampling? • Herding? What went wrong in 2015?

Partially true. • Biased sampling? Deﬁnitely true. • Herding? What went wrong in 2015?

What went wrong in 2015?

Partially true. • Biased sampling? Deﬁnitely true. • Herding? Deﬁnitely true. What went wrong in 2015?

Partially true. • Biased sampling? Deﬁnitely true. • Herding? Deﬁnitely true. One survey found that 75% of US adults don't trust surveys. What went wrong in 2015?

BREAK INTO DATA SCIENCE Polling data

What does polling data look like?

Raw data contains: • Voting Intention (“Which party will you
be voting for on June 8th?”) • Party leader satisfaction • Policy preferences (“Do you think tuition fees should be abolished?”) • Demographic background (location, gender, age, education, etc) • Voted during EU Referendum? Remain or Leave? • Voted during 2015 general election? Which party voted for? • Questions designed to gauge likelihood of voting

Raw data looks like:

Automated data extraction? http://tabula.technology/

Open polling data? bit.ly/UKPoliticsDatasets

Open polling data? http://opinionbee.uk/

Easily accessible? Methodology? Regional?

Rolling our own open polling data pipeline 1. Alert

Rolling our own open polling data pipeline 1. Alert 2.
Update

Rolling our own open polling data pipeline 1. Alert 2.
Update 3. Automate https://github.com/six50/pipeline/

Regional polling data

D3 poll tracker https://sixﬁfty.org.uk/polls

Polling data on SixFifty.org.uk https://sixﬁfty.org.uk/polls

BREAK INTO DATA SCIENCE Forecast Model

The Plan 1. Replicate UNS model 2. Check predictions against
other forecasts 3. Evaluate with historical election data 4. Build ML model 5. Iterate...

UNS models National UNS forecast for 2017 • CON: 337
• LAB : 236 • SNP : 47 • LD : 6 • PC : 5 • Other: 19

National UNS forecast for 2017 • CON: 337 • LAB
: 236 • SNP : 47 • LD : 6 • PC : 5 • Other: 19 Regional UNS forecast for 2017 • CON: 342 • LAB : 236 • SNP : 43 • LD : 6 • PC : 4 • Other: 19 UNS models

UNS models

Model Evaluation • Repeat for GE2010 -> GE2015 • Forecast
GE2015 seat-by-seat • Evaluate: • National Vote Share (MAE) = 2% error • National Seat Count (total) = 135 incorrectly called • Seat-by-seat Accuracy = 79% correctly called • Seat vote share MAE = 4.6% mean error / party / seat

Build ML Model • Polling data only for 2010, 2015,
2017 • CON / LAB / LD / UKIP / GRN only • Calculate and forecast using UNS => useful feature! • Features: • Region • Electorate (registered, total who voted, total votes cast) • Party • Previous election: total votes, vote share, won constituency? • Current election: poll vote share, swing • UNS forecast: vote share (%), predicted winner

Build ML Model

Build ML Model • Evaluation • 5x5 cross-validation on GE2015
predictions from GE2010 data • Models (scikit-learn) • Linear Regression (Simple, Lasso, Ridge) • Ensemble (Random Forest, Gradient Boosting, Extra Trees) • Neural net (MLPRegressor) • Tune best default model (Gradient Boosting)

Build ML Model • Tune best default model (Gradient Boosting)

• Seat vote share MAE • UNS model = 4.6%
mean error / party / seat • GB model = 2.3% mean error / party / seat Evaluate ML Model

Feature Importance

Regional UNS forecast for 2017 • CON: 342 • LAB
: 236 • SNP : 43 • LD : 6 • PC : 4 • Other: 19 ML forecast for 2017 • CON: 332 • LAB : 232 • SNP : 56 • LD : 5 • PC : 4 • Other: 19 Final Prediction

•Very friendly to SNP! •Overﬁtting to 2015 election •Wish we
had more data! Caveats

BREAK INTO DATA SCIENCE Next steps

BREAK INTO DATA SCIENCE Thank You

BREAK INTO DATA SCIENCE UPDATE This section has been added
after the PyData talk. 1. Final forecast. 2. Comparison with election result. 3. Comparison with other forecasts.

Regional UNS forecast for 2017 • CON: 342 • LAB
: 236 • SNP : 43 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 ML forecast for 2017 • CON: 336 • LAB : 232 • SNP : 52 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 Final Prediction ← Majority of 34 ← Majority of 22

ML forecast for 2017 • CON: 336 • LAB :
232 • SNP : 52 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 Final Prediction ← Majority of 22 Actual result • CON: 317 • LAB : 262 • SNP : 35 • LD : 12 • PC : 4 • GRN: 1 • Other: 19 ← 9 short of majority

TL;DR In seven weeks we built the most accurate election
forecast built purely from public data.

PyData - June 6th - Introducing SixFifty

PyData - June 6th - Introducing SixFifty

More Decks by John Sandall

Other Decks in Technology

Featured

Transcript