SixFifty's Story: Why Are UK Elections So Hard To Predict?

BREAK INTO DATA SCIENCE SixFifty's Story: Why Are UK Elections
So Hard To Predict? John Sandall 2nd October 2017 @john_sandall @SixFiftyData

BREAK INTO DATA SCIENCE 18th April 2017

WHAT IS DATA SCIENCE? Call to arms

WHAT IS DATA SCIENCE? Inspiration

The UK has 650 parliamentary constituencies. Why SixFifty?

BREAK INTO DATA SCIENCE How To Forecast A General Election

Uniform National Swing: A Case Study Welcome to Shefﬁeld Hallam!
2010 results for Shefﬁeld Hallam • CON: 24% • LAB: 16% • LD: 53% MP: Nick Clegg.

Step 1. Compare national results with latest polling 2010 national
results • CON: 36% • LAB: 29% • LD: 23% Uniform National Swing: A Case Study

Step 1. Compare national results with latest polling 2010 national
results • CON: 36% • LAB: 29% • LD: 23% 2015 polling • CON: 33% • LAB: 33% • LD: 9% Uniform National Swing: A Case Study

Step 2. Calculate "uniform national swing" (i.e. uplift) 2010 ->
2015 UNS • CON: -8% • LAB: +13% • LD: -62% Uniform National Swing: A Case Study

Step 3. Apply UNS to each constituency 2015 forecast for
Shefﬁeld Hallam • CON: 24% less 8% = 22% • LAB: 16% add 13% = 18% • LD: 53% less 62% = 20% Uniform National Swing: A Case Study

Step 4. Forecast winner 2015 forecast for Shefﬁeld Hallam •
CON: 22% <- Conservative victory! • LAB: 18% • LD: 20% Uniform National Swing: A Case Study

Step 5. So who won? 2015 forecast for Shefﬁeld Hallam
• CON: 22% <- Conservative victory! • LAB: 18% • LD: 20% 2015 result for Shefﬁeld Hallam • CON: 14% • LAB: 36% • LD: 40% <- Lib Dem victory! Uniform National Swing: A Case Study

Is there a better way? • Use regional polling where
available. • Model out regional polls from national poll breakdowns. • Adjust each pollster for historical reliability or bias. • Adjust polls based on how they weight undecided voters. • Adjust based on current sentiments around polling accuracy. Uniform National Swing: A Case Study

But really...is there a better way? • Use rigorous &
modern modelling techniques. • Cross-validate, backtest, evaluate for predictive accuracy. • Blend in multiple data sources, not just polling. • Greater understanding of what drives election outcomes. • Open source our code, data, methodology. Uniform National Swing: A Case Study

BREAK INTO DATA SCIENCE What Went Wrong In 2015?

Average forecasts for the 2015 general election What went wrong
in 2015?

Average forecast (left) vs actual results (right) for the 2015
general election What went wrong in 2015?

What went wrong in 2015?

• "Shy Tories"? Mostly a convenient myth. • "Lazy Labour"?
Partially true. • Biased sampling? Deﬁnitely true. • Herding? Deﬁnitely true. One survey found that 75% of US adults don't trust surveys. What went wrong in 2015?

BREAK INTO DATA SCIENCE Polling data

What does polling data look like?

Raw data contains: • Voting Intention (“Which party will you
be voting for on June 8th?”) • Party leader satisfaction • Policy preferences (“Do you think tuition fees should be abolished?”) • Demographic background (location, gender, age, education, etc) • Voted during EU Referendum? Remain or Leave? • Voted during 2015 general election? Which party voted for? • Questions designed to gauge likelihood of voting

Raw data looks like:

Automated data extraction? http://tabula.technology/

Open polling data? bit.ly/UKPoliticsDatasets

Open polling data? http://opinionbee.uk/

Easily accessible? Methodology? Regional?

Rolling our own open polling data pipeline 1. Alert

Rolling our own open polling data pipeline 1. Alert 2.
Update

Rolling our own open polling data pipeline 1. Alert 2.
Update 3. Automate https://github.com/six50/pipeline/

Regional polling data

D3 poll tracker https://sixﬁfty.org.uk/polls

Polling data on SixFifty.org.uk https://sixﬁfty.org.uk/polls

BREAK INTO DATA SCIENCE Technology Stack

Guiding Principles 1. Serverless Extreme time constraints. DevOps skills needed
elsewhere. 2. Tech Agnostic Volunteers working ad-hoc need to plug-and-play. 3. Ubiquitous Tech Stick to what most people know. No time to learn new tech. Stay lean. 4. Self Organising Create an infrastructure for volunteers to work independently.

Project management, code management, comms Phabricator. GitHub. Slack. Polling data
pipeline RSS → Slack → PDF extraction → Google Sheets → Python (pandas) → S3. Data visualisation R (ggplot) for social media poll trackers. D3 for interactive website tracker. Modelling Python (pandas, scikit-learn). R (dplyr). Stack No data stored in git

Open Data http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/

Open Data https://github.com/six50/pipeline

Are Polls Open Data?

BREAK INTO DATA SCIENCE The SixFifty Forecast Model

The Plan 1. Replicate UNS model 2. Check predictions against
other forecasts 3. Evaluate with historical election data 4. Build ML model 5. Iterate...

UNS models National UNS forecast for 2017 • CON: 337
• LAB : 236 • SNP : 47 • LD : 6 • PC : 5 • Other: 19 ← Majority of 24

National UNS forecast for 2017 • CON: 337 • LAB
: 236 • SNP : 47 • LD : 6 • PC : 5 • Other: 19 Regional UNS forecast for 2017 • CON: 342 • LAB : 236 • SNP : 43 • LD : 6 • PC : 4 • Other: 19 UNS models ← Majority of 24 ← Majority of 34

Comparison With Other Forecasts Forecast Predicted Conservative Majority 2015 result
+10 YouGov -24 (hung) SixFifty national UNS model +24 New Statesman +24 SixFifty regional UNS model +34 Lord Ashcroft +64 Elections Etc +66 Electoral Calculus +66 Election Forecast +82

UNS models

Model Evaluation 1. Repeat for GE2010 -> GE2015 2. Forecast
GE2015 seat-by-seat 3. Evaluate: • National Vote Share (MAE) = 2% error • National Seat Count (total) = 135 incorrectly called • Seat-by-seat Accuracy = 79% correctly called • Seat vote share MAE = 4.6% mean error / party / seat

Build ML Model • Polling data only for 2010, 2015,
2017 • CON / LAB / LD / UKIP / GRN only • Calculate and forecast using UNS => useful feature! • Features: • Region • Electorate (registered, total who voted, total votes cast) • Party • Previous election: total votes, vote share, won constituency? • Current election: poll vote share, swing • UNS forecast: vote share (%), predicted winner

Build ML Model

Build ML Model Evaluation • 5x5 cross-validation on GE2015 predictions
from GE2010 data Models (scikit-learn) • Linear Regression (Simple, Lasso, Ridge) • Ensemble (Random Forest, Gradient Boosting, Extra Trees) • Neural net (MLPRegressor) Tune best default model (Gradient Boosted Trees)

Build ML Model Tune best default model (Gradient Boosting)

Seat vote share MAE • UNS model = 4.6% mean
error / party / seat • GB model = 2.3% mean error / party / seat TL;DR – 50% reduction in average error per seat Evaluate ML Model

Feature Importance

Regional UNS forecast for 2017 • CON: 342 • LAB
: 236 • SNP : 43 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 ML forecast for 2017 • CON: 336 • LAB : 232 • SNP : 52 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 Final Prediction ← Majority of 34 ← Majority of 22

Comparison With Other Forecasts Forecast Predicted Conservative Majority YouGov -24
(hung) Final SixFifty ML model +22 SixFifty national UNS model +24 New Statesman +24 SixFifty regional UNS model +34 Lord Ashcroft +64 Elections Etc +66 Electoral Calculus +66 Election Forecast +82

Comparison With Other Forecasts https://www.thesun.co.uk/news/3686937/pollster-yougov-is-mocked-over-utter-tripe-poll-which-shows-theresa-may-losing-her-majority-in-election/

Election Night!

BREAK INTO DATA SCIENCE Reﬂection

We Tried To Do Too Much!

In seven weeks we... • Created a multitude of explainer
articles and tech blogs • Built a best practice open data pipeline for polling data • Created multiple election forecasts using machine learning • Wrote scripts for scraping Twitter & tagging live TV • Ran an election data hackathon at Newspeak House • Lots of dead ends! We Tried To Do Too Much!

BREAK INTO DATA SCIENCE How Did We Do?

How Did We Do? Our Prediction What Happened

ML forecast for 2017 • CON: 336 • LAB :
232 • SNP : 52 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 Final Prediction ← Majority of 22 Actual result • CON: 317 • LAB : 262 • SNP : 35 • LD : 12 • PC : 4 • GRN: 1 • Other: 19 ← 9 short of majority

UK #GE2017 Final Seat Projections

TL;DR In seven weeks we built the most accurate election
forecast built purely from public data. UK #GE2017 Final Seat Projections

BREAK INTO DATA SCIENCE Next steps

BREAK INTO DATA SCIENCE Thank You

SixFifty's Story: Why Are UK Elections So Hard ...

SixFifty's Story: Why Are UK Elections So Hard To Predict?

More Decks by John Sandall

Other Decks in Technology

Featured

Transcript