SixFifty's Story: Why Are UK Elections So Hard To Predict?

D97d7f6467d12ec08c5157dc9820a8c4?s=47 John Sandall
September 20, 2017

SixFifty's Story: Why Are UK Elections So Hard To Predict?

Presented at the Applied Data Engineering meetup, London, September 2017.

https://www.meetup.com/Applied-Data-Engineering-London/events/242957677/

---

When Theresa May announced plans on April 18th for the UK to hold a general election it was met with much cynicism. However, as self-confessed psephologists (and huge fans of Nate Silver's FiveThirtyEight datablog), we instead were thrilled at the opportunity. SixFifty is a collaboration of data scientists, software engineers, data journalists and political operatives brought together within hours of the snap general election being announced.

Our goals:

• Understand why forecasting elections in the UK using open data is notoriously difficult, and to see how far good statistical practice and modern machine learning methods can take us.

• Make political and demographic data more open and accessible by showcasing and releasing cleaned versions of the datasets we're using.

• We also hope that by communicating our methodology at a non-technical level we will contribute to improving statistical literacy, especially around concepts fundamental to elections, polling and open data.

In this talk we will cover our approach to creating an open polling data pipeline, the challenges we faced especially around data provenance, the infrastructural design decisions made to remain lean under strict resource and time limitations, and the various technologies used to transform PDF polling tables into an election forecast more accurate than any other published prediction using open data.

D97d7f6467d12ec08c5157dc9820a8c4?s=128

John Sandall

September 20, 2017
Tweet

Transcript

  1. BREAK INTO DATA SCIENCE SixFifty's Story: Why Are UK Elections

    So Hard To Predict? John Sandall 20th September 2017 @john_sandall @SixFiftyData
  2. BREAK INTO DATA SCIENCE 18th April 2017

  3. WHAT IS DATA SCIENCE? Call to arms

  4. WHAT IS DATA SCIENCE? Inspiration

  5. WHAT IS DATA SCIENCE? Sidenote: Why am I addicted to

    poll trackers? Habit Forming 101 1. Cue
  6. WHAT IS DATA SCIENCE? Sidenote: Why am I addicted to

    poll trackers? Habit Forming 101 1. Cue 2. Routine
  7. Sidenote: Why am I addicted to poll trackers? Habit Forming

    101 1. Cue 2. Routine 3. Reward "Predictable feedback loops don’t create desire" – Nir Eyal
  8. The UK has 650 parliamentary constituencies. Why SixFifty?

  9. BREAK INTO DATA SCIENCE How To Forecast A General Election

  10. Uniform National Swing: A Case Study Welcome to Sheffield Hallam!

    2010 results for Sheffield Hallam • CON: 24% • LAB: 16% • LD: 53% MP: Nick Clegg.
  11. Step 1. Compare national results with latest polling 2010 national

    results • CON: 36% • LAB: 29% • LD: 23% 2015 polling • CON: 33% • LAB: 33% • LD: 9% Uniform National Swing: A Case Study
  12. Step 2. Calculate "uniform national swing" (i.e. uplift) 2010 ->

    2015 UNS • CON: -8% • LAB: +13% • LD: -62% Uniform National Swing: A Case Study
  13. Step 3. Apply UNS to each constituency 2015 forecast for

    Sheffield Hallam • CON: 24% less 8% = 22% • LAB: 16% add 13% = 18% • LD: 53% less 62% = 20% Uniform National Swing: A Case Study
  14. Step 4. Forecast winner 2015 forecast for Sheffield Hallam •

    CON: 22% <- Conservative victory! • LAB: 18% • LD: 20% Uniform National Swing: A Case Study
  15. Step 5. So who won? 2015 forecast for Sheffield Hallam

    • CON: 22% <- Conservative victory! • LAB: 18% • LD: 20% 2015 result for Sheffield Hallam • CON: 14% • LAB: 36% • LD: 40% <- Lib Dem victory! Uniform National Swing: A Case Study
  16. Is there a better way? • Use regional polling where

    available. • Model out regional polls from national poll breakdowns. • Adjust each pollster for historical reliability or bias. • Adjust polls based on how they weight undecided voters. • Adjust based on current sentiments around polling accuracy. Uniform National Swing: A Case Study
  17. But really...is there a better way? • Use rigorous &

    modern modelling techniques. • Cross-validate, backtest, evaluate for predictive accuracy. • Blend in multiple data sources, not just polling. • Greater understanding of what drives election outcomes. • Open source our code, data, methodology. Uniform National Swing: A Case Study
  18. BREAK INTO DATA SCIENCE Polling 101

  19. Average forecast (left) vs actual results (right) for the 2015

    general election What went wrong in 2015?
  20. • "Shy Tories"? Mostly a convenient myth. • "Lazy Labour"?

    Partially true. • Biased sampling? Definitely true. • Herding? What went wrong in 2015?
  21. What went wrong in 2015?

  22. • "Shy Tories"? Mostly a convenient myth. • "Lazy Labour"?

    Partially true. • Biased sampling? Definitely true. • Herding? Definitely true. One survey found that 75% of US adults don't trust surveys. What went wrong in 2015?
  23. BREAK INTO DATA SCIENCE Polling data

  24. What does polling data look like?

  25. Raw data contains: • Voting Intention (“Which party will you

    be voting for on June 8th?”) • Party leader satisfaction • Policy preferences (“Do you think tuition fees should be abolished?”) • Demographic background (location, gender, age, education, etc) • Voted during EU Referendum? Remain or Leave? • Voted during 2015 general election? Which party voted for? • Questions designed to gauge likelihood of voting
  26. Raw data looks like:

  27. Automated data extraction? http://tabula.technology/

  28. Open polling data? bit.ly/UKPoliticsDatasets

  29. Open polling data? http://opinionbee.uk/

  30. Easily accessible? Methodology? Regional?

  31. Rolling our own open polling data pipeline 1. Alert

  32. Rolling our own open polling data pipeline 1. Alert 2.

    Update
  33. Rolling our own open polling data pipeline 1. Alert 2.

    Update 3. Automate https://github.com/six50/pipeline/
  34. Regional polling data

  35. Regional polling data

  36. Regional polling data

  37. D3 poll tracker https://sixfifty.org.uk/polls

  38. Polling data on SixFifty.org.uk https://sixfifty.org.uk/polls

  39. BREAK INTO DATA SCIENCE Technology Stack

  40. Guiding Principles 1. Serverless Extreme time constraints. DevOps skills needed

    elsewhere. 2. Tech Agnostic Volunteers working ad-hoc need to plug-and-play. 3. Ubiquitous Tech Stick to what most people know. No time to learn new tech. Stay lean. 4. Self Organising Create an infrastructure for volunteers to work independently.
  41. Stack Project management, code management, comms Phabricator. GitHub. Slack. Polling

    data pipeline RSS → Slack → PDF extraction → Google Sheets → Python (pandas) → S3. Data visualisation R (ggplot) for social media poll trackers. D3 for interactive website tracker. Modelling Python (pandas, scikit-learn). R (dplyr).
  42. Project management, code management, comms Phabricator. GitHub. Slack. Polling data

    pipeline RSS → Slack → PDF extraction → Google Sheets → Python (pandas) → S3. Data visualisation R (ggplot) for social media poll trackers. D3 for interactive website tracker. Modelling Python (pandas, scikit-learn). R (dplyr). Stack No data stored in git
  43. Open Data http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/

  44. Open Data https://github.com/six50/pipeline

  45. Open Data https://github.com/six50/pipeline

  46. Are Polls Open Data?

  47. Are Polls Open Data?

  48. Are Polls Open Data?

  49. BREAK INTO DATA SCIENCE The SixFifty Forecast Model

  50. The Plan 1. Replicate UNS model 2. Check predictions against

    other forecasts 3. Evaluate with historical election data 4. Build ML model 5. Iterate...
  51. UNS models National UNS forecast for 2017 • CON: 337

    • LAB : 236 • SNP : 47 • LD : 6 • PC : 5 • Other: 19 ← Majority of 24
  52. National UNS forecast for 2017 • CON: 337 • LAB

    : 236 • SNP : 47 • LD : 6 • PC : 5 • Other: 19 Regional UNS forecast for 2017 • CON: 342 • LAB : 236 • SNP : 43 • LD : 6 • PC : 4 • Other: 19 UNS models ← Majority of 24 ← Majority of 34
  53. Comparison With Other Forecasts Forecast Predicted Conservative Majority 2015 result

    +10 YouGov -24 (hung) SixFifty national UNS model +24 New Statesman +24 SixFifty regional UNS model +34 Lord Ashcroft +64 Elections Etc +66 Electoral Calculus +66 Election Forecast +82
  54. UNS models

  55. Model Evaluation 1. Repeat for GE2010 -> GE2015 2. Forecast

    GE2015 seat-by-seat 3. Evaluate: • National Vote Share (MAE) = 2% error • National Seat Count (total) = 135 incorrectly called • Seat-by-seat Accuracy = 79% correctly called • Seat vote share MAE = 4.6% mean error / party / seat
  56. Build ML Model • Polling data only for 2010, 2015,

    2017 • CON / LAB / LD / UKIP / GRN only • Calculate and forecast using UNS => useful feature! • Features: • Region • Electorate (registered, total who voted, total votes cast) • Party • Previous election: total votes, vote share, won constituency? • Current election: poll vote share, swing • UNS forecast: vote share (%), predicted winner
  57. Build ML Model

  58. Build ML Model Evaluation • 5x5 cross-validation on GE2015 predictions

    from GE2010 data Models (scikit-learn) • Linear Regression (Simple, Lasso, Ridge) • Ensemble (Random Forest, Gradient Boosting, Extra Trees) • Neural net (MLPRegressor) Tune best default model (Gradient Boosted Trees)
  59. Build ML Model Tune best default model (Gradient Boosting)

  60. Seat vote share MAE • UNS model = 4.6% mean

    error / party / seat • GB model = 2.3% mean error / party / seat TL;DR – 50% reduction in average error per seat Evaluate ML Model
  61. Feature Importance

  62. Regional UNS forecast for 2017 • CON: 342 • LAB

    : 236 • SNP : 43 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 ML forecast for 2017 • CON: 336 • LAB : 232 • SNP : 52 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 Final Prediction ← Majority of 34 ← Majority of 22
  63. Comparison With Other Forecasts Forecast Predicted Conservative Majority YouGov -24

    (hung) Final SixFifty ML model +22 SixFifty national UNS model +24 New Statesman +24 SixFifty regional UNS model +34 Lord Ashcroft +64 Elections Etc +66 Electoral Calculus +66 Election Forecast +82
  64. Comparison With Other Forecasts https://www.thesun.co.uk/news/3686937/pollster-yougov-is-mocked-over-utter-tripe-poll-which-shows-theresa-may-losing-her-majority-in-election/

  65. Comparison With Other Forecasts https://www.thesun.co.uk/news/3686937/pollster-yougov-is-mocked-over-utter-tripe-poll-which-shows-theresa-may-losing-her-majority-in-election/

  66. Election Night!

  67. BREAK INTO DATA SCIENCE Reflection

  68. In seven weeks we... • Created a multitude of explainer

    articles and tech blogs • Built a best practice open data pipeline for polling data • Created multiple election forecasts using machine learning • Wrote scripts for scraping Twitter & tagging live TV • Ran an election data hackathon at Newspeak House • Lots of dead ends! We Tried To Do Too Much!
  69. We Tried To Do Too Much!

  70. We Tried To Do Too Much!

  71. We Tried To Do Too Much!

  72. BREAK INTO DATA SCIENCE How Did We Do?

  73. How Did We Do? Our Prediction What Happened

  74. ML forecast for 2017 • CON: 336 • LAB :

    232 • SNP : 52 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 Final Prediction ← Majority of 22 Actual result • CON: 317 • LAB : 262 • SNP : 35 • LD : 12 • PC : 4 • GRN: 1 • Other: 19 ← 9 short of majority
  75. UK #GE2017 Final Seat Projections

  76. TL;DR In seven weeks we built the most accurate election

    forecast built purely from public data. UK #GE2017 Final Seat Projections
  77. BREAK INTO DATA SCIENCE Next steps

  78. BREAK INTO DATA SCIENCE Thank You