Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Psephology 101

Psephology 101

John Sandall is founder of SixFifty, a non-partisan collaboration of data scientists, software engineers, journalists and political experts who came together to try and predict the UK General Election using open data and advanced modelling techniques. He is also a Fellow of Newspeak House and co-organiser of PyData Bristol. He will talk through his experience of building models using machine learning techniques and how to use datasets which are publically available to build models.

John Sandall

June 06, 2019
Tweet

More Decks by John Sandall

Other Decks in Technology

Transcript

  1. Comparison With Other Forecasts Forecast Predicted Conservative Majority YouGov -24

    (hung) Final SixFifty ML model +22 SixFifty national UNS model +24 New Statesman +24 SixFifty regional UNS model +34 Lord Ashcroft +64 Elections Etc +66 Electoral Calculus +66 Election Forecast +82
  2. TL;DR In seven weeks we built the most accurate election

    forecast built purely from public data. UK #GE2017 Final Seat Projections
  3. What are we doing? National vote share forecasts get a

    lot of media attention... https://www.theguardian.com/politics/2019/jun/01/brexit-party-nigel-farage-lead- opinion-poll-conservatives-opinium https://www.electoralcalculus.co.uk/
  4. What are we doing? ...but seat level forecasts can feed

    into strategic decision making https://www.electoralcalculus.co.uk/
  5. Forecasting Techniques 1.Extrapolate from national polls 2.Extrapolate from regional polling

    3.Maybe we can get better seat level accuracy by taking into account what happened last time?
  6. Uniform National Swing: A Case Study Welcome to Sheffield Hallam!

    (MP: Nick Clegg) 2010 Results (Sheffield Hallam) 24% 16% 53%
  7. 2010 Results (Sheffield Hallam) 2010 National Results 2015 National Polling

    24% 36% 33% 16% 29% 33% 53% 23% 9% Uniform National Swing: A Case Study Step 1. Compare national results with latest polling
  8. Uniform National Swing: A Case Study 2010 Results (Sheffield Hallam)

    2010 National Results 2015 National Polling Uniform National Swing 24% 36% 33% -8% 16% 29% 33% +13% 53% 23% 9% -62% Step 2. Calculate "uniform national swing" (i.e. uplift)
  9. Uniform National Swing: A Case Study 2010 Results (Sheffield Hallam)

    2010 National Results 2015 National Polling Uniform National Swing 2015 Forecast (Sheffield Hallam) 24% 36% 33% -8% 22% 16% 29% 33% +13% 18% 53% 23% 9% -62% 20% Step 3. Apply UNS to each constituency
  10. Uniform National Swing: A Case Study 2010 Results (Sheffield Hallam)

    2010 National Results 2015 National Polling Uniform National Swing 2015 Forecast (Sheffield Hallam) 24% 36% 33% -8% 22% 16% 29% 33% +13% 18% 53% 23% 9% -62% 20% Step 4. Forecast winner (Conservative victory?)
  11. Uniform National Swing: A Case Study Step 5. So who

    won? (Liberal Democrat victory) 2010 Results (Sheffield Hallam) 2010 National Results 2015 National Polling Uniform National Swing 2015 Forecast (Sheffield Hallam) 2015 Results (Sheffield Hallam) 24% 36% 33% -8% 22% 14% 16% 29% 33% +13% 18% 36% 53% 23% 9% -62% 20% 40%
  12. Is there a better way? • Use regional polling where

    available. • Model out regional polls from national poll breakdowns. • Adjust each pollster for historical reliability or bias. • Adjust polls based on how they weight undecided voters. • Adjust based on current sentiments around polling accuracy. Uniform National Swing: A Case Study
  13. • Use rigorous & modern modelling techniques. • Cross-validate, backtest,

    evaluate for predictive accuracy. • Blend in multiple data sources, not just polling. • Greater understanding of what drives election outcomes. • Open source our code, data, methodology. Is there a better way?
  14. Backtesting If we simulated the last three elections... ...using ONLY

    data that was available before election night... ...how does each technique do?
  15. Backtesting If we simulated the last three elections... ...using ONLY

    data that was available before election night... ...how does each technique do? Error / seat 50% 58% 67% 75% 2010 2015 2017
  16. Backtesting If we simulated the last three elections... ...using ONLY

    data that was available before election night... ...how does each technique do? Error / seat 50% 58% 67% 75% 2010 2015 2017
  17. Raw data contains: • Voting Intention (“Which party will you

    be voting for on June 8th?”) • Party leader satisfaction • Policy preferences (“Do you think tuition fees should be abolished?”) • Demographic background (location, gender, age, education, etc) • Voted during EU Referendum? Remain or Leave? • Voted during 2015 general election? Which party voted for? • Questions designed to gauge likelihood of voting
  18. Auto-extraction of polling data • Pay people? Done this, expensive,

    inaccurate. • Scraping PDFs (Tabula)? Done this, costly, brittle, doesn't scale. • Deep Learning? Dropbox solved a much simpler problem with a much larger team, solve this and you have a unicorn! • Collaboration with pollsters? Slow road, but the most realistic; why would they invest in this? • Regardless, detailed historical data pre-2012 (with demographic breakouts) is hard to find.
  19. Data challenges • No single data hub for political science.

    • Lack of consistent identifiers makes it hard to join datasets. • Joining overlapping geographic regions is difficult (e.g. census districts to electoral consituencies). • There is no mechanism for political scientists and analysts to share reliably pre-processed data. • Lack of clear data licences inhibits sharing/republishing data.
  20. • Python package for automatically downloading, processing, joining and creating

    model ready datasets. • Automatically cleans and standardises lookup identifiers. • Moves towards open data whilst respecting licenses by doing all of this on device. My solution What else can I do? What can communities like Campaign Lab do?