Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData - June 6th - Introducing SixFifty

PyData - June 6th - Introducing SixFifty

John Sandall on SixFifty's Story: Why Are UK Elections So Hard To Predict?

When Theresa May announced plans on April 18th for the UK to hold a general election it was met with much cynicism. However, as self-confessed psephologists (and huge fans of Nate Silver's FiveThirtyEight datablog), we instead were thrilled at the opportunity. SixFifty is a collaboration of data scientists, software engineers, data journalists and political operatives brought together within hours of the snap general election being announced. Our goals:

• Understand why forecasting elections in the UK using open data is notoriously difficult, and to see how far good statistical practice and modern machine learning methods can take us.

• Make political and demographic data more open and accessible by showcasing and releasing cleaned versions of the datasets we're using.

• We also hope that by communicating our methodology at a non-technical level we will contribute to improving statistical literacy, especially around concepts fundamental to elections and polling.

About the speaker:

John Sandall spends his days working as a self-employed data science consultant and his nights working as a Data Science Instructor at General Assembly. Previously he was Lead Data Scientist at mobile ticketing startup YPlan, quantitative analyst at Apple, strategy/tech lead for non-profit startup STIR Education, co-founder of an ed-tech startup, and dabbled in genomics at Imperial College London.

John Sandall

June 27, 2017
Tweet

More Decks by John Sandall

Other Decks in Technology

Transcript

  1. BREAK INTO DATA SCIENCE
    Introducing SixFifty
    John Sandall
    6th June 2017
    @john_sandall
    @SixFiftyData

    View Slide

  2. BREAK INTO DATA SCIENCE
    18th April 2017

    View Slide

  3. WHAT IS DATA SCIENCE? Call to arms

    View Slide

  4. WHAT IS DATA SCIENCE?
    Inspiration

    View Slide

  5. WHAT IS DATA SCIENCE?
    Sidenote: Why am I addicted to poll trackers?
    Habit Forming 101
    1. Cue

    View Slide

  6. WHAT IS DATA SCIENCE?
    Sidenote: Why am I addicted to poll trackers?
    Habit Forming 101
    1. Cue
    2. Routine

    View Slide

  7. Sidenote: Why am I addicted to poll trackers?
    Habit Forming 101
    1. Cue
    2. Routine
    3. Reward
    "Predictable feedback loops don’t create desire" – Nir Eyal

    View Slide

  8. The UK has 650
    parliamentary
    constituencies.
    Why SixFifty?

    View Slide

  9. BREAK INTO DATA SCIENCE
    General Election
    Forecasts

    View Slide

  10. Uniform National Swing: A Case Study
    Welcome to Sheffield Hallam
    2010 results for Sheffield Hallam
    • CON: 24%
    • LAB: 16%
    • LD: 53%
    MP: Nick Clegg.

    View Slide

  11. Step 1. Compare national results with latest polling
    2010 national results
    • CON: 36%
    • LAB: 29%
    • LD: 23%
    2015 polling
    • CON: 33%
    • LAB: 33%
    • LD: 9%
    Uniform National Swing: A Case Study

    View Slide

  12. Step 2. Calculate "uniform national swing" (i.e. uplift)
    2010 -> 2015 UNS
    • CON: -8%
    • LAB: +13%
    • LD: -62%
    Uniform National Swing: A Case Study

    View Slide

  13. Step 3. Apply UNS to each constituency
    2015 forecast for Sheffield Hallam
    • CON: 24% less 8% = 22%
    • LAB: 16% add 13% = 18%
    • LD: 53% less 62% = 20%
    Uniform National Swing: A Case Study

    View Slide

  14. Step 4. Forecast winner
    2015 forecast for Sheffield Hallam
    • CON: 22% <- Conservative victory!
    • LAB: 18%
    • LD: 20%
    Uniform National Swing: A Case Study

    View Slide

  15. Step 5. So who won?
    2015 forecast for Sheffield Hallam
    • CON: 22% <- Conservative victory!
    • LAB: 18%
    • LD: 20%
    2015 result for Sheffield Hallam
    • CON: 14%
    • LAB: 36%
    • LD: 40% <- Lib Dem victory!
    Uniform National Swing: A Case Study

    View Slide

  16. Is there a better way?
    • Use rigorous & modern modelling techniques.
    • Cross-validate, backtest, evaluate for predictive accuracy.
    • Open source our code, data, methodology.
    • Blend in multiple data sources, not just polling.
    • Greater understanding of what drives election outcomes.
    Uniform National Swing: A Case Study

    View Slide

  17. BREAK INTO DATA SCIENCE
    Polling

    View Slide

  18. Average forecast (left) vs actual results (right) for the 2015 general election
    What went wrong in 2015?

    View Slide

  19. • "Shy Tories"?
    • "Lazy Labour"?
    • Biased sampling?
    • Herding?
    What went wrong in 2015?

    View Slide

  20. • "Shy Tories"? Mostly a convenient myth.
    • "Lazy Labour"?
    • Biased sampling?
    • Herding?
    What went wrong in 2015?

    View Slide

  21. • "Shy Tories"? Mostly a convenient myth.
    • "Lazy Labour"? Partially true.
    • Biased sampling?
    • Herding?
    What went wrong in 2015?

    View Slide

  22. • "Shy Tories"? Mostly a convenient myth.
    • "Lazy Labour"? Partially true.
    • Biased sampling? Definitely true.
    • Herding?
    What went wrong in 2015?

    View Slide

  23. What went wrong in 2015?

    View Slide

  24. • "Shy Tories"? Mostly a convenient myth.
    • "Lazy Labour"? Partially true.
    • Biased sampling? Definitely true.
    • Herding? Definitely true.
    What went wrong in 2015?

    View Slide

  25. • "Shy Tories"? Mostly a convenient myth.
    • "Lazy Labour"? Partially true.
    • Biased sampling? Definitely true.
    • Herding? Definitely true.
    One survey found that 75% of US adults don't trust surveys.
    What went wrong in 2015?

    View Slide

  26. BREAK INTO DATA SCIENCE
    Polling data

    View Slide

  27. What does polling data look like?

    View Slide

  28. Raw data contains:
    • Voting Intention (“Which party will you be voting for on June 8th?”)
    • Party leader satisfaction
    • Policy preferences (“Do you think tuition fees should be abolished?”)
    • Demographic background (location, gender, age, education, etc)
    • Voted during EU Referendum? Remain or Leave?
    • Voted during 2015 general election? Which party voted for?
    • Questions designed to gauge likelihood of voting

    View Slide

  29. Raw data looks like:

    View Slide

  30. Automated data extraction?
    http://tabula.technology/

    View Slide

  31. Open polling data?
    bit.ly/UKPoliticsDatasets

    View Slide

  32. Open polling data?
    http://opinionbee.uk/

    View Slide

  33. Easily accessible? Methodology? Regional?

    View Slide

  34. Rolling our own open polling data pipeline
    1. Alert

    View Slide

  35. Rolling our own open polling data pipeline
    1. Alert
    2. Update

    View Slide

  36. Rolling our own open polling data pipeline
    1. Alert
    2. Update
    3. Automate
    https://github.com/six50/pipeline/

    View Slide

  37. Regional polling data

    View Slide

  38. Regional polling data

    View Slide

  39. Regional polling data

    View Slide

  40. D3 poll tracker
    https://sixfifty.org.uk/polls

    View Slide

  41. Polling data on SixFifty.org.uk
    https://sixfifty.org.uk/polls

    View Slide

  42. BREAK INTO DATA SCIENCE
    Forecast Model

    View Slide

  43. The Plan
    1. Replicate UNS model
    2. Check predictions against other forecasts
    3. Evaluate with historical election data
    4. Build ML model
    5. Iterate...

    View Slide

  44. UNS models
    National UNS forecast for 2017
    • CON: 337
    • LAB : 236
    • SNP : 47
    • LD : 6
    • PC : 5
    • Other: 19

    View Slide

  45. National UNS forecast for 2017
    • CON: 337
    • LAB : 236
    • SNP : 47
    • LD : 6
    • PC : 5
    • Other: 19
    Regional UNS forecast for 2017
    • CON: 342
    • LAB : 236
    • SNP : 43
    • LD : 6
    • PC : 4
    • Other: 19
    UNS models

    View Slide

  46. UNS models

    View Slide

  47. Model Evaluation
    • Repeat for GE2010 -> GE2015
    • Forecast GE2015 seat-by-seat
    • Evaluate:
    • National Vote Share (MAE) = 2% error
    • National Seat Count (total) = 135 incorrectly called
    • Seat-by-seat Accuracy = 79% correctly called
    • Seat vote share MAE = 4.6% mean error / party / seat

    View Slide

  48. Build ML Model
    • Polling data only for 2010, 2015, 2017
    • CON / LAB / LD / UKIP / GRN only
    • Calculate and forecast using UNS => useful feature!
    • Features:
    • Region
    • Electorate (registered, total who voted, total votes cast)
    • Party
    • Previous election: total votes, vote share, won constituency?
    • Current election: poll vote share, swing
    • UNS forecast: vote share (%), predicted winner

    View Slide

  49. Build ML Model

    View Slide

  50. Build ML Model
    • Evaluation
    • 5x5 cross-validation on GE2015 predictions from GE2010 data
    • Models (scikit-learn)
    • Linear Regression (Simple, Lasso, Ridge)
    • Ensemble (Random Forest, Gradient Boosting, Extra Trees)
    • Neural net (MLPRegressor)
    • Tune best default model (Gradient Boosting)

    View Slide

  51. Build ML Model
    • Tune best default model (Gradient Boosting)

    View Slide

  52. • Seat vote share MAE
    • UNS model = 4.6% mean error / party / seat
    • GB model = 2.3% mean error / party / seat
    Evaluate ML Model

    View Slide

  53. Feature Importance

    View Slide

  54. Regional UNS forecast for 2017
    • CON: 342
    • LAB : 236
    • SNP : 43
    • LD : 6
    • PC : 4
    • Other: 19
    ML forecast for 2017
    • CON: 332
    • LAB : 232
    • SNP : 56
    • LD : 5
    • PC : 4
    • Other: 19
    Final Prediction

    View Slide

  55. •Very friendly to SNP!
    •Overfitting to 2015 election
    •Wish we had more data!
    Caveats

    View Slide

  56. BREAK INTO DATA SCIENCE
    Next steps

    View Slide

  57. BREAK INTO DATA SCIENCE
    Thank You

    View Slide

  58. BREAK INTO DATA SCIENCE
    UPDATE
    This section has been added after the PyData talk.
    1. Final forecast.
    2. Comparison with election result.
    3. Comparison with other forecasts.

    View Slide

  59. Regional UNS forecast for 2017
    • CON: 342
    • LAB : 236
    • SNP : 43
    • LD : 6
    • PC : 4
    • GRN: 1
    • Other: 19
    ML forecast for 2017
    • CON: 336
    • LAB : 232
    • SNP : 52
    • LD : 6
    • PC : 4
    • GRN: 1
    • Other: 19
    Final Prediction
    ← Majority of 34 ← Majority of 22

    View Slide

  60. View Slide

  61. ML forecast for 2017
    • CON: 336
    • LAB : 232
    • SNP : 52
    • LD : 6
    • PC : 4
    • GRN: 1
    • Other: 19
    Final Prediction
    ← Majority of 22
    Actual result
    • CON: 317
    • LAB : 262
    • SNP : 35
    • LD : 12
    • PC : 4
    • GRN: 1
    • Other: 19
    ← 9 short of majority

    View Slide

  62. View Slide

  63. TL;DR
    In seven weeks we built the most accurate election
    forecast built purely from public data.

    View Slide