Slide 1

Slide 1 text

BREAK INTO DATA SCIENCE Introducing SixFifty John Sandall 6th June 2017 @john_sandall @SixFiftyData

Slide 2

Slide 2 text

BREAK INTO DATA SCIENCE 18th April 2017

Slide 3

Slide 3 text

WHAT IS DATA SCIENCE? Call to arms

Slide 4

Slide 4 text

WHAT IS DATA SCIENCE? Inspiration

Slide 5

Slide 5 text

WHAT IS DATA SCIENCE? Sidenote: Why am I addicted to poll trackers? Habit Forming 101 1. Cue

Slide 6

Slide 6 text

WHAT IS DATA SCIENCE? Sidenote: Why am I addicted to poll trackers? Habit Forming 101 1. Cue 2. Routine

Slide 7

Slide 7 text

Sidenote: Why am I addicted to poll trackers? Habit Forming 101 1. Cue 2. Routine 3. Reward "Predictable feedback loops don’t create desire" – Nir Eyal

Slide 8

Slide 8 text

The UK has 650 parliamentary constituencies. Why SixFifty?

Slide 9

Slide 9 text

BREAK INTO DATA SCIENCE General Election Forecasts

Slide 10

Slide 10 text

Uniform National Swing: A Case Study Welcome to Sheffield Hallam 2010 results for Sheffield Hallam • CON: 24% • LAB: 16% • LD: 53% MP: Nick Clegg.

Slide 11

Slide 11 text

Step 1. Compare national results with latest polling 2010 national results • CON: 36% • LAB: 29% • LD: 23% 2015 polling • CON: 33% • LAB: 33% • LD: 9% Uniform National Swing: A Case Study

Slide 12

Slide 12 text

Step 2. Calculate "uniform national swing" (i.e. uplift) 2010 -> 2015 UNS • CON: -8% • LAB: +13% • LD: -62% Uniform National Swing: A Case Study

Slide 13

Slide 13 text

Step 3. Apply UNS to each constituency 2015 forecast for Sheffield Hallam • CON: 24% less 8% = 22% • LAB: 16% add 13% = 18% • LD: 53% less 62% = 20% Uniform National Swing: A Case Study

Slide 14

Slide 14 text

Step 4. Forecast winner 2015 forecast for Sheffield Hallam • CON: 22% <- Conservative victory! • LAB: 18% • LD: 20% Uniform National Swing: A Case Study

Slide 15

Slide 15 text

Step 5. So who won? 2015 forecast for Sheffield Hallam • CON: 22% <- Conservative victory! • LAB: 18% • LD: 20% 2015 result for Sheffield Hallam • CON: 14% • LAB: 36% • LD: 40% <- Lib Dem victory! Uniform National Swing: A Case Study

Slide 16

Slide 16 text

Is there a better way? • Use rigorous & modern modelling techniques. • Cross-validate, backtest, evaluate for predictive accuracy. • Open source our code, data, methodology. • Blend in multiple data sources, not just polling. • Greater understanding of what drives election outcomes. Uniform National Swing: A Case Study

Slide 17

Slide 17 text

BREAK INTO DATA SCIENCE Polling

Slide 18

Slide 18 text

Average forecast (left) vs actual results (right) for the 2015 general election What went wrong in 2015?

Slide 19

Slide 19 text

• "Shy Tories"? • "Lazy Labour"? • Biased sampling? • Herding? What went wrong in 2015?

Slide 20

Slide 20 text

• "Shy Tories"? Mostly a convenient myth. • "Lazy Labour"? • Biased sampling? • Herding? What went wrong in 2015?

Slide 21

Slide 21 text

• "Shy Tories"? Mostly a convenient myth. • "Lazy Labour"? Partially true. • Biased sampling? • Herding? What went wrong in 2015?

Slide 22

Slide 22 text

• "Shy Tories"? Mostly a convenient myth. • "Lazy Labour"? Partially true. • Biased sampling? Definitely true. • Herding? What went wrong in 2015?

Slide 23

Slide 23 text

What went wrong in 2015?

Slide 24

Slide 24 text

• "Shy Tories"? Mostly a convenient myth. • "Lazy Labour"? Partially true. • Biased sampling? Definitely true. • Herding? Definitely true. What went wrong in 2015?

Slide 25

Slide 25 text

• "Shy Tories"? Mostly a convenient myth. • "Lazy Labour"? Partially true. • Biased sampling? Definitely true. • Herding? Definitely true. One survey found that 75% of US adults don't trust surveys. What went wrong in 2015?

Slide 26

Slide 26 text

BREAK INTO DATA SCIENCE Polling data

Slide 27

Slide 27 text

What does polling data look like?

Slide 28

Slide 28 text

Raw data contains: • Voting Intention (“Which party will you be voting for on June 8th?”) • Party leader satisfaction • Policy preferences (“Do you think tuition fees should be abolished?”) • Demographic background (location, gender, age, education, etc) • Voted during EU Referendum? Remain or Leave? • Voted during 2015 general election? Which party voted for? • Questions designed to gauge likelihood of voting

Slide 29

Slide 29 text

Raw data looks like:

Slide 30

Slide 30 text

Automated data extraction? http://tabula.technology/

Slide 31

Slide 31 text

Open polling data? bit.ly/UKPoliticsDatasets

Slide 32

Slide 32 text

Open polling data? http://opinionbee.uk/

Slide 33

Slide 33 text

Easily accessible? Methodology? Regional?

Slide 34

Slide 34 text

Rolling our own open polling data pipeline 1. Alert

Slide 35

Slide 35 text

Rolling our own open polling data pipeline 1. Alert 2. Update

Slide 36

Slide 36 text

Rolling our own open polling data pipeline 1. Alert 2. Update 3. Automate https://github.com/six50/pipeline/

Slide 37

Slide 37 text

Regional polling data

Slide 38

Slide 38 text

Regional polling data

Slide 39

Slide 39 text

Regional polling data

Slide 40

Slide 40 text

D3 poll tracker https://sixfifty.org.uk/polls

Slide 41

Slide 41 text

Polling data on SixFifty.org.uk https://sixfifty.org.uk/polls

Slide 42

Slide 42 text

BREAK INTO DATA SCIENCE Forecast Model

Slide 43

Slide 43 text

The Plan 1. Replicate UNS model 2. Check predictions against other forecasts 3. Evaluate with historical election data 4. Build ML model 5. Iterate...

Slide 44

Slide 44 text

UNS models National UNS forecast for 2017 • CON: 337 • LAB : 236 • SNP : 47 • LD : 6 • PC : 5 • Other: 19

Slide 45

Slide 45 text

National UNS forecast for 2017 • CON: 337 • LAB : 236 • SNP : 47 • LD : 6 • PC : 5 • Other: 19 Regional UNS forecast for 2017 • CON: 342 • LAB : 236 • SNP : 43 • LD : 6 • PC : 4 • Other: 19 UNS models

Slide 46

Slide 46 text

UNS models

Slide 47

Slide 47 text

Model Evaluation • Repeat for GE2010 -> GE2015 • Forecast GE2015 seat-by-seat • Evaluate: • National Vote Share (MAE) = 2% error • National Seat Count (total) = 135 incorrectly called • Seat-by-seat Accuracy = 79% correctly called • Seat vote share MAE = 4.6% mean error / party / seat

Slide 48

Slide 48 text

Build ML Model • Polling data only for 2010, 2015, 2017 • CON / LAB / LD / UKIP / GRN only • Calculate and forecast using UNS => useful feature! • Features: • Region • Electorate (registered, total who voted, total votes cast) • Party • Previous election: total votes, vote share, won constituency? • Current election: poll vote share, swing • UNS forecast: vote share (%), predicted winner

Slide 49

Slide 49 text

Build ML Model

Slide 50

Slide 50 text

Build ML Model • Evaluation • 5x5 cross-validation on GE2015 predictions from GE2010 data • Models (scikit-learn) • Linear Regression (Simple, Lasso, Ridge) • Ensemble (Random Forest, Gradient Boosting, Extra Trees) • Neural net (MLPRegressor) • Tune best default model (Gradient Boosting)

Slide 51

Slide 51 text

Build ML Model • Tune best default model (Gradient Boosting)

Slide 52

Slide 52 text

• Seat vote share MAE • UNS model = 4.6% mean error / party / seat • GB model = 2.3% mean error / party / seat Evaluate ML Model

Slide 53

Slide 53 text

Feature Importance

Slide 54

Slide 54 text

Regional UNS forecast for 2017 • CON: 342 • LAB : 236 • SNP : 43 • LD : 6 • PC : 4 • Other: 19 ML forecast for 2017 • CON: 332 • LAB : 232 • SNP : 56 • LD : 5 • PC : 4 • Other: 19 Final Prediction

Slide 55

Slide 55 text

•Very friendly to SNP! •Overfitting to 2015 election •Wish we had more data! Caveats

Slide 56

Slide 56 text

BREAK INTO DATA SCIENCE Next steps

Slide 57

Slide 57 text

BREAK INTO DATA SCIENCE Thank You

Slide 58

Slide 58 text

BREAK INTO DATA SCIENCE UPDATE This section has been added after the PyData talk. 1. Final forecast. 2. Comparison with election result. 3. Comparison with other forecasts.

Slide 59

Slide 59 text

Regional UNS forecast for 2017 • CON: 342 • LAB : 236 • SNP : 43 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 ML forecast for 2017 • CON: 336 • LAB : 232 • SNP : 52 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 Final Prediction ← Majority of 34 ← Majority of 22

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

ML forecast for 2017 • CON: 336 • LAB : 232 • SNP : 52 • LD : 6 • PC : 4 • GRN: 1 • Other: 19 Final Prediction ← Majority of 22 Actual result • CON: 317 • LAB : 262 • SNP : 35 • LD : 12 • PC : 4 • GRN: 1 • Other: 19 ← 9 short of majority

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

TL;DR In seven weeks we built the most accurate election forecast built purely from public data.