Presented at the Applied Data Engineering meetup, London, September 2017.
https://www.meetup.com/Applied-Data-Engineering-London/events/242957677/
---
When Theresa May announced plans on April 18th for the UK to hold a general election it was met with much cynicism. However, as self-confessed psephologists (and huge fans of Nate Silver's FiveThirtyEight datablog), we instead were thrilled at the opportunity. SixFifty is a collaboration of data scientists, software engineers, data journalists and political operatives brought together within hours of the snap general election being announced.
Our goals:
• Understand why forecasting elections in the UK using open data is notoriously difficult, and to see how far good statistical practice and modern machine learning methods can take us.
• Make political and demographic data more open and accessible by showcasing and releasing cleaned versions of the datasets we're using.
• We also hope that by communicating our methodology at a non-technical level we will contribute to improving statistical literacy, especially around concepts fundamental to elections, polling and open data.
In this talk we will cover our approach to creating an open polling data pipeline, the challenges we faced especially around data provenance, the infrastructural design decisions made to remain lean under strict resource and time limitations, and the various technologies used to transform PDF polling tables into an election forecast more accurate than any other published prediction using open data.