Dirty data makes analysis and machine learning harder (or impossible!) and more prone to failure. I'll talk on the techniques we use at ModelInsight to fix badly encoded, inconsistent and hard-to-parse text data that enable us to prepare real-world industrial data for research.
Topics will include text cleaning through normalisation and similarity measures, date parsing, data joining and visualisation. This talk is aimed at helping you make rapid progress on new projects.
Conference link:
https://www.euroscipy.org/2015/schedule/presentation/4/
Write-up:
http://ianozsvald.com/2015/08/28/euroscipy-2015-and-data-cleaning-on-text-for-ml-talk/