regression working? • Why did it make that decision? • Can I calculate on Pandas in parallel? • Can I automate my machine learning? • Github for examples: Last year – my introduction to Random Forests as a worked process with examples and graphs
our model works as we’d expect in the real world – are the “important features” really important? Are they noise? • Help colleagues gain confidence in the model • Diagnose if certain examples are poorly understood
that plug into sklearn • Classification – class balance, confusion matrix • Regression – y vs ŷ, residual errors • Presented at PyDataLondon 2017 • http://www.scikit-yb.org/en/latest/
& LIME can explain single examples • Expensive house – many rooms, low LSTAT %, good pupil/teacher ratio • Cheap house – high LSTAT %, few rooms, maybe high nitric oxide pollution and lower pupil/teacher ratio • These interpretations are different to the global feature importances • Also see Kat Jarmul’s keynote: https://blog.kjamistan.com/towards-interpretable-reliable-models / • Michał Łopuszyński @ PyDataWarsaw https://www.slideshare.net/lopusz/debugging-machinelearning
• Pandas-compatible parallel processor • Runs on many cores and machines • Also see: Automated data exploration by Víctor Zabalza from Friday • http://ianozsvald.com/2017/06/07/kaggles -quora-question-paris-competition/
Mercedes (6 week competition, 5 days of my effort) • In top 50% result with little more than TPOT and a few days • Ensembled 3 estimators (2 from TPOT) • http://ianozsvald.com/2017/07/01/kaggl es-mercedes-benz-greener-manufacturing /
just like you debug your code – explain its working to colleagues • Write-up: http://ianozsvald.com/ • Training next year – what do you need? • Questions in exchange for beer :-) • Please send me a postcard if this is useful • See my longer diagnosis Notebook on github: