Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On the Diagramatic Diagnosis of Data (BuapestBI...

ianozsvald
November 16, 2018

On the Diagramatic Diagnosis of Data (BuapestBI 2018)

The wrong way to start your machine learning project is to “chuck everything into a model to see what happens”. The better way is to visualise your data to expose the relationships that you see, to confirm that your data looks good and to identify problems that are likely to make your life difficult. You’ll save time, you’ll understand “why” your data works and you’ll uncover problems sooner.
We’ll review ways to quickly and visually diagnose your data, to check it meets your assumptions and to prepare it for discussion with your colleagues. We’ll look at tools including Pandas, Seaborn and Pandas Profiling. At the end you’ll have new tools to help you confidently investigate new data with your associates.
This talk introduces Ian’s new discover_feature_relationships tool which will save you time during your Exploratory Data Analysis phase.
http://budapestbiforum.hu/2018/hu/eloadasok/on-the-diagramatic-diagnosis-of-data-ian-ozsvald-mor-consulting-ltd/

ianozsvald

November 16, 2018
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. [email protected] @IanOzsvald[.com] BudapestBI 2018 Introductions • I’m an engineering data

    scientist • Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com
  2. [email protected] @IanOzsvald[.com] BudapestBI 2018 Community Announcement! • Have you thanked

    a speaker, an organiser or Bence yet? Lots of volunteered time – please say thanks • Thank contributors too! • Did I take a photo?
  3. [email protected] @IanOzsvald[.com] BudapestBI 2018 Goals today • How long since

    you had brand new data? • Univariate investigations • Show relationships with seaborn • discover_feature_relationships – my new tool (feedback please!) • Data stories
  4. [email protected] @IanOzsvald[.com] BudapestBI 2018 Discovering relationships • Project is on

    GitHub • Shows correlations and machine learned relationships for all feature pairs • RandomForest + cross validation + some assumptions • Categories encoded->Labels • Feedback please!
  5. [email protected] @IanOzsvald[.com] BudapestBI 2018 Diff the upper and lower triangles

    CRIM vs RAD might be interesting as both sides had "some" predictive power
  6. [email protected] @IanOzsvald[.com] BudapestBI 2018 NetworkX to show relationships Who predicts

    MEDV directly and indirectly? What new data might we try to get, given these relationships?
  7. [email protected] @IanOzsvald[.com] BudapestBI 2018 Data Stories (@bertil_hatt) • A short

    report describing the data and proposing things we could do with it • Stuff we trust (or don’t) • Interesting or unexpected relationships • Missing data (e.g. missingno) • Propose experiments that we might run on this data which generate a benefit • Document, don’t forget! https://medium.com/@bertil_hatt/what-does-bad-data-look-like-91dc2a7bcb7a
  8. [email protected] @IanOzsvald[.com] BudapestBI 2018 Conclusions • Visualise and communicate all

    of your data relationships • Visit PyDataLondon 2019 :-) • I’d love a postcard if you learned something? • See more examples: https://github.com/ianozsvald