Slide 1

Slide 1 text

Diagramatic Diagnosis of Data BudapestBI 2018 Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Introductions ● I’m an engineering data scientist ● Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com

Slide 3

Slide 3 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Community Announcement! ● Have you thanked a speaker, an organiser or Bence yet? Lots of volunteered time – please say thanks ● Thank contributors too! ● Did I take a photo?

Slide 4

Slide 4 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Goals today ● How long since you had brand new data? ● Univariate investigations ● Show relationships with seaborn ● discover_feature_relationships – my new tool (feedback please!) ● Data stories

Slide 5

Slide 5 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 pandas_profiling (Titanic)

Slide 6

Slide 6 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 pandas_profiling

Slide 7

Slide 7 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Describing Titanic relationships

Slide 8

Slide 8 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Seaborn (Titanic data) Non-formatted default pivot result

Slide 9

Slide 9 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Seaborn (Titanic data)

Slide 10

Slide 10 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Seaborn (Titanic data)

Slide 11

Slide 11 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Discovering relationships ● Project is on GitHub ● Shows correlations and machine learned relationships for all feature pairs ● RandomForest + cross validation + some assumptions ● Categories encoded->Labels ● Feedback please!

Slide 12

Slide 12 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Spearman feature correlations

Slide 13

Slide 13 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Discovering relationships

Slide 14

Slide 14 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Pandas scatter LSTAT vs MEDV

Slide 15

Slide 15 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Seaborn JointGrid with alpha

Slide 16

Slide 16 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Seaborn hex jointplot

Slide 17

Slide 17 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Diff the upper and lower triangles CRIM vs RAD might be interesting as both sides had "some" predictive power

Slide 18

Slide 18 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 A non-symmetric relationship CRIM can predict RAD but RAD poorly predicts CRIM So maybe we need better data?

Slide 19

Slide 19 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 NetworkX to show relationships Who predicts MEDV directly and indirectly? What new data might we try to get, given these relationships?

Slide 20

Slide 20 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Data Stories (@bertil_hatt) ● A short report describing the data and proposing things we could do with it ● Stuff we trust (or don’t) ● Interesting or unexpected relationships ● Missing data (e.g. missingno) ● Propose experiments that we might run on this data which generate a benefit ● Document, don’t forget! https://medium.com/@bertil_hatt/what-does-bad-data-look-like-91dc2a7bcb7a

Slide 21

Slide 21 text

[email protected] @IanOzsvald[.com] BudapestBI 2018 Conclusions ● Visualise and communicate all of your data relationships ● Visit PyDataLondon 2019 :-) ● I’d love a postcard if you learned something? ● See more examples: https://github.com/ianozsvald