Slide 1

Slide 1 text

Statistically Solving Sneezes and Sniffles - A Work In Progress PyDataBerlin 2016 License: CC By Attribution Ian Ozsvald @IanOzsvald ModelInsight.io Giles Weaver @GilesWeaver

Slide 2

Slide 2 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Who Are We? ● Ian - “Industrial Data Scientist” for 15 yrs ● Giles - bioinformatician turned Data Sci.

Slide 3

Slide 3 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Goal ● Help my wife have a less sneezy life - therefore try to understand “what drives a person's Rhinitis?” (i.e. sneezes) ● Can we help folk reduce symptoms by explaining the drivers of those symptoms? A step towards “personalised medicine”? ● Could we help people reduce their medication? ● 10–30% of Western population affected by Allergic Rhinitis (overall ≈1.4 billion people?) ● Some antihistamines (AH) have negative health associations - (e.g. anticholinergics [inc. U.S. Benadryl] linked to Alzheimers) ● UK folk don't tend to use these AHs but nobody knows the consequences of long-term usage

Slide 4

Slide 4 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Counts of daily sneezes & AH Using Seaborn and Pandas DataFrames for countplots

Slide 5

Slide 5 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Hypothesis ● “Ian's wife Emily suffers from non-allergic Rhinitis” (not allergic or infectious Rhinitis) ● “Possibly it is weather related” ● “Alcohol might make things worse” ● “Airborne pollution might be a factor” ● (I trust that Antihistamines work) ● We need to gather data so we can answer these questions – this is a small data problem ● Note - sneeze & AH behaviour similar out of the country and when at home (I'm not the cause! Nor, probably, is our cat, nor the apartment)

Slide 6

Slide 6 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Data Gathering Methodology ● iOS ● Event logs ● GPS trace ● Editable history ● Open Src ● >1yr old github.com/radicalrobot/allergy-tracker

Slide 7

Slide 7 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Some data issues ● Apple's DateTime epoch is != the Unix DateTime epoch (use ISO 8601!) ● GPS on London Underground on iPhone 6 confidently reports location (0,0) # Nigeria?! ● Weak experimental design (in hindsight) - we're logging positive events - does “0 events” mean “nothing happened” or “we forgot to log stuff”? ● SQLite->DataFrame with Python for clean-up

Slide 8

Slide 8 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Counts of daily sneezes & AH Using Seaborn and Pandas DataFrames for countplots

Slide 9

Slide 9 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Sneezes and AH over 1yr Self-logged data by Emily

Slide 10

Slide 10 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Sneezing by hour & day of week

Slide 11

Slide 11 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver AH by hour & day of week Is a morning antihistamine used because of today's environment or yesterday's environment?

Slide 12

Slide 12 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Sneezes and AH are related 3 Day resampled sum of Sneezes and AH usage - positive relationship. This means Sneezing is a predictor of AH usage (checked by CrossVal)

Slide 13

Slide 13 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver How long does an AH last for? Uses: Plan your day? Compare effectiveness of different treatments?

Slide 14

Slide 14 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Learning Relationships ● Antihistamine usage is ≈50/50 use/no use per day - treat as binary classification problem (not timeseries) ● We want a robust, interpretable model ● Logistic Regression with randomly shuffled rows and cross validation ● Can we find any strong features?

Slide 15

Slide 15 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Features - weather and pollution Annual NO2 pollution via LondonAir.org.uk weatherData R package for Wunderground London City Airport

Slide 16

Slide 16 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Features - Augmentation ● Weather & Pollution ● Raw data ● Create moving averages, rolling sums, standard deviations, differences of moving averages (for directionality) to smooth out days of noisy signal ● MyFitnessPal - extract alcohol use (from text strings) ● Feature set - row of relevant values (e.g. humidity, temperature, alcohol usage) and AH usage as binary Target (True or False) ● Oyster - bus and London Underground usage per day github.com/ianozsvald/london_oyster_pdf_to_dataframe_parser

Slide 17

Slide 17 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver 1 Year Model ● 84 features (raw & augmented), 330 rows of daily data (resampled from sub-second timestamped raw events) ● Take a complex model, strip it down, remove everything that doesn't feel right... ● Left with few consistently predictive features - Sneezes per day, Previous day's AH usage ● Everything else is not very predictive ● What's wrong with 1 year of data? ● Are signals like external humidity and temperature etc useful as a predictor in e.g. mid-summer or winter?

Slide 18

Slide 18 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver April-Aug 2015 Model Days when Emily exposed to 'the weather', not in a climate controlled office - suddenly some features emerge These boxplots show LogReg. coefs. from 5000 models built on 80% randomly sampled training data and scores on 20% test data Do we trust this?

Slide 19

Slide 19 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver 60 day training, sequential, model 60 day training, 40 day testing, rolling models Coefficients exposed for the year - can I trust Humidity (neg and pos coefs)? Can I trust Humidity (neg or 0 coef)?

Slide 20

Slide 20 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver New experiments ● Weekends and work-from-home days - can we replicate April-Aug model? ● We can evolve our approach ● 3 months consistent alcohol usage to be logged in the App ● Can I get data from BlueAir home hygrometer? ● Move house (not a joke)

Slide 21

Slide 21 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Tools ● HDF5 for data sharing ● Pandas - resampling and query ● Notebooks rendered in GitHub ● Slack channels ● Reproducible using versioned DataFrames ● MinRK's Notebook scratchpad and QTConsole button https://github.com/minrk/nbextension-scratchpad https://github.com/minrk/ipython_extensions/

Slide 22

Slide 22 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Next steps ● Get 2nd year of data (exciting!) ● We've added more people (iOS users - join us?) ● Regress on 3 Day resampled data from binary clf? ● We're assuming i.i.d. - is this fair? ● Collaborate with academic groups ● Strap sensors to wife (e.g. AtmoTube)?

Slide 23

Slide 23 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Conclusion ● Challenging problem - we have found 1 potential signal from scratch ● We can answer “how effective is an antihistamine” but not yet “does alcohol make sneezing more likely” ● No evidence (yet?) against air pollution ● Thanks to:

Slide 24

Slide 24 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Does Alcohol Increase Sneezing? "Possibly" - we need cleaner data. Hat tip to Jon Sedar for PyMC3 model

Slide 25

Slide 25 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Appendix - 1 Year Model ● Consistent features - Sneezes, Previous day's AH usage via Recursive Feature Elimination and exhaustive model evaluation

Slide 26

Slide 26 text

[email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Things that didn't work ● Autocorrelation - no signal? ● Regress on nbr sneezes per day (heavily modified by AH usage!) ● Utilising food diary (text too complex) Sneeze autocorrelation per day Sneeze autocorrelation per hour