Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistically Solving Sniffs and Sniffles (PyDa...

Statistically Solving Sniffs and Sniffles (PyDataBerlin 2016)

A group of us are attempting to solve my wife's Rhinitis using self-logged data, machine learning and a host of environmental features that we've logged from the environment. This talk discusses what did and didn't work and where we're taking the research in 2016.

ianozsvald

May 21, 2016
Tweet

More Decks by ianozsvald

Other Decks in Research

Transcript

  1. Statistically Solving Sneezes and Sniffles - A Work In Progress

    PyDataBerlin 2016 License: CC By Attribution Ian Ozsvald @IanOzsvald ModelInsight.io Giles Weaver @GilesWeaver
  2. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Who Are We? • Ian

    - “Industrial Data Scientist” for 15 yrs • Giles - bioinformatician turned Data Sci.
  3. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Goal • Help my wife

    have a less sneezy life - therefore try to understand “what drives a person's Rhinitis?” (i.e. sneezes) • Can we help folk reduce symptoms by explaining the drivers of those symptoms? A step towards “personalised medicine”? • Could we help people reduce their medication? • 10–30% of Western population affected by Allergic Rhinitis (overall ≈1.4 billion people?) • Some antihistamines (AH) have negative health associations - (e.g. anticholinergics [inc. U.S. Benadryl] linked to Alzheimers) • UK folk don't tend to use these AHs but nobody knows the consequences of long-term usage
  4. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Hypothesis • “Ian's wife Emily

    suffers from non-allergic Rhinitis” (not allergic or infectious Rhinitis) • “Possibly it is weather related” • “Alcohol might make things worse” • “Airborne pollution might be a factor” • (I trust that Antihistamines work) • We need to gather data so we can answer these questions – this is a small data problem • Note - sneeze & AH behaviour similar out of the country and when at home (I'm not the cause! Nor, probably, is our cat, nor the apartment)
  5. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Data Gathering Methodology • iOS

    • Event logs • GPS trace • Editable history • Open Src • >1yr old github.com/radicalrobot/allergy-tracker
  6. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Some data issues • Apple's

    DateTime epoch is != the Unix DateTime epoch (use ISO 8601!) • GPS on London Underground on iPhone 6 confidently reports location (0,0) # Nigeria?! • Weak experimental design (in hindsight) - we're logging positive events - does “0 events” mean “nothing happened” or “we forgot to log stuff”? • SQLite->DataFrame with Python for clean-up
  7. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver AH by hour & day

    of week Is a morning antihistamine used because of today's environment or yesterday's environment?
  8. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Sneezes and AH are related

    3 Day resampled sum of Sneezes and AH usage - positive relationship. This means Sneezing is a predictor of AH usage (checked by CrossVal)
  9. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver How long does an AH

    last for? Uses: Plan your day? Compare effectiveness of different treatments?
  10. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Learning Relationships • Antihistamine usage

    is ≈50/50 use/no use per day - treat as binary classification problem (not timeseries) • We want a robust, interpretable model • Logistic Regression with randomly shuffled rows and cross validation • Can we find any strong features?
  11. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Features - weather and pollution

    Annual NO2 pollution via LondonAir.org.uk weatherData R package for Wunderground London City Airport
  12. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Features - Augmentation • Weather

    & Pollution • Raw data • Create moving averages, rolling sums, standard deviations, differences of moving averages (for directionality) to smooth out days of noisy signal • MyFitnessPal - extract alcohol use (from text strings) • Feature set - row of relevant values (e.g. humidity, temperature, alcohol usage) and AH usage as binary Target (True or False) • Oyster - bus and London Underground usage per day github.com/ianozsvald/london_oyster_pdf_to_dataframe_parser
  13. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver 1 Year Model • 84

    features (raw & augmented), 330 rows of daily data (resampled from sub-second timestamped raw events) • Take a complex model, strip it down, remove everything that doesn't feel right... • Left with few consistently predictive features - Sneezes per day, Previous day's AH usage <sigh> • Everything else is not very predictive • What's wrong with 1 year of data? • Are signals like external humidity and temperature etc useful as a predictor in e.g. mid-summer or winter?
  14. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver April-Aug 2015 Model Days when

    Emily exposed to 'the weather', not in a climate controlled office - suddenly some features emerge These boxplots show LogReg. coefs. from 5000 models built on 80% randomly sampled training data and scores on 20% test data Do we trust this?
  15. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver 60 day training, sequential, model

    60 day training, 40 day testing, rolling models Coefficients exposed for the year - can I trust Humidity (neg and pos coefs)? Can I trust Humidity (neg or 0 coef)?
  16. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver New experiments • Weekends and

    work-from-home days - can we replicate April-Aug model? • We can evolve our approach • 3 months consistent alcohol usage to be logged in the App • Can I get data from BlueAir home hygrometer? • Move house (not a joke)
  17. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Tools • HDF5 for data

    sharing • Pandas - resampling and query • Notebooks rendered in GitHub • Slack channels • Reproducible using versioned DataFrames • MinRK's Notebook scratchpad and QTConsole button https://github.com/minrk/nbextension-scratchpad https://github.com/minrk/ipython_extensions/
  18. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Next steps • Get 2nd

    year of data (exciting!) • We've added more people (iOS users - join us?) • Regress on 3 Day resampled data from binary clf? • We're assuming i.i.d. - is this fair? • Collaborate with academic groups • Strap sensors to wife (e.g. AtmoTube)?
  19. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Conclusion • Challenging problem -

    we have found 1 potential signal from scratch • We can answer “how effective is an antihistamine” but not yet “does alcohol make sneezing more likely” • No evidence (yet?) against air pollution • Thanks to:
  20. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Appendix - 1 Year Model

    • Consistent features - Sneezes, Previous day's AH usage <sigh> via Recursive Feature Elimination and exhaustive model evaluation
  21. [email protected] PyDataBerlin 2016 @IanOzsvald @gilesweaver Things that didn't work •

    Autocorrelation - no signal? • Regress on nbr sneezes per day (heavily modified by AH usage!) • Utilising food diary (text too complex) Sneeze autocorrelation per day Sneeze autocorrelation per hour