real life example > Encourage Data Science thinking > Prepare you for real life Data Science → Help you embrace your vulnerabilities → Build your strength from them
processing, cleaning, understanding, ... → ~10% Analysis & ~10% Communication > Main analysis was a Bayesian Model → Each full run took ~4 days non stop on a PC > I loved working on this project → High Pressure & STRESS but lots of learning! → Software: STATA, WinBugs, LaTeX, HTML, R, Linux, ...
... > Analysis & Communication for another day > Look at a real life problems → There are no right answers! → You have to find the best compromises → Most importantly you must be able to defend your choices! > Have fun!
cancer mortality in Portugal? > How does Portugal compare to other countries? Exercise: → Discuss what these questions mean to you? → What needs to be defined? → What actions and/or decisions could be taken?
cancer mortality in Portugal? → How is cancer defined? By site? All combined? → What Trends? Current trends starting when? → What would you like to predict? Deaths in 10 or 20 years? → You need a subject matter expert > How does Portugal compare to other countries? → Which countries? European? Asian? Rich? Poor?
Data reported by countries → Civilian registration systems > Compilation of mortality data by: → Age, sex, year and cause of death → International Classification of Diseases (ICD) > Available from: → http://www.who.int/healthinfo/mortality_data/en/
for mortality and morbidity statistics > Defines diseases, disorders, injuries and other related health conditions > Useful for: → Storing, retrieving & analysing health information → Sharing and comparing health information
of the WHO data → Issues have been simplified for this workshop → They still reflect the reality > Three areas will be covered → Making the raw data usable → Handling data difference → Understanding & managing “standards” & “definitions”
time and sex → Spain ~46M & Sweden ~10M → Better to use Rate rather number of deaths → Allows for fairer comparisons > Rate = 100,000* (Number of Deaths / Population) → For 100,000 people how many deaths would there be Note: I use the UK notation where comma is “thousands” separator not a decimal place.
of → ICD10 Mortality Data → Population Data > Using the variables definitions: → What do you understand about the data? → How would you restructure the data to produce the graphs and tables in the previous slides?
a country? → Germany was formerly East and West Germany → Czech Republic & Slovakia were formerly Czechoslovakia > How do you define the European Union? → Start (1951) 6 countries – Now (2018) 28 countries → The UK has voted to leave (2019) → How can you fairly compare to the EU average?
- - - 1997 - Y - - Y 1998 - Y - - Y 1999 - Y - Y Y 2000 - Y - Y Y 2001 - Y - Y Y 2002 - Y Y Y Y 2003 - Y Y Y Y 2004 - Y - Y Y 2005 - Y - Y Y 2006 - Y - Y Y 2007 - Y Y Y Y 2008 - Y Y Y Y 2009 - Y Y Y Y 2010 - Y Y Y Y 2011 - Y Y Y Y 2012 - Y Y Y Y 2013 - Y Y Y Y 2014 Y Y Y Y Y 2015 - Y - Y Y ICD10 Data?
Y Y Y 1995 Y Y Y Y Y 1996 Y - Y Y Y 1997 Y - Y Y - 1998 Y - Y Y - 1999 Y - Y - - 2000 Y - Y - - 2001 Y - Y - - 2002 Y - - - - 2003 Y - - - - 2004 Y - - - - 2005 Y - - - - 2006 Y - - - - 2007 Y - - - - 2008 Y - - - - 2009 Y - - - - 2010 Y - - - - 2011 Y - - - - 2012 Y - - - - 2013 Y - - - - ICD9 Data?
years where data is missing? > How do you handle when → Some diseases could be split whilst others could be joined → Countries use ICD9 and ICD10 at different times > Should you use ICD7 & ICD8 data? → Is data going back to the 1950s comparable? → How do you handle partial coverage of death registrations?
used → Food classification, Medication classification, ... → Classifying professions, social economic status, ... > They are useful for harmonisation → “Fairer” comparison, data sharing & retrieval, ... → Important to understand the details of implementation > Compromises have to be made for analysis → Document decisions and choices openly and transparently → By being honest you will save yourself a lot of stress later
> It usually takes the most time → Making the raw data usable → Handling data differences & abnormalities → Understanding & managing “standards” & “definitions” > There are always surprises → Work closely with a subject matter expert → Document everything openly and transparently