Data Mining in Mining Data: Statistical Approaches in Occupational Health Surveillance for a Mining Site

Data Mining in Mining Data: Statistical Approaches in Occupational Health
Surveillance for a Mining Site Queensland University of Technology, Brisbane, Australia Bayesian Research Applications Group ACEMS Nicholas Tierney Tuesday 5th July 1

Background Research Problem Case Study Population Method A small digression
into Bayesian methods Results Discussion Future Work 2

Background 3

PhD Research Industry Doctoral Training Centre (IDTC) • A new
PhD programme, aiming to create stronger links between research and industry • The IDTC links a PhD student with an Industry Partner Key feature of the IDTC The research problem is born out of industry, and not necessarily academia. 4

Immersion in Industry 5

Hunter Industrial Medicine • Firm of occupational health physicians •
Collect medical information on employees in various industries • Mining • Consutrction • Emergency Services 6

Example medical data uin date sex fev1_perc smoked 0 2008-05-20
female 98 non_smoker 10820 2010-12-30 male 108 non_smoker 10943 2005-08-23 male 103 non_smoker 10943 2010-02-09 male 98 non_smoker 10943 2009-06-02 male 98 non_smoker uin seg dust days 0 tech_services 0.000000 0 10820 ug_maintenance 1.178260 0 10943 ug_operations 2.840808 0 10943 ug_operations 2.464896 252 10943 ug_maintenance 1.566031 448 7

Research Problem 8

Research Problem Industry is required by law to monitor employee
health and analyse their employee health data, to perform Occupational Health Surveillance, which is deﬁned as: The systematic collection, analysis, and dissemination of employee exposure and health data to facilitate early detection of disease and dangerous exposures in the workplace. 9

Research Problem: • Current OHS analyses ignore features of data,
such as repeated measurements and workplace structures • This leads to limited predictive performance and poor inferences. Research project aim Provide statistical guidance and support for Hunter Industrial Medicine, and to identify occupational health risk proﬁles. 10

Case Study Population 11

Case Study • 2063 employee medical records from a mining
site • 1467 employees • Data collected from 2005 - 2011 • Employees grouped into Similar Exposure Groups (SEGs) • Administration SEG is diﬀerent to Underground Maintenance. • Individuals may change SEG over time. • Medical visit frequency changes by SEG 12

Case Study Research Aim Explore two objectives: 1. Determine whether
smoking has an impact on lung function. 2. Explore the eﬀect of dust exposure on lung function. 13

Exploring Case Study Data Variables of interest: • Lung function
(FEV1%) FEV1% = FEV1(L) FEV1(Predicted) ∗ 100 • SEG • Dust exposures • Gender • Smoking Status • Days since arrival 14

Lung Function (FEV1%) 15

Gender and Smoking 16

Dust Exposures • Dust exposures aren’t recorded at the same
time as medical visits. • Dust scores are interpolated by ﬁtting a loess model for each SEG, and then predicting dust values for each SEG using corresponding dates in the medical data 17

Method 18

Case Study Method Explore two objectives: 1. Determine whether smoking
has an impact on lung function. 2. Explore the eﬀect of dust exposure on lung function. To do so we build a Bayesian hierarchical model • As far as we are aware, this is the ﬁrst time Bayesian methods have been applied to this kind of OHS data. 19

A small digression into Bayesian methods 20

Why Bayesian? A statistical model aims to explain the phenonema/process
under study, and can be thought of as a proposed mechanism for the generation of data y For a single observation, yi yi ∼ p(yi|θ) Where p(yi |θ) is the probability of observed yi as a function of some parameter/s θ, The “likelihood”. • p(yi |θ) is assumed to come from some parametric family (Normal, Poisson, Weibull,. . . ) 21

Why Bayesian Given a model, we aim to estimate θ
in order to summarise/make inferential statements about the population under study • In the classical or “frequentist” setting, this is achieved by maximum likelihood techniques • In the frequentist framework then, the focus here is on the probability of the data given θ: p(Y |θ) 22

Why Bayesian Bayesian methods are based on the application of
Bayes’ theorem For continuous θ: p(θ|y) = p(y|θ)p(θ) p(y|θ)p(θ)dθ For discrete θ: p(θ|y) = p(y|θ)p(θ) θ p(y|θ)p(θ) Up to a constant of proportionality, p(θ|y) ∝ p(y|θ)p(θ) 23

Why Bayesian The main thing to take away from this
is that using Bayesian methods means that we shift the focus from probability of data given the parameters to probability of parameters given the data. This allows you to ask what can you say about θ given the data you have collected. 24

Beneﬁts of Bayesian • Allows for the quantiﬁcation of uncertainty
in θ and/or the incorporation of previous knowledge/“expert opinion” • Shifts the statement of probability from the data to the unknown parameters. • Distributions are given for parameters, rather than point-estimates. You could think of point estimates giving you information about the height of a mountain, and the bayesian methods giving you a rich topographical map, describing the landscape 25

Priors Priors can be informative or noninformative • Informative: previous
knowledge or opinion about θ is available • Noninformative: little is known - allows the data to “speak for itself” 26

Bayesian Estimation Models are very rarely simple. . . The
Bayesian posterior is often analytically intractable We consider numerical solutions of Markov Chain Monte Carlo (MCMC), using Gibbs sampling, an MCMC approach commonly used in Bayesian methods (C. P. Robert & Casella, 2004). 27

Model Yij ∼ N(µij, τy ) µij = β0 +
δxij + αi + βi xij+ θsexi + λsmokeij + ηdustij + nSEG −1 k=1 γkI(SEGij = k) for i = 1 . . . npatients for j = 1 . . . nobservations,i For k = 1, . . . , nSEG 28

Model Priors αi ∼ N(0, τα) βi ∼ N(0, τβ)
γk ∼ N(0, τγk ) β0, δ, θ, λ, η ∼ N(0, 103) τα, τy , τβ, ∼ Γ(0.001, 0.001) 29

Model Software Model was ﬁt using JAGS, run for 20,000
iterations, with a burnin of 10,000 30

Results 31

Eﬀects of Smoking, Dust, Gender, Days Figure 2. Posterior means
and 95% credible intervals for Smoking, Gender, Dust, and Days. 32

Eﬀect of SEG Figure 3. Posterior means and credible intervals
for SEG eﬀects. 33

Individual Eﬀects 34

Results Summary The definition of a Bayesian credible interval as
a range of probable values for a parameter makes it easier to communicate model inferences and show individuals who do not deviate much from the overall population change in lung function. • Highly unlikely that dust, gender, and the number of days since first visi have an effect on lung function • Highly likely smoking has an impact on lung function. • SEG “Shot Drill Sample Explosion” is highly likely to have higher lung function over Administration after accounting for SEG-based dust exposure. 35

Using MCMC Samples to convey information diﬀerently. Taking every MCMC
poster sample (after the burn-in), we can make the samples to be 1 or 0, based on whether the sample is < 0. variable samples indicator smoking -0.5 1 smoking -0.1 1 smoking -1.3 1 dust 1.2 0 dust 0.9 0 variable mean_indicator smoking 0.99 dust 0.45 36

Probability of Eﬀect Being < 0 37

Probability of SEG eﬀect being < 0 38

Probability of Individual eﬀect being < 0 39

Results Summary • Appears to be no significant between-individual effect
for lung function over time. • ≈ 100% chance smoking is negatively impacting lung function, vs 85% chance for sex. • Low chance that SEGs have lung function lower than Administration. • Identify individuals for closer monitoring based upon finding those individuals with ≥ 3 visits and > 50% chance of deviating negatively from the overall population. 40

Discussion 41

Discussion Strengths • Model accounts for repeated measures, and workplace
structure. • Estimates of uncertainty are easy to understand. • Able to obtain probabilities of observing eﬀects. Limitations • Model provides poorer prediction at extremes. • Only complete cases considered, missing data not fully accounted for. • Dust data imputation process is oversimpliﬁed. 42

Future Work 43

Future Work • Extend model to improve prediction at extremes
by using a t-distribution to model lung function. • Incorporate missing data using Bayesian approaches • Improve dust imputation model, potentially using another Bayesian model for dust imputation • Perform posterior predictions to ﬁnd those individuals who are predicted to be unwell in the future. 44

Future Work Further explore industry questions of interest: 1. Explore
relationships between workplace exposures and lung function 2. Explore relationships between workplace exposures and hearing 3. Explore effectiveness of their smoking ban 4. Explore whether changes in protocol for filter changes affect diesel particulate levels 5. Identifying health risk profiles for individuals and groups 45

References 1: JAGS: A program for analysis of Bayesian Graphical
Models using Gibbs sampling, Plummer, M, 2003. 2: Martyn Plummer (2016). rjags: Bayesian Graphical Models using MCMC. R package version 4-6. https://CRAN.R-project.org/package=rjags 3: R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. 46

References 4: RStudio Team (2015). RStudio: Integrated Development for R.
RStudio, Inc., Boston, MA http://www.rstudio.com/. 5: Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2014). Bayesian data analysis (Vol. 2). Boca Raton, FL, USA: Chapman & Hall/CRC. 6: Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. (2012). The BUGS book: A practical introduction to Bayesian analysis. CRC press. 47

Acknowledgements The author would like to thank and acknowledge the
United States Oﬃce of Naval Research for providing travel support for this conference. The author would also like to thank Xing Lee and Nicole White for their advice and collaboration with this project. 48

Contact Email: [email protected] Website: www.njtierney.com Twitter: nj_tierney Github: njtierney 49

Data Mining in Mining Data: Statistical Approac...

Data Mining in Mining Data: Statistical Approaches in Occupational Health Surveillance for a Mining Site

More Decks by Nicholas Tierney

Other Decks in Science

Featured

Transcript