Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Milena Machado

Milena Machado

(Federal Institute of Science and Technology of Espírito Santo, Brazil - IFES)

Title — Time series, principal component analysis and logistic regression: An application to the association between annoyance and air pollution

Abstract — Several studies applied regression models to quantify the relationship between annoyance from environmental stressors and measured levels of these stressors, such as odors (Blanes-Vidal, 2015), noise (Klaeboe et al., 2000), vibration (Klaeboe et al., 2003) and various air pollutants (Rotko et al. 2002; Llop et al., 2008; Amundsen et al., 2008; Nikolopoulou et al., 2011; Machado et al. 2018). Most of these authors have used linear and logistic regression techniques to establish a relationship between annoyance and concentration levels of air pollutants, considering only one pollutant in the model as a single covariate. In this talk, we will consider multiple pollutants covariates. These covariates are generally physically and statistically correlated, implying multicollinearity and, therefore, inflation of the variance of the estimators and spurious results.

The objective of this work is to propose a combination of a time series model (VAR-1), principal components analysis (PCA), and binary logistic regression to estimate the annoyance caused by particulate matter using more than one pollutant (PM10, TSP, and PS) in the same model. As Zamprogno (2020) pointed out, the PCA technique requires variables not correlated in time, i. e., serially independent. In practice, air pollutant concentrations can hardly be assumed uncorrelated or stationary.

This work uses a time series model to transform the original air pollutants data in time uncorrelated data (white noise) before applying the PCA technique. By the filtering analysis (VAR (1)) and PCA technique, it is possible to obtain components (linear combination of air pollutants) that are uncorrelated in time and between them. Then, applying multiple logistic regression allows us to calculate the relative risk (RR) of annoyance for each pollutant involved in the model. The relative risk estimates confirm that, in general, an increase in air pollutant concentrations significantly contributes to increasing the probability of being annoyed.

These results provide evidence of a significant correlation between perceived annoyance and groups of particulate matter. This study offers scientific arguments for policymakers to force some industries to reduce the emission of pollutants that generate nuisance and health risks.

Biography — I am Associate Professor at the Federal Institute of Science and Technology of ES – IFES. I am collaborator researcher at NQUALIAR (Group for Air Quality Studies) and collaborator researcher at NUMES (Stochastics Modeling Study Group). At the moment, I am a visiting researcher at the L2S supervised by prof. Pascal Bondon I have been studying/working and interested in: multivariate statistical techniques, such as Times Series Analysis, Principal Component Analysis, Multiple Correspondence Analysis, Logistic Regression Models, and others. The main areas of applications are Environmental Engineering (Air quality area), health science (Epidemiological area), with a special interest in problems of atmospheric pollution.

S³ Seminar

March 16, 2023
Tweet

More Decks by S³ Seminar

Other Decks in Research

Transcript

  1. Mar 2023 Profª. MACHADO Milena Associate Professor –IFES, Brazil Visiting

    researcher Time series, principal component analysis and logistic regression: An application to the association between annoyance and air pollution
  2. Prevailing wind direction The Vitoria region (Brazil) Metropolitan area: 1,689,714

    hab Density: 1490 hab/km²  3º largest port system in Latin America Industrial sites: steel plant, iron ore pellet mill, stone quarrying, cement and food industry, asphalt plant, etc.
  3. Contribution ETE Barueri (SP) Regresseion models only one pollutant Time

    series analysis PCA Logistic regression + + Regresseion models more than one pollution Relative risk References for this talk: • Melo M.M. et al. “STUDY OF A SPATIAL AND TEMPORAL ANALYSIS FOR PARTICULATE MATTER” (Award The best oral presentation – Dust Conference – Italy, 2014). • Melo M.M. et al, Santos J., Reisen V. (2018) “A new methodology to derive settleable particulate matter guidelines to assist policy- makers on reducing public nuisance” (Journal of Atmospheric Environment) • Machado M, Reisen VA, Santos JM, Reis NC, Frère S, Bondon P, Ispány M, Cotta HHA., ( 2020) “Use of multivariate time series techniques to estimate the impact of particulate matter on the perceived annoy” ( Journal of Atmospheric Environment)
  4. Methodology  Vitoria (Brazil) Survey face to face (n= 2638)

    Mesurement of air pollutants Time series analysis, Principal component analysis, Logistic regression, Relative risk Panel by phone (n= 519) from 2011 to 2014 Settled Particles (SP = PM 2.5 , PM10 and TSP )
  5. 0,00 2,00 4,00 6,00 8,00 10,00 12,00 14,00 jan-11 mai-12

    out-13 fev-15 Settled particles flux (SP) Meses 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 40,00 jan-11 mai-12 out-13 fev-15 PM10 (Mean 30 days) Meses 0,00 10,00 20,00 30,00 40,00 50,00 60,00 jan-11 mai-12 out-13 fev-15 PM10 (max 30 days) Meses 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 jan-11 mai-12 out-13 fev-15 PTS (Mean 30 days) Meses 0,00 20,00 40,00 60,00 80,00 100,00 120,00 jan-11 mai-12 out-13 fev-15 PTS (Max (30 days) Meses Time series analysis
  6. RESULTS Question: Think about this month, how do you fell

    annoyed by dust, in a scale from 1 to 10, where 1 is not annoyed and 10 extremely annoyed ? 1-2-3-4-5-6-7-8-9-10 n= 2638 8% 8% 18% 32% 34% 0% 5% 10% 15% 20% 25% 30% 35% 40% 0 a 2 3 a 4 5 a 6 7 a 8 9 a 10 Percentage of respondents Levels of annoyance P (x= 5) P(x=10) P(x=14) RMGV -3,32 ,449 1,566 25% 76% 95% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Probabilidade de incomodados
  7. CORRELATION MATRIX FOR THE ORIGINAL VARIABLES (BEFORE TIME SERIES ANALYSIS

    – VAR 1) Variables SP PM10 (mean) TSP (mean) PM10 (maxim) TSP (maxim) SP 1. PM10 (mean) 0.424** 1 TSP (mean) 0.278 0.764** 1 PM10 (maxim) 0.409** 0.681** 0.654** 1 TSP (maxim) 0.342* 0.701** 0.754** 0.772** 1 **p-value=0,01; *p-value=0,05 RESULTS Zamprogno (2013), the PCA technique requires variables that are not correlated in time, i. e., and also stationary time series (serially independent). Thus, it is necessary to apply a Vector Autoregressive Model as a filter to eliminate the temporal correlation. To avoid spurious results.
  8. Auto-correlation function and partial correlation function PS – Settled Particles

    PM10 Mean 30 days TSP Mean 30 days PM10 Max 30 days TSP Max 30 days 2011 2013 2014 6 7 8 9 10 11 12 13 Time Deposition rate (g/m3 30 days) 0 5 15 25 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF 0 5 15 25 -0.2 0.0 0.2 0.4 Lag Partial ACF 2011 2012 2013 2014 24 26 28 30 32 34 Time PM10 : Monthly mean ( /m3 ) 0 5 10 15 20 25 30 -0.4 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF 0 5 10 15 20 25 30 -0.2 0.0 0.2 0.4 0.6 Lag Partial ACF 2011 2013 2014 30 35 40 45 50 Time PM10 : Monthly maximum ( /m3 ) 0 5 15 25 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF 0 5 15 25 -0.3 -0.1 0.0 0.1 0.2 0.3 Lag Partial ACF 2011 2012 2013 2014 35 40 45 50 55 60 Time TSP: Monthly mean( /m3 ) 0 5 10 15 20 25 30 -0.4 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF 0 5 10 15 20 25 30 -0.2 0.0 0.2 0.4 0.6 0.8 Lag Partial ACF 2011 2013 2014 50 60 70 80 90 Time TSP: Monthly maximum( /m3 ) 0 5 15 25 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF 0 5 15 25 -0.2 0.0 0.2 0.4 Lag Partial ACF
  9. SP (Settled Particles – 30 days) PM10 (Mean 30 days)

    TSP (Mean 30 days) PM10 (Max 30 days) TSP (Máx 30 days) After Var (1) filter 0 5 10 15 20 25 30 -0.2 0.2 0.6 1.0 Lag ACF 0 5 10 15 20 25 30 -0.3 -0.1 0.1 0.3 Lag Partial ACF 0 5 10 15 20 25 30 -0.2 0.2 0.6 1.0 Lag ACF 0 5 10 15 20 25 30 -0.3 -0.1 0.1 0.3 Lag Partial ACF 0 5 10 15 20 25 30 -0.2 0.2 0.6 1.0 Lag ACF 0 5 10 15 20 25 30 -0.3 -0.1 0.1 0.3 Lag Partial ACF 0 5 10 15 20 25 30 -0.2 0.2 0.6 1.0 Lag ACF 0 5 10 15 20 25 30 -0.3 -0.1 0.1 0.3 Lag Partial ACF 0 5 10 15 20 25 30 -0.2 0.2 0.6 1.0 Lag ACF 0 5 10 15 20 25 30 -0.3 -0.1 0.1 0.3 Lag Partial ACF
  10. Variables SP PM10 (mean) TSP (mean) PM10 (maxim) TSP (maxim)

    SP 1,000 PM10 (mean) 0,214 1,000 TSP (mean) -0,004 ,573** 1,000 PM10 (maxim) 0,234 ,428** ,344* 1,000 TSP (maxim) ,378* ,533** ,337* ,685** 1,000 **p-value=0,01 *p-value=0,05 CORRELATION MATRIX FOR THE VARIABLES AFTER APPLYING THE FILTERING MODEL RESULTS The PCA technique is going to be applied at the filtered series in order to avoid the cross-correlation (multicolinearity) among variables.
  11. PC1 PC2 PC3 PC4 PC5 Eigenvalue 2,576 1,071 0,681 0,396

    0,276 Variability (%) 51,528 21,426 13,622 7,913 5,510 Cumulative % 51,528 72,955 86,577 94,490 100,000 SP (monthly rate) 0,267 0,733* -0,554 -0,269 -0,112 PM10 (monthly mean) 0,495* -0,257 -0,365 0,674 -0,319 TSP (monthly mean) 0,400* -0,583 -0,318 -0,607 0,172 PM10 (monthly maxim) 0,492* 0,104 0,611* -0,254 -0,557 TSP (monthly maxim) 0,531* 0,214 0,293 0,200 0,739 RESULTS OF FACTOR LOADINGS STATISTICS AND APPLICATION OF PCA RESULTS The components PC1, PC2 and PC3 explain about 86% of the total variability the original data.
  12. Pollutants RR IC (95%) Dif IC SP 1.462 (1.070; 1.854)

    0,784 PM 10 (monthly mean) 1.649 (1.061; 2.237) 1,176 TSP (monthly mean) 2.181 (1.471; 2.891) 1,42 PM 10 (monthly maxim) 2.411 (1.401; 3.421) 2,02 TSP (monthly maxim) 1.822 (0.592; 3.052) 2,46 THE RELATIVE RISK ESTIMATED BY MODEL VAR-PCA-LOGISITC REGRESSION RESULTS The estimated relative risks increased the probability of annoyance by a factor of 1.5 considering the interquartile variation equal to 2g/m² 30 days. ̰ 𝜷 Standard error PC1 0,053 0,202 PC2 0,058 0,309 PC3 -0,245 0,390 Intercept 0,204 0,320 Parameters estimated by the multiple logistic model ෢ 𝑅𝑅∗ 𝑥𝑖 ≈ 𝑒𝑥𝑖 ෡ 𝛽𝑖∗ The RR can be defined as the association that an effect can be occur (annoyance) following a certain exposure to a risk factor. y= Degree ≥7 (Extremely annoyed) = 1 Degree <7 (not/ little annoyed) = 0 X = SP, TSP, PM10 𝑃 𝑌 = 1 = 𝜋 𝑿 = 𝑒𝛽0+⋯+𝛽𝑝𝑥 1 + 𝑒𝛽0+⋯+𝛽𝑝𝑥
  13. Conclusions  By combining VAR-PCA-LOG statistical techniques, is proposed as

    useful tool to considering a group of pollutants at the same model.  This study provide evidence of a significant correlation between particulate matter and perceived annoyance levels, indicating that, at least for particulate matter, perceived annoyance is not related only to one pollutant but to a group of pollutants.  The estimates relative risk showed that, in general, an increase in air pollutant concentrations (i.e., the particulate matter metrics examined here: TSP, PM10 and SP) significantly contributes in increasing the probability of being annoyed.
  14. Further work… 1. Use the bootstrap technique, or others, to

    estimate the most accurate confidence intervals of the results. 2. Add other pollutants in a multiple model. Papers published: MACHADO, M.; SANTOS, J. M.; FRERE, S.; CHAGNON, P.; REISEN, V. A.; BONDON, P.; ISPÁNY, M.; MAVROIDIS, I.; REIS JR, N. C. Deconstruction of annoyance due to air pollution by multiple correspondence analyses. Environmental Science and Pollution Research, v. 28, n. 35, p. 47904-47920, 2021. MACHADO, M. ; SANTOS, J. M.; REISEN, V. A.; PEGO E SILVA, A. F.; REIS JUNIOR, N. C.; BONDON, P.; MAVROIDIS, I.; PREZOTTI FILHO, P. R.; FRERE, S.; LIMA, A. T. Parameters influencing population annoyance pertaining to air pollution. Journal of Environmental Management, v. 323, p. 115955, 2022.