I revisited the NHANES 2017 to pre pandemic 2020 data, as there are so many variables to analyse. For this post, I wanted to analyze the labs that were performed on participants of the Mobile Examination Center portion of the exam, particularly the heavy metal labs. According to the article Heavy Metals Toxicity and the Environment, “Because of their high degree of toxicity, arsenic, cadmium, chromium, lead, and mercury rank among the priority metals that are of public health significance. These metallic elements are considered systemic toxicants that are known to induce multiple organ damage, even at lower levels of exposure.”
I wanted to take a look at not only the proportion of people who are at or above the detection limit, but if there are certain demographic variables that are correlated with higher amounts of heavy metals detected in the blood of participants.
Load Packages
1
2
3
4
5
6
7
|
suppressPackageStartupMessages({
library(tidyverse)
library(nhanesA)
library(survey)
library(janitor)
library(tableone)
})
|
Import Data
1
2
3
4
|
demo <- nhanes("P_DEMO") %>%
dplyr::select(SEQN, RIAGENDR, RIDAGEYR, RIDEXPRG, RIDRETH3, SDMVPSU, SDMVSTRA, WTMECPRP)
lab <- nhanes("P_PBCD") %>%
dplyr::select(SEQN, LBXBPB, LBDBPBLC, LBXBCD, LBDBCDLC, LBXTHG, LBDTHGLC)
|
Merge Data
1
2
3
4
5
|
df <- merge(demo,
lab,
by = c("SEQN"),
all.y = TRUE)
df$SEQN <- NULL
|
Investigate and Clean Data
1
2
|
init.table <- CreateTableOne(data = df, includeNA = TRUE)
print(init.table, showAllLevels = TRUE)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
|
##
## level
## n
## RIAGENDR (%) Male
## Female
## RIDAGEYR (mean (SD))
## RIDEXPRG (%) Yes, positive lab pregnancy test or self-reported pregnant at exam
## The participant was not pregnant at exam
## Cannot ascertain if the participant is pregnant at exam
## <NA>
## RIDRETH3 (%) Mexican American
## Other Hispanic
## Non-Hispanic White
## Non-Hispanic Black
## Non-Hispanic Asian
## Other Race - Including Multi-Racial
## SDMVPSU (mean (SD))
## SDMVSTRA (mean (SD))
## WTMECPRP (mean (SD))
## LBXBPB (mean (SD))
## LBDBPBLC (%) At or above the detection limit
## Below lower detection limit
## <NA>
## LBXBCD (mean (SD))
## LBDBCDLC (%) At or above the detection limit
## Below lower detection limit
## <NA>
## LBXTHG (mean (SD))
## LBDTHGLC (%) At or above the detection limit
## Below lower detection limit
## <NA>
##
## Overall
## n 13772
## RIAGENDR (%) 6818 (49.5)
## 6954 (50.5)
## RIDAGEYR (mean (SD)) 35.16 (24.73)
## RIDEXPRG (%) 87 ( 0.6)
## 1604 (11.6)
## 59 ( 0.4)
## 12022 (87.3)
## RIDRETH3 (%) 1763 (12.8)
## 1373 (10.0)
## 4565 (33.1)
## 3688 (26.8)
## 1483 (10.8)
## 900 ( 6.5)
## SDMVPSU (mean (SD)) 1.54 (0.54)
## SDMVSTRA (mean (SD)) 160.36 (6.95)
## WTMECPRP (mean (SD)) 23148.77 (27878.70)
## LBXBPB (mean (SD)) 1.03 (1.16)
## LBDBPBLC (%) 11103 (80.6)
## 4 ( 0.0)
## 2665 (19.4)
## LBXBCD (mean (SD)) 0.36 (0.48)
## LBDBCDLC (%) 9720 (70.6)
## 2382 (17.3)
## 1670 (12.1)
## LBXTHG (mean (SD)) 1.07 (2.06)
## LBDTHGLC (%) 8296 (60.2)
## 3806 (27.6)
## 1670 (12.1)
|
Here, I’m going to rename some variables to make the names more intuitive, as well as make the names of the categories more simple so that it is more readable when the output of the multivariate regression is printed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
dff <- df %>%
mutate(
preg = as.factor(case_when(
RIDEXPRG == "Yes, positive lab pregnancy test or self-reported pregnant at exam" ~ "Yes",
RIDEXPRG == "The participant was not pregnant at exam" ~ "No",
TRUE ~ NA)),
race = as.factor(case_when(
RIDRETH3 == "Mexican American" | RIDRETH3 == "Other Hispanic" ~ "Hispanic",
RIDRETH3 == "Non-Hispanic White" ~ "White",
RIDRETH3 == "Non-Hispanic Black" ~"Black",
RIDRETH3 == "Non-Hispanic Asian" ~ "Asian",
RIDRETH3 == "Other Race - Including Multi-Racial" ~ "Other")),
gender = as.factor(RIAGENDR)
) %>%
rename(
age = RIDAGEYR,
psu = SDMVPSU,
strata = SDMVSTRA,
weight = WTMECPRP,
lead = LBXBPB,
lead_lim = LBDBPBLC,
cadmium = LBXBCD,
cadmium_lim = LBDBCDLC,
mercury = LBXTHG,
mercury_lim = LBDTHGLC
) %>%
dplyr::select(-RIDEXPRG, -RIDRETH3, -RIAGENDR)
|
Checking again, making sure that I didn’t miss anything.
1
2
|
clean.table <- CreateTableOne(data = dff, includeNA = TRUE)
print(clean.table, showAllLevels = TRUE)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
##
## level Overall
## n 13772
## age (mean (SD)) 35.16 (24.73)
## psu (mean (SD)) 1.54 (0.54)
## strata (mean (SD)) 160.36 (6.95)
## weight (mean (SD)) 23148.77 (27878.70)
## lead (mean (SD)) 1.03 (1.16)
## lead_lim (%) At or above the detection limit 11103 (80.6)
## Below lower detection limit 4 ( 0.0)
## <NA> 2665 (19.4)
## cadmium (mean (SD)) 0.36 (0.48)
## cadmium_lim (%) At or above the detection limit 9720 (70.6)
## Below lower detection limit 2382 (17.3)
## <NA> 1670 (12.1)
## mercury (mean (SD)) 1.07 (2.06)
## mercury_lim (%) At or above the detection limit 8296 (60.2)
## Below lower detection limit 3806 (27.6)
## <NA> 1670 (12.1)
## preg (%) No 1604 (11.6)
## Yes 87 ( 0.6)
## <NA> 12081 (87.7)
## race (%) Asian 1483 (10.8)
## Black 3688 (26.8)
## Hispanic 3136 (22.8)
## Other 900 ( 6.5)
## White 4565 (33.1)
## gender (%) Male 6818 (49.5)
## Female 6954 (50.5)
|
Survey Design
Creating the survey design object here.
1
2
3
4
5
6
|
metal.svy <- svydesign(id = ~psu,
strata = ~strata,
weights = ~weight,
nest = TRUE,
survey.lonely.psu = "adjust",
data = dff)
|
Proportions
I want to use the variables that show if the participant was at or above the detection limit to take a look at the weighted proportions of the survey participants blood heavy metal concentrations, which reflects the population as a whole.
1
|
svymean(~lead_lim, metal.svy, na.rm = TRUE)
|
1
2
3
|
## mean SE
## lead_limAt or above the detection limit 0.99970726 2e-04
## lead_limBelow lower detection limit 0.00029274 2e-04
|
1
|
svymean(~mercury_lim, metal.svy, na.rm = TRUE)
|
1
2
3
|
## mean SE
## mercury_limAt or above the detection limit 0.72948 0.014
## mercury_limBelow lower detection limit 0.27052 0.014
|
1
|
svymean(~cadmium_lim, metal.svy, na.rm = TRUE)
|
1
2
3
|
## mean SE
## cadmium_limAt or above the detection limit 0.8448 0.0071
## cadmium_limBelow lower detection limit 0.1552 0.0071
|
Wow, almost everyone is at or above the detection limit for lead, and a large majority for the other heavy metals.
Fit Multivariate Regressions
First I want to take a subset of the survey design object to remove missing values for lab values for any of the heavy metals.
1
2
3
4
|
metal.subset <- subset(metal.svy,
!is.na(lead)
& !is.na(cadmium)
& !is.na(mercury))
|
Then, to fit the model using svyglm() with the subset of the survey design object.
1
2
3
4
5
6
|
lead.fit <- svyglm(lead ~ gender + age + race,
metal.subset)
mercury.fit <- svyglm(mercury ~ gender + age + race,
metal.subset)
cadmium.fit <- svyglm(cadmium ~ gender + age + race,
metal.subset)
|
Diagnostics
I’m going to take a quick look at the diagnostic plots for each model, to make sure that there is a linear relationship. I’m going to start with the lead model.
1
2
|
par(mfrow = c(2, 2))
plot(lead.fit)
|
And then mercury.
1
2
|
par(mfrow = c(2, 2))
plot(mercury.fit)
|
And lastly, cadmium.
1
2
|
par(mfrow = c(2, 2))
plot(cadmium.fit)
|
Most of the plots look good, there are departures from the normal distribution, but overall the relationship is linear.
Now to take a look at the coefficients of the models.
Signifigant Coefficients
For lead,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
##
## Call:
## svyglm(formula = lead ~ gender + age + race, design = metal.subset)
##
## Survey design:
## subset(metal.svy, !is.na(lead) & !is.na(cadmium) & !is.na(mercury))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.750177 0.058333 12.860 7.98e-11 ***
## genderFemale -0.268034 0.025504 -10.509 2.35e-09 ***
## age 0.016640 0.000466 35.705 < 2e-16 ***
## raceBlack -0.242977 0.054023 -4.498 0.000246 ***
## raceHispanic -0.308710 0.067818 -4.552 0.000218 ***
## raceOther -0.327197 0.062801 -5.210 4.98e-05 ***
## raceWhite -0.395893 0.058607 -6.755 1.88e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9845273)
##
## Number of Fisher Scoring iterations: 2
|
All significant statistically, some that stand out to me are that Non-Hispanic white is associated with the largest decrease when compared to the intercept which is Asian. Another is that the only positive coefficient is age although it is quite small.
Female is also lower than male, which is interesting as you would think they would be the same from household exposures as men and women live together, perhaps there is another place that men were more frequently exposed to lead.
And mercury,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
##
## Call:
## svyglm(formula = mercury ~ gender + age + race, design = metal.subset)
##
## Survey design:
## subset(metal.svy, !is.na(lead) & !is.na(cadmium) & !is.na(mercury))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.127359 0.300128 7.088 9.63e-07 ***
## genderFemale -0.172979 0.047452 -3.645 0.00172 **
## age 0.015134 0.001399 10.819 1.46e-09 ***
## raceBlack -1.532638 0.298384 -5.136 5.87e-05 ***
## raceHispanic -1.582814 0.293034 -5.401 3.27e-05 ***
## raceOther -1.639871 0.297172 -5.518 2.53e-05 ***
## raceWhite -1.597285 0.287334 -5.559 2.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3.284387)
##
## Number of Fisher Scoring iterations: 2
|
it seems to follow the same patterns as lead.
And cadmium,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
##
## Call:
## svyglm(formula = cadmium ~ gender + age + race, design = metal.subset)
##
## Survey design:
## subset(metal.svy, !is.na(lead) & !is.na(cadmium) & !is.na(mercury))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2393745 0.0270553 8.848 3.64e-08 ***
## genderFemale 0.0748863 0.0116488 6.429 3.66e-06 ***
## age 0.0044088 0.0003573 12.338 1.62e-10 ***
## raceBlack -0.0203956 0.0282558 -0.722 0.47920
## raceHispanic -0.1487407 0.0241972 -6.147 6.58e-06 ***
## raceOther 0.0074420 0.0360248 0.207 0.83854
## raceWhite -0.0868630 0.0281021 -3.091 0.00601 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.2304773)
##
## Number of Fisher Scoring iterations: 2
|
it definitely deviates from the pattern seen in the other two model’s coefficients: Black and Other Race are not statistically significant coefficients, and surprisingly in this model Female has a positive coefficient instead of negative. I want to dig deeper into these gender differences.
First I want to visualize the mean concentration for each type of heavy metal, in the two different gender groups.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
par(mfrow = c(1, 3),
xpd = NA)
svyboxplot(lead ~ gender,
metal.subset,
ylim = c(0, 3),
xaxt = "n",
ylab = "Blood Lead (ug/dL)",
col = c("#235347", "#8EB69B"))
svyboxplot(mercury ~ gender,
metal.subset,
ylim = c(0, 4),
xaxt = "n",
ylab = "Blood Mercury (ug/L)",
col = c("#235347", "#8EB69B"))
svyboxplot(cadmium ~ gender,
metal.subset,
ylim = c(0, 2),
xaxt = "n",
ylab = "Blood Cadmium (ug/L)",
col = c("#235347", "#8EB69B"),
bty = "L")
legend("bottomleft",
legend = c("Male", "Female"),
fill = c("#235347", "#8EB69B"),
horiz = TRUE,
inset = c(-1.5, -0.15),
cex = 1.2)
mtext("Heavy Metal Concentration in Blood by Gender",
side = 3,
line = -2.5,
outer = TRUE)
|
So there are some visible differences in the lead and cadmium group, and not as much within the mercury group. However, statistical significance can’t be determined just by looking at the graph.
Wilcoxon Signed Rank Test
I’m going to use the Wilcoxon Signed Rank test rather than the parametric t-test as the data showed a deviation from the normal distribution in the q-q plot.
1
|
svyranktest(lead ~ gender, metal.subset, test = c("wilcoxon"))
|
1
2
3
4
5
6
7
8
9
|
##
## Design-based KruskalWallis test
##
## data: lead ~ gender
## t = -10.78, df = 24, p-value = 1.111e-10
## alternative hypothesis: true difference in mean rank score is not equal to 0
## sample estimates:
## difference in mean rank score
## -0.08826645
|
1
|
svyranktest(cadmium ~ gender, metal.subset, test = c("wilcoxon"))
|
1
2
3
4
5
6
7
8
9
|
##
## Design-based KruskalWallis test
##
## data: cadmium ~ gender
## t = 12.499, df = 24, p-value = 5.349e-12
## alternative hypothesis: true difference in mean rank score is not equal to 0
## sample estimates:
## difference in mean rank score
## 0.1000802
|
1
|
svyranktest(mercury ~ gender, metal.subset, test = c("wilcoxon"))
|
1
2
3
4
5
6
7
8
9
|
##
## Design-based KruskalWallis test
##
## data: mercury ~ gender
## t = -0.91282, df = 24, p-value = 0.3704
## alternative hypothesis: true difference in mean rank score is not equal to 0
## sample estimates:
## difference in mean rank score
## -0.006822827
|
This test confirms my suspicions from the graph, that lead and cadmium do have statistically significant differences in the means between men and women.
Conclusion
Looking into this aspect of the NHANES survey was an interesting exploratory analysis that raised some questions for me regarding gender disparities in heavy metal exposure.
Exposure to these heavy metals can be linked to adverse health outcomes and it is concerning that such a large marjority of the population has detectable exposure in their blood.