One thing that I’ve learned since I’ve been looking into working with large complex survey data like the CDC’s NHANES is about survey weights and how to use them in your analysis. In this analysis I tried to use the survey package in R to use the survey design to analyze the data, as there are certain demographics that were oversampled and undersampled as a part of the survey design.
Load Packages
1
2
3
4
5
6
7
|
suppressPackageStartupMessages({
library(tidyverse)
library(nhanesA)
library(janitor)
library(survey)
library(gtsummary)
})
|
Load Data
1
2
|
demo <- nhanes("P_DEMO")
demo <- nhanesTranslate("P_DEMO", names(demo), data = demo)
|
## Translated columns: RIDSTATR RIAGENDR RIDRETH1 RIDRETH3 RIDEXMON DMDBORN4 DMDYRUSZ DMDEDUC2 DMDMARTZ RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANGA
1
2
|
exam <- nhanes("P_BMX")
exam <- nhanesTranslate("P_BMX", names(exam), data = exam)
|
## Translated columns: BMDSTATS BMIWT BMIHT BMDBMIC
Retain Useful Variables
1
2
3
4
5
|
demo_select <- demo %>%
select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH3, SDMVPSU, SDMVSTRA, WTMECPRP)
exam_select <- exam %>%
select(SEQN, BMXBMI)
|
Merge Data
1
2
3
|
merged_data <- merge(demo_select, exam_select,
by = c("SEQN"), all = TRUE)
merged_data$SEQN <- NULL
|
Clean Dataset to Make Analysis Easier
First I wanted to see which race categories there are so that I can capture all of them when I recode them into more simple categories.
1
|
merged_data %>% tabyl(RIDRETH3)
|
## RIDRETH3 n percent
## Mexican American 1990 0.12789203
## Other Hispanic 1544 0.09922879
## Non-Hispanic White 5271 0.33875321
## Non-Hispanic Black 4098 0.26336761
## Non-Hispanic Asian 1638 0.10526992
## Other Race - Including Multi-Racial 1019 0.06548843
Now, I’m going to rename some variables, and use case_when() to make the race_cat and bmi_cat columns, as well as factoring these columns so that they can be used in my analysis and are not just recognized as character strings by R.
I also added at the end to output a tibble of the raw NHANES data without the weights to print out how my clean data looks and make sure that I didn’t miss any columns in the renaming process.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
nhanes <- merged_data %>%
rename(
gender = RIAGENDR,
age = RIDAGEYR,
bmi = BMXBMI,
psu = SDMVPSU,
strata = SDMVSTRA,
weight = WTMECPRP) %>%
mutate(
race_cat = case_when(RIDRETH3 == "Mexican American" | RIDRETH3 == "Other Hispanic" ~ "Hispanic",
RIDRETH3 == "Non-Hispanic White" ~ "White",
RIDRETH3 == "Non-Hispanic Black" ~ "Black",
RIDRETH3 == "Non-Hispanic Asian" ~ "Asian",
RIDRETH3 == "Other Race - Including Multi-Racial" ~ "Other"),
bmi_cat = case_when(bmi < 18.5 ~ "Underweight",
bmi < 25 ~ "Normal",
bmi < 30 ~ "Overweight",
bmi >= 30 ~ "Obese")
) %>%
filter(!is.na(bmi)) %>%
mutate_if(., is.character, as.factor) %>%
select(-RIDRETH3)
# Unweighted
nhanes_tibble <- as_tibble(nhanes) %>%
filter(age >= 18) %>%
select(-psu, -strata, -weight) %>%
print()
|
## # A tibble: 8,790 × 5
## gender age bmi race_cat bmi_cat
## <fct> <dbl> <dbl> <fct> <fct>
## 1 Female 29 37.8 Asian Obese
## 2 Male 49 29.7 White Overweight
## 3 Male 36 21.9 White Normal
## 4 Male 68 30.2 Other Obese
## 5 Male 76 26.6 White Overweight
## 6 Female 44 39.1 Hispanic Obese
## 7 Female 33 28.9 Asian Overweight
## 8 Female 68 28.1 Black Overweight
## 9 Female 42 31.3 Asian Obese
## 10 Male 58 30.5 Hispanic Obese
## # ℹ 8,780 more rows
Refactor Order of BMI Categories
R shows factored variables autmatically in alphabetical order, but this did not make much sense for my chart as BMI categories make more sense from underweight to obese, so I used the factor() function to remedy this before I make my table.
1
2
|
nhanes$bmi_cat <- factor(nhanes$bmi_cat,
levels = c("Underweight", "Normal", "Overweight", "Obese"))
|
Survey Design
I learned that in order to properly analyze large survey data, and to make it representative of the population at large, you have to create a survey design object and use that in your analysis.
1
2
3
4
5
|
nhanes_design <- svydesign(id = ~psu,
strata = ~strata,
weights = ~weight,
nest = TRUE,
data = nhanes)
|
Apply Eligibility Criteria
It is also important that the data is subset in this manner when working with survey data.
1
|
nhanes_adult <- subset(nhanes_design, age >= 18)
|
Summary Statistics
BMI Categories by Race/Hispanic Origin, Gender, and Average Age Weighted
1
2
3
4
5
6
7
8
|
tbl_svysummary(nhanes_adult,
by = bmi_cat,
include = c(gender, age, race_cat),
label = list(gender ~ "Gender",
age ~ "Age",
race_cat ~ "Race/Hispanic Origin"),
statistic = list(age ~ "{mean} ({sd})")
)
|
Characteristic |
Underweight, N = 3,797,884 |
Normal, N = 62,452,871 |
Overweight, N = 76,984,839 |
Obese, N = 101,342,006 |
Gender |
|
|
|
|
Male |
1,148,289 (30%) |
27,234,973 (44%) |
41,256,750 (54%) |
48,289,482 (48%) |
Female |
2,649,595 (70%) |
35,217,899 (56%) |
35,728,089 (46%) |
53,052,524 (52%) |
Age |
37 (17) |
44 (19) |
50 (17) |
48 (17) |
Race/Hispanic Origin |
|
|
|
|
Asian |
385,874 (10%) |
6,348,964 (10%) |
5,473,194 (7.1%) |
2,321,160 (2.3%) |
Black |
613,071 (16%) |
6,036,949 (9.7%) |
7,381,673 (9.6%) |
13,643,866 (13%) |
Hispanic |
440,910 (12%) |
7,245,902 (12%) |
14,181,024 (18%) |
17,625,000 (17%) |
Other |
106,290 (2.8%) |
2,276,951 (3.6%) |
2,692,021 (3.5%) |
4,801,387 (4.7%) |
White |
2,251,739 (59%) |
40,544,106 (65%) |
47,256,927 (61%) |
62,950,593 (62%) |
BMI Categories by Race/Hispanic Origin, Gender, and Average Age Unweighted
1
2
3
4
5
6
7
8
|
tbl_summary(nhanes_tibble,
by = bmi_cat,
include = c(gender, age, race_cat),
label = list(gender ~ "Gender",
age ~ "Age",
race_cat ~ "Race/Hispanic Origin"),
statistic = list(age ~ "{mean} ({sd})")
)
|
Characteristic |
Normal, N = 2,185 |
Obese, N = 3,688 |
Overweight, N = 2,767 |
Underweight, N = 150 |
Gender |
|
|
|
|
Male |
1,044 (48%) |
1,644 (45%) |
1,521 (55%) |
62 (41%) |
Female |
1,141 (52%) |
2,044 (55%) |
1,246 (45%) |
88 (59%) |
Age |
46 (20) |
50 (17) |
52 (18) |
39 (20) |
Race/Hispanic Origin |
|
|
|
|
Asian |
478 (22%) |
163 (4.4%) |
403 (15%) |
27 (18%) |
Black |
494 (23%) |
1,175 (32%) |
609 (22%) |
48 (32%) |
Hispanic |
341 (16%) |
879 (24%) |
701 (25%) |
19 (13%) |
Other |
110 (5.0%) |
198 (5.4%) |
113 (4.1%) |
6 (4.0%) |
White |
762 (35%) |
1,273 (35%) |
941 (34%) |
50 (33%) |
Gender vs BMI T-Test
Normality Assumption
1
|
svyqqmath(~bmi, design = nhanes_adult)
|
This deviates from the normal distribution, so I applied a log transformation to the BMI variable to remedy this deviation from the normal distribution at the tail end of the data.
1
|
svyqqmath(~log10(bmi), design = nhanes_adult)
|
That looks a lot better, now time to visualize the data and run the analysis.
Boxplot Visualization
1
2
3
4
5
|
svyboxplot(log10(bmi) ~ gender,
design = nhanes_adult,
xlab = "Gender",
main = "Weighted Boxplot of Mean Log Transformed BMI by Gender",
ylab = "Log Transformed BMI")
|
T-Test
1
|
svyttest(log10(bmi) ~ gender, nhanes_adult)
|
##
## Design-based t-test
##
## data: log10(bmi) ~ gender
## t = 0.46451, df = 24, p-value = 0.6465
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
## -0.005848037 0.009244940
## sample estimates:
## difference in mean
## 0.001698451
Race/Hispanic Origin vs. BMI ANOVA
1
2
3
4
5
6
|
svyboxplot(log10(bmi) ~ race_cat,
design = nhanes_adult,
xlab = "Race/Hispanic Origin",
ylab = "Log Transformed BMI",
main = "Weighted Boxplot of Mean Log Transformed BMI by
Race/Hispanic Origin")
|
In visually inspecting the boxplot, it looks like there is a difference between the various race/hispanic origin groups and BMI, but lets see.
Fit Model
1
|
race.bmi.glm <- svyglm(log10(bmi) ~ race_cat, nhanes_adult)
|
ANOVA
A barrier that I ran into was that the survey package in R does not seen to have an ANOVA function, and when you run the survey GLM through the regular ANOVA function it does not work as I thought it would. I did some researching around and I found that this is a way to assess this for complex surveys using a Wald test.
1
2
|
options(scipen = 999)
regTermTest(race.bmi.glm, ~race_cat)
|
## Wald test for race_cat
## in svyglm(formula = log10(bmi) ~ race_cat, design = nhanes_adult)
## F = 184.1178 on 4 and 21 df: p= 0.00000000000000050064
So it looks like there is a statistically significant difference between the various Race/Hispanic Origin categories and BMI, as hypothesized by the boxplot.
Conclusion
While the NHANES dataset is still very useful, I did have to dig a lot deeper and do more research to find out how I can properly use this data.
As I am also working through another book on statistics, I found out that there is not just parametric statistics if your data is normal, and non parmetric statistics if it isn’t, but that you’re allowed to transform your data to help fit better into the normal distribution so that you can use parametric statistics like T-Tests with more confidence.