NHANES BMI Analysis Using Survey Package

One thing that I’ve learned since I’ve been looking into working with large complex survey data like the CDC’s NHANES is about survey weights and how to use them in your analysis. In this analysis I tried to use the survey package in R to use the survey design to analyze the data, as there are certain demographics that were oversampled and undersampled as a part of the survey design.

Load Packages

1
2
3
4
5
6
7


suppressPackageStartupMessages({
library(tidyverse)
library(nhanesA)
library(janitor)
library(survey)
library(gtsummary)
})

Load Data

1
2


demo <- nhanes("P_DEMO")
demo <- nhanesTranslate("P_DEMO", names(demo), data = demo)

## Translated columns: RIDSTATR RIAGENDR RIDRETH1 RIDRETH3 RIDEXMON DMDBORN4 DMDYRUSZ DMDEDUC2 DMDMARTZ RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANGA

1
2


exam <- nhanes("P_BMX")
exam <- nhanesTranslate("P_BMX", names(exam), data = exam)

## Translated columns: BMDSTATS BMIWT BMIHT BMDBMIC

Retain Useful Variables

1
2
3
4
5


demo_select <- demo %>% 
  select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH3, SDMVPSU, SDMVSTRA, WTMECPRP)

exam_select <- exam %>% 
  select(SEQN, BMXBMI)

Merge Data

1
2
3


merged_data <- merge(demo_select, exam_select,
                     by = c("SEQN"), all = TRUE)
merged_data$SEQN <- NULL

Clean Dataset to Make Analysis Easier

First I wanted to see which race categories there are so that I can capture all of them when I recode them into more simple categories.

1

merged_data %>% tabyl(RIDRETH3)

##                             RIDRETH3    n    percent
##                     Mexican American 1990 0.12789203
##                       Other Hispanic 1544 0.09922879
##                   Non-Hispanic White 5271 0.33875321
##                   Non-Hispanic Black 4098 0.26336761
##                   Non-Hispanic Asian 1638 0.10526992
##  Other Race - Including Multi-Racial 1019 0.06548843

Now, I’m going to rename some variables, and use case_when() to make the race_cat and bmi_cat columns, as well as factoring these columns so that they can be used in my analysis and are not just recognized as character strings by R.

I also added at the end to output a tibble of the raw NHANES data without the weights to print out how my clean data looks and make sure that I didn’t miss any columns in the renaming process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


nhanes <- merged_data %>% 
  rename(
    gender = RIAGENDR,
    age = RIDAGEYR,
    bmi = BMXBMI,
    psu = SDMVPSU,
    strata = SDMVSTRA,
    weight = WTMECPRP) %>% 
  mutate(
    race_cat = case_when(RIDRETH3 == "Mexican American" | RIDRETH3 == "Other Hispanic" ~ "Hispanic",
                         RIDRETH3 == "Non-Hispanic White" ~ "White",
                         RIDRETH3 == "Non-Hispanic Black" ~ "Black",
                         RIDRETH3 == "Non-Hispanic Asian" ~ "Asian",
                         RIDRETH3 == "Other Race - Including Multi-Racial" ~ "Other"),
    bmi_cat = case_when(bmi < 18.5 ~ "Underweight",
                        bmi < 25 ~ "Normal",
                        bmi < 30 ~ "Overweight",
                        bmi >= 30 ~ "Obese")
  ) %>% 
  filter(!is.na(bmi)) %>% 
  mutate_if(., is.character, as.factor) %>% 
  select(-RIDRETH3)

# Unweighted 
nhanes_tibble <- as_tibble(nhanes) %>% 
  filter(age >= 18) %>%
  select(-psu, -strata, -weight) %>% 
  print()

## # A tibble: 8,790 × 5
##    gender   age   bmi race_cat bmi_cat   
##    <fct>  <dbl> <dbl> <fct>    <fct>     
##  1 Female    29  37.8 Asian    Obese     
##  2 Male      49  29.7 White    Overweight
##  3 Male      36  21.9 White    Normal    
##  4 Male      68  30.2 Other    Obese     
##  5 Male      76  26.6 White    Overweight
##  6 Female    44  39.1 Hispanic Obese     
##  7 Female    33  28.9 Asian    Overweight
##  8 Female    68  28.1 Black    Overweight
##  9 Female    42  31.3 Asian    Obese     
## 10 Male      58  30.5 Hispanic Obese     
## # ℹ 8,780 more rows

Refactor Order of BMI Categories

R shows factored variables autmatically in alphabetical order, but this did not make much sense for my chart as BMI categories make more sense from underweight to obese, so I used the factor() function to remedy this before I make my table.

1
2


nhanes$bmi_cat <- factor(nhanes$bmi_cat, 
                         levels = c("Underweight", "Normal", "Overweight", "Obese"))

Survey Design

I learned that in order to properly analyze large survey data, and to make it representative of the population at large, you have to create a survey design object and use that in your analysis.

1
2
3
4
5


nhanes_design <- svydesign(id = ~psu,
                           strata = ~strata,
                           weights = ~weight,
                           nest = TRUE,
                           data = nhanes)

Apply Eligibility Criteria

It is also important that the data is subset in this manner when working with survey data.

1

nhanes_adult <- subset(nhanes_design, age >= 18)

Summary Statistics

BMI Categories by Race/Hispanic Origin, Gender, and Average Age Weighted

1
2
3
4
5
6
7
8


tbl_svysummary(nhanes_adult,
               by = bmi_cat,
               include = c(gender, age, race_cat),
               label = list(gender ~ "Gender",
                            age ~ "Age",
                            race_cat ~ "Race/Hispanic Origin"),
               statistic = list(age ~ "{mean} ({sd})")
               )

Characteristic	Underweight, N = 3,797,884¹	Normal, N = 62,452,871¹	Overweight, N = 76,984,839¹	Obese, N = 101,342,006¹
Gender
Male	1,148,289 (30%)	27,234,973 (44%)	41,256,750 (54%)	48,289,482 (48%)
Female	2,649,595 (70%)	35,217,899 (56%)	35,728,089 (46%)	53,052,524 (52%)
Age	37 (17)	44 (19)	50 (17)	48 (17)
Race/Hispanic Origin
Asian	385,874 (10%)	6,348,964 (10%)	5,473,194 (7.1%)	2,321,160 (2.3%)
Black	613,071 (16%)	6,036,949 (9.7%)	7,381,673 (9.6%)	13,643,866 (13%)
Hispanic	440,910 (12%)	7,245,902 (12%)	14,181,024 (18%)	17,625,000 (17%)
Other	106,290 (2.8%)	2,276,951 (3.6%)	2,692,021 (3.5%)	4,801,387 (4.7%)
White	2,251,739 (59%)	40,544,106 (65%)	47,256,927 (61%)	62,950,593 (62%)
¹ n (%); Mean (SD)

BMI Categories by Race/Hispanic Origin, Gender, and Average Age Unweighted

1
2
3
4
5
6
7
8


tbl_summary(nhanes_tibble,
               by = bmi_cat,
               include = c(gender, age, race_cat),
               label = list(gender ~ "Gender",
                            age ~ "Age",
                            race_cat ~ "Race/Hispanic Origin"),
               statistic = list(age ~ "{mean} ({sd})")
               )

Characteristic	Normal, N = 2,185¹	Obese, N = 3,688¹	Overweight, N = 2,767¹	Underweight, N = 150¹
Gender
Male	1,044 (48%)	1,644 (45%)	1,521 (55%)	62 (41%)
Female	1,141 (52%)	2,044 (55%)	1,246 (45%)	88 (59%)
Age	46 (20)	50 (17)	52 (18)	39 (20)
Race/Hispanic Origin
Asian	478 (22%)	163 (4.4%)	403 (15%)	27 (18%)
Black	494 (23%)	1,175 (32%)	609 (22%)	48 (32%)
Hispanic	341 (16%)	879 (24%)	701 (25%)	19 (13%)
Other	110 (5.0%)	198 (5.4%)	113 (4.1%)	6 (4.0%)
White	762 (35%)	1,273 (35%)	941 (34%)	50 (33%)
¹ n (%); Mean (SD)

Gender vs BMI T-Test

Normality Assumption

1

svyqqmath(~bmi, design = nhanes_adult)

This deviates from the normal distribution, so I applied a log transformation to the BMI variable to remedy this deviation from the normal distribution at the tail end of the data.

1

svyqqmath(~log10(bmi), design = nhanes_adult)

That looks a lot better, now time to visualize the data and run the analysis.

Boxplot Visualization

1
2
3
4
5


svyboxplot(log10(bmi) ~ gender, 
           design = nhanes_adult,
           xlab = "Gender",
           main = "Weighted Boxplot of Mean Log Transformed BMI by Gender",
           ylab = "Log Transformed BMI")

T-Test

1

svyttest(log10(bmi) ~ gender, nhanes_adult)

## 
## 	Design-based t-test
## 
## data:  log10(bmi) ~ gender
## t = 0.46451, df = 24, p-value = 0.6465
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  -0.005848037  0.009244940
## sample estimates:
## difference in mean 
##        0.001698451

Race/Hispanic Origin vs. BMI ANOVA

1
2
3
4
5
6


svyboxplot(log10(bmi) ~ race_cat, 
           design = nhanes_adult,
           xlab = "Race/Hispanic Origin",
           ylab = "Log Transformed BMI",
           main = "Weighted Boxplot of Mean Log Transformed BMI by 
           Race/Hispanic Origin")

In visually inspecting the boxplot, it looks like there is a difference between the various race/hispanic origin groups and BMI, but lets see.

Fit Model

1

race.bmi.glm <- svyglm(log10(bmi) ~ race_cat, nhanes_adult)

ANOVA

A barrier that I ran into was that the survey package in R does not seen to have an ANOVA function, and when you run the survey GLM through the regular ANOVA function it does not work as I thought it would. I did some researching around and I found that this is a way to assess this for complex surveys using a Wald test.

1
2


options(scipen = 999)
regTermTest(race.bmi.glm, ~race_cat)

## Wald test for race_cat
##  in svyglm(formula = log10(bmi) ~ race_cat, design = nhanes_adult)
## F =  184.1178  on  4  and  21  df: p= 0.00000000000000050064

So it looks like there is a statistically significant difference between the various Race/Hispanic Origin categories and BMI, as hypothesized by the boxplot.

Conclusion

While the NHANES dataset is still very useful, I did have to dig a lot deeper and do more research to find out how I can properly use this data.

As I am also working through another book on statistics, I found out that there is not just parametric statistics if your data is normal, and non parmetric statistics if it isn’t, but that you’re allowed to transform your data to help fit better into the normal distribution so that you can use parametric statistics like T-Tests with more confidence.