For this post, I decided to analyze NHANES data from the years 2011-2016 once again, as there are so many variables to explore and possible research questions that can be looked into from this data alone. I decided to do an analysis of trends in very general liver health data, in this case the proportion of the population that has been told that they have a liver condition, and average age at diagnosis in those that responded yes.
There are many kinds of liver conditions that can be caused by either genetic, viral, or lifestyle factors that are not always serious, but can definitely be if it goes untreated. So, I was interested in seeing if there was a change in the responses to these survey interview questions over the three two-year long cycles.
Load Packages
|
|
Load Data
Looking back at my prior NHANES projects, I realized that the code to import the data was quite repetitive, so I decided to face my fear of writing functions and write some simple functions to import the data that I needed from the three different survey cycles. First, I created one for the demographic data, and then the liver condition data as I would be selecting the same columns from all three data sets.
Demographic Import Function
|
|
Liver Import Function
|
|
and now to apply my functions to import the data that I need in a less repetitive manner.
|
|
Merge Data
I’m also going to create a function to merge the demographic and liver data by the unique identifier SEQN, and then drop the column as that won’t be needed past this step to create the final dataset.
|
|
Then I applied the merge.nhanes() functions, and also created a separate column that identifies which cycle that the specific data came from as that is the variable I am interested in comparing as I’m trying to see if there are any changes over time that have occurred. And finally, I create my final, unclean, and merged dataset using rbind().
|
|
Clean Data
I then use dyplr’s mutate, rename, and select to clean up the dataset and give it more intuitive variable names. I also had to go back in and make sure that all types of coding for missing data was accounted for as I noticed when doing summary statistics later that the liver age variable was way outside of a reasonable age that someone would even be alive,
so I assumed there was alternative coding for NAs that I must have missed, and found out some observations were coded as 99999 instead of NA, and I fixed that with a simple if_else() function.
|
|
Summary Statistics
Now to take a look at a summary table of my data to make sure that everything was coded in the way that I want it to be, and that there are no glaring outliers anymore that may show that the data was not cleaned properly.
|
|
|
|
Looks good! Doesn’t say that the average age that someone was told that they had a liver condition was 49,000 anymore haha.
Survey Design Objects
Now an important part of the survey analysis process in R, to create a survey design object with my cleaned up data frame using the psu weight, strata variables created by NHANES.
|
|
Apply Exclusion Criteria
And to apply my exclusion criteria to the survey design object to exclude participants under 18, ones that are currently confirmed to be pregnant, and ones with missing values for the liver variable (which stores the response to the question, “Were you ever told that you have a liver condition?”)
|
|
Proportion of Population with a Liver Condition
I’m going to use the svyby() function to calculate the proportion of people diagnosed with a liver condition, using the survey package takes the weights into account and makes it reflect the proportion of the whole population that has been told that they have a liver condition, rather than just the group surveyed.
I also found that you can extract elements of the svyby function if you assign the output to an object, which I did below in order to graph the values in a bar chart.
|
|

Change in Proportion of Population with a Liver Condition Over Time
I’m going to make a quick table using the svytable() function to take a look at how many individuals in the population have been told that they have a liver condition vs. those that do not over the three survey cycles.
|
|
|
|
Just by looking at the table, the number of yes answers has gotten greater, but so has the population overall. I’m going to do a chi squared test to see if there are any statistically differences in the proportions.
|
|
|
|
At a p value threshold of 0.05, this null hypothesis that there are no differences in the proportions cannot be rejected.
Average Age at Diagnosis
Another variable that I am interested in taking a closer look at how it has changed over the three survey cycles is the liverage variable. This is a follow up question for those who responded yes to the previous question if they were ever told that they had a liver condition to capture the age that they were first told that they had a liver condition.
I used the svyboxplot() function to visualize this data comparing the average age at diagnosis with a liver condition between survey cycles.
|
|

Change in Average Age at Diagnosis of a Liver Condition Over Time
In order to compare the average age at diagnosis within each survey cycle group, I performed an analysis of variance test by fitting a svyglm() model along with using the regTermTest() function, as there is no specific anova function in this package that takes into consideration the survey design.
|
|
|
|
It looks like this variable as well does not show any statistically significant differences between the survey cycle years at a p = 0.05 threshold.
Conclusion
I wanted to do a trend analysis over the years as it was a bit of a departure from what I have already done with the NHANES data in the past. In doing this analysis, I also started using R’s ability to write custom functions to avoid writing repetitive code, which was very intimidating to be before starting this, but I realized that it’s just generalizing regular R code.
I would like to do future analyses with even more years, as since there were only three cycles to compare there were not too many changes noted, and perhaps there is more of a long term (10+ year) trend that I’m missing by just focusing on a six year time period. I’m also excited for when the newest NHANES data releases that would be the first post-pandemic data, so that we can see how the pandemic changed up trends in population health data.
This trend analysis can also be expanded upon by adding the laboratory data for liver health markers from the Mobile Examination Center (MEC) portion of the survey. One thing to note that would change if this was added as a variable is the survey weights, which would be represented by the MEC weigthts rather than the interview weights, as a smaller amount of people came in for the MEC exam compared to the at home interview.