CLDR logo
The Center for Large Data Research and Data Sharing in Rehabilitation involves a consortium of investigators from the University of Texas Medical Branch, Colorado State University, Cornell University, and the University of Michigan. The CLDR is funded by NIH - National Institute of Child Health and Human Development, through the National Center for Medical Rehabilitation Research, the National Institute for Neurological Disorders and Stroke, and the National Institute of Biomedical Imaging and Bioengineering. (P2CHD065702).

Education & Training: Analyzing Data

Analyzing Nested (Clustered) Data Overview

Most large data sets that can be used for rehabilitation-related research contain data that are inherently 'nested' or 'clustered.' Persons who see the same provider, are admitted to the same hospital, or live in the same community share common characteristics, experiences, and environmental influences. As a result, individuals within a group or setting (commonly referred to as 'context') tend to be more similar to each other than those chosen at random from all groups in terms of both health determinants and health outcomes. This correlation (dependency) of observations violates the assumption of independence for regression analysis leading to biased standard errors of parameter estimates. Thus, regardless of whether your specific research question includes factors from more than one level, it is necessary to account for the hierarchical nature of potentially correlated observations within these data sets.

There are two basic approaches to working with nested data:

  1. Adjust standard errors of individual-level predictors to account for the potential bias introduced by ignoring the nested structure of the data, or
  2. Model the structure of the data and partition the variance attributable to the different levels.

A generalized estimating equation (GEE) can be used when the objective is simply adjusting the standard errors. Conversely, a multilevel model (hierarchical linear model [HLM] or hierarchical generalized linear model [HGLM] for numerical and categorical outcome variables, respectively) can accommodate either objective: controlling for or modeling the correlated observations, including repeated measures on the same subjects over time.

Citations

  • P Burton et al. (1998) Extending the simple linear regression model to account for correlated responses: an introduction to generalized estimating equations and multi-level mixed modeling. Statist Med. 17: 1261-1291.
  • H Goldstein et al. (2002) Tutorial in Biostatistics Multilevel modeling of medical data. Statist Med. 21: 3291-3315.
  • JA Hanley et al. (2003) Statistical analysis of correlated data using generalized estimating equations: an orientation. Am J Epidemiol. 157(4): 364-375.
  • JB Bingenheimer & SW Raudenbush (2004) Statistical and substantive inferences in public health: Issues in the application of multilevel models. Annu Rev Public Health. 25: 53-77.
  • J Merlo et al. (2005) A brief conceptual tutorial on multilevel analysis in social epidemiology: Investigating contextual phenomena in different groups of people. J Epidemiol Community Health. 59: 729-736.


Analyses Resources


The CLDR is funded by the National Institutes of Health (grant# P2CHD065702). See About Us for more details.

Give Website Feedback