Skip to main content
blue bar

Resource Archive

  1. HIPAA: Security (HIPAA compliance) for Administrative Data sets.
  2. See Dr. Linda S. Elting's presentation on Security, Privacy, and Ethical Issues in Database Research.
  3. Data Use Agreements.

Analyzing Nested (Clustered) Data


Most large data sets that can be used for rehabilitation-related research contain data that are inherently 'nested' or 'clustered.' Persons who see the same provider, are admitted to the same hospital, or live in the same community share common characteristics, experiences, and environmental influences. As a result, individuals within a group or setting (commonly referred to as 'context') tend to be more similar to each other than those chosen at random from all groups in terms of both health determinants and health outcomes. This correlation (dependency) of observations violates the assumption of independence for regression analysis leading to biased standard errors of parameter estimates. Thus, regardless of whether your specific research question includes factors from more than one level, it is necessary to account for the hierarchical nature of potentially correlated observations within these data sets. There are two basic approaches to working with nested data: 1) adjust standard errors of individual-level predictors to account for the potential bias introduced by ignoring the nested structure of the data, or 2) model the structure of the data and partition the variance attributable to the different levels. A generalized estimating equation (GEE) can be used when the objective is simply adjusting the standard errors. Conversely, a multilevel model (hierarchical linear model [HLM] or hierarchical generalized linear model [HGLM] for numerical and categorical outcome variables, respectively) can accommodate either objective: controlling for or modeling the correlated observations, including repeated measures on the same subjects over time.


  • P Burton et al. (1998) Extending the simple linear regression model to account for correlated responses: an introduction to generalized estimating equations and multi-level mixed modeling. Statist Med. 17: 1261-1291.
  • H Goldstein et al. (2002) Tutorial in Biostatistics Multilevel modeling of medical data. Statistic Med. 21: 3291-3315.
  • JA Hanley et al. (2003) Statistical analysis of correlated data using generalized estimating equations: an orientation. Am J Epidemiol. 157(4): 364-375.
  • JB Bingenheimer & SW Raudenbush (2004) Statistical and substantive inferences in public health: Issues in the application of multilevel models. Annu Rev Public Health. 25: 53-77.
  • J Merlo et al. (2005) A brief conceptual tutorial on multilevel analysis in social epidemiology: Investigating contextual phenomena in different groups of people. J Epidemiol Community Health. 59: 729-736.

Analyses Resources

  • Linear Models Video - View the video link for an introduction to hierarchical linear models (HLM) and hierarchical generalized linear models (HGLM). The 90-min video features Professor Ann O'Connell from The Ohio State University. It was recorded during an educational symposium on multilevel modeling at UTMB sponsored by the Center for Rehabilitation Research using Large Data sets in April 2011.


Research involving administrative datasets or large national surveys typically lacks one or more of the three design criteria that define rigorous "experimental research" designs: manipulation, randomization, and control. While randomized controlled trials (RCTs) are the epitome of experimental research and remain the gold standard for inferring causation, methodology advances over the past 20 years have greatly increased our interest in and understanding of quasi-experimental or "observational "research. A major advantage of existing claims or survey data is that they reflect routine practice for large and representative populations, in contrast to the much smaller and often healthier patient populations recruited in clinical trials. In other words, these datasets capture the characteristics and experiences of everyday patients in everyday clinical settings. Moreover, these resources provide the only way to assess policy- or practice-related changes, the so-called "natural experiments."

The fundamental strength of RCTs is the primary criticism of quasi-experimental research: internal validity - the degree to which the relationship between the treatment and outcome is free from the effects of extraneous factors. However, treatment decisions in practice are not randomly assigned. Rather, factors such as prognosis, patient - and provider-preferences, insurance coverage, and out-of-pocket costs influence who gets what treatment. Thus, socio-demographic and clinical characteristics are not balanced between treated and untreated cohorts. External validity - degree to which the results can be generalized to persons or settings outside the experimental situation - is generally less of a concern in observational studies since the experimental situation is routine patients receiving routine care.

When independent variable manipulation and random assignment are beyond the control of the investigator, there are four other design parameters that can strengthen a study's internal validity: 1) cohort identification (incident vs. prevalent users), 2) control or "counterfactual" group, 3) pre-period measurement, and 4) post-period measurement.


NF Marko & RJ Well (2010) The role of observational investigations in comparative effectiveness research. Value in Health. 13(8): 989-997.

S Schneeweiss & J Avorn (2005) A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 58: 323-337.

E von Elm et al. (2007) The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med. 147(8): 573-577.

Other Links

STROBE Statement Website: STROBE stands for an international, collaborative initiative of epidemiologists, methodologists, statisticians, researchers and journal editors involved in the conduct and dissemination of observational studies, with the common aim of Strengthening the Reporting of Observational studies in Epidemiology.


Rigorous Quasi-Experimental Comparative Effectiveness Research Study Design by Professor Matthew Maciejewski from Duke University and the Center for Health Services Research at Durham VA Medical Center. This 60 min video recorded during the Comparative Effectiveness Research with Population-Based Data conference in the Baker Institute at Rice University on July 13, 2012.


Director's Letter
by David Atkins, M.D., M.P.H., Director, HSR&D

Embracing "Big Data" and Data Science
by Stephan Fihn, M.D., M.P.H., VHA Office of Analytics and Business Intelligence, Washington, D.C.

"Big Data" Challenges for Health Services Research
by David Atkins, M.D., M.P.H., HSR&D, Washington, D.C.

Research Highlights

Using Big Data Meaningfully to Improve Quality and Safety at the Point of Care
by Hardeep Singh, M.D., M.P.H., Houston VA Center for Innovation in Quality, Effectiveness and Safety, Michael E. DeBakey VA Medical Center, Houston, Texas and Dean F. Sittig, Ph.D., University of Texas School of Biomedical Informatics, Houston, Texas

Opportunities for Big Data Research in VA
by Adi V. Gundlapalli, M.D., Ph.D., M.S., Salt Lake Informatics, Decision Enhancement, and Analytic Sciences Center (IDEAS 2.0), VA Salt Lake City Health Care System, Salt Lake City, Utah

Predictive Analytics for Population Surveillance and Personalized Medicine: The Example of Acute Kidney Injury
by Michael E. Matheny, M.D., M.S., M.P.H., Geriatric Research Education Clinical Center, VA Tennessee Valley Healthcare System, Nashville, Tennessee



The Center for Large Data Research and Data Sharing in Rehabilitation involves a consortium of investigators from the University of Texas Medical Branch, Cornell University, and the University of Michigan. The CLDR is funded by NIH - National Institute of Child Health and Human Development, through the National Center for Medical Rehabilitation Research, the National Institute for Neurological Disorders and Stroke, and the National Institute of Biomedical Imaging and Bioengineering. (P2CHD065702).