Global web ALERT button


Education & Training  Other Links & Resources

Privacy & Regulations

  1. HIPAA: Security (HIPAA compliance) for Administrative Data sets.
  2. See Dr. Linda S. Elting's presentation on Security, Privacy, and Ethical Issues in Database Research.
  3. Data Use Agreements.


Analyzing Nested (Clustered) Data


Most large data sets that can be used for rehabilitation-related research contain data that are inherently 'nested' or 'clustered.' Persons who see the same provider, are admitted to the same hospital, or live in the same community share common characteristics, experiences, and environmental influences. As a result, individuals within a group or setting (commonly referred to as 'context') tend to be more similar to each other than those chosen at random from all groups in terms of both health determinants and health outcomes. This correlation (dependency) of observations violates the assumption of independence for regression analysis leading to biased standard errors of parameter estimates. Thus, regardless of whether your specific research question includes factors from more than one level, it is necessary to account for the hierarchical nature of potentially correlated observations within these data sets. There are two basic approaches to working with nested data: 1) adjust standard errors of individual-level predictors to account for the potential bias introduced by ignoring the nested structure of the data, or 2) model the structure of the data and partition the variance attributable to the different levels. A generalized estimating equation (GEE) can be used when the objective is simply adjusting the standard errors. Conversely, a multilevel model (hierarchical linear model [HLM] or hierarchical generalized linear model [HGLM] for numerical and categorical outcome variables, respectively) can accommodate either objective: controlling for or modeling the correlated observations, including repeated measures on the same subjects over time.


  • P Burton et al. (1998) Extending the simple linear regression model to account for correlated responses: an introduction to generalized estimating equations and multi-level mixed modeling. Statist Med. 17: 1261-1291.
  • H Goldstein et al. (2002) Tutorial in Biostatistics Multilevel modeling of medical data. Statistic Med. 21: 3291-3315.
  • JA Hanley et al. (2003) Statistical analysis of correlated data using generalized estimating equations: an orientation. Am J Epidemiol. 157(4): 364-375.
  • JB Bingenheimer & SW Raudenbush (2004) Statistical and substantive inferences in public health: Issues in the application of multilevel models. Annu Rev Public Health. 25: 53-77.
  • J Merlo et al. (2005) A brief conceptual tutorial on multilevel analysis in social epidemiology: Investigating contextual phenomena in different groups of people. J Epidemiol Community Health. 59: 729-736.

Analyses Resources

  • Linear Models Video - View the video link for an introduction to hierarchical linear models (HLM) and hierarchical generalized linear models (HGLM). The 90-min video features Professor Ann O'Connell from The Ohio State University. It was recorded during an educational symposium on multilevel modeling at UTMB sponsored by the Center for Rehabilitation Research using Large Data sets in April 2011.


Design: Quasi-Experimental Research


Research involving administrative datasets or large national surveys typically lacks one or more of the three design criteria that define rigorous "experimental research" designs: manipulation, randomization, and control. While randomized controlled trials (RCTs) are the epitome of experimental research and remain the gold standard for inferring causation, methodology advances over the past 20 years have greatly increased our interest in and understanding of quasi-experimental or "observational "research. A major advantage of existing claims or survey data is that they reflect routine practice for large and representative populations, in contrast to the much smaller and often healthier patient populations recruited in clinical trials. In other words, these datasets capture the characteristics and experiences of everyday patients in everyday clinical settings. Moreover, these resources provide the only way to assess policy- or practice-related changes, the so-called "natural experiments."

The fundamental strength of RCTs is the primary criticism of quasi-experimental research: internal validity - the degree to which the relationship between the treatment and outcome is free from the effects of extraneous factors. However, treatment decisions in practice are not randomly assigned. Rather, factors such as prognosis, patient - and provider-preferences, insurance coverage, and out-of-pocket costs influence who gets what treatment. Thus, socio-demographic and clinical characteristics are not balanced between treated and untreated cohorts. External validity - degree to which the results can be generalized to persons or settings outside the experimental situation - is generally less of a concern in observational studies since the experimental situation is routine patients receiving routine care.

When independent variable manipulation and random assignment are beyond the control of the investigator, there are four other design parameters that can strengthen a study's internal validity: 1) cohort identification (incident vs. prevalent users), 2) control or "counterfactual" group, 3) pre-period measurement, and 4) post-period measurement.


NF Marko & RJ Well (2010) The role of observational investigations in comparative effectiveness research. Value in Health. 13(8): 989-997.

S Schneeweiss & J Avorn (2005) A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 58: 323-337.

E von Elm et al. (2007) The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med. 147(8): 573-577.

Other Links

STROBE Statement Website: STROBE stands for an international, collaborative initiative of epidemiologists, methodologists, statisticians, researchers and journal editors involved in the conduct and dissemination of observational studies, with the common aim of Strengthening the Reporting of Observational studies in Epidemiology.


Rigorous Quasi-Experimental Comparative Effectiveness Research Study Design by Professor Matthew Maciejewski from Duke University and the Center for Health Services Research at Durham VA Medical Center. This 60 min video recorded during the Comparative Effectiveness Research with Population-Based Data conference in the Baker Institute at Rice University on July 13, 2012.

Design: Selection Bias and Confounding


In observational studies, participants are not randomly assigned to intervention groups. In fact, individuals receiving a given treatment may be markedly different than those not receiving treatment. Covariates that are independently associated with both treatment and outcome variables are called confounders. Illness severity, for example, would be considered a confounding variable if it influences whether or not a patient receives a given treatment and is also associated with the outcome of interest. Important covariates may not be available in existing datasets. Ignoring group differences in important covariates, whether available or not, can lead to biased estimates of treatment effects. It is important to remember that random error (chance) leads to imprecise results, whereas systematic error (bias) leads to inaccurate results.

Common approaches to control for group differences include stratified analyses, matching, or multivariable modeling using observed covariates, but these strategies are limited in the number of covariates that can be included, and none address unobserved covariates. Alternative techniques to deal with confounding include sensitivity, propensity score, or instrumental variable analyses.

Sensitivity analysis identifies what the strength and prevalence of an unmeasured confounder would have to be to alter the conclusion of the study. In other words, sensitivity analysis does not rule out the possibility that confounding exists; it describes the circumstances necessary for an unmeasured confounder to negate the observed effect of the treatment (or exposure) on the outcome.

Propensity score analysis uses any and all observed covariates to determine the likelihood (conditional probability) that a person belongs to the treatment group. The propensity scores can then be used, through a variety of options, to balance observed covariates and thus, reduce observed confounding.

Instrumental variable (IV) analysis involves identifying a variable (instrument) that is associated with treatment, but not directly associated with the outcome. Since all unmeasured factors are part of error term, selection bias is (likely) present when error term is correlated with both the outcome and the treatment variable. IV analysis involves 1) modeling treatment as a function of covariates and instrument, and 2) use this information to 'break link' with unobserved confounder(s). The unique feature of IV analysis is that it reduces confounding from both observed and unobserved factors.


MA Brookhart et al. (2010) Instrumental variable methods in comparative safety and effectiveness research. Pharmacoepidemiol Drug Saf. 19: 537-554

RB D'Agostino (1998) Tutorial in biostatistics propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Statist Med. 17: 2265-2281.

JP Leigh & M Schembri (2004) Instrumental variables technique: cigarette price provided better estimate of effects of smoking on SF-12. J Clin Epidemiol. 57(3): 284-293. EP Martens et al. (2006) Instrumental variables application and limitations. Epidemiol. 17: 260-267.

PR Rosenbaum (2005) Sensitivity analysis in observational studies. Encyclopedia of Statistics in Behavioral Science. Vol 4: 1809-1814.

MG Stineman et al. (2008) The effectiveness of inpatient rehabilitation in the acute postoperative phase of care after transtibial or transfemoral amputation: study of an integrated health care delivery system. Arch Phys Med Rehabil. 89: 1863-1872.

JA Stukel et al. (2007) Analysis of observational studies in the presence of treatment selection bias: Effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. JAMA. 297(3): 278-285.

Other Links - Video

The videos below cover analytic procedures for dealing with confounding and recorded during the Comparative Effectiveness Research with Population-Based Data conference in the Baker Institute at Rice University on July 13, 2012.

Important Online Resources


Director's Letter
by David Atkins, M.D., M.P.H., Director, HSR&D

Embracing "Big Data" and Data Science
by Stephan Fihn, M.D., M.P.H., VHA Office of Analytics and Business Intelligence, Washington, D.C.

"Big Data" Challenges for Health Services Research
by David Atkins, M.D., M.P.H., HSR&D, Washington, D.C.

Research Highlights

Using Big Data Meaningfully to Improve Quality and Safety at the Point of Care
by Hardeep Singh, M.D., M.P.H., Houston VA Center for Innovation in Quality, Effectiveness and Safety, Michael E. DeBakey VA Medical Center, Houston, Texas and Dean F. Sittig, Ph.D., University of Texas School of Biomedical Informatics, Houston, Texas

Opportunities for Big Data Research in VA
by Adi V. Gundlapalli, M.D., Ph.D., M.S., Salt Lake Informatics, Decision Enhancement, and Analytic Sciences Center (IDEAS 2.0), VA Salt Lake City Health Care System, Salt Lake City, Utah

Predictive Analytics for Population Surveillance and Personalized Medicine: The Example of Acute Kidney Injury
by Michael E. Matheny, M.D., M.S., M.P.H., Geriatric Research Education Clinical Center, VA Tennessee Valley Healthcare System, Nashville, Tennessee



The CLDR is funded by the National Institutes of Health (grant# P2CHD065702). See About Us for more details.

Give Website Feedback