The Center for Health and Clinical Outcomes Research (H-COR) Works-in-Progress Seminar Series recently hosted Dr. Vibhuti Gupta, associate professor in the Department of Biostatistics and Data Science, for a session on multimodal AI for precision medicine. Dr. Gupta outlined the challenges of combining clinical, imaging, and genomic data into unified predictive models, and presented a framework he is building at UTMB using pre-trained foundation models.
Why Single-Modality Models Fall Short in Cancer Risk Prediction
Dr. Gupta opened with a framing of precision medicine as a multimodal problem. A patient's clinical journey generates many types of data at different time points, from pathology reports and imaging to genomic sequences and electronic health records. Most published AI models analyze one of these sources in isolation. That fragmented view, Dr. Gupta argued, misses the holistic picture clinicians need.
He illustrated this with his prior NIH-funded prostate cancer project, which integrated histopathology images, RNA sequencing data, and clinical variables for 401 patients from the Cancer Genome Atlas. In preliminary testing, the multimodal model achieved higher recall and area under the curve (AUC) than any single-modality model.
But Dr. Gupta was candid about limitations.
The cohort came from one public dataset, image processing alone required weeks of computation, and the model has not been validated on hospital-based patient data.
A Proposed Framework Using Foundation Models for Scalable Integration
Dr. Gupta proposed a framework that uses pre-trained foundation models to process each data type before combining them into a unified patient representation.
Simplified overview of Dr. Gupta's proposed multimodal AI framework. Each data source is processed through a domain-specific foundation model, producing embeddings that are fused into a combined patient representation for downstream prediction tasks.
In this approach, clinical text passes through language models, histopathology images through pathology-specific models, genomic data through omics models, and wearable sensor streams through time-series models. Each produces a standardized numeric embedding. Those embeddings are fused into a combined representation that can support tasks like risk stratification or treatment response prediction.
The key advantage is adaptability.
New data types can be added without rebuilding the entire pipeline. And because the foundation models are pre-trained on large biomedical datasets, the computational cost of fine-tuning on local data drops significantly.
Where Attendees Offered Direction
The Q&A surfaced practical questions that sharpened the presentation. One attendee asked how multimodal AI performance compares with the clinical staging methods physicians use today. Dr. Gupta acknowledged that published benchmarks against standard clinical workflows remain sparse in the prostate cancer literature.
Another participant raised the question of cost. Spatial biology and multi-omics data collection can run into hundreds of thousands of dollars, and that acquisition cost becomes part of any future deployment argument. Dr. Gupta agreed this belongs in later-stage evaluation but noted the project is still in its prototype phase.
Several attendees also suggested looking at existing clinical foundation models for lessons on moving from prototype to clinical relevance.
The session closed with a direct request from Dr. Gupta for UTMB collaborators across disease areas.
Curating high-quality, harmonized multimodal data remains his biggest barrier, and cross-campus partnerships are essential to advance the framework from concept to clinical testing.
The H-COR Works-in-Progress Seminar Series is held monthly and open to researchers across UTMB. For more information, visit the H-COR website.