An External Validation Study on Two Pre-Trained Large Language Models for Multimodal Prognostication in Laryngeal and Hypopharyngeal Cancer: Integrating Clinical, Treatment, and Radiomic Data to Predict Survival Outcomes with Interpretable Reasoning

2025 Bioengineering 0 citations

Abstract

Background: Laryngeal and hypopharyngeal cancers (LHCs) exhibit heterogeneous outcomes after definitive radiotherapy (RT). Large language models (LLMs) may enhance prognostic stratification by integrating complex clinical and imaging data. This study validated two pre-trained LLMs—GPT-4o-2024-08-06 and Gemma-2-27b-it—for outcome prediction in LHC. Methods: Ninety-two patients with non-metastatic LHC treated with definitive (chemo)radiotherapy at Linkou Chang Gung Memorial Hospital (2006–2013) were retrospectively analyzed. First-order and 3D radiomic features were extracted from intra- and peritumoral regions on pre- and mid-RT CT scans. LLMs were prompted with clinical variables, radiotherapy notes, and radiomic features to classify patients as high- or low-risk for death, recurrence, and distant metastasis. Model performance was assessed using sensitivity, specificity, AUC, Kaplan–Meier survival analysis, and McNemar tests. Results: Integration of radiomic features significantly improved prognostic discrimination over clinical/RT plan data alone for both LLMs. For death prediction, pre-RT radiomics were the most predictive: GPT-4o achieved a peak AUC of 0.730 using intratumoral features, while Gemma-2-27b reached 0.736 using peritumoral features. For recurrence prediction, mid-RT peritumoral features yielded optimal performance (AUC = 0.703 for GPT-4o; AUC = 0.709 for Gemma-2-27b). Kaplan–Meier analyses confirmed statistically significant separation of risk groups: pre-RT intra- and peritumoral features for overall survival (for both GPT-4o and Gemma-2-27b, p < 0.05), and mid-RT peritumoral features for recurrence-free survival (p = 0.028 for GPT-4o; p = 0.017 for Gemma-2-27b). McNemar tests revealed no significant performance difference between the two LLMs when augmented with radiomics (all p > 0.05), indicating that the open-source model achieved comparable accuracy to its proprietary counterpart. Both models generated clinically coherent, patient-specific rationales explaining risk assignments, enhancing interpretability and clinical trust. Conclusions: This external validation demonstrates that pre-trained LLMs can serve as accurate, interpretable, and multimodal prognostic engines for LHC. Pre-RT radiomic features are critical for predicting mortality and metastasis, while mid-RT peritumoral features uniquely inform recurrence risk. The comparable performance of the open-source Gemma-2-27b-it model suggests a scalable, cost-effective, and privacy-preserving pathway for the integration of LLM-based tools into precision radiation oncology workflows to enhance risk stratification and therapeutic personalization.

Affiliated Institutions

Related Publications

Publication Info

Year
2025
Type
article
Volume
12
Issue
12
Pages
1345-1345
Citations
0
Access
Closed

Citation Metrics

0
OpenAlex
0
Influential
0
CrossRef

Cite This

Wing‐Keen Yap, Shih-Chun Cheng, Chia-Hsin Lin et al. (2025). An External Validation Study on Two Pre-Trained Large Language Models for Multimodal Prognostication in Laryngeal and Hypopharyngeal Cancer: Integrating Clinical, Treatment, and Radiomic Data to Predict Survival Outcomes with Interpretable Reasoning. Bioengineering , 12 (12) , 1345-1345. https://doi.org/10.3390/bioengineering12121345

Identifiers

DOI
10.3390/bioengineering12121345

Data Quality

Data completeness: 77%