MedFusion: A Unified Multimodal Framework for Visual Question Answering and Explainable Medical Recommendation

Satyajit Mahapatra; Jibitesh Mishra; Kumar Janardan Patra; Sanjit Kumar Dash; Aliazar Deneke Deferisha

doi:10.3844/jcssp.2026.1539.1551

Research Article Open Access

MedFusion: A Unified Multimodal Framework for Visual Question Answering and Explainable Medical Recommendation

Satyajit Mahapatra¹, Jibitesh Mishra¹, Kumar Janardan Patra¹, Sanjit Kumar Dash¹ and Aliazar Deneke Deferisha²

¹ Schools of Computer Science, Odisha University of Technology and Research, Bhubaneswar, Odisha, India
² Faculty of Computing and Software Engineering, Arba Minch University, Arba Minch, Ethiopia

Abstract

In clinical decision-making, the ability to ask visual questions about medical images and receive accurate, personalized, and interpretable recommendations can significantly enhance practitioner support systems. This paper presents MedFusion, a unified multimodal framework that integrates Visual Question Answering (VQA), personalized medical recommendation, and explainability within a single architecture. The proposed model employs co-attention–based visual–textual fusion augmented with retrieval-enhanced reasoning to improve answer grounding, while personalized recommendations are generated using a shared multimodal representation supported by GAN-guided feature augmentation. To enhance transparency, the framework provides attention-based heatmaps and natural-language rationales for both answers and recommendations. Extensive experiments on VQA-RAD, EHRXQA, and Med-RecX demonstrate that MedFusion outperforms state-of-the-art medical VQA and recommendation baselines, achieving a 7.4% improvement in VQA accuracy, reducing RMSE to 0.91, and improving human-rated interpretability to 4.5/5. Ablation studies confirm the effectiveness of retrieval augmentation, GAN-guided enhancement, and joint multi-task learning. These results indicate that MedFusion offers a robust and explainable decision-support solution, advancing the deployment of trustworthy, user-adaptive AI systems in real-world healthcare environments.

Journal of Computer Science

Volume 22 No. 5, 2026, 1539-1551

DOI: https://doi.org/10.3844/jcssp.2026.1539.1551

Submitted On: 2 August 2025 Published On: 20 May 2026

How to Cite: Mahapatra, S., Mishra, J., Patra, K. J., Dash, S. K. & Deferisha, A. D. (2026). MedFusion: A Unified Multimodal Framework for Visual Question Answering and Explainable Medical Recommendation. Journal of Computer Science, 22(5), 1539-1551. https://doi.org/10.3844/jcssp.2026.1539.1551

Copyright: © 2026 Satyajit Mahapatra, Jibitesh Mishra, Kumar Janardan Patra, Sanjit Kumar Dash and Aliazar Deneke Deferisha. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

46 Views
12 Downloads
0 Citations

Download

Keywords

Multimodal Learning
VQA
Medical Recommendation
XAI
Co-Attention
Retrieval-Augmented Reasoning
CGAN
Healthcare Informatics