Bridging Vision and Language in Medical Imaging: Towards Robust Automated Report Generation from PET/CT Scans
Main Article Content
Abstract
Vision-Language Models (VLMs) have achieved significant success in multimodal reasoning in general domains, yet their application to medical imaging remains limited, especially in specialized data domains such as PET/CT. In this study, we introduce Vietnamese Positron Emission Tomography - Vision-Language Model (ViPET-VLM), a novel pipeline specifically designed for medical report generation and visual question answering tasks on PET/CT data. ViPET-VLM integrates a fusion module to combine morphological information from CT with functional signals from PET, thereby forming a richer multimodal representation. To enhance clinical reliability, we propose a regularization mechanism with specialized loss functions that not only ensure diagnostic accuracy but also guide the model's attention to critical regions of interest in the images. ViPET-VLM was evaluated on a comprehensive, expert-validated PET/CT dataset and demonstrated marked improvements over current state-of-the-art methods in both report generation and medical question answering. The model shows potential for enhancing the accuracy and clinical applicability of VLMs for medical imaging.
Keywords
PET/CT, Medical Report Generation, Vision-Language Models
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
References
[2] Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, The dawn of LMMs: Preliminary explorations with GPT-4V(ision), arXiv preprint arXiv:2309.17421, pp. 1–166, 2023.
[3] Anthropic, Claude: An AI assistant by anthropic, 2024, [Online]. Available: https://www.anthropic.com/index/claude. Accessed on 2025-05-11
[4] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu, and C. Li, Llava-onevision: Easy visual task transfer, arXiv preprint arXiv:2408.03326, 2024.
[5] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, Qwen2.5-VL technical report, arXiv preprint arXiv:2502.13923, 2025.
[6] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu, Qwen technical report, arXiv preprint arXiv:2309.16609, 2023.
[7] H. Liu, C. Li, Q. Wu, and Y. J. Lee, Visual instruction tuning, Advances in Neural Information Processing Systems, vol. 36, pp. 34892– 34916, 2023.
[8] S. Yan, W. K. Cheung, I. W. Tsang, K. Chiu, T. M. Tong, K. C. Cheung, and S. See, Ahive: Anatomy-aware hierarchical vision encoding for interactive radiology report retrieval, in Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14324–14333, 2024.
[9] T. T. Pham, N.-V. Ho, N.-T. Bui, T. Phan, P. Brijesh, D. Adjeroh, G. Doretto, A. Nguyen, C. C. Wu, H. Nguyen, and N. Le., Fg-cxr: A radiologist-aligned gaze dataset for enhancing interpretability in chest X-ray report generation, in Proceedings of the 2024 Asian Conference on Computer Vision, pp. 941–958, 2024.
[10] S. Javed, A. Mahmood, I. I. Ganapathi, F. A. Dharejo, N. Werghi, and M. Bennamoun, Cplip: zero-shot learning for histopathology with comprehensive vision-language alignment, in Proceedings of the IEEE/CVF 2024 Conference on Computer Vision and Pattern Recognition, pp. 11450–11459, 2024.
[11] OpenAI, Gpt-4o: Openai’s multimodal model with vision, audio, and text capabilities. 2024. Available: https://openai.com/index/gpt-4o
[12] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, A. Mensch, K. Milln, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, Flamingo: a visual language model for few-shot learning, arXiv preprint arXiv:2204.14198, 2022.
[13] H. Nori, Y. T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y. Li, W. Liu, R. Luo, S. M. McKinney, R. O. Ness, H. Poon, T. Qin, N. Usuyama, C. White, and E. Horvitz, Can generalist foundation models outcompete special-purpose tuning? case study in medicine, arXiv preprint arXiv:2311.16452, 2023.
[14] J. H. Moon, H. Lee, W. Shin, Y.-H. Kim, and E. Choi, Multimodal understanding and generation for medical images and text via vision-language pre-training, IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 12, pp. 6070–6080, 2022.
[15] M. Aljabri, M. AlAmir, M. AlGhamdi, M. Abdel-Mottaleb, F. Collado-Mesa, Towards a better understanding of annotation tools for medical imaging: A survey, Multimedia Tools and Applications, vol. 81, no. 18, pp. 25877–25911, 2022.
[16] A. Davis, R. Souza, and J.-H. Lim, Knowledge-augmented language models interpreting structured chest x-ray findings, arXiv preprint arXiv:2505.01711, 2025.
[17] S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao, Knowledge matters: Chest radiology report generation with general and specific knowledge, Medical Image Analysis, vol. 80, p. 102510, 2022.
[18] Z. Wang, Z. Wu, D. Agarwal, and J. Sun, Medclip: Contrastive learning from unpaired medical images and text, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, vol. 2022, p. 3876, 2022.
[19] M. Moor, Q. Huang, S. Wu, M. Yasunaga, C. Zakka, Y. Dalmia, E. P. Reis, P. Rajpurkar, and J. Leskovec, Med-flamingo: A multimodal medical few-shot learner, In Proceedings of Machine Learning for Health (ML4H) PMLR, pp. 353-367, 2023.
[20] P. Chambon, C. Bluethgen, J.-B. Delbrouck, R. Van der Sluijs, M. Połacin, J. M. Z. Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, and A. Chaudhari, Roentgen: Vision-language foundation model for chest x-ray generation, arXiv preprint arXiv:2211.12737, 2022.
[21] L. Xu, H. Sun, Z. Ni, H. Li, and S. Zhang, Medvilam: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation, arXiv preprint arXiv:2409.19684, 2024.
[22] Y. Xin, G. C. Ates, K. Gong, and W. Shao, Med3dvlm: An efficient vision-language model for 3D medical image analysis, arXiv preprint arXiv:2503.20047, 2025.
[23] W. Lin, Z. Zhao, X. Zhang, C. Wu, Y. Zhang, Y. Wang, and W. Xie, Pmc-clip: Contrastive language-image pre-training using biomedical documents, in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 525–536, Springer, 2023.
[24] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng, Mimic-cxr, a deidentified publicly available database of chest radiographs with free-text reports, Scientific data, vol. 6, no. 1, p. 317, 2019.
[25] F. Bai, Y. Du, T. Huang, M. Q.-H. Meng, and B. Zhao, M3d: Advancing 3D medical image analysis with multi-modal large language models, arXiv preprint arXiv:2404.00578, 2024.
[26] C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data, Nature Communications, 16, no. 1, p.7866, 2025.
[27] I. E. Hamamci, S. Er, and B. Menze, CT2REP: Automated radiology report generation for 3d medical imaging, in Proceedings of the 2024 International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 476–486, Springer, 2024.
[28] Nguyen, H. T., Nguyen, D. T., Nguyen, T. M. D., Nguyen, T. T., Truong, T. N., Pham, H. H., Nguyen, P. L., Toward a vision-language foundation model for medical data: Multimodal dataset and benchmarks for Vietnamese PET/CT report generation, (2025) arXiv preprint arXiv:2509.24739.
[29] C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, Llava-med: Training a large language-and-vision assistant for biomedicine in one day, Advances in Neural Information Processing Systems, vol. 36, pp. 28541–28564, 2023.
[30] S. Lee, J. Youn, H. Kim, M. Kim, and S. H. Yoon, CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images, European Radiology, 35(7), pp.4374-4386, 2024.
[31] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, Mistral 7b, arXiv preprint arXiv:2310.06825, 2023.
[32] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, Lora: Low-rank adaptation of large language models, International Conference on Learning Representations, vol. 1, no. 2, p. 3, 2022.