Early Detection of Diabetes Using Machine Learning

Anh Vu Tran1, Thi Hong Tham Pham1, Dan Vy Vu1, Quang Huy Hoang1, Thi Viet Huong Pham2,
1 Ha Noi University of Science and Technology, Ha Noi, Vietnam
2 International School, Vietnam National University, Hanoi, Vietnam

Main Article Content

Abstract

Early detection of diabetes mellitus is essential for mitigating severe complications such as renal failure and cardiovascular disease. This study presents an optimized machine learning framework for the early diagnosis of diabetes using the Pima Indians Diabetes dataset. To address inherent data quality issues, we implemented a rigorous preprocessing pipeline comprising median-based missing value imputation, interquartile range (IQR) outlier removal, and the Synthetic Minority Over-sampling Technique (SMOTE) to rectify class imbalance. We evaluated two ensemble architectures, Random Forest (RF) and XGBoost, integrated with Grid Search for systematic hyperparameter optimization. Experimental results across three scenarios demonstrated that feature selection significantly impacts predictive integrity; specifically, maintaining Insulin as a core feature while strategically excluding Skin Thickness enhanced model stability. The Random Forest model achieved a peak accuracy of 96.61% with a near-perfect recall rate, while XGBoost reached 95.76% accuracy. By outperforming several contemporary models, this research underscores the necessity of synergistic data preprocessing and ensemble learning in clinical diagnostics. These findings provide a robust decision-support tool for healthcare providers to facilitate timely intervention and improved patient outcomes.

Article Details

References

[1] “Diabetes in Viet Nam,” World Health Organization.
[2] Jensen T and Deckert T, “Diabetic retinopathy, nephropathy and neuropathy. Generalized vascular damage in insulin-dependent diabetic patients.,” Horm Metab Res Suppl., vol. 26, pp. 68–70, 1992.
[3] P. N. Thotad, G. R. Bharamagoudar, and B. S. Anami, “Diabetes disease detection and classification on Indian demographic and health survey data using machine learning methods,” Diabetes & Metabolic Syndrome: Clinical Research & Reviews, vol. 17, no. 1, p. 102690, Mar. 2023, doi: 10.1016/J.DSX.2022.102690.
[4] Md.Shamim Reza, Umme Hafsha, Ruhul Amin, Rubia Yasmin, and Sabba Ruhi, “Improving SVM performance for type II diabetes prediction with an improved non-linear kernel: Insights from the PIMA dataset,” Computer Methods and Programs in Biomedicine Update, vol. 4, 2023.
[5] Q. Liu et al., “Predicting the Risk of Incident Type 2 Diabetes Mellitus in Chinese Elderly Using Machine Learning Techniques,” J Pers Med, vol. 12, p. 905, May 2022, doi: 10.3390/jpm12060905.
[6] P. Bharath Kumar Chowdary and R. Udaya Kumar, “An Effective Approach for Detecting Diabetes using Deep Learning Techniques based on Convolutional LSTM Networks,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 12, no. 4, 2021.
[7] U. M. Butt, S. Letchmunan, M. Ali, F. H. Hassan, A. Baqir, and H. H. R. Sherazi, “Machine Learning Based Diabetes Classification and Prediction for Healthcare Applications,” J Healthc Eng, vol. 2021, no. 1, p. 9930985, Jan. 2021, doi: https://doi.org/10.1155/2021/9930985.
[8] Rashi Rastogi and Mamta Bansal, “Diabetes prediction model using data mining techniques,” Measurement: Sensors, vol. 25, 2023.
[9] Muhammad Mazhar Bukhari, Bader Fahad Alkhamees, Saddam Hussain, Abdu Gumaei, Adel Assiri, and Syed Sajid Ullah, “An Improved Artificial Neural Network Model for Effective Diabetes Prediction,” Complexity, Hindawi, vol. 2021, pp. 1–10, Jun. 2021.
[10] A. Saeed, H. M. Salman Ajmal, M. Umair Ahmad Khan, and W. T. Toor, “Enhancing Early Detection of Diabetes Using Machine Learning Algorithms,” in 2024 1st International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR), Muscat, Oman, 2024, pp. 1–6.
[11] H. B. Kibria, M. Nahiduzzaman, Md. O. F. Goni, M. Ahsan, and J. Haider, “An Ensemble Approach for the Prediction of Diabetes Mellitus Using a Soft Voting Classifier with an Explainable AI,” Sensors, vol. 22, no. 19, 2022, doi: 10.3390/s22197268.
[12] S. J. A. A. M. M. G. Q. A.-T. S. M. T. H. A. Alawi Alqushaibi Mohd Hilmi Hasan, “Type 2 Diabetes Risk Prediction Using Deep Convolutional Neural Network Based-Bayesian Optimization,” Computers, Materials & Continua, vol. 75, no. 2, pp. 3223–3238, 2023, doi: 10.32604/cmc.2023.035655.
[13] S. Mishra and S. Dash, “Predictive Analysis On Diabetes Detection Using Pima Indian Diabetes Dataset,” INTERNATIONAL JOURNAL OF RESEARCH AND ANALYTICAL REVIEWS, vol. 11, pp. 587–599, Jun. 2024, doi: 10.1729/Journal.40190.
[14] Md. K. Hasan, Md. A. Alam, D. Das, E. Hossain, and M. Hasan, “Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers,” IEEE Access, vol. PP, p. 1, Apr. 2020, doi: 10.1109/ACCESS.2020.2989857.
[15] M. Heydari, M. Teimouri, Z. Heshmati, and S. M. Alavinia, “Comparison of various classification algorithms in the diagnosis of type 2 diabetes in Iran,” Int J Diabetes Dev Ctries, vol. 36, no. 2, pp. 167–173, 2016, doi: 10.1007/s13410-015-0374-4.
[16] P. Verma and A. Khatoon, “Data Mining Applications in Healthcare: A Comparative Analysis of Classification Techniques for Diabetes Diagnosis Using the PIMA Indian Diabetes Dataset,” in 2024 4th International Conference on Innovative Practices in Technology and Management (ICIPTM), 2024, pp. 1–5. doi: 10.1109/ICIPTM59628.2024.10563296.
[17] R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” 2013. [Online]. Available: http://www.biomedcentral.com/1471-2105/14/106
[18] M. Bhagat and B. Bakariya, “Implementation of Logistic Regression on Diabetic Dataset using Train-Test-Split, K-Fold and Stratified K-Fold Approach,” National Academy Science Letters, vol. 45, no. 5, pp. 401–404, 2022, doi: 10.1007/s40009-022-01131-9.
[19] Liu, Y., Wang, Y., Zhang, J. (2012). New Machine Learning Algorithm: Random Forest. In: Liu, B., Ma, M., Chang, J. (eds) Information Computing and Applications. ICICA 2012. Lecture Notes in Computer Science, vol 7473. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34062-8_32
[20] Aayush Tyagi, “What is XGBoost Algorithm?,” Analytics Vidhya. Accessed: Dec. 23, 2025. [Online]. Available: https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/
[21] Rahul Shah, “Tune Hyperparameters with GridSearchCV,” Accessed: Dec. 23, 2025. Analytics Vidhya, [Online]. Available https://www.analyticsvidhya.com/blog/2021/06/tune-hyperparameters-with-gridsearchcv/
[22] J. J. Khanam and S. Y. Foo, “A comparison of machine learning algorithms for diabetes prediction,” ICT Express, vol. 7, no. 4, pp. 432–439, 2021, doi: https://doi.org/10.1016/j.icte.2021.02.004.
[23] H. B. Kibria, M. Nahiduzzaman, Md. O. F. Goni, M. Ahsan, and J. Haider, “An Ensemble Approach for the Prediction of Diabetes Mellitus Using a Soft Voting Classifier with an Explainable AI,” Sensors, vol. 22, no. 19, 2022, doi: 10.3390/s22197268.
[24] J. K. Shimpi, P. Shanmugam, and A. A. Stonier, “Analytical model to predict diabetic patients using an optimized hybrid classifier,” Soft comput, vol. 28, no. 3, pp. 1883–1892, 2024, doi: 10.1007/s00500-023-09487-w.
[25] M. Y. Shams, Z. Tarek, and A. M. Elshewey, “A novel RFE-GRU model for diabetes classification using PIMA Indian dataset,” Sci Rep, vol. 15, no. 1, Dec. 2025, doi: 10.1038/s41598-024-82420-9.
[26] B. Toleva, I. Atanasov, I. Ivanov, and V. Hooper, “An Effective Methodology for Diabetes Prediction in the Case of Class Imbalance,” Bioengineering, vol. 12, no. 1, 2025, doi: 10.3390/bioengineering12010035.