A Comparative Analysis of Smote and CSSF Techniques for Diabetes Classification Using Imbalanced Data
- 1 Institute of Visual Informatics, Universiti Kebangsaan Malaysia, Bangi Selangor, Malaysia
Abstract
Diabetes, a prevalent chronic metabolic disorder, poses a significant burden on healthcare systems worldwide. Accurate and timely diagnosis is crucial for effective management and complication prevention. Machine learning presents a promising solution but often faces challenges due to class imbalance within datasets, particularly the underrepresentation of diabetic cases. To address this issue, we introduce Cluster-based Synthetic Sample Filtering (CSSF), a method that enhances synthetic sample quality through advanced clustering and filtering techniques. Building upon the Synthetic Minority Over-sampling Technique (SMOTE), CSSF strategically generates synthetic samples within clusters while eliminating noisy instances, thereby improving classification accuracy and reliability. Comparative analysis demonstrates CSSF's effectiveness in mitigating class imbalance. Initial models achieved a 67% accuracy rate, which improved to 82% after smote preprocessing. CSSF further elevated accuracy to an impressive 90%. Notably, Support Vector Machines (SVM), neural networks (deep learning) and random forest achieved a remarkable 92% accuracy post-CSSF preprocessing. Decision tree and K-Nearest Neighbors (KNN) also demonstrated commendable accuracy after CSSF preprocessing. Crucially, CSSF consistently outperformed smote in precision, recall, and the F1-score, highlighting its superiority. Recognizing the importance of ethical AI practices, this study addresses ethical considerations and potential biases in machine learning within healthcare data analysis, promoting fairness, transparency and responsible AI utilization. This research underscores the necessity of ethical and effective approaches to address class imbalance in diabetes classification
DOI: https://doi.org/10.3844/jcssp.2024.1146.1165
Copyright: © 2024 Bashar Hamad Aubaidan, Rabiah Abdul Kadir and Mohamad Taha Ijab. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 871 Views
- 489 Downloads
- 0 Citations
Download
Keywords
- Imbalanced Datasets
- SMOTE
- CSSF
- Synthetic Minority Over-Sampling Technique
- Cluster-Based Synthetic Sample Filtering
- Class Imbalance
- Class Imbalance