Optimizing N-linked Glycosylation Site Prediction in Human Proteins with Ensemble Stacking and Cross-Validation

Mubina Malik; Jaimin Undavia

doi:10.3844/jcssp.2024.1753.1765

Research Article Open Access

Optimizing N-linked Glycosylation Site Prediction in Human Proteins with Ensemble Stacking and Cross-Validation

Mubina Malik¹ and Jaimin Undavia¹

¹ Department of Computer Science and Applications, CMPICA, CHARUSAT, Charotar University of Science and Technology (CHARUSAT), CHARUSAT Campus, Changa, India

Abstract

The most frequent post-translational modification of proteins in all territories is glycosylation which impacts many biological activities. The most significant and critical of these modifications is N-linked glycosylation which is associated with various human diseases including diabetes cancer Inflammation Alzheimers and atherosclerosis. This article illustrates recent advances in knowledge of biology that are eventually targeting the computer science sector. Moreover-identification of N-linked glycosylation helps to understand the biological system of humans and the mechanism of glycosylation. Machine learning techniques became very important for the N-linked glycosylation prediction from human protein because the experimental process is time-consuming and costly. This article proposes an ensemble machine learning approach for N-linked glycosylation prediction integrating updated and experimentally verified databases (UniProtKB dbPTM and nGlycositeAtlas) with an optimal window size of 21. MMSeq2 clustering with a threshold of 0.3 was employed to eliminate duplicate and similar protein sequences for improved dataset preparation. A total of 9040 features were extracted using various descriptors including sequence structural and physicochemical features. ANOVA F-score CHI2 and Mutual Information were used as ensemble feature selection techniques the combination of all these results generated 182 desirable features for the final model training. The model was then trained using cross-validation methods and ensemble stacking using four base classifiers: SVM LR XGBoost and RF. The prediction result demonstrates that ensemble stacking techniques with cross-validation give a more reliable and promising result than the individual base classifiers. Moreover, ensemble Stacking with cross-validation performs better than the individual classifier with an Accuracy of 99.99% Precision of 99.98% Recall of 100% AUC of 99.94% MCC of 99.96%, and F-score 99.99%.

Journal of Computer Science

Volume 20 No. 12, 2024, 1753-1765

DOI: https://doi.org/10.3844/jcssp.2024.1753.1765

Submitted On: 21 July 2024 Published On: 20 November 2024

How to Cite: Malik, M. & Undavia, J. (2024). Optimizing N-linked Glycosylation Site Prediction in Human Proteins with Ensemble Stacking and Cross-Validation. Journal of Computer Science, 20(12), 1753-1765. https://doi.org/10.3844/jcssp.2024.1753.1765

Copyright: © 2024 Mubina Malik and Jaimin Undavia. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

2,341 Views
1,084 Downloads
0 Citations

Download

Keywords

Machine Learning
Ensemble Stacking
XGBoost
Random Forest
SVM Cross Validation
Protein N-Linked Glycosylation