Predicting pathological response to neoadjuvant chemoradiotherapy in locally advanced rectal cancer with two step feature selection and ensemble learning.

0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Yijing Chen, Ran Ge, You Guo, Chengnan Liu, Changshun Qian, Fangmin Shi, Hui Wang, Shuxin Yang

Ngôn ngữ: eng

Ký hiệu phân loại: 303.385 Prejudice

Thông tin xuất bản: England : Scientific reports , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 742582

Thêm vào giỏ Liên kết toàn văn

Patients with locally advanced rectal cancer (LARC) show substantial individual variability and a pronounced imbalance in response distribution to neoadjuvant chemoradiotherapy (nCRT), posing significant challenges to treatment response prediction. This study aims to identify effective predictive biomarkers and develop an ensemble learning-based prediction model to assess the response of LARC patients to nCRT. A two-step feature selection method was developed to identify predictive biomarkers by deriving stable reversal gene pairs through within-sample relative expression orderings (REOs) from LARC patients undergoing nCRT. Preliminary screening utilized four methods-MDFS, Boruta, MCFS, and VSOLassoBag-to form a candidate feature set. Secondary screening ranked these features by permutation importance, applying Incremental Feature Selection (IFS) with an Extreme Gradient Boosting (XGBoost) to determine final predictive gene pairs. The ensemble model BoostForest, combining boosting and bagging, served as the predictive framework, with SHAP employed for interpretability. Through two-step feature selection, the 32-gene pair signature (32-GPS) was established as the final predictive biomarker. In the test set, the model achieved an area under the precision-recall curve (AUPRC) of 0.983 and an accuracy of 0.988. In the validation cohort, the AUPRC was 0.785, with an accuracy of 0.898, indicating strong model performance. The study further demonstrated that BoostForest achieved superior overall performance compared to Random Forest, Support Vector Machine (SVM), and XGBoost. To evaluate the effectiveness of the 32-GPS, its performance was compared with two alternative feature sets: the lasso-gene pair signature (lasso-GPS), derived through lasso regression, and the 15-shared gene pair signature (15-SGPS), consisting of gene pairs identified by all four feature selection methods. The 32-GPS demonstrated superior performance in both comparisons. The two-step feature selection method identified robust predictive biomarkers, and BoostForest outperformed Random Forest, Support Vector Machine, and XGBoost in classification performance and predictive capability.

Tạo bộ sưu tập với mã QR