Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis.

0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Kimberly Amador, Richard Camicioli, Gabrielle Dagasso, Nils D Forkert, Zahinoor Ismail, Chris Kang, Oury Monchi, Erik Y Ohara, Raissa Souza, Emma A M Stanley, Matthias Wilms, Anthony J Winder

Ngôn ngữ: eng

Ký hiệu phân loại: 649.63 Training in grooming and self-reliance

Thông tin xuất bản: England : NPJ digital medicine , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 143115

Thêm vào giỏ Liên kết toàn văn

Distributed learning enables collaborative machine learning model training without requiring cross-institutional data sharing, thereby addressing privacy concerns. However, local quality control variability can negatively impact model performance while systematic human visual inspection is time-consuming and may violate the goal of keeping data inaccessible outside acquisition centers. This work proposes a novel self-supervised method to identify and eliminate harmful data during distributed learning model training fully-automatically. Harmful data is defined as samples that, when included in training, increase misdiagnosis rates. The method was tested using neuroimaging data from 83 centers for Parkinson's disease classification with simulated inclusion of a few harmful data samples. The proposed method reliably identified harmful images, with centers providing only harmful datasets being easier to identify than single harmful images within otherwise good datasets. While only evaluated using neuroimaging data, the presented method is application-agnostic and presents a step towards automated quality control in distributed learning.

Tạo bộ sưu tập với mã QR