Differentiating between GPT-generated and human-written feedback for radiology residents.

0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Andrew D Chung, Benjamin Ym Kwan, Arsalan Rizwan, Nick Rogoza, Zier Zhou

Ngôn ngữ: eng

Ký hiệu phân loại:

Thông tin xuất bản: United States : Current problems in diagnostic radiology , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 489377

Thêm vào giỏ Liên kết toàn văn

PURPOSE: Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources. METHODS: Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019-2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated. RESULTS: Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (k=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %). CONCLUSION: Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.

Tạo bộ sưu tập với mã QR