Assessing the performance of ChatGPT and Bard/Gemini against radiologists for Prostate Imaging-Reporting and Data System classification based on prostate multiparametric MRI text reports.

 0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Tristan Barrett, Iztok Caglic, Dimitri A Kessler, Yi-Hsin Kuo, Kang-Lung Lee, Nadeem Shaida

Ngôn ngữ: eng

Ký hiệu phân loại: 539.7214 Atomic and nuclear physics

Thông tin xuất bản: England : The British journal of radiology , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 251897

OBJECTIVES: Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign Prostate Imaging-Reporting and Data System (PI-RADS) categories based on clinical text reports. METHODS: One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by 2 uroradiologists, ChatGPT-3.5 (GPT-3.5), ChatGPT-4o mini (GPT-4), Bard, and Gemini. Original report classifications were considered definitive. RESULTS: Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 "hallucination" for 2 patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively. CONCLUSIONS: Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors. ADVANCES IN KNOWLEDGE: This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.
Tạo bộ sưu tập với mã QR

THƯ VIỆN - TRƯỜNG ĐẠI HỌC CÔNG NGHỆ TP.HCM

ĐT: (028) 36225755 | Email: tt.thuvien@hutech.edu.vn

Copyright @2024 THƯ VIỆN HUTECH