Evaluating Large Language Model Performance to Support the Diagnosis and Management of Patients with Primary Immune Disorders.

 0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Aaron T Chin, Daniel V DiGiacomo, Cullen Dutmer, Jocelyn R Farmer, Yingya Li, Mei-Sing Ong, Nicholas L Rider, Kirk Roberts, Guergana Savova

Ngôn ngữ: eng

Ký hiệu phân loại: 920.71 Men

Thông tin xuất bản: United States : The Journal of allergy and clinical immunology , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 164981

 BACKGROUND: Generative artificial intelligence (GAI) is transforming healthcare in a variety of ways
  however, present utility of GAI for supporting clinicians in rare disease such as primary immune disorders (PI) is not well studied. Here we evaluate the ability of 6 state-of-the-art large language models (LLMs) for providing clinical guidance about PI. OBJECTIVE: We sought to quantitatively and qualitatively measure the utility of current, open-source LLMs for diagnosing and providing helpful clinical decision support about PI. METHODS: Five expert clinical immunologists provided 5 real-world (n=25), anonymized PI case vignettes via multi-turn prompting to 6 LLMs (OpenAI GPT-4o, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Mistral-7B-Instruct-v0.3, Mistral-Large-Instruct-2407, and Mixtral-8x7B-Instruct-v0.1). We assessed the diagnostic accuracy of the LLMs and the quality of clinical reasoning using the Revised-IDEA (R-IDEA) score. Qualitative LLM assessment was made by immunologist narratives. RESULTS: Performance accuracy (>
 88%) and R-IDEA scores (>
 =8) were superior for 3 models (GPT-4o, Llama-3.1-70B-Instruct, Mistral-Large-Instruct-2407), with GPT-4o achieving the highest diagnostic accuracy (96.2%). Conversely, the remaining 3 models fell below acceptable accuracy rates near 60% or worse and poor R-IDEA scores (<
 =0.55), with Mistral-7B-Instruct-v0.3 attaining the worst diagnostic accuracy (42.3%). Compared with the 3 best-performing LLMs, the 3 worst-performing LLMs received a substantially lower median R-IDEA score (p<
 0.001). Interclass correlation coefficient for R-IDEA score assignments varied substantially by LLM, ranging from good to poor agreement, and did not appear to correlate with either diagnostic accuracy or the median R-IDEA score. Qualitatively, immunologists identified several themes (e.g. correctness, differential diagnosis appropriateness, relative conciseness of explanations) of relevance to PI. CONCLUSIONS: LLM can support the diagnosis and management of PI
  however, further tuning is needed to optimize LLMs for best practice recommendations.
Tạo bộ sưu tập với mã QR

THƯ VIỆN - TRƯỜNG ĐẠI HỌC CÔNG NGHỆ TP.HCM

ĐT: (028) 36225755 | Email: tt.thuvien@hutech.edu.vn

Copyright @2024 THƯ VIỆN HUTECH