BACKGROUND: Generative artificial intelligence (GAI) is transforming healthcare in a variety of ways
however, present utility of GAI for supporting clinicians in rare disease such as primary immune disorders (PI) is not well studied. Here we evaluate the ability of 6 state-of-the-art large language models (LLMs) for providing clinical guidance about PI. OBJECTIVE: We sought to quantitatively and qualitatively measure the utility of current, open-source LLMs for diagnosing and providing helpful clinical decision support about PI. METHODS: Five expert clinical immunologists provided 5 real-world (n=25), anonymized PI case vignettes via multi-turn prompting to 6 LLMs (OpenAI GPT-4o, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Mistral-7B-Instruct-v0.3, Mistral-Large-Instruct-2407, and Mixtral-8x7B-Instruct-v0.1). We assessed the diagnostic accuracy of the LLMs and the quality of clinical reasoning using the Revised-IDEA (R-IDEA) score. Qualitative LLM assessment was made by immunologist narratives. RESULTS: Performance accuracy (>
88%) and R-IDEA scores (>
=8) were superior for 3 models (GPT-4o, Llama-3.1-70B-Instruct, Mistral-Large-Instruct-2407), with GPT-4o achieving the highest diagnostic accuracy (96.2%). Conversely, the remaining 3 models fell below acceptable accuracy rates near 60% or worse and poor R-IDEA scores (<
=0.55), with Mistral-7B-Instruct-v0.3 attaining the worst diagnostic accuracy (42.3%). Compared with the 3 best-performing LLMs, the 3 worst-performing LLMs received a substantially lower median R-IDEA score (p<
0.001). Interclass correlation coefficient for R-IDEA score assignments varied substantially by LLM, ranging from good to poor agreement, and did not appear to correlate with either diagnostic accuracy or the median R-IDEA score. Qualitatively, immunologists identified several themes (e.g. correctness, differential diagnosis appropriateness, relative conciseness of explanations) of relevance to PI. CONCLUSIONS: LLM can support the diagnosis and management of PI
however, further tuning is needed to optimize LLMs for best practice recommendations.