Development and validation of the provider documentation summarization quality instrument for large language models.

 0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Majid Afshar, Kyle Burton, Emma Croxford, Cris Ebby, Elliot First, Yanjun Gao, Jillian Gorski, Cherodeep Goswami, Matthew Kalscheur, Samy Khalil, Frank Liao, Brian Patterson, Nicholas Pellegrino, Marie Pisani, Tyler Rubeor, Miranda Schnier, Peter Stetson, Graham Wills, Karen Wong

Ngôn ngữ: eng

Ký hiệu phân loại: 338.9 Economic development and growth

Thông tin xuất bản: England : Journal of the American Medical Informatics Association : JAMIA , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 744084

 OBJECTIVES: As large language models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation and as models and documentation practices evolve. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. This study aimed to validate the PDSQI-9 across key aspects of construct validity. MATERIALS AND METHODS: Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation analyses for substantive validity, factor analysis and Cronbach's α for structural validity, inter-rater reliability (ICC and Krippendorff's α) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Raters underwent standardized training to ensure consistent application of the instrument. RESULTS: Seven physician raters evaluated 779 summaries and answered 8329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's α = 0.879
  95% CI, 0.867-0.891) and high inter-rater reliability (ICC = 0.867
  95% CI, 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (ρ = -0.200, P = .029) and Organized (ρ = -0.190, P = .037). The semi-Delphi process ensured clinically relevant attributes, and discriminant validity distinguished high- from low-quality summaries (P<
 .002). DISCUSSION: The PDSQI-9 showed high inter-rater reliability, internal consistency, and a meaningful factor structure that reliably captured key dimensions of documentation quality. It distinguished between high- and low-quality summaries, supporting its practical utility for health systems needing an evaluation instrument for LLMs. CONCLUSIONS: The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer, more effective integration of LLMs into healthcare workflows.
Tạo bộ sưu tập với mã QR

THƯ VIỆN - TRƯỜNG ĐẠI HỌC CÔNG NGHỆ TP.HCM

ĐT: (028) 36225755 | Email: tt.thuvien@hutech.edu.vn

Copyright @2024 THƯ VIỆN HUTECH