Robust privacy amidst innovation with large language models through a critical assessment of the risks.

0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Yao-Shun Chuang, Yu-Chun Hsu, Xiaoqian Jiang, Noman Mohammed, Atiquer Rahman Sarkar

Ngôn ngữ: eng

Ký hiệu phân loại:

Thông tin xuất bản: England : Journal of the American Medical Informatics Association : JAMIA , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 726276

Thêm vào giỏ Liên kết toàn văn

OBJECTIVE: This study evaluates the integration of electronic health records (EHRs) and natural language processing (NLP) with large language models (LLMs) to enhance healthcare data management and patient care, focusing on using advanced language models to create secure, Health Insurance Portability and Accountability Act-compliant synthetic patient notes for global biomedical research. MATERIALS AND METHODS: The study used de-identified and re-identified versions of the MIMIC III dataset with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic clinical notes. Text generation employed templates and keyword extraction for contextually relevant notes, with One-shot generation for comparison. Privacy was assessed by analyzing protected health information (PHI) occurrence and co-occurrence, while utility was evaluated by training an ICD-9 coder using synthetic notes. Text quality was measured using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and cosine similarity metrics to compare synthetic notes with source notes for semantic similarity. RESULTS: The analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation exhibited the highest PHI exposure and PHI co-occurrence, particularly in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Re-identified data consistently outperformed de-identified data. DISCUSSION: Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing. CONCLUSION: This study shows that keyword-based methods can create synthetic clinical notes that protect privacy while retaining data usability, potentially improving clinical data sharing. The use of dummy PHIs to counter privacy attacks may offer better utility and privacy than traditional de-identification.

Tạo bộ sưu tập với mã QR