De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT.

 0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Jiyong An, Hyunyoung Baek, Jiyun Kim, Seunggeun Lee, Leonard Sunwoo, Sooyoung Yoo

Ngôn ngữ: eng

Ký hiệu phân loại: 328.3653 Specific topics of legislative bodies

Thông tin xuất bản: England : BMC medical informatics and decision making , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 184464

BACKGROUND: De-identification of clinical notes is essential to utilize the rich information in unstructured text data in medical research. However, only limited work has been done in removing personal information from clinical notes in Korea. METHODS: Our study utilized a comprehensive dataset stored in the Note table of the OMOP Common Data Model at Seoul National University Bundang Hospital. This dataset includes 11,181,617 radiology and 9,282,477 notes from various other departments (non-radiology reports). From this, 0.1% of the reports (11,182) were randomly selected for training and validation purposes. We used two de-identification strategies to improve performance with limited and few annotated data. First, a rule-based approach is used to construct regular expressions on the 1,112 notes annotated by domain experts. Second, by using the regular expressions as label-er, we applied a semi-supervised approach to fine-tune a pre-trained Korean BERT model with pseudo-labeled notes. RESULTS: Validation was conducted using 342 radiology and 12 non-radiology notes labeled at the token level. Our rule-based approach achieved 97.2% precision, 93.7% recall, and 96.2% F1 score from the department of radiology notes. For machine learning approach, KoBERT-NER that is fine-tuned with 32,000 automatically pseudo-labeled notes achieved 96.5% precision, 97.6% recall, and 97.1% F1 score. CONCLUSION: By combining a rule-based approach and machine learning in a semi-supervised way, our results show that the performance of de-identification can be improved.
Tạo bộ sưu tập với mã QR

THƯ VIỆN - TRƯỜNG ĐẠI HỌC CÔNG NGHỆ TP.HCM

ĐT: (028) 36225755 | Email: tt.thuvien@hutech.edu.vn

Copyright @2024 THƯ VIỆN HUTECH