Comparative ranking of marginal confounding impact of natural language processing-derived versus structured features in pharmacoepidemiology.

 0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Lily G Bessette, Thomas Deramus, Kueiyu Joshua Lin, Kerry Ngan, Joseph M Plasek, Theodore N Tsacogianis, Janick G Weberpals, Richard D Wyss, Jie Yang, Li Zhou

Ngôn ngữ: eng

Ký hiệu phân loại: 025.523 Cooperative information services

Thông tin xuất bản: United States : Computers in biology and medicine , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 199490

OBJECTIVE: To explore the ability of natural language processing (NLP) methods to identify confounder information beyond what can be identified using claims codes alone for pharmacoepidemiology. METHODS: We developed a retrospective cohort for high vs low dose proton pump inhibitors from linked Medicare claims (2008-2017) and electronic health record data for patients with a history of peptic ulcer disease or osteoarthritis. Clinical notes authored one year before first dispensing date were processed by off-the-shelf tools: bag-of-n-grams, latent Dirichlet allocation, a linguistics-focused tool, BERT sentence embeddings, BioBERT word embeddings, and GloVe word embeddings. Candidate features were ranked using Bross formula, a simple way to rank the marginal confounding impact of binary features on estimated causal effects. RESULTS: The marginal confounding impact in the Bross rankings of NLP-derived features trended from 39 % in the top 100 to 77 % in the top 500 to 93 % in the top 5000 among patients with peptic ulcer disease. More specifically, the top 25 confounders are largely from factors identified by domain experts and structured fields, and the marginal impact of these confounders is stronger than others. Features 25 to 50 include features identified by a linguistics-focused tool and embeddings, whereas features 50 to 100 include more embeddings and bag-of-ngrams. After 100, the curve flattens, meaning that the marginal impact of those potential confounders gets smaller. Similarly, among patients with osteoarthritis, NLP-derived features trended from 66 % in the top 100 to 84 % in the top 500 to 95 % in the top 5000 when the outcome was gastrointestinal bleed and from 47 % in the top 100 to 81 % in the top 500 to 94 % in the top 5000 when the outcome was acute kidney injury. Similar trends were observed in the information gain data, though NLP-derived features had higher baselines. CONCLUSIONS: NLP contributed to finding large numbers of features that can supplement claims data and prespecified variables to help provide additional confounder information. We found that unsupervised off-the-shelf NLP tools can scale to generate large numbers of features appropriate for high-dimensional proxy adjustment and pharmacoepidemiology use cases.
Tạo bộ sưu tập với mã QR

THƯ VIỆN - TRƯỜNG ĐẠI HỌC CÔNG NGHỆ TP.HCM

ĐT: (028) 36225755 | Email: tt.thuvien@hutech.edu.vn

Copyright @2024 THƯ VIỆN HUTECH