Behind the mask: Random and selective masking in transformer models applied to specialized social science texts.

0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Joan C Timoneda, Sebastián Vallejo Vera

Ngôn ngữ: eng

Ký hiệu phân loại: 413.1 Specialized dictionaries

Thông tin xuất bản: United States : PloS one , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 473404

Thêm vào giỏ Liên kết toàn văn

Transformer models such as BERT and RoBERTa are increasingly popular in the social sciences to generate data through supervised text classification. These models can be further trained through Masked Language Modeling (MLM) to increase performance in specialized applications. MLM uses a default masking rate of 15 percent, and few works have investigated how different masking rates may affect performance. Importantly, there are no systematic tests on whether selectively masking certain words improves classifier accuracy. In this article, we further train a set of models to classify fake news around the coronavirus pandemic using 15, 25, 40, 60 and 80 percent random and selective masking. We find that a masking rate of 40 percent, both random and selective, improves within-category performance but has little impact on overall performance. This finding has important implications for scholars looking to build BERT and RoBERTa classifiers, especially those where one specific category is more relevant to their research.

Tạo bộ sưu tập với mã QR