Primary bone tumors (PBTs) present significant diagnostic challenges due to their heterogeneous nature and similarities with bone infections. This study aimed to develop an ensemble deep learning framework that integrates multicenter radiographs and extensive clinical features to accurately differentiate between PBTs and bone infections. We compared the performance of the ensemble model with four imaging models based solely on radiographs utilizing EfficientNet B3, EfficientNet B4, Vision Transformer, and Swin Transformers. The patients were split into external dataset (N = 423) and internal dataset [including training (N = 1044), test (N = 354), and validation set (N = 171)]. The ensemble model outperformed imaging models, achieving areas under the curve (AUCs) of 0.948 and 0.963 on internal and external sets, respectively, with accuracies of 0.881 and 0.895. Its performance surpassed junior and mid-level radiologists and was comparable to senior radiologists (accuracy: 83.6%). These findings underscore the potential of deep learning in enhancing diagnostic precision for PBTs and bone infections (Research Registration Unique Identifying Number (UIN): researchregistry10483 and with details are available at https://www.researchregistry.com/register-now#home/registrationdetails/6693845995ba110026aeb754/ ).