Introduction Generative Pre-Training Transformer (ChatGPT) has become widely recognized for its capability to generate text, synthesize complex information, and perform a variety of tasks without requiring human specialists for data collection. The latest iteration, ChatGPT-4, is a large multimodal model capable of integrating both text and image inputs, rendering it particularly promising for medical applications. However, its efficacy in analyzing radiographic images remains largely unexplored. Aim This study aims to (i) address the lack of data on the accuracy of ChatGPT in radiographic fracture classification into stable or unstable under the revised Arbeitsgemeinschaft für Osteosynthesefragen/Orthopedic Trauma Association (AO/OTA) classification system, and this procedure is also performed by surgeons, and (ii) compare the agreement between surgeons or ChatGPT-based performance. The study hypothesizes that the use of ChatGPT would achieve moderate agreement with orthopedic surgeons. Materials and methods Patients diagnosed with pertrochanteric fractures were retrospectively collected. Patients with both preoperative two-directional plain radiographs and CT scans (3D-CT) images were conditioned for enrollment into the study. Two orthopedic surgeons (observer 1 and observer 2, respectively) and one resident (observer 3) were once assigned to dichotomized groups into A1 (stable) or A2 (unstable) based on AO/OTA classification using two-directional plain radiographs. Prior to the ChatGPT study, all the anteroposterior images trimmed at the fractured side, attached with figure names including gender, and age, were inputted into OpenAI ChatGPT-4. Radiological evaluation prompts were designed to initiate ChatGPT's classification analysis of the uploaded radiographic images. A single observer (MN) decided the classification patterns by examining 3D CT scan images as well as plain radiographs. This judgment of A1 (stable) and A2 (unstable) was set as a benchmark to mark the results of observers and ChatGPT based on plain radiographs. Results The cohort consisted of 29 males and 90 females, with a mean age of 87 years after the data exclusion. The fractures were classified into A1 (stable) and A2 (unstable) groups based on CT imaging. The A1 group included 50 patients (13 males, 37 females
mean age: 86.2 ± 7.8 years), while the A2 group included 69 patients (16 males, 53 females
mean age: 87.0 ± 7.9 years). Kappa values for fracture classification between plain radiographs evaluated by the three observers and ChatGPT, compared to the CT-based gold standard, showed fair to moderate agreement: Observer 1: 0.494 (95% CI: 0.337-0.650), Observer 2: 0.390 (95% CI: 0.227-0.553), Observer 3: 0.360 (95% CI: 0.198-0.521), and ChatGPT: 0.420 (95% CI: 0.255-0.585). ChatGPT demonstrated accuracy, sensitivity, specificity, and positive and negative predictable values comparable to the human observers, suggesting moderate reliability. Conclusion This study demonstrates that ChatGPT can classify pertrochanteric fractures into A1 (stable) and A2 (unstable) under the Revised AO/OTA Classification System. Its moderate agreement with CT-based assessments (κ = 0.420) is comparable to the performance of orthopedic surgeons. Moreover, ChatGPT is straightforward to integrate into clinical workflows, requiring minimal data collection for training.