The aim of this study is to evaluate GPT-4's reasoning ability to interpret oral mucosal disease photos and generate structured reports from free-text inputs, while exploring the role of prompt engineering in enhancing its performance. Prompt received by utilizing automatic prompt engineering and knowledge of oral physicians, was provided to GPT-4 for generating structured reports based on cases of oral mucosal disease. The structured reports included 7 fine-grained items: "location", "shape", "number", "size", "clinical manifestation", "the border of the lesion" and "diagnosis". 120 cases were used for testing, which were divided into two datasets, textbook dataset and internet dataset. Oral physicians evaluated GPT-4's responses by confusion matrices, receiving recall and accuracy. ANOVA and Wald χ2 tests with Bonferroni correction were used to statistical analysis. A total of 120 cases of oral mucosal diseases were included, encompassing the following two datasets: textbook dataset (n = 60), internet dataset (n = 60). GPT-4 had higher recall with the textbook dataset compared to the internet dataset (90.73% vs 89.12%
P = .462, χ