BACKGROUND: Discrimination of multiple non-small cell lung cancers (NSCLCs) as multiple primary lung cancers (MPLCs) or intrapulmonary metastases (IPMs) is critical but remains challenging. The aim of this study is to develop and validate the machine learning (ML) models based on the molecular features for estimating the probability of MPLC or IPM for patients presenting multiple NSCLCs. METHODS: A total of 72 multiple NSCLCs patients with 157 surgical resection tumor lesions from January 2012 to January 2018 at two institutions were included for developing and testing models. Specifically, 46 patients with 103 tumors which were defined as definitive MPLC or IPM according to International Association for the Study of Lung Cancer (IASLC) criteria were used to develop models. They were spilt into training and validation sets using stratified random sampling and five-fold cross-validation. The developed models were tested in other 26 patients whose tumors were undetermined by traditional methods. Whole-exome sequencing (WES) was performed on all included tumor samples. Four molecular features were calculated to characterize tumors relatedness and served as model inputs, including genetic divergence, shared mutation number, Pearson correlation coefficient and early mutation number. Decision trees (DT), random forests (RF), and gradient boosting decision trees (GBDT) were employed, with performance assessed by areas under the curve (AUCs), accuracy, precision, recall, and F1 score in validation set. Disease-free survival (DFS) were used to evaluate model performance in test cohort. Clinical and genetic characteristics were then compared between MPLC and IPM populations. RESULTS: All of the four molecular features showed significant differences between MPLC and IPM patients in development cohort. That is, MPLC exhibited higher genetic divergence, lower shared mutation number, Pearson correlation and early mutation number than IPM (P<
0.002). DT model, RF model and GBDT model were developed with these factors and achieved a mean AUC of 0.94 [standard deviation (SD) 0.09], 1.00 (SD 0.00) and 1.00 (SD 0.00) in validation set, respectively. DT model, RF model and GBDT model discriminated the undetermined multiple NSCLCs as MPLC (n=15) and IPM (n=11) consistently. MPLC identified by ML models had significantly prolonged DFS [hazard ratio =0.21
95% confidence interval (CI): 0.04-1.0
P=0.04] than that of IPM. MPLC patients had a relative higher prevalence of family history of first-degree relatives with cancer, and more than half of these patients reported a family history of lung cancer. EGFR remains the most common mutated driver both in MPLC and IPM populations. CONCLUSIONS: ML models based on the molecular features effectively distcriminate primary tumors from metastases in multiple NSCLCs, which improve the accuracy of multiple NSCLCs diagnosis and assist in clinical decision-making, particularly in challenging cases.