BACKGROUND: The majority of machine learning applications in assisted reproduction have been focused on predicting the likelihood of pregnancy. In the present study, we aim to investigate which machine learning models are most effective in predicting the occurrence of a high proportion (>
30 %) of 3PN/MPN zygotes in individual IVF cycles. METHODS: Eight machine learning algorithms were trained and compared, including the AdaBoost and Gaussian NB. Data from IVF cycles carried out from September 2015 to September 2019 were used as a training set. Cycle data from October 2019 to June 2020 were used as a validation set to verify the training model. Cycles with a 3PN/MPN zygote proportion higher than 30 % were classified as high 3PN/MPN zygote proportion cycles. RESULTS: The AdaBoost algorithm was the best model for model construction and external validation. In both the training and validation sets, age, basal FSH, FSH and E2 level on the day of Gonadotrophin (GN) stimulation, and FSH and LH levels on the day of HCG were statistically higher in patients with 3PN/MPN >
30 % than in patients with 3PN/MPN ≤ 30 %
AFC, AMH, E2 level on HCG day and total number of oocytes were lower in patients with 3PN/MPN >
30 % than in patients with 3PN/MPN ≤ 30 %. The top five predictors were the number of oocytes retrieved, age, male factor infertility, AFC, and total days of GN stimulation. CONCLUSION: By applying a suitable machine learning algorithm, we can potentially predict the risk of a high proportion of 3PN/MPN zygotes in individual IVF cycles before insemination and avoid polyspermy fertilization by ICSI fertilization method.