BACKGROUND AND OBJECTIVE: Diabetic Retinopathy (DR) is a serious diabetes complication that can cause blindness if not diagnosed in its early stages. Manual diagnosis by ophthalmologists is labor-intensive and time-consuming, particularly in overburdened healthcare systems. This highlights the need for automated, accurate, and personalized machine learning approaches for early DR detection and treatment. Although several deep learning models have been widely used for DR diagnosis, Vision Transformers have recently demonstrated superior image analysis capabilities by capturing long-range dependencies. A hybrid model named ResViT FusionNet has been proposed to improve the accuracy of DR detection in this work. METHODS: For multiclass fundus image classification of diabetic retinopathy, we designed the ResViT FusionNet model, which integrates the robust feature extraction capabilities of Convolutional Neural Networks (CNNs), specifically ResNet50, with the comprehensive understanding provided by Vision Transformers (ViTs). To ensure balanced datasets and enhance model performance, various preprocessing techniques, including data augmentation, were applied. These techniques included rescaling pixel values, horizontal flipping, rotation, and zooming, aimed at improving model generalization and robustness. Additionally, to make the model's predictions more transparent and interpretable, especially in clinical settings, we employed Explainable AI (XAI) techniques. LIME was used to interpret the model's predictions, and Gradient-weighted Class Activation Mapping (Grad-CAM) was applied to generate heatmaps, highlighting the areas of the fundus images that contributed most to the classification decisions. These visual explanations not only enhance trust in the model but also help healthcare professionals understand the underlying factors influencing the predictions. RESULTS: The experimental findings show that ResViT FusionNet surpasses the performance of leading CNNs and standard ViT models. The model achieved impressive evaluation metrics, including a 0.9307 Precision , 0.9300 Recall, 0.9275 F1 Score, 0.9301 Accuracy, 0.8944 MCC, 0.8935 Kappa and a Jaccard Index of 0.8749. CONCLUSIONS: These findings demonstrate that ResViT FusionNet is a powerful tool for assisting ophthalmologists in making precise, personalized, and timely diagnostic decisions in the evaluation of diabetic retinopathy (DR). The model's integration of CNN and ViT capabilities, along with its explainability through XAI techniques like LIME and Grad-CAM, positions it as an advanced solution for DR detection and classification. By combining the strengths of both convolutional neural networks and vision transformers, and offering transparent insights into its decision-making process, the model enhances diagnostic accuracy and trustworthiness, making it a valuable asset in clinical settings.