Current deep learning methods for diagnosing Alzheimer's disease (AD) typically rely on analyzing all or parts of high-resolution 3D volumetric features, which demand expensive computational resources and powerful GPUs, particularly when using multimodal data. In contrast, lightweight cortical surface representations offer a more efficient approach for quantifying AD-related changes across different cortical regions, such as alterations in cortical structures, impaired glucose metabolism, and the deposition of pathological biomarkers like amyloid-β and tau. Despite these advantages, few studies have focused on diagnosing AD using multimodal surface-based data. This study pioneers a novel method that leverages multimodal, lightweight cortical surface features extracted from MRI and PET scans, providing an alternative to computationally intensive 3D volumetric features. Our model employs a middle-fusion approach with a cross-attention mechanism to efficiently integrate features from different modalities. Experimental evaluations on the ADNI series dataset, using T1-weighted MRI and [Formula: see text]Fluorodeoxyglucose PET, demonstrate that the proposed model outperforms volume-based methods in both early AD diagnosis accuracy and computational efficiency. The effectiveness of our model is further validated with the combination of T1-weighted MRI, Aβ PET, and Tau PET scans, yielding favorable results. Our findings highlight the potential of surface-based transformer models as a superior alternative to conventional volume-based approaches.