Food recognition from images is crucial for dietary management, enabling applications like automated meal tracking and personalized nutrition planning. However, challenges such as background noise disrupting intra-class consistency, inter-class distinction, and domain shifts due to variations in capture angles, lighting, and image resolution persist. This study proposes a multi-stage convolutional neural network-based framework incorporating a boundary-aware module (BAM) for boundary region perception, deformable ROI pooling (DRP) for spatial feature refinement, a transformer encoder for capturing global contextual relationships, and a NetRVLAD module for robust feature aggregation. The framework achieved state-of-the-art performance on three benchmark datasets, with Top-1 accuracies of 99.80% on the Food-5k dataset, 99.17% on the Food-101 dataset, and 85.87% on the Food-2k dataset, significantly outperforming existing methods. This framework holds promise as a foundational tool for intelligent dietary management, offering robust and accurate solutions for real-world applications.