BACKGROUND: Breast cancer is the most common cancer worldwide, and magnetic resonance imaging (MRI) constitutes a very sensitive technique for invasive cancer detection. When reviewing breast MRI examination, clinical radiologists rely on multimodal information, composed of imaging data but also information not present in the images such as clinical information. Most machine learning (ML) approaches are not well suited for multimodal data. However, attention-based architectures, such as Transformers, are flexible and therefore good candidates for integrating multimodal data. PURPOSE: The aim of this study was to develop and evaluate a novel multimodal deep learning (DL) model combining ultrafast dynamic contrast-enhanced (UF-DCE) MRI images, lesion characteristics and clinical information for breast lesion classification. MATERIALS AND METHODS: From 2019 to 2023, UF-DCE breast images and radiology reports of 240 patients were retrospectively collected from a single clinical center and annotated. Imaging data were constituted of volumes of interest (VOI) extracted around segmented lesions. Non-imaging data were constituted of both clinical (categorical) and geometrical (scalar) data. Clinical data were extracted from annotated reports and were associated to their corresponding lesions. We compared the diagnostic performances of traditional ML methods for non-imaging data, an image model based on the DL architecture, and a novel Transformer-based architecture, the Multimodal Sieve Transformer with Vision Transformer encoder (MMST-V). RESULTS: The final dataset included 987 lesions (280 benign, 121 malignant lesions, and 586 benign lymph nodes) and 1081 reports. For classification with non-imaging data, scalar data had a greater influence on performances of lesion classification (Area under the receiver operating characteristic curve (AUROC) = 0.875 ± 0.042) than categorical data (AUROC = 0.680 ± 0.060). MMST-V achieved better performances (AUROC = 0.928 ± 0.027) than classification based on non-imaging data (AUROC = 0.900 ± 0.045), and imaging data only (AUROC = 0.863 ± 0.025). CONCLUSION: The proposed MMST-V is an adaptative approach that can consider redundant information provided by multimodal information. It demonstrated better performances than unimodal methods. Results highlight that the combination of clinical patient data and detailed lesion information as additional clinical knowledge enhances the diagnostic performances of UF-DCE breast MRI.