Spatially resolved transcriptomics (SRT) simultaneously measure spatial location, histology images, and transcriptional profiles of cells or regions in undissociated tissues. Integrative analysis of multi-modal SRT data holds immense potential for understanding biological mechanisms. Here, we present a flexible multi-modal contrastive learning for the integration of SRT data (MuCST), which joins denoising, heterogeneity elimination, and compatible feature learning. MuCST accurately identifies spatial domains and is applicable to diverse datasets platforms. Overall, MuCST provides an alternative for integrative analysis of multi-modal SRT data ( https://github.com/xkmaxidian/MuCST ).