The Swin Transformer, with its window-based attention mechanism, demonstrates strong feature modeling capabilities. However, it struggles with high-resolution feature maps due to its fixed window size, particularly when capturing long-range dependencies in magnetic resonance image reconstruction tasks. To overcome this, we propose a novel multi-modal hybrid window attention Transformer (MHWT) that introduces a retractable attention mechanism combined with shape-alternating window design. This approach expands attention coverage while maintaining computational efficiency. Additionally, we employ a variable and shifted window attention strategy to model both local and global dependencies more flexibly. Improvements to the Transformer encoder, including adjustments to normalization and attention score computation, enhance training stability and reconstruction performance. Experimental results on multiple public datasets show that our method outperforms state-of-the-art approaches in both single-modal and multi-modal scenarios, demonstrating superior image reconstruction ability and adaptability. The code is publicly available at https://github.com/EnieHan/MHWT.