Currently, multi-modal 3D object detection methods have become a key area of research in the field of autonomous driving. Fusion is an essential factor affecting performance in multi-modal object detection. However, previous methods still suffer from the inability to effectively fuse features from LiDAR and RGB images, resulting in a low utilization rate of complementary information between depth and semantic texture features. At the same time, existing methods may not adequately capture the structural information in Region of Interest (RoI) features when extracting them. Structural information plays a crucial role in RoI features. It encompasses the position, size, and orientation of objects, as well as the relative positions and spatial relationships between objects. Its absence can result in false or missed detections. To solve the above problems, we propose a multi-modal sensor fusion network, Bi-Att3DDet, which mainly consists of a Self-Attentive RoI Feature Extraction module (SARoIFE) and a Feature Bidirectional Interactive Fusion module (FBIF). Specifically, SARoIFE captures the relationship between different positions in RoI features to obtain high-quality RoI features through the self-attention mechanism. SARoIFE prepares for the fusion stage. FBIF performs bidirectional interaction between LiDAR and pseudo RoI features to make full use of the complementary information. We perform comprehensive experiments on the KITTI dataset, and our method notably demonstrates a 1.55% improvement in the hard difficulty level and a 0.19% improvement in the mean Average Precision (mAP) metric on the test dataset.