This study introduces a new framework for combining calibrated mobile mapping system (MMS) data and low-cost unmanned aerial vehicle (UAV) images to generate seamless, high-fidelity 3D urban maps. This approach addresses the limitations of single-source mapping, such as occlusions in aerial top views and insufficient vertical detail in ground-level data, by utilizing the complementary strengths of the two technologies. The proposed approach combines cloth simulation filtering for ground point extraction from MMS data with deep-learning-based segmentation (U²-Net) for feature extraction from UAV images. Street-view MMS images are projected onto a top-down viewpoint using inverse perspective mapping to align diverse datasets, and precise cross-view alignment is achieved using the LightGlue technique. The spatial accuracy of the 3D model was improved by integrating the matched features as ground control points into a structure from the motion pipeline. Validation using data from the campus of Yonsei University and the nearby urban area of Yeonhui-dong yielded notable accuracy gains and a root mean square error of 0.131 m. Geospatial analysis, infrastructure monitoring, and urban planning can benefit from this flexible and scalable method, which enhances 3D urban mapping capabilities.