Deep learning offers efficient solutions for drug-target interaction prediction, but current methods often fail to capture the full complexity of multi-modal data (i.e. sequence, graphs, and three-dimensional structures), limiting both performance and generalization. Here, we present UnitedDTA, a novel explainable deep learning framework capable of integrating multi-modal biomolecule data to improve the binding affinity prediction, especially for novel (unseen) drugs and targets. UnitedDTA enables automatic learning unified discriminative representations from multi-modality data via contrastive learning and cross-attention mechanisms for cross-modality alignment and integration. Comparative results on multiple benchmark datasets show that UnitedDTA significantly outperforms the state-of-the-art drug-target affinity prediction methods and exhibits better generalization ability in predicting unseen drug-target pairs. More importantly, unlike most "black-box" deep learning methods, our well-established model offers better interpretability which enables us to directly infer the important substructures of the drug-target complexes that influence the binding activity, thus providing the insights in unveiling the binding preferences. Moreover, by extending UnitedDTA to other downstream tasks (e.g. molecular property prediction), we showcase the proposed multi-modal representation learning is capable of capturing the latent molecular representations that are closely associated with the molecular property, demonstrating the broad application potential for advancing the drug discovery process.