Steady state visually evoked potential (SSVEP)-based brain-computer interfaces (BCIs), which are widely used in rehabilitation and disability assistance, can benefit from real-time emotion recognition to enhance human-machine interaction. However, the learned discriminative latent representations in SSVEP-BCIs may generalize in an unintended direction, which can lead to reduced accuracy in detecting emotional states. In this paper, we introduce a Valence-Arousal Disentangled Representation Learning (VADL) method, drawing inspiration from the classical two-dimensional emotional model, to enhance the performance and generalization of emotion recognition within SSVEP-BCIs. VADL distinctly disentangles the latent variables of valence and arousal information to improve accuracy. It utilizes the structured state space duality model to thoroughly extract global emotional features. Additionally, we propose a Multisubject Gradient Blending training strategy that individually tailors the learning pace of reconstruction and discrimination tasks within VADL on-the-fly. To verify the feasibility of our method, we have developed a comprehensive database comprising 23 subjects, in which both the emotional states and SSVEPs were effectively elicited. Experimental results indicate that VADL surpasses existing state-of-the-art benchmark algorithms.