Speech emotion recognition (SER) has always been a popular yet challenging task with broad applications in areas such as social media communication and medical diagnostics. Due to the characteristics of speech emotion recognition dataset, which often have small data volumes and high complexity, effectively integrating and modeling audio data remains a significant challenge in this field. To address this, we propose a model architecture that combines fine-tuned Wave2vec2.0 with Neural Controlled Differential Equations (NCDEs): First, we use a fine-tuned Wav2vec2.0 to extract rich contextual information. Then we model the high-dimensional time series feature set using a Neural Controlled Differential Equation classifier. We set the vector field as an MLP and update the model's hidden state by solving the controlled differential equation. We conducted speech emotion recognition experiments on the IEMOCAP dataset. The experiments show that our model achieves the weighted accuracy of 73.37% and the unweighted accuracy of 74.18%. Additionally, our model converges very quickly, reaching a good accuracy after just one epoch of training. Furthermore, our model exhibits excellent stability. The standard deviation of weighted accuracy (WA) is 0.45% and the standard deviation of unweighted accuracy (UA) is 0.39%.