Emotions are expressed via many features including facial displays, vocal intonation, and touch, and perceivers can often interpret emotional displays across the different modalities with high accuracy. Here, we examine how emotion perception from faces and voices relates to one another, probing individual differences in emotion recognition abilities across visual and auditory modalities. We developed a novel emotion sorting task, in which participants were tasked with freely grouping different stimuli into perceived emotional categories, without requiring pre-defined emotion labels. Participants completed two emotion sorting tasks, one using silent videos of facial expressions, the other with audio recordings of vocal expressions. We furthermore manipulated the emotional intensity, contrasting more subtle, lower intensity vs higher intensity emotion portrayals. We find that participants' performance on the emotion sorting task was similar for face and voice stimuli. As expected, performance was lower when stimuli were of low emotional intensity. Consistent with previous reports, we find that task performance was positively correlated across the two modalities. Our findings show that emotion perception in the visual and auditory modalities may be underpinned by similar and/or shared processes, highlighting that emotion sorting tasks are powerful paradigms to investigate emotion recognition from voices, cross-modal and multimodal emotion recognition.