Evaluating CNN, RNN, and Vision Transformer for Emotion Recognition: Strengths and Weaknesses
Date
2025Author
Yushchenko, Artur
Smelyakov, Kirill
Chupryna, Anastasiya
Metadata
Show full item recordAbstract
This paper explores three prominent deep learning architectures — Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Vision Transformers (ViT) — for emotion recognition, examining their potential strengths and weaknesses under various conditions. It discusses how each approach may capture critical spatial, temporal, or global features in emotional data, highlighting differences in feature extraction, representational capacity, and scalability. Additionally, new solutions are proposed to enhance accuracy and adaptability, integrating design principles that address recognized challenges in real-world implementations. Novel insights are offered on aligning model selection with specific application demands, such as the nature of input signals, available computational resources, and desired real-time performance. While the comparative analysis remains broad to accommodate diverse use cases, it underscores the importance of carefully balancing accuracy and efficiency. Conclusions drawn from the investigation include recommendations on when each architecture may be most advantageous, providing a flexible framework for researchers and practitioners to navigate the trade-offs. These findings have implications for developing adaptive emotion recognition systems that leverage state-of-the-art deep learning techniques across multiple contexts.
