Enhancing Facial Emotion Recognition on FER2013 Using Attention-based CNN and Sparsemax-Driven Class-Balanced Architectures

Suwartono, Christiany; Bata, Julius Victor Manuel; Airlangga, Gregorius

Enhancing Facial Emotion Recognition on FER2013 Using Attention-based CNN and Sparsemax-Driven Class-Balanced Architectures

Suwartono, Christiany and Bata, Julius Victor Manuel and Airlangga, Gregorius (2025) Enhancing Facial Emotion Recognition on FER2013 Using Attention-based CNN and Sparsemax-Driven Class-Balanced Architectures. Buletin Ilmiah Sarjana Teknik Elektro, 7 (4). pp. 944-954.

[thumbnail of 14510-Article Text-68754-1-10-20251203.pdf]

Text
14510-Article Text-68754-1-10-20251203.pdf - Published Version
Download (551kB)

Official URL: https://journal2.uad.ac.id/index.php/biste/article...

Abstract

Facial emotion recognition plays a critical role in various human–computer interaction applications, yet remains challenging due to class imbalance, label noise, and subtle inter-class visual similarities. The FER2013 dataset, containing seven emotion classes, is particularly difficult because of its low resolution and heavily skewed label distribution. This study presents a comparative investigation of advanced deep learning architectures against traditional machine-learning baselines on FER2013 to address these challenges and improve recognition performance. Two novel architectures are proposed. The first is an attention-based convolutional neural network (CNN) that integrates Mish activations and squeeze-and-excitation (SE) channel recalibration to enhance the discriminative capacity of intermediate features. The second, FastCNN-SE, is a refined extension designed for computational efficiency and minority-class robustness, incorporating Sparsemax activation, Poly-Focal loss, class-balanced reweighting, and MixUp augmentation. The research contribution is demonstrating how combining attention, sparse activations, and imbalance-aware learning improves FER performance under challenging real-world conditions. Both models were extensively evaluated: the attention-CNN under 10-fold cross-validation, achieving 0.6170 accuracy and 0.555 macro-F1, and FastCNN-SE on the held-out test set, achieving 0.5960 accuracy and 0.5138 macro-F1. These deep models significantly outperform PCA-based Logistic Regression, Linear SVC, and Random Forest baselines (≤0.37 accuracy and ≤0.29 macro-F1). We additionally justify the differing evaluation protocols by emphasizing cross-validation for architectural stability and held-out testing for generalization and note that FastCNN-SE contains ~3M parameters, enabling efficient inference. These findings demonstrate that architecture-level fusion of SE attention, Sparsemax, and Poly-Focal loss improves balanced emotion recognition, offering a strong foundation for future studies on efficient and robust affective-computing systems.

Item Type:	Article
Subjects:	T Technology > TK Electrical engineering. Electronics Nuclear engineering
Depositing User:	BISTE UAD
Date Deposited:	16 May 2026 16:36
Last Modified:	16 May 2026 16:36
URI:	https://alxiv.org/id/eprint/835

Actions (login required)

: View Item