## TL;DR
Affective computing gives AI emotional intelligence — recognizing human emotions from voice, face, text, and physiology, and responding empathetically. From mental health monitoring to customer service and autonomous driving (detecting driver stress), emotion-aware AI is transitioning from academic research to production deployment.
## Core Explanation
Emotion models: (1) Categorical — emotions are discrete (Ekman's six: happiness, sadness, anger, fear, disgust, surprise) plus neutral. Used in most classification benchmarks; (2) Dimensional — emotions vary continuously along valence (positive-negative), arousal (calm-excited), and dominance (controlled-submissive). Captures nuanced states (nostalgia, frustration) that categorical models miss. Modalities: (A) Facial expression — CNNs/Vision Transformers process face images/video, detecting Action Units (AU) from Facial Action Coding System (FACS). Landmark detection → expression classification; (B) Speech — prosody (pitch, energy, rate), spectral features (MFCCs, spectrograms) processed by CNNs/LSTMs/Transformers. Cross-lingual emotion in speech is particularly challenging; (C) Text — sentiment analysis via fine-tuned transformers (BERT, RoBERTa, emotion-specific models); (D) Physiological — EEG (brain), ECG (heart rate variability), GSR (skin conductance), providing ground-truth emotional signals not subject to social masking.
## Detailed Analysis
Multimodal fusion strategies: (E) Early fusion — concatenate all modality features before classification; (L) Late fusion — classify each modality independently, ensemble predictions; (H) Hybrid/cross-modal — attention mechanisms learn which modality to trust when modalities conflict. MemoCMT (Nature 2025): cross-modal transformer processes speech and facial features simultaneously, learning to attend to the facial stream when speech is ambiguous and vice versa. Achieves 82% accuracy on IEMOCAP (4-class emotion), improving 8% over best unimodal. EmoVerse (ScienceDirect 2025): extends multimodal LLMs (LLaVA, GPT-4V) with affective reasoning — the model generates not just emotion labels but explanations ("The person appears sad because their speech rate slowed and they mentioned loss"). Applications: (1) Mental health — detecting depression/anxiety from speech and text patterns; (2) Education — detecting confusion and engagement in online learning; (3) Automotive — driver emotion/stress monitoring for safety; (4) Customer service — real-time agent coaching based on customer emotion; (5) Social robotics — empathetic response generation. Key challenge: cultural variability — a smile means different things across cultures; training data is overwhelmingly Western (IEMOCAP, RAVDESS are English-only). The 2025 IEEE survey calls for culturally-diverse benchmarks.
## Further Reading
- IEMOCAP, RAVDESS, MELD Emotion Datasets
- Dimensional Emotion Model (Russell's Circumplex)
- OpenFace: Facial Action Unit Detection