TY - GEN
T1 - Multimodal Attentive Learning for Real-time Explainable Emotion Recognition in Conversations
AU - Arumugam, Balaji
AU - Bhattacharjee, Sreyasee Das
AU - Yuan, Junsong
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Human emotion recognition plays a pivotal role in building an intelligent conversational agent for providing real-time automated support service in various problem settings. Recent research works have explored the temporal patterns in conversations to enable a comprehensive understanding of the content and context of conversations from a video clip, which does not fully leverage the multi-modal (facial expressions of the participants, speech tone, content, and context of the discussion) information and their temporal evolution. To address this, we propose a multimodal attentive learning framework that keeps track of spatio-temporal states of the participants and their conversation dynamics. By designing a novel contrastive loss-based optimization framework, the proposed method exhibits promise in identifying the emotion state of the individual speaker in real-time and can identify top-k words in the conversation that influence emotion recognition. The consistent superior performance over other state-of-the-art works in two large-scale datasets, MELD and IEMOCAP, demonstrate the feasibility of our approach.
AB - Human emotion recognition plays a pivotal role in building an intelligent conversational agent for providing real-time automated support service in various problem settings. Recent research works have explored the temporal patterns in conversations to enable a comprehensive understanding of the content and context of conversations from a video clip, which does not fully leverage the multi-modal (facial expressions of the participants, speech tone, content, and context of the discussion) information and their temporal evolution. To address this, we propose a multimodal attentive learning framework that keeps track of spatio-temporal states of the participants and their conversation dynamics. By designing a novel contrastive loss-based optimization framework, the proposed method exhibits promise in identifying the emotion state of the individual speaker in real-time and can identify top-k words in the conversation that influence emotion recognition. The consistent superior performance over other state-of-the-art works in two large-scale datasets, MELD and IEMOCAP, demonstrate the feasibility of our approach.
KW - Cross-modal Attention
KW - Emotion Recognition
KW - Explainable Decision Visualization
KW - Multimodal
KW - SpatioTemporal Feature Representation
UR - https://www.scopus.com/pages/publications/85142495017
U2 - 10.1109/ISCAS48785.2022.9938005
DO - 10.1109/ISCAS48785.2022.9938005
M3 - Conference contribution
AN - SCOPUS:85142495017
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
SP - 1210
EP - 1214
BT - IEEE International Symposium on Circuits and Systems, ISCAS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE International Symposium on Circuits and Systems, ISCAS 2022
Y2 - 27 May 2022 through 1 June 2022
ER -