Multimodal Sentiment Analysis Using Deep Multimodal Learning Structure
PRESENTER:
Habibeh Naderi Khorshidi
Dalhousie University
ABSTRACT:
Human brain recognizes the sentiment of an expressed opinion by
integrating multiple sources of information. Our sentiment
perception is not only obtained by analyzing verbal information but
also acquired by investigating the audio and visual cues of how
that utterance has been expressed. A single source of information
(e.g., text-based sentiment analysis) may not be enough to detect
and handle ambiguity. However, the textual, audio and visual
characteristics of a statement are strongly related and their
combination can resolve ambiguity to some extent.
In this research, we want to understand the interaction patterns
between the spoken words and visual gestures. Hence, we propose a
multimodal deep learning structure that automatically extracts
salient features from textual, acoustic and visual data for
sentiment analysis. We use a convolutional neural network (CNN)
plus an LSTM recurrent neural network (RNN) structure to extract
visual features and two independent LSTM RNNs to extract textual
and acoustic features. Then, we try to find an optimal
configuration to combine all features into a joint representation
that builds our multimodal layer. Finally, above the multimodal
layer, we consider a decision Softmax layer to obtain the
predictions, i.e., the probability of the input example being
positive or negative.