研究生: |
嚴敏誠 Yen, Min-Cheng |
---|---|
論文名稱: |
應用於連續情緒辨識之基於殘差影像的情緒特徵學習 Emotion-Related Feature Learning From Residual Images for Continuous Emotion Recognition |
指導教授: |
許秋婷
Hsu, Chiou-Ting |
口試委員: |
李祈均
Lee, Chi-Chun 陳永昇 Chen, Yong-Sheng |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 英文 |
論文頁數: | 32 |
中文關鍵詞: | 連續情緒辨識 、特徵學習 、殘差影像 、捲積神經網路 、長短期記憶 |
外文關鍵詞: | Long, Short, Term |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
連續情緒識別的目標是從聲音及影像序列辨識人類情緒。人類情緒可以由四種向度來描述,包含Arousal、Valence、Power以及Expectation。過去的情緒識別方法大多利用人為定義特徵來描述人類情緒,然而這些人為定義的特徵並不適用於情緒的描述。因此我們提出了情緒相關特徵學習,透過這個特徵來反映出情緒上的改變。另外,為了處理臉部外觀與情緒標註之間不一致的問題,我們利用殘差影像來表達影片中前後鄰近幀中的外觀變化,因為我們認為前後鄰近幀的相對情緒變化是比較可靠的。在學習情緒特徵時,我們透過聯合及回歸損失函數來得到更具鑑別力的情緒特徵。但由於頭部姿勢變化的影響,可能會導致殘差影像中前後影片幀無法準確對齊,所以我們利用基於局部的臉部特徵點來解決校正問題。最後,我們透過聯合時間網路結合影像與聲音資訊以及透過LSTM學習情緒在長時間上的變化。實驗顯示本論文方法在AVEC 2012和RECOLA資料庫上與其他連續情緒辨識方法並駕齊驅。
Continuous emotion recognition aims to recognize human emotion from audio-visual sequences. Human emotion can be described as four dimensions including arousal, valence, power and expectation. Previous work mostly uses hand-crafted features, which are not strongly related to emotion. We propose to learn emotion-related features that reflect how much of a person’s emotion changes. In addition, in order to handle the inconsistency problem between facial appearances and dimensional labels, we use residual images which are defined as the difference between two adjacent video frames to capture the relative change between facial appearances. When learning the emotion-related features, we propose an efficient loss, joint ranking and regression loss, to obtain more discriminative features. However, misalignment of adjacent frames due to pose variation degrades the effectiveness of residual images. We use part-based facial landmarks to deal with the misalignment problem. Finally, we propose a fusion temporal network to combine visual and audio cues and model long-term emotional evolution through LSTM. Our experiments demonstrate that our method achieves comparable results with previous work on the AVEC 2012 dataset and the RECOLA dataset
[1] C. Shan, S. Gong, and P. W. McOwan. Conditional mutual information based boosting for facial expression recognition. In Proc. of the British Machine Vision Conference (BMVC), 2005.
[2] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and S. Yan. Peak-Piloted Deep Network for Facial Expression Recognition. In Proc. of the European Conference on Computer Vision (ECCV), 2016.
[3] I. J. Goodfellow, D. Erhan, P. L. Carrier, et al. Challenges in representation learning: A report on three machine learning contests. In Proc. of the Neural Information Processing, 2013.
[4] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proc. of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2010.
[5] J. Nicolle, V. Rapp, K. Bailly, L. Prevost, and M. Chetouani. Robust Continuous Prediction of Human Emotions using Multiscale Dynamic Cues. In Proc. of the 14th ACM International Conference on Multimodal Interaction, 2012.
[6] H. Meng, N. Bianchi-Berthouze, Y. Deng, J. Cheng, and J. P. Cosmas. Time-Delay Neural Network for Continuous Emotional Dimension Prediction From Facial Expression Sequences. IEEE Transactions on Cybernetics, April 2016.
[7] H. Chen, J. Li, F. Zhang, Y. Li, and H. Wang. 3D model-based continuous emotion recognition. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[8] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep Face Recognition. In Proc. of the British Machine Vision Conference (BMVC), 2015.
[9] J. Kim, J. K. Lee and K. M. Lee. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[10] K. Zhang, Y. Huang, Y. Du, and l. Wang. Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks. IEEE Transactions on Image Processing, March 2017
[11] K. Simonyan, and A. Zisserman. Very Deep Convolutional Networks for Large-scale Image Recognition. ArXiv:1409.1556, 2014.
[12] A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proc. of the 31st International Conference on Machine Learning, 2014.
[13] B. Schuller, M. Valstar, F. Eyben, R. Cowie, and M. Pantic. AVEC 2012–The Continuous Audio/Visual Emotion Challenge. In Proc. of the 14th ACM International Conference on Multimodal Interaction, 2012.
[14] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder. The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Transactions on Affective Computing, 2012.
[15] F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic. AV+EC 2015 – The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
[16] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne. Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions. In Proc. of the 10th IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2013.
[17] A. Asthana, S, Zafeiriou, S, Cheng, and M. Pantic. Incremental Face Alignment in the Wild. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[18] F. Eyben, M. Wöllmer, and B. Schuller. OpenSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor. In Proc. of the 18th ACM Multimedia, 2010.
[19] D. Ozkan, S. Scherer, and L.P. Morency. Step-wise emotion recognition using concatenated-HMM. In Proc. of the 14th ACM International Conference on Multimodal Interaction, 2012.
[20] C. Soladié, H. Salam, C. Pelachaud, N. Stoiber, and R. Séguier. A multimodal fuzzy inference system using a continuous facial expression representation for emotion detection. In Proc. of the 14th ACM International Conference on Multimodal Interaction, 2012.
[21] M. Nicolaou, S. Zafeiriou, and M. Pantic. Correlated-spaces regression for learning continuous emotion dimensions. In Proc. of the 21th ACM International Conference on Multimedia, 2013.
[22] T. Baltrusaitis, N. Banda, and P. Robinson. Dimensional Affect Recognition using Continuous Conditional Random Fields. In Proc. of the 10th IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2013.
[23] A. Savran, H. Cao, A. Nenkova, and R. Verma. Temporal Bayesian Fusion for Affect Sensing Combining Video, Audio, and Lexical Modalities. IEEE Transactions on Cybernetics, 2015.
[24] P. Cardinal, N. Dehak, A. Lameiras, J. Alam, and P. Boucher. ETS System for AV+ EC 2015 Challenge. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
[25] L. Chao, J. Tao, M. Yang, Y. Li, and Z. Wen. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
[26] S. Chen, and Q. Jin. Multi-modal dimensional emotion recognition using recurrent neural networks. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
[27] M. Kächele, P. Thiam, G. Palm, F. Schwenker, and M. Schels. Ensemble methods for continuous affect recognition: Multi-modality, temporality, and challenges. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
[28] F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P. Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognition Letters, 2015.