簡易檢索 / 詳目顯示

研究生: 嚴敏誠
Yen, Min-Cheng
論文名稱: 應用於連續情緒辨識之基於殘差影像的情緒特徵學習
Emotion-Related Feature Learning From Residual Images for Continuous Emotion Recognition
指導教授: 許秋婷
Hsu, Chiou-Ting
口試委員: 李祈均
Lee, Chi-Chun
陳永昇
Chen, Yong-Sheng
學位類別: 碩士
Master
系所名稱:
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 32
中文關鍵詞: 連續情緒辨識特徵學習殘差影像捲積神經網路長短期記憶
外文關鍵詞: Long, Short, Term
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 連續情緒識別的目標是從聲音及影像序列辨識人類情緒。人類情緒可以由四種向度來描述,包含Arousal、Valence、Power以及Expectation。過去的情緒識別方法大多利用人為定義特徵來描述人類情緒,然而這些人為定義的特徵並不適用於情緒的描述。因此我們提出了情緒相關特徵學習,透過這個特徵來反映出情緒上的改變。另外,為了處理臉部外觀與情緒標註之間不一致的問題,我們利用殘差影像來表達影片中前後鄰近幀中的外觀變化,因為我們認為前後鄰近幀的相對情緒變化是比較可靠的。在學習情緒特徵時,我們透過聯合及回歸損失函數來得到更具鑑別力的情緒特徵。但由於頭部姿勢變化的影響,可能會導致殘差影像中前後影片幀無法準確對齊,所以我們利用基於局部的臉部特徵點來解決校正問題。最後,我們透過聯合時間網路結合影像與聲音資訊以及透過LSTM學習情緒在長時間上的變化。實驗顯示本論文方法在AVEC 2012和RECOLA資料庫上與其他連續情緒辨識方法並駕齊驅。


    Continuous emotion recognition aims to recognize human emotion from audio-visual sequences. Human emotion can be described as four dimensions including arousal, valence, power and expectation. Previous work mostly uses hand-crafted features, which are not strongly related to emotion. We propose to learn emotion-related features that reflect how much of a person’s emotion changes. In addition, in order to handle the inconsistency problem between facial appearances and dimensional labels, we use residual images which are defined as the difference between two adjacent video frames to capture the relative change between facial appearances. When learning the emotion-related features, we propose an efficient loss, joint ranking and regression loss, to obtain more discriminative features. However, misalignment of adjacent frames due to pose variation degrades the effectiveness of residual images. We use part-based facial landmarks to deal with the misalignment problem. Finally, we propose a fusion temporal network to combine visual and audio cues and model long-term emotional evolution through LSTM. Our experiments demonstrate that our method achieves comparable results with previous work on the AVEC 2012 dataset and the RECOLA dataset

    中文摘要 I Abstract II 1. Introduction 1 2. Related Work 5 2.1 State-of-the-Art Methods 5 2.1.1 Multi-scale Dynamic Cues 5 2.1.2 3D Shape Model 6 2.2 Discussion and Motivation 7 3. Proposed Method 9 3.1 Emotion-Related Feature Learning from Residual Images 9 3.2 Residual Appearance Temporal Network 12 3.3 Part-based Facial Landmarks Temporal Network 13 3.4 Fusion Temporal Network 14 4. Experimental Results 16 4.1 Datasets 16 4.2 Implementation Details 16 4.3 Evaluation Metric 17 4.4 Feature Evaluation 17 4.4.1 Emotion-Related Features 18 4.4.2 Facial Landmarks 19 4.4.3 Audio Features 21 4.5 Results of Fusion Temporal Network 21 4.6 Comparison with Existing Methods 22 4.7 Discussion 24 4.7.1 Limitations of the Emotion-Related Feature Learning 24 4.7.2 Advantages of the Joint Ranking and Regression Loss 25 5. Conclusion 28 6. References 29

    [1] C. Shan, S. Gong, and P. W. McOwan. Conditional mutual information based boosting for facial expression recognition. In Proc. of the British Machine Vision Conference (BMVC), 2005.
    [2] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and S. Yan. Peak-Piloted Deep Network for Facial Expression Recognition. In Proc. of the European Conference on Computer Vision (ECCV), 2016.
    [3] I. J. Goodfellow, D. Erhan, P. L. Carrier, et al. Challenges in representation learning: A report on three machine learning contests. In Proc. of the Neural Information Processing, 2013.
    [4] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proc. of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2010.
    [5] J. Nicolle, V. Rapp, K. Bailly, L. Prevost, and M. Chetouani. Robust Continuous Prediction of Human Emotions using Multiscale Dynamic Cues. In Proc. of the 14th ACM International Conference on Multimodal Interaction, 2012.
    [6] H. Meng, N. Bianchi-Berthouze, Y. Deng, J. Cheng, and J. P. Cosmas. Time-Delay Neural Network for Continuous Emotional Dimension Prediction From Facial Expression Sequences. IEEE Transactions on Cybernetics, April 2016.
    [7] H. Chen, J. Li, F. Zhang, Y. Li, and H. Wang. 3D model-based continuous emotion recognition. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
    [8] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep Face Recognition. In Proc. of the British Machine Vision Conference (BMVC), 2015.
    [9] J. Kim, J. K. Lee and K. M. Lee. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    [10] K. Zhang, Y. Huang, Y. Du, and l. Wang. Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks. IEEE Transactions on Image Processing, March 2017
    [11] K. Simonyan, and A. Zisserman. Very Deep Convolutional Networks for Large-scale Image Recognition. ArXiv:1409.1556, 2014.
    [12] A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proc. of the 31st International Conference on Machine Learning, 2014.
    [13] B. Schuller, M. Valstar, F. Eyben, R. Cowie, and M. Pantic. AVEC 2012–The Continuous Audio/Visual Emotion Challenge. In Proc. of the 14th ACM International Conference on Multimodal Interaction, 2012.
    [14] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder. The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Transactions on Affective Computing, 2012.
    [15] F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic. AV+EC 2015 – The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
    [16] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne. Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions. In Proc. of the 10th IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2013.
    [17] A. Asthana, S, Zafeiriou, S, Cheng, and M. Pantic. Incremental Face Alignment in the Wild. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
    [18] F. Eyben, M. Wöllmer, and B. Schuller. OpenSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor. In Proc. of the 18th ACM Multimedia, 2010.
    [19] D. Ozkan, S. Scherer, and L.P. Morency. Step-wise emotion recognition using concatenated-HMM. In Proc. of the 14th ACM International Conference on Multimodal Interaction, 2012.
    [20] C. Soladié, H. Salam, C. Pelachaud, N. Stoiber, and R. Séguier. A multimodal fuzzy inference system using a continuous facial expression representation for emotion detection. In Proc. of the 14th ACM International Conference on Multimodal Interaction, 2012.
    [21] M. Nicolaou, S. Zafeiriou, and M. Pantic. Correlated-spaces regression for learning continuous emotion dimensions. In Proc. of the 21th ACM International Conference on Multimedia, 2013.
    [22] T. Baltrusaitis, N. Banda, and P. Robinson. Dimensional Affect Recognition using Continuous Conditional Random Fields. In Proc. of the 10th IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2013.
    [23] A. Savran, H. Cao, A. Nenkova, and R. Verma. Temporal Bayesian Fusion for Affect Sensing Combining Video, Audio, and Lexical Modalities. IEEE Transactions on Cybernetics, 2015.
    [24] P. Cardinal, N. Dehak, A. Lameiras, J. Alam, and P. Boucher. ETS System for AV+ EC 2015 Challenge. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
    [25] L. Chao, J. Tao, M. Yang, Y. Li, and Z. Wen. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
    [26] S. Chen, and Q. Jin. Multi-modal dimensional emotion recognition using recurrent neural networks. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
    [27] M. Kächele, P. Thiam, G. Palm, F. Schwenker, and M. Schels. Ensemble methods for continuous affect recognition: Multi-modality, temporality, and challenges. In Proc. of the 5th International Workshop on Audio/Visual Emotion Challenge, 2015.
    [28] F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P. Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognition Letters, 2015.

    QR CODE