研究生: |
陳瑋晨 Chen, Wei-Chen |
---|---|
論文名稱: |
一個多模態連續情緒辨識系統與其應用於全域情感辨識之研究 A Study on Automatic Multimodal Continuous Emotion Tracking and Its Application in Global Affect Recognition |
指導教授: |
李祈均
Lee, Chi-Chun |
口試委員: |
冀泰石
Chi, Tai-shih 曹昱 Tsao, Yu 李宏毅 Lee, Hung-Yi |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2016 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 41 |
中文關鍵詞: | 行為訊號處理 、多模態情緒辨識 、薄片擷取 |
外文關鍵詞: | Sequence to Sequence, Thin-slicing |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
人與人之間的交流可能會藉由聲音、身體和表情來互相溝通,且溝通是非常複雜並包含有多模態(multimodal)的資訊夾雜在其中,例如:語句、語調、眼睛注視方向、手勢或肢體動作等,同時資訊又會互相影響、交互作用,相同的語句、不同的語調、不同的肢體動作,表達的意義也不同。當人們表達情緒的時候,除了全域的情感外,細部其實是一個連續性的情緒變化,每個時間點都是有不同的情緒,帶有非常豐富的資訊,而這兩種不同的情緒,全域情感及連續情緒都是非常重要的。因此本論文應用人類訊號處理,擷取身體語言與語音兩個行為特徵模態,並系統性地利用不同時間區間對於行為訊號特徵進行編碼,針對情緒三個維度激動程度、正負向與自主程度利用兩種機器學習演算法進行連續情緒辨識,分別是支持向量迴歸與Sequence to Sequence Learning,實驗結果顯示,藉由增加豐富特徵的時間資訊可以有效地幫助連續情緒辨識,結果得到了顯著性地提升,接著利用連續情緒應用於全域情感辨識,結果顯示,雖經過中間連續情緒編碼,但還是有效地幫助全域情感辨識,最後,我們更利用人類心理學Thin-slicing情感接收機制,進一步的提高全域情感辨識相關性,同時也發現一個非常有趣的事情,兩個不同的機器學習演算法連續情緒辨識結果差不多,但於全域情感確呈現截然不同的樣貌。
Human communication and interaction is a multimodal structure through voice, facial expression, body action, which are able to convey the people’s affect. In different moment of interaction, people feel different emotion. The expressions of emotion c be annotated by global emotion and continuous emotion. Continuous emotion is very complex and informative, but continuous and global emotion are both important. Therefore, we perform time encoding framework at behavior signal, body language and audio, to track continuous emotion, and focus on three dimensions of emotion which are activation, valence and dominance. During the training and testing process, we used two machine learning algorithms, support vector regression and sequence to sequence learning. Moreover, we extend our framework by two direction. First, we concentrate on continuous emotion tracking. Secondly, we use continuous emotion to predict the global affect. Comparing the previous study, continuous emotion tracking results achieve better performance of our system, and it also effectively help the global affect recognition. Interestingly, the continuous emotion tracking results bring the human perception mechanism of thin-slicing and further improve the global affect correlation.
[1] D. Bone, M. S. Goodwin, M. P. Black, C. C. Lee, K. Audhkhasi, and S. Narayanan, “Applying machine learning to facilitate autism diagnostics: Pitfalls and promises,” Journal of autism and developmental disorders, vol. 45, no. 5, pp. 1121-1136, 2015.
[2] Shan-Wen Hsiao, Hung-Ching Sun, Ming-Chuan Hsieh, Ming-Hsueh Tsai, Hsin-Chih Lin, Chi-Chun Lee, “A multimodal approach for automatic assessment of school principals' oral presentation during pre-service training program,” in Proceedings of the International Speech Communication Association (Interspeech), 2015.
[3] M. P. Black, et al., “Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features,” Speech Communication, vol. 55, no. 1, pp. 1-21, 2013.
[4] Fu-Sheng Tsai, Ya-Ling Hsu, Wei-Chen Chen, Yi-Ming Weng, Chip-Jin Ng,Chi-Chun Lee, “Toward development and evaluation of pain level-rating scale for emergency triage based on vocal characteristics and facial expressions,” in Proceedings of the International Speech Communication Association (Interspeech), 2016.
[5] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 1,pp. 39-58, 2009.
[6] Y. Kim, H. Lee, and E. M. Provost, “Deep learning for robust feature generation in audiovisual emotion recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3687-3691, 2013.
[7] C. E. Osgood, G. J. Suci, and P. H. Tannenbaum, The measurement of meaning. Urbana, IL: University of Illinois Press, 1957.
[8] S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, “Recurrent neural networks for emotion recognition in video,” Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 467-474, 2015.
[9] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” Association for Computational Linguistics (ACL), 2015.
[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks.” Advances in neural information processing systems, pp. 3104-3112, 2014.
[11] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[12] O. Vinyals, and Q. Le, “A neural conversational model,” ICML Deep Learning Workshop, 2015.
[13] A. Metallinou, Z. Yang, C. C. Lee, C. Busso, S.Carnicke, and S. Narayanan, “The USC CreativeIT database of multimodal dyadic interactions: from speech and full body motion capture to continuous emotional annotations,” Language resources and evaluation, vol. 50, no. 3, pp. 497-521, 2016.
[14] R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, and M. Schröder, “'FEELTRACE': An instrument for recording perceived emotion in real time,” ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.
[15] N. Ambady, and R. Rosenthal, “Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis,” Psychological bulletin, vol. 111, no. 2, pp. 256-274, 1992.
[16] N. Ambady, and R. Rosenthal, “Half a minute: Predicting teacher evaluations from thin slices of nonverbal behavior and physical attractiveness,” Journal of personality and social psychology, vol. 64, no. 3, pp. 431-441, 1993.
[17] J. Harrigan, and R. Rosenthal, New handbook of methods in nonverbal behavior research. Oxford University Press, 2008.
[18] W. C. Lin, and C. C. Lee, “A thin-slice perception of emotion? An information theoretic-based framework to identify locally emotion-rich behavior segments for global affect recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5790-5794, 2016.
[19] P. Boersma et al., “Praat, a system for doing phonetics by computer,” Glot international, vol. 5, no. 9/10, pp. 341–345, 2002.
[20] B. McFee et al., “librosa: Audio and music signal analysis in python,” Proceedings of the 14th Python in Science Conference. 2015.
[21] M. Schuster, and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
[22] S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[23] R. J. Williams, and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270-280. 1989.
[24] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[25] T. Tieleman, and G. Hinton, Lecture 6.5-rmsprop, COURSERA: Neural Networks for Machine Learning, 2012.
[26] A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks. Springer Berlin Heidelberg, 2012.
[27] A. Metallinou, A. Katsamanis, and S. Narayanan, “Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information,” Image and Vision Computing, vol. 31, no. 2, pp. 137-152, 2013.
[28] A. Kleinsmith, N. Bianchi-Berthouze, and A. Steed, “Automatic recognition of non-acted affective postures,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41, no. 4, pp. 1027-1038, 2011.
[29] N. Sebe, I. Cohen, T. Gevers, and T. S. Huang, “Emotion recognition based on joint visual and audio cues,” 18th International Conference on Pattern Recognition (ICPR), vol. 1, pp. 1136-1139, 2006.
[30] Fabien Ringeval, et al, “Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data,” Pattern Recognition Letters, vol. 66, pp. 22-30, 2015.
[31] H. Gunes, M. Pantic, “Automatic, dimensional and continuous emotion recognition,” International Journal of Synthetic Emotions, vol. 1, no.1, 2010.
[32] N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi, “A supervised approach to movie emotion tracking,” IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2376-2379, 2011.
[33] A. Metallinou, A. Katsamanis, Y. Wang, and S. Narayanan, “Tracking changes in continuous emotion states using body language and prosodic cues,” IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2288-2291, 2011.
[34] A. Hanjalic, and L. Q. Xu, “Affective video content representation and modeling,” IEEE transactions on multimedia, vol. 7, no. 1, pp. 143-154, 2005.
[35] F. Eyben, G. L. Salomão, J. Sundberg, K. R. Scherer, and B. W. Schuller, “Emotion in the singing voice—a deeperlook at acoustic features in the light of automatic classification,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no., pp. 1-9, 2015.
[36] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” Proceedings of the 18th ACM international conference on Multimedia, pp. 1459-1462, 2010.
[37] C. H. Wu, Z. J. Chuang, and Y. C. Lin, “Emotion recognition from text using semantic labels and separable mixture models,” ACM transactions on Asian language information processing (TALIP), vol. 5, no. 2, pp. 165-183, 2006.
[38] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain,” Psychological review, vol. 65, no. 6, pp. 386-408, 1958.
[39] D. E. Rumelhart, G. E. Hinton and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp.533-536, 1986.
[40] G. E. Hinton, and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504-507, 2006.
[41] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157-166, 1994.
[42] D. Ververidis, & C. Kotropoulos, “Emotional speech recognition: Resources, features, and methods. Speech communication,” vol. 48, no. 9, pp. 1162-1181, 2006.
[43] M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Transactions on Affective Computing, vol. 2, no. 2, pp. 92-105, 2011.
[44] M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll, “LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework,” Image and Vision Computing, vol. 31, no. 2, pp. 153-163, 2013.
[45] M. Wöllmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L. P. Morency, “Youtube movie reviews: Sentiment analysis in an audio-visual context,” IEEE Intelligent Systems, vol. 28, no. 3, pp. 46-53, 2013.