透過結合fMRI大腦血氧濃度相依訊號以改善語音情緒辨識系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳亘宇 Chen, Hsuan Yu
論文名稱：	透過結合fMRI大腦血氧濃度相依訊號以改善語音情緒辨識系統 Improving Categorical Emotion Recognition by Fusing Audio Features with Generated fMRI Brain Responses
指導教授：	李祈均 Lee, Chi Chun
口試委員:	郭立威 Kuo, Li Wei 劉弈汶 Liu, Yi Wen 曹昱 Tsao, Yu
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2016
畢業學年度：	105
語文別：	中文
論文頁數：	36
中文關鍵詞：	人類行為訊號處裡、情緒辨識、情緒正負向、功能性磁振造影、高斯混合回歸模型
外文關鍵詞：	behavioral signal processing(BSP), emotion recognition, valence, fMRI, Gaussian Mixture Regression
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

了解人類神經感知系統對於語音中情緒解碼的運作機制是一個重要的研究方向。然而，語音在激動程度(Activation/Arousal)上，比較好表現出來，如 : 講話高昂、說話速度快慢，即使缺乏語意下，透過聲學基本特徵仍然可以辨識出語音的激動程度。但是把相同的研究架構討論於語音情緒正負向(valence)的分析就相對困難，判斷語音的正負向通常需要語意，如果語意不在，光從語音角度就很難用特徵來表示。此論文中，我們希望透過fMRI角度來探討大腦血氧濃度相依訊號(Blood Oxygenation Level Dependent, BOLD)是否能幫助提升透過語音情緒辨識系統；尤其對於正負向的差異(受試者在沒有語意的語音刺激下)。本論文期望可以用fMRI訊號特徵來強化表達語音訊號的正負向特性，並探討導致這些情感認知的確切特性。
然而，進行fMRI研究需要花費龐大的經費與時間成本來做資料的收集，為了滿足實驗的要求，往往要找到合適且配合的受試者十分不易。本論也透過建立一個統計生成模型:高斯混合回歸(GMR)，描述語音特徵與大腦fMRI特徵的結合關係，藉由此模型我們可以在缺乏fMRI資料情況下，模擬出fMRI特徵。我們透過一系列的實驗驗證，發現利用GMR模擬出來的fMRI特徵也可以幫助語音特徵在情緒辨識上得到良好的提升。

Understanding the underlying neuro-perceptual mechanism of humans’ ability to decode emotional content in vocal signal is an important research direction. However, it is well know that obtaining valence from speech features is much difficult than arousal. Arousal can be accurately identified, also automatically-recognized, using speech signal without context. On the other hand, it is much more difficult to recognize valence if speech does not contain context. In this paper, we obtain the fMRI-derived features from blood oxygen level-dependent (BOLD) signals when subjects are exposed to various vocal emotion stimuli. We observe that by using the fMRI-derived feature to predict valence is beneficial to speech-based emotion recognition system. Furthermore, due to the fact that fMRI scanning is costly and time-consuming, we integrate audio features and fMRI-derived features to learn a joint representation by using Gaussian mixture regression (GMR). Finally, the proposed framework demonstrates that we are capable of obtaining an improved categorical emotion recognition using audio features fused with the stimulated vocal-induced fMRI-derived features, which generated from the GMR model.

口試委員會審定書    #
誌謝    i
中文摘要    iii
ABSTRACT    iv
目錄    v
圖目錄    vii
表目錄    viii
Chapter 1    緒論    1
1.1    研究動機與目的    1
1.2    研究簡介    2
1.3    論文架構    3
Chapter 2    研究方法    4
2.1    語音情緒刺激設計    4
2.2    fMRI數據採集    7
2.2.1    fMRI簡介    7
2.2.2    fMRI資料收集    9
2.2.3    fMRI資料前處理    9
2.3    機器學習模型    11
2.3.1    支持向量機(Support Vector Machine)    11
2.3.2    高斯混合回歸(Gaussian Mixture Regression)    12
Chapter 3    實驗架構與結果    16
3.1    實驗前置作業    17
3.1.1    語音特徵擷取    17
3.1.2    fMRI特徵擷取    19
3.1.3    機器學習建模    20
3.2    實驗一之結果與討論    22
3.3    實驗二之結果與討論    26
3.4    實驗三之結果與討論    29
Chapter 4    結論與未來發展    32
4.1    結論    32
4.2    未來發展    32
參考文獻    34

                                

[1] Buchanan, T. W., Lutz, K., Mirzazade, S., Specht, K., Shah, N. J., Zilles, K., & Jäncke, L. (2000). Recognition of emotional prosody and verbal components of spoken language: an fMRI study. Cognitive Brain Research, 9(3), 227-238.
[2] Vuilleumier, P., Armony, J. L., Driver, J., & Dolan, R. J. (2001). Effects of attention and emotion on face processing in the human brain: an event-related fMRI study. Neuron, 30(3), 829-841.
[3] Sander, D., Grandjean, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., & Vuilleumier, P. (2005). Emotion and attention interactions in social cognition: brain regions involved in processing anger prosody. Neuroimage,28(4), 848-858.
[4] Grandjean, D., Sander, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., & Vuilleumier, P. (2005). The voices of wrath: brain responses to angry prosody in meaningless speech. Nature neuroscience, 8(2), 145-146.
[5] Olson, I. R., Plotzker, A., & Ezzyat, Y. (2007). The enigmatic temporal pole: a review of findings on social and emotional processing. Brain, 130(7), 1718-1731.
[6] Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE transactions on speech and audio processing, 13(2), 293-303.
[7] Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2009). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE transactions on pattern analysis and machine intelligence, 31(1), 39-58.
[8] Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech communication, 48(9), 1162-1181.
[9] Calvo, R. A., & D'Mello, S. (2010). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on affective computing, 1(1), 18-37.
[10] Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., ... & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4), 335-359.
[11] Chen, H. Y., Liao, Y. H., Jan, H. T., Kuo, L. W., & Lee, C. C. (2016, March). A Gaussian mixture regression approach toward modeling the affective dynamics between acoustically-derived vocal arousal score (VC-AS) and internal brain fMRI bold signal response. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5775-5779). IEEE.
[12] Chao-Gan, Y., & Yu-Feng, Z. (2010). DPARSF: a MATLAB toolbox for “pipeline” data analysis of resting-state fMRI. Frontiers in systems neuroscience, 4.
[13] Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Muller, A., Kossaifi, J., ... & Varoquaux, G. (2014). Machine learning for neuroimaging with scikit-learn. arXiv preprint arXiv:1412.3919.
[14] Mourão-Miranda, J., Bokde, A. L., Born, C., Hampel, H., & Stetter, M. (2005). Classifying brain states and determining the discriminating activation patterns: support vector machine on functional MRI data. NeuroImage, 28(4), 980-995.
[15] Härdle, W. K., Prastyo, D. D., & Hafner, C. (2012). Support Vector Machines with Evolutionary Feature Selection for Default Prediction.
[16] Calinon, S., Guenter, F., & Billard, A. (2007). On learning, representing, and generalizing a task in a humanoid robot. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(2), 286-298.
[17] Metallinou, A., Katsamanis, A., & Narayanan, S. (2013). Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information. Image and Vision Computing, 31(2), 137-152.
[18] Boersma, P. (2002). Praat, a system for doing phonetics by computer. Glot international, 5(9/10), 341-345.
[19] Perronnin, F., & Dance, C. (2007, June). Fisher kernels on visual vocabularies for image categorization. In 2007 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-8). IEEE.
[20] Peng, X., Zou, C., Qiao, Y., & Peng, Q. (2014, September). Action recognition with stacked fisher vectors. In European Conference on Computer Vision (pp. 581-595). Springer International Publishing.
[21] Sun, C., & Nevatia, R. (2013, January). Large-scale web video event classification by use of fisher vectors. In Applications of Computer Vision (WACV), 2013 IEEE Workshop on (pp. 15-22). IEEE.
[22] Peng, X., Wang, L., Wang, X., & Qiao, Y. (2016). Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice.Computer Vision and Image Understanding.
[23] Calhoun, V. D., Liu, J., & Adalı, T. (2009). A review of group ICA for fMRI data and ICA for joint inference of imaging, genetic, and ERP data. Neuroimage,45(1), S163-S172.
[24] Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems (pp. 2222-2230).
[25] Han, J., Ji, X., Hu, X., Guo, L., & Liu, T. (2015). Arousal recognition using audio-visual features and fmri-based brain response. IEEE Transactions on Affective Computing, 6(4), 337-347.
[26] Jenke, R., Peer, A., & Buss, M. (2014). Feature extraction and selection for emotion recognition from EEG. IEEE Transactions on Affective Computing,5(3), 327-339.
[27] Anders, S., Eippert, F., Weiskopf, N., & Veit, R. (2008). The human amygdala is sensitive to the valence of pictures and sounds irrespective of arousal: an fMRI study. Social cognitive and affective neuroscience, 3(3), 233-243.
[28] Whalen, P. J., Rauch, S. L., Etcoff, N. L., McInerney, S. C., Lee, M. B., & Jenike, M. A. (1998). Masked presentations of emotional facial expressions modulate amygdala activity without explicit knowledge. The Journal of neuroscience, 18(1), 411-418.
[29] Formisano, E., De Martino, F., & Valente, G. (2008). Multivariate analysis of fMRI time series: classification and regression of brain responses using machine learning. Magnetic resonance imaging, 26(7), 921-934.
[30] Frühholz, S., Trost, W., & Grandjean, D. (2014). The role of the medial temporal limbic system in processing emotions in voice and music. Progress in neurobiology, 123, 1-17.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文