研究生: |
廖育賢 Liao,Yu-Hsien |
---|---|
論文名稱: |
結合fMRI之迴旋積類神經網路多層次特徵 用以改善語音情緒辨識系統 Improving Audio-based Categorical Emotion Recognition System by Fusing Convolutional Neural Network Hierarchical Features from fMRI Scanning |
指導教授: |
李祈均
Lee,Chi-Chun |
口試委員: |
曹昱
許秋婷 郭立威 |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2016 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 45 |
中文關鍵詞: | fMRI 、人類行為訊號處理 、情緒辨識 、迴旋積類神經網路 |
外文關鍵詞: | fMRI, Behavior Signal Processing, Emotion Recognition, Convolutional Neural Network |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在過去各領域的研究學者致力於利用內在行為資訊來改善外在行為對於情感計算方面的研究,而本文研究目的為基於外在行為:語音訊號的情緒辨識系統中,加入了內在行為:fMRI腦部生理訊號用以改善辨識系統,並且在這個研究中主要利用迴旋積類神經網路(Convolutional Neural Network, CNN)對於fMRI的特徵擷取方法取代傳統多變數模式分析方法(MVPA),基於電腦視覺的角度對於fMRI進行處理,而迴旋積類神經網路為深度學習網路中延伸出的一種方法,透過迴旋積的特性以局部間相關性來達到局部性地特徵擷取,同時也將fMRI經過多次線性與非線性轉換至一高維度多層次特徵。
更進一步地,本論文加入了對於不同受試者的情感預測結果的多數決投票,忽略少數受試者較為不同的認知決策,得到一利用群體分析的結果,以及基於區域迴旋積類神經網路(Region-based CNN)方式,對於個別葉系統(Lobe System)進行迴旋積類神經網路訓練的方式來達到更為局部性地特徵擷取,強化特徵學習的能力來提升系統辨識效果,並試著找出對於情感計算重要的區域。
最後本論文結合Temporal、Frontal、Parietal Lobe的迴旋機類神經網路多層次特徵,搭配多數決投票方法,有效地改善結合語音-fMRI之多模態情緒辨識系統,於本論文的實驗中顯示出迴旋積類神經網路對於 fMRI 的特徵擷取能力且致力於提升結合 fMRI-語音之多模態情感辨識系統的準確率。
In recent years, improving external behaviors’ emotion recognition system by fusing internal behaviors is an important research direction for exports in many science. In this paper, our initial research is modeling an external behavior (audio signal) categorical emotion recognition system by fusing internal behavior (blood oxygen level-dependent signal, BOLD signal). Next, we focus on using convolutional neural network feature extraction method to improve multi-variate pattern analysis (MVPA) in fMRI research. Convolutional neural network (CNN) is a type of deep learning which method bases on computer vision, and achieves locally feature extraction via convolution’s properties: local associativity. Concurrently, CNN takes fMRI to multiple linear and non-linear transform into high-level hierarchical feature.
Furthermore, we join majority vote method for different subject of affective decision, in order to ignore a few of subject with more different decision and get a result about an analysis of groups. In addition, we base on the concept of region-based CNN, using CNN feature learning on individual lobe system to achieve more locally feature extraction and try to find the importance lobe about affective computing.
Finally, we effectively improve multi-model categorical emotion recognition system by fusing Temporal, Frontal and Parietal Lobe of CNN hierarchical feature and joining majority vote. All experiments in this paper will demonstrate that fMRI’s CNN hierarchical features indeed has capability in emotion recognition and be devoted to increase technique’s emotion recognition performance.
[1] Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE transactions on speech and audio processing, 13(2), 293-303.
[2] Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2009). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE transactions on pattern analysis and machine intelligence, 31(1), 39-58.
[3] Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech communication, 48(9), 1162-1181.
[4] Calvo, R. A., & D'Mello, S. (2010). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on affective computing, 1(1), 18-37.
[5] Buchanan, T. W., Lutz, K., Mirzazade, S., Specht, K., Shah, N. J., Zilles, K., & Jäncke, L. (2000). Recognition of emotional prosody and verbal components of spoken language: an fMRI study. Cognitive Brain Research, 9(3), 227-238.
[6] Vuilleumier, P., Armony, J. L., Driver, J., & Dolan, R. J. (2001). Effects of attention and emotion on face processing in the human brain: an event-related fMRI study. Neuron, 30(3), 829-841.
[7] Sander, D., Grandjean, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., & Vuilleumier, P. (2005). Emotion and attention interactions in social cognition: brain regions involved in processing anger prosody. Neuroimage,28(4), 848-858.
[8] Grandjean, D., Sander, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., & Vuilleumier, P. (2005). The voices of wrath: brain responses to angry prosody in meaningless speech. Nature neuroscience, 8(2), 145-146.
[9] Olson, I. R., Plotzker, A., & Ezzyat, Y. (2007). The enigmatic temporal pole: a review of findings on social and emotional processing. Brain, 130(7), 1718-1731.
[10] Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning. Urbana, IL: University of Illinois Press.
[11] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507
[12] Plis, S. M., Hjelm, D. R., Salakhutdinov, R., & Calhoun, V. D. (2013). Deep learning for neuroimaging: a validation study. arXiv preprint arXiv:1312.5847.
[13] Jirayucharoensak, S., Pan-Ngum, S., & Israsena, P. (2014). EEG-based emotion recognition using deep learning network with principal component based covariate shift adaptation. The Scientific World Journal, 2014.
[14] Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S. & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4), 335-359.
[15] Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J. P., Frith, C. D., & Frackowiak, R. S. (1994). Statistical parametric maps in functional imaging: a general linear approach. Human brain mapping, 2(4), 189-210.
[16] Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix, N. & Joliot, M. (2002). Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. Neuroimage, 15(1), 273-289.
[17] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[18] Duraisamy, M., & Duraisamy, S. (2012). CNN-based approach for segmentation of brain and lung MRI images. European Journal of Scientific Research ISSN, 298-313.
[19] Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., & Nielsen, M. (2013, September). Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In International Conference on Medical Image Computing and Computer-Assisted Intervention . Springer Berlin Heidelberg. (pp. 246-253)
[20] Waske, B., & Benediktsson, J. A. (2007). Fusion of support vector machines for classification of multisensor data. IEEE transactions on geoscience and remote sensing, 45(12), 3858-3866.
[21] [21]Sabatini, U., Boulanouar, K., Fabre, N., Martin, F., Carel, C., Colonnese, C. & Rascol, O. (2000). Cortical motor reorganization in akinetic patients with Parkinson's disease. Brain, 123(2), 394-403.
[22] Demirci, O., Clark, V. P., Magnotta, V. A., Andreasen, N. C., Lauriello, J., Kiehl, K. A., ... & Calhoun, V. D. (2008). A review of challenges in the use of fMRI for disease classification/characterization and a projection pursuit application from a multi-site fMRI schizophrenia study. Brain imaging and behavior, 2(3), 207-226.
[23] Phan, K. L., Wager, T., Taylor, S. F., & Liberzon, I. (2002). Functional neuroanatomy of emotion: a meta-analysis of emotion activation studies in PET and fMRI. Neuroimage, 16(2), 331-348.
[24] Berthoz, S., Artiges, E., Van de Moortele, P. F., Poline, J. B., Rouquette, S., Consoli, S. M., & Martinot, J. L. (2002). Effect of impaired recognition and expression of emotions on frontocingulate cortices: an fMRI study of men with alexithymia. american Journal of Psychiatry, 159(6), 961-967.
[25] Williams, L. M., Phillips, M. L., Brammer, M. J., Skerrett, D., Lagopoulos, J., Rennie, C. & Gordon, E. (2001). Arousal dissociates amygdala and hippocampal fear responses: evidence from simultaneous fMRI and skin conductance recording. Neuroimage, 14(5), 1070-1079.
[26] Tabert, M. H., Borod, J. C., Tang, C. Y., Lange, G., Wei, T. C., Johnson, R. & Buchsbaum, M. S. (2001). Differential amygdala activation during emotional decision and recognition memory tasks using unpleasant words: an fMRI study.Neuropsychologia, 39(6), 556-573.
[27] Sander, D., Grandjean, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., & Vuilleumier, P. (2005). Emotion and attention interactions in social cognition: brain regions involved in processing anger prosody. Neuroimage,28(4), 848-858.
[28] Grandjean, D., Sander, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., & Vuilleumier, P. (2005). The voices of wrath: brain responses to angry prosody in meaningless speech. Nature neuroscience, 8(2), 145-146.
[29] Huesken, D., Lange, J., Mickanin, C., Weiler, J., Asselbergs, F., Warner, J. & Labow, M. (2005). Design of a genome-wide siRNA library using an artificial neural network. Nature biotechnology, 23(8), 995-1001.
[30] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
[31] Lawrence, S., Giles, C. L., Tsoi, A. C., & Back, A. D. (1997). Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks,8(1), 98-113.
[32] Larochelle, H., & Bengio, Y. (2008, July). Classification using discriminative restricted Boltzmann machines. In Proceedings of the 25th international conference on Machine learning (pp. 536-543). ACM.
[33] Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231.
[34] Norman, K. A., Polyn, S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in cognitive sciences,10(9), 424-430.
[35] De Martino, F., Valente, G., Staeren, N., Ashburner, J., Goebel, R., & Formisano, E. (2008). Combining multivariate voxel selection and support vector machines for mapping and classification of fMRI spatial patterns.Neuroimage, 43(1), 44-58.
[36] Martin, J. (2012). Neuroanatomy text and atlas. McGraw Hill Professional.
[37] Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech communication, 41(4), 603-623.
[38] Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C. M., Kazemzadeh, A. & Narayanan, S. (2004, October). Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on Multimodal interfaces (pp. 205-211). ACM.
[39] Castellano, G., Villalba, S. D., & Camurri, A. (2007, September). Recognising human emotions from body movement and gesture dynamics. In International Conference on Affective Computing and Intelligent Interaction . Springer Berlin Heidelberg. (pp. 71-82)
[40] Han, J., Ji, X., Hu, X., Guo, L., & Liu, T. (2015). Arousal recognition using audio-visual features and fmri-based brain response. IEEE Transactions on Affective Computing, 6(4), 337-347.
[41] Koelstra, S., & Patras, I. (2013). Fusion of facial expressions and EEG for implicit affective tagging. Image and Vision Computing, 31(2), 164-174.
[42] Soleymani, M., Lichtenauer, J., Pun, T., & Pantic, M. (2012). A multimodal database for affect recognition and implicit tagging. IEEE Transactions on Affective Computing, 3(1), 42-55.
[43] Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1440-1448).
[44] Varoquaux, G., Sadaghiani, S., Pinel, P., Kleinschmidt, A., Poline, J. B., & Thirion, B. (2010). A group model for stable multi-subject ICA on fMRI datasets.Neuroimage, 51(1), 288-299.
[45] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5(3), 1.
[46] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6), 386.
[47] Safavian, S. R., & Landgrebe, D. (1990). A survey of decision tree classifier methodology.
[48] Mikolov, T., Kombrink, S., Burget, L., Černocký, J., & Khudanpur, S. (2011, May). Extensions of recurrent neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. (pp. 5528-5531)
[49] Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of physiology,160(1), 106-154.
[50] Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4), 193-202.
[51] Lawrence, S., Giles, C. L., Tsoi, A. C., & Back, A. D. (1997). Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks,8(1), 98-113.
[52] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
[53] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).
[54] Hornak, J., Rolls, E. T., & Wade, D. (1996). Face and voice expression identification in patients with emotional and behavioural changes following ventral frontal lobe damage. Neuropsychologia, 34(4), 247-261.
[55] Simon, O., Mangin, J. F., Cohen, L., Le Bihan, D., & Dehaene, S. (2002). Topographical layout of hand, eye, calculation, and language-related areas in the human parietal lobe. Neuron, 33(3), 475-487.
[56] Chatfield, K., Lempitsky, V. S., Vedaldi, A., & Zisserman, A. (2011, September). The devil is in the details: an evaluation of recent feature encoding methods. InBMVC (Vol. 2, No. 4, p. 8).
[57] Li, T. L., Chan, A. B., & Chun, A. (2010, March). Automatic musical pattern feature extraction using convolutional neural network. In Proc. Int. Conf. Data Mining and Applications.
[58] Narayanan, S., & Georgiou, P. G. (2013). Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5), 1203-1233.
[59] Picard, R. W., & Picard, R. (1997). Affective computing (Vol. 252). Cambridge: MIT press.