簡易檢索 / 詳目顯示

研究生: 呂志娟
Lu, Chih-Chuan
論文名稱: 利用多媒體資料建構的語音前端網路觀察情緒辨識重要資料
Observe Critical Data in Emotion Recognition Using a Speech Front-End Network Learned from Media Data In-the-Wild
指導教授: 李祈均
Lee, Chi-Chun
口試委員: 曹昱
Tsao, Yu
胡敏君
Hu, Min-Chun
賴穎暉
Lai, Ying-Hui
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 41
中文關鍵詞: 語音情緒辨識卷積神經網路語音前端網路初始化微調
外文關鍵詞: speech emotion recognition, convolutional neural network, speech front-end network, initialization fine-tuning
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音情緒辨識近年來為深度學習所賜,成果越來越斐然,然而情緒構成複雜度造成的情緒資料庫蒐集問題仍然存在:語音情緒資料難以快速累積,以及在不同語境間變異度高。初始化微調是深度學習中一個常見的解決方法,然而若是純粹以多媒體背景資料則會與語音情緒辨識存在太大的差異,還是需要不論是在初始化時給予情緒的引導,或是微調時提供更準確的方法。因此本論文提出利用大量隨手可得的多媒體資料,伴隨從其聲音及文字資訊產生的情緒激發、向性代理標記,學習給語音情緒辨識問題應用的初始化語音前端網路;接著以此初始化網路取向的取樣方法輔助微調,建立目標資料庫的語音情緒辨識模型。結果顯示在語音前端網路的及取樣方法輔助下,結果都可以勝過隨機初始化,有卓越提升的表現。


    The rapid development of deep learning technology bring benefit to progression of speech emotion recognition (SER), though the complexity of emotion still exists to cause problems of the difficulties in rapidly obtaining large scale annotated data and hardly handled high variability across different domains. The initialization - fine-tuning strategy is a common solution in deep learning research. However, simply applying abundant media can still has high discrepancy between it and SER problem. An emotion guidance introduces would help solving it. In this work, we propose to learn an initialization speech front-end network on a large-scaled media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information; and then, to build the SER prediction model by fine-tuning with the assistant of initialization-oriented sampling method. The result shows that the integration of both speech front-end network and sampling method can achieve better performance than random initialization.

    摘要 ii ABSTRACT iii 誌謝 iv 目錄 v 表目錄 vii 圖目錄 viii 第一章 緒論 1 1.1 前言 1 1.2 研究動機/目的 3 1.3 論文架構 3 第二章 資料庫與預處理 5 2.1 資料庫介紹 5 2.1.1 背景資料庫:TED-LIUM 5 2.1.2 目標資料庫:IEMOCAP 6 2.2 資料前處理 6 2.2.1 語音資料 6 2.2.2 標記資料 9 第三章 研究方法 10 3.1 代理標記(Proxy Label) 10 3.1.1 規則式激發標記 10 3.1.2 字典式向性標記 11 3.2 類神經網路 12 3.2.1 深度神經網路(Deep Neural Network,DNN) 13 3.2.2 卷積神經網路(Convolutional Neural Network,CNN) 15 3.3 初始化與微調(Initialization – Fine-tuning) 16 3.4 語音前端網路訓練與應用 17 3.4.1 初始化網路 17 3.4.2 取樣方法與微調網路 19 第四章 實驗設計與結果分析 21 4.1 實驗設計 21 4.2 實驗一:前端網路架構 22 4.3 實驗二:以微量資料微調目標資料庫 24 4.4 實驗三:取樣重點資料 25 4.5 實驗四:不同取樣參數 28 4.6 實驗結果分析 30 第五章 結論與未來展望 32 參考文獻 34

    [1] C.Lisetti and C.LeRouge, “Affective computing in tele-home health,” in Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004, pp. 8 pp.-.
    [2] A.Luneski, P. D.Bamidis, and M.Hitoglou-Antoniadou, “Affective computing and medical informatics: state of the art in emotion-aware medical applications.,” Studies in health technology and informatics, vol. 136, p. 517, 2008.
    [3] K.L.B and L. P.GG, “Student Emotion Recognition System (SERS) for e-learning Improvement Based on Learner Concentration Metric,” Procedia Computer Science, vol. 85, pp. 767–776, 2016.
    [4] K. C.Lin, T.Huang, J. C.Hung, N. Y.Yen, and S. J.Chen, “Facial emotion recognition towards affective computing‐based learning,” Library Hi Tech, vol. 31, no. 2, pp. 294–307, 2013.
    [5] Q.Luo, “Emotion Recognition in Modern Distant Education System by Using Neural Networks and SVM,” in Applied Computing, Computer Science, and Advanced Communication, Springer, 2009, pp. 240–247.
    [6] N.Jascanu, V.Jascanu, and S.Bumbaru, “Toward Emotional E-Commerce: The Customer Agent,” in Knowledge-Based Intelligent Information and Engineering Systems, 2008, pp. 202–209.
    [7] M.Shanmugam, S.Sun, A.Amidi, F.Khani, and F.Khani, “The Applications of Social Commerce Constructs,” International Journal Inf. Manag., vol. 36, no. 3, pp. 425–432, 2016.
    [8] M. A.Anusuya and S. K.Katti, “Speech Recognition by Machine: A Review,” International journal of computer science and Information Security (IJCSIS), vol. 6, no. 3, pp. 181–205, 2009.
    [9] K.Han, D.Yu, and I.Tashev, “Speech emotion recognition using deep neural network and extreme learning machine,” in Proceedings of the International Speech Communication Association (Interspeech), 2014.
    [10] P.Harár, R.Burget, and M. K.Dutta, “Speech emotion recognition with deep learning,” in 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), 2017, pp. 137–140.
    [11] W. Q.Zheng, J. S.Yu, and Y. X.Zou, “An experimental study of speech emotion recognition based on deep convolutional neural networks,” in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, pp. 827–831.
    [12] J.Kim, G.Englebienne, K. P.Truong, and V.Evers, “Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition,” in Proceedings of the 2017 ACM on Multimedia Conference, 2017, pp. 1006–1013.
    [13] H.-C.Shin et al., “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1285–1298, 2016.
    [14] B.Zhou, A.Lapedriza, J.Xiao, A.Torralba, and A.Oliva, “Learning deep features for scene recognition using places database,” in Advances in neural information processing systems, 2014, pp. 487–495.
    [15] X.Zhuang, A.Ghoshal, A.-V.Rosti, M.Paulik, and D.Liu, “Improving DNN Bluetooth Narrowband Acoustic Models by Cross-Bandwidth and Cross-Lingual Initialization,” in Proceedings of the International Speech Communication Association (Interspeech), 2017, pp. 2148–2152.
    [16] J.Yosinski, J.Clune, Y.Bengio, and H.Lipson, “How transferable are features in deep neural networks?,” in Advances in neural information processing systems, 2014, pp. 3320–3328.
    [17] K.Simonyan and A.Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” pp. 1–14, 2014.
    [18] Q.Chen, X.Xu, S.Hu, X.Li, Q.Zou, and Y.Li, “A transfer learning approach for classification of clinical significant prostate cancers from mpMRI scans,” in Medical Imaging 2017: Computer-Aided Diagnosis, 2017, vol. 10134, p. 101344F.
    [19] T.Liu, S.Xie, J.Yu, L.Niu, and W.Sun, “Classification of thyroid nodules in ultrasound images using deep model based transfer learning and hybrid features,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 919–923.
    [20] H. M.Fayek, M.Lech, and L.Cavedon, “On the Correlation and Transferability of Features Between Automatic Speech Recognition and Speech Emotion Recognition.,” in Proceedings of the International Speech Communication Association (Interspeech), 2016, pp. 3618–3622.
    [21] J.Deng, Z.Zhang, E.Marchi, and B.Schuller, “Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition,” in 2013 International Conference on Affective Computing and Intelligent Interaction (ACII), 2013, pp. 511–516.
    [22] Z.Huang, W.Xue, Q.Mao, and Y.Zhan, “Unsupervised domain adaptation for speech emotion recognition using PCANet,” Multimedia Tools and Applications, vol. 76, no. 5, pp. 6785–6799, 2017.
    [23] M.Neumann and N. T.Vu, “Cross-lingual and Multilingual Speech Emotion Recognition on English and French,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    [24] A. M.Badshah, J.Ahmad, N.Rahim, and S. W.Baik, “Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network,” in 2017 International Conference on Platform Technology and Service (PlatCon), 2017, pp. 1–5.
    [25] S.-G.Lee, J.Kim, H.-J.Jung, and Y.Choe, “Comparing Sample-Wise Learnability across Deep Neural Network Models,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 9961–9962.
    [26] T.Schaul, J.Quan, I.Antonoglou, and D.Silver, “Prioritized experience replay,” in 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings, 2016.
    [27] F.Hernandez, V.Nguyen, S.Ghannay, N.Tomashenko, and Y.Estève, “TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation,” in Proceedings of Speech and Computer: 20th International Conference, SPECOM, 2018, pp. 198–208.
    [28] C.Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
    [29] H. M.Fayek, M.Lech, and L.Cavedon, “Evaluating deep learning architectures for Speech Emotion Recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.
    [30] S.Mariooryad and C.Busso, “Exploring Cross-Modality Affective Reactions for Audiovisual Emotion Recognition,” IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 183–196, 2013.
    [31] M.Asgari, G.Kiss, J.vanSanten, I.Shafran, and X.Song, “Automatic measurement of affective valence and arousal in speech,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 965–969.
    [32] S. G.Karadoğan and J.Larsen, “Combining semantic and acoustic features for valence and arousal recognition in speech,” in 2012 3rd International Workshop on Cognitive Information Processing (CIP), 2012, pp. 1–6.
    [33] D.Bone, C.-C.Lee, and S.Narayanan, “Robust Unsupervised Arousal Rating: A Rule-Based Framework with Knowledge-Inspired Vocal Features,” IEEE Transactions on Affective Computing, vol. 5, no. 2, pp. 201–213, 2014.
    [34] B.Pang and L.Lee, “Opinion mining and sentiment analysis,” Foundations and Trends®in Information Retrieval, vol. 2, no. 1--2, pp. 1–135, 2008.
    [35] C. J.Hutto and E.Gilbert, “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text,” in Eighth International AAAI Conference on Weblogs and Social Media, 2014, p. 18.
    [36] F.Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.
    [37] D. E.Rumelhart, G. E.Hinton, and R. J.Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
    [38] G. E.Hinton and R. R.Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
    [39] S.Ioffe and C.Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015, vol. 1, pp. 448–456.
    [40] W.Lim, D.Jang, and T.Lee, “Speech emotion recognition using convolutional and Recurrent Neural Networks,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp. 1–4.
    [41] I.Jindal, M.Nokleby, and X.Chen, “Learning deep networks from noisy labels with dropout regularization,” in Proceedings of the IEEE International Conference on Data Mining (ICDM), 2017, pp. 967–972.
    [42] D. P.Kingma and J.Ba, “Adam: A method for stachastic optimization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
    [43] J. A.Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980.
    [44] S.Narayanan and P. G.Georgiou, “Behavioral signal processing: Deriving human behavioral informatics from speech and language,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1203–1233, 2013.

    QR CODE