簡易檢索 / 詳目顯示

研究生: 蕭善文
Hsiao, Shan Wen
論文名稱: 應用多任務與多模態融合技術於候用校長演講自動評分系統之建構
Toward Automatic Assessment of Pre-service Principals Oral Presentation using Multitask and Multimodal Fusion Technique
指導教授: 李祈均
Lee, Chi Chun
口試委員: 謝名娟
孫民
曹昱
蔡明學
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2016
畢業學年度: 105
語文別: 中文
論文頁數: 45
中文關鍵詞: 人類行為訊號處理教育研究口頭演講多模態訊號處理多任務學習
外文關鍵詞: behavioral signal processing, educational research, oral presentation, multimodal signal processing, multi-task learning
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 有許多科學領域的專家們致力於開發人類行為之計算模型,並且已經成為相當具前瞻性的跨學科研究。本論文與國家教育研究院(NAER)之研究人員合作,共同開發應用於候用校長即席演講之自動評分系統。此即席演講資料來源自國家教育研究院為候用校長們所舉辦之儲訓(認證)計畫。本研究提出短時距且高密度之特徵計算方法,搭配全時整合編碼算法,用以描述演講過程中校長多模態行為資訊。此外設計了具備初步辨識能力之行為計算框架,並且延伸討論實驗兩個面向之可能性。首先,受到心理學文獻對人類內心決策、評斷機制研究的啟發,本論文利用認知極端之樣本集訓練分類學習模型,並計算置信度得分作為系統對測試演講之數值評分。其次,有鑑於多任務學習在許多研究領域的成功應用,本論文整合來自於儲訓計畫各個面向之訓練任務。利用任務之間潛在的關聯性獲取更重要的行為特徵資訊,使得評分系統獲得更佳的學習效果。所有實驗與結果分析均展示此系統應用於高層次與主觀屬性之可行性。


    Developing computational models of human behaviors for experts in many science fields has been at the forefront of several interdisciplinary research. In this work, we collaborate with researchers from National Academy for Educational Research (NAER) to develop an automatic scoring system for pre-service principals’ impromptu speech at the certification program. We propose a dense unit-level feature extraction and session-level encoding methods to characterize principals’ multimodal behavior. Moreover, we extend the framework by two direction. First, with inspiration from the psychological evidence in human’s decision-making mechanism, we assign confidence scores outputted from classifier as the predicted scores to all the speech. Secondly, as recent works on multi-task learning have been successfully utilized in many fields, we leverage other training tasks of the certification program to incorporate information about these new targets and achieve better performance of our scoring system. All the experiments demonstrate that our framework indeed has capability in handling high-level and subjective attributes.

    口試委員會審定書 # 誌謝 i 中文摘要 ii ABSTRACT iii 目錄 iv 圖目錄 vi 表目錄 vii Chapter 1 序論 1 Chapter 2 資料庫 6 2.1 國家教育研究院-候用校長即席演講 6 2.2 自定義之演講評分項目 7 2.3 儲訓計畫之校長訓練任務 10 Chapter 3 研究方法 11 3.1 短時高密度特徵擷取 11 3.1.1 音訊模態 11 3.1.2 視訊模態 12 3.2 全時整合編碼 14 3.2.1 詞袋模型 14 3.2.2 費雪矢量編碼 14 3.3 基於分類器學習之連續分數評分 17 3.4 多任務學習 18 3.4.1 聯合特徵學習 19 3.4.2 格拉姆矩陣組合 20 Chapter 4 實驗一:分類實驗 21 4.1 實驗設置 21 4.2 結果分析 22 Chapter 5 實驗二:評分實驗 24 5.1 實驗設置 24 5.2 結果分析 25 Chapter 6 實驗三:多任務實驗 30 6.1 實驗設置 30 6.2 結果分析 31 Chapter 7 結論 34 參考文獻 36 附錄1 44

    [1] Picard, R. W., & Picard, R. (1997). Affective computing (Vol. 252). Cambridge: MIT press.
    [2] Vinciarelli, A., Pantic, M., Heylen, D., Pelachaud, C., Poggi, I., D'Errico, F., & Schroeder, M. (2012). Bridging the gap between social animal and unsocial machine: A survey of social signal processing. IEEE Transactions on Affective Computing, 3(1), 69-87.
    [3] Narayanan, S., & Georgiou, P. G. (2013). Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5), 1203-1233.
    [4] Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE transactions on speech and audio processing, 13(2), 293-303.
    [5] El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572-587.
    [6] Karg, M., Samadani, A. A., Gorbet, R., Kühnlenz, K., Hoey, J., & Kulić, D. (2013). Body movements for affective expression: A survey of automatic recognition and generation. IEEE Transactions on Affective Computing, 4(4), 341-359.
    [7] Crane, E., & Gross, M. (2007, September). Motion capture and emotion: Affect detection in whole body movement. In International Conference on Affective Computing and Intelligent Interaction (pp. 95-101). Springer Berlin Heidelberg.
    [8] Sariyanidi, E., Gunes, H., & Cavallaro, A. (2015). Automatic analysis of facial affect: A survey of registration, representation, and recognition. IEEE transactions on pattern analysis and machine intelligence, 37(6), 1113-1133.
    [9] Wagner, J., Kim, J., & André, E. (2005, July). From physiological signals to emotions: Implementing and comparing selected methods for feature extraction and classification. In 2005 IEEE International Conference on Multimedia and Expo (pp. 940-943). IEEE.
    [10] Koelstra, S., Muhl, C., Soleymani, M., Lee, J. S., Yazdani, A., Ebrahimi, T., ... & Patras, I. (2012). Deap: A database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing, 3(1), 18-31.
    [11] Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C. M., Kazemzadeh, A., ... & Narayanan, S. (2004, October). Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on Multimodal interfaces (pp. 205-211). ACM.
    [12] Gunes, H., Piccardi, M., & Pantic, M. (2008). From the lab to the real world: Affect recognition using multiple cues and modalities (pp. 185-218). InTech Education and Publishing.
    [13] Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., MüLler, C., & Narayanan, S. (2013). Paralinguistics in speech and language—state-of-the-art and the challenge. Computer Speech & Language, 27(1), 4-39.
    [14] Bone, D., Li, M., Black, M. P., & Narayanan, S. S. (2014). Intoxicated speech detection: A fusion framework with speaker-normalized hierarchical functionals and GMM supervectors. Computer speech & language, 28(2), 375-391.
    [15] Morency, L. P., de Kok, I., & Gratch, J. (2010). A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems, 20(1), 70-84.
    [16] Jeon, J. H., Xia, R., & Liu, Y. (2014). Level of interest sensing in spoken dialog using decision-level fusion of acoustic and lexical evidence. Computer Speech & Language, 28(2), 420-433.
    [17] Pentland, A. (2005). Socially aware, computation and communication. Computer, 38(3), 33-40.
    [18] Fasola, J., & Mataric, M. (2013). A socially assistive robot exercise coach for the elderly. Journal of Human-Robot Interaction, 2(2), 3-32.
    [19] Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C. C., Lammert, A. C., Christensen, A., ... & Narayanan, S. S. (2013). Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features. Speech Communication, 55(1), 1-21.
    [20] Lee, C. C., Katsamanis, A., Black, M. P., Baucom, B. R., Christensen, A., Georgiou, P. G., & Narayanan, S. S. (2014). Computing vocal entrainment: A signal-derived PCA-based quantification scheme with application to affect analysis in married couple interactions. Computer Speech & Language, 28(2), 518-539.
    [21] Georgiou, P. G., Black, M. P., Lammert, A. C., Baucom, B. R., & Narayanan, S. S. (2011, October). “That’s Aggravating, Very Aggravating”: Is It Possible to Classify Behaviors in Couple Interactions Using Automatically Derived Lexical Features?. In International Conference on Affective Computing and Intelligent Interaction (pp. 87-96). Springer Berlin Heidelberg.
    [22] Xiao, B., Bone, D., Van Segbroeck, M., Imel, Z. E., Atkins, D. C., Georgiou, P. G., & Narayanan, S. S. (2014, September). Modeling therapist empathy through prosody in drug addiction counseling. In INTERSPEECH (pp. 213-217).
    [23] Imel, Z. E., Barco, J. S., Brown, H. J., Baucom, B. R., Baer, J. S., Kircher, J. C., & Atkins, D. C. (2014). The association of therapist empathy and synchrony in vocally encoded arousal. Journal of counseling psychology, 61(1), 146.
    [24] Bone, D., Lee, C. C., Black, M. P., Williams, M. E., Lee, S., Levitt, P., & Narayanan, S. (2014). The psychologist as an interlocutor in autism spectrum disorder assessment: Insights from a study of spontaneous prosody. Journal of Speech, Language, and Hearing Research, 57(4), 1162-1177.
    [25] Metallinou, A., Grossman, R. B., & Narayanan, S. (2013, July). Quantifying atypicality in affective facial expressions of children with autism spectrum disorders. In 2013 IEEE international conference on multimedia and expo (ICME) (pp. 1-6). IEEE.
    [26] Yang, Z., Metallinou, A., & Narayanan, S. (2014). Analysis and predictive modeling of body language behavior in dyadic interactions from multimodal interlocutor cues. IEEE Transactions on Multimedia, 16(6), 1766-1778.
    [27] Yang, Z., Metallinou, A., Erzin, E., & Narayanan, S. (2014, May). Analysis of interaction attitudes using data-driven hand gesture phrases. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 699-703). IEEE.
    [28] Black, M. P., Tepperman, J., & Narayanan, S. S. (2011). Automatic prediction of children's reading ability for high-level literacy assessment. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 1015-1028.
    [29] Hsiao, S. W., Sun, H. C., Hsieh, M. C., Tsai, M. H., Lin, H. C., & Lee, C. C. (2015). A Multimodal Approach for Automatic Assessment of School Principals' Oral Presentation During Pre-Service Training Program. In Sixteenth Annual Conference of the International Speech Communication Association.
    [30] Watson, S., Miller, T., Johnston, L., & Rutledge, V. (2006). Professional development school graduate performance: Perceptions of school principals. The Teacher Educator, 42(2), 77-86.
    [31] Keung, P. S. (2007). Continuing professional development of principals in Hong Kong. Frontiers of Education in China, 2(4), 605-619.
    [32] Salazar, P. S. (2007). The professional development needs of rural high school principals: A seven-state study. The Rural Educator, 28(3).
    [33] Yan, W., & Catherine Ehrich, L. (2009). Principal preparation and training: a look at China and its issues. International Journal of Educational Management, 23(1), 51-64.
    [34] Keith, D. L. (2011). Principal desirabilitiy for professional development. Academy of Educational Leadership Journal, 15(2), 95.
    [35] Lunenburg, F. C., & Ornstein, A. C. (2011). Educational administration: Concepts and practices. Cengage Learning.
    [36] Streeter, L., Bernstein, J., Foltz, P., & DeL and, D. (2011). Pearson’s automated scoring of writing, speaking, and mathematics.
    [37] Topol, B., Olson, J., & Roeber, E. (2010). The cost of new higher quality assessments: A comprehensive analysis of the potential costs for future state assessments. Stanford, CA: Stanford Center for Opportunity Policy in Education. Retrieved August, 2, 2010.
    [38] Balogh, J., Bernstein, J., Cheng, J., & Townshend, B. (2007, October). Automatic evaluation of reading accuracy: assessing machine scores. In SLaTE (pp. 112-115).
    [39] Bernstein, J., Suzuki, M., Cheng, J., & Pado, U. (2009). Evaluating diglossic aspects of an automated test of spoken modern standard Arabic. In SLaTE (pp. 17-20).
    [40] Bernstein, J., Van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing.
    [41] Bernstein, J., & Cheng, J. (2007). Logic and validation of fully automatic spoken English test. The path of speech technologies in computer assisted language learning: From research toward practice, 174-194.
    [42] Cheng, D. S., Salamin, H., Salvagnini, P., Cristani, M., Vinciarelli, A., & Murino, V. (2014). Predicting online lecture ratings based on gesturing and vocal behavior. Journal on Multimodal User Interfaces, 8(2), 151-160.
    [43] Salvagnini, P., Salamin, H., Cristani, M., Vinciarelli, A., & Murino, V. (2012, December). Learning how to teach from “Videolectures”: automatic prediction of lecture ratings based on teacher's nonverbal behavior. In Cognitive Infocommunications (CogInfoCom), 2012 IEEE 3rd International Conference on (pp. 415-419). IEEE.
    [44] Tamrakar, A., Ali, S., Yu, Q., Liu, J., Javed, O., Divakaran, A., ... & Sawhney, H. (2012, June). Evaluation of low-level features and their combinations for complex event detection in open source videos. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 3681-3688). IEEE.
    [45] Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011, June). Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (pp. 3169-3176). IEEE.
    [46] Baraldi, L., Paci, F., Serra, G., Benini, L., & Cucchiara, R. (2014). Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 688-693).
    [47] Lee, C. C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9), 1162-1171.
    [48] Hertwig, R., & Todd, P. M. (2003). More is not always better: The benefits of cognitive limits. Thinking: Psychological perspectives on reasoning, judgment and decision making, 213-231.
    [49] Hogarth, R. M., & Karelaia, N. (2005). Ignoring information in binary choice with continuous variables: When is less “more”?. Journal of Mathematical Psychology, 49(2), 115-124.
    [50] Chatfield, K., Lempitsky, V. S., Vedaldi, A., & Zisserman, A. (2011, September). The devil is in the details: an evaluation of recent feature encoding methods. In BMVC (Vol. 2, No. 4, p. 8).
    [51] Caruana, R. (1998). Multitask learning. In Learning to learn (pp. 95-133). Springer US.
    [52] Zhu, D., Ma, B., & Li, H. (2009, April). Joint MAP adaptation of feature transformation and gaussian mixture model for speaker recognition. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4045-4048). IEEE.
    [53] Kim, H. D., Zhai, C., & Han, J. (2010, March). Aggregation of multiple judgments for evaluating ordered lists. In European Conference on Information Retrieval (pp. 166-178). Springer Berlin Heidelberg.
    [54] Eyben, F., Wöllmer, M., & Schuller, B. (2010, October). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459-1462). ACM.
    [55] Shi, J., & Tomasi, C. (1994, June). Good features to track. In Computer Vision and Pattern Recognition, 1994. Proceedings CVPR'94., 1994 IEEE Computer Society Conference on (pp. 593-600). IEEE.
    [56] Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3551-3558).
    [57] Jaakkola, T. S., & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. Advances in neural information processing systems, 487-493.
    [58] Perronnin, F., Sánchez, J., & Mensink, T. (2010, September). Improving the fisher kernel for large-scale image classification. In European conference on computer vision (pp. 143-156). Springer Berlin Heidelberg.
    [59] Chang, C. C., & Lin, C. J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
    [60] Cosmides, L. (1983). Invariances in the acoustic expression of emotion during speech. Journal of Experimental Psychology: Human Perception and Performance, 9(6), 864.
    [61] Grimm, M., Kroschel, K., Mower, E., & Narayanan, S. (2007). Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10), 787-800.
    [62] Sztahó, D., Kiss, G., & Vicsi, K. (2015). Estimating the Severity of Parkinson's Disease from Speech Using Linear Regression and Database Partitioning. InSixteenth Annual Conference of the International Speech Communication Association.
    [63] Kim, J., Nasir, M., Gupta, R., Segbroeck, M., Bone, D., Black, M., ... & Narayanan, S. (2015). Automatic estimation of Parkinson’s disease severity from diverse speech tasks. Proc. of INTERSPEECH. Dresden, Germany: ISCA, 914-918.
    [64] Quatieri, T. F., & Malyska, N. (2012). Vocal-Source Biomarkers for Depression: A Link to Psychomotor Activity. In Interspeech (pp. 1059-1062).
    [65] Black, M. P., Bone, D., Skordilis, Z. I., Gupta, R., Xia, W., Papadopoulos, P., ... & Georgiou, P. G. (2015). Automated Evaluation of Non-Native English Pronunciation Quality: Combining Knowledge-and Data-Driven Features at Multiple Time Scales. In Sixteenth Annual Conference of the International Speech Communication Association.
    [66] Montacié, C., & Caraty, M. J. (2015). Phrase Accentuation Verification and Phonetic Variation Measurement for the Degree of Nativeness Sub-Challenge. In Sixteenth Annual Conference of the International Speech Communication Association.
    [67] Lee, C. C., Bone, D., & Narayanan, S. S. (2015). An Analysis of the Relationship between Signal-derived Vocal Arousal Score and Human Emotion Production and Perception. In Sixteenth Annual Conference of the International Speech Communication Association.
    [68] Kim, Y., Lee, H., & Provost, E. M. (2013, May). Deep learning for robust feature generation in audiovisual emotion recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 3687-3691). IEEE.
    [69] Xia, R., Deng, J., Schuller, B., & Liu, Y. (2014, May). Modeling gender information for emotion recognition using denoising autoencoder. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 990-994). IEEE.
    [70] Baucom, K. J., Baucom, B. R., & Christensen, A. (2012). Do the naïve know best? The predictive power of naïve ratings of couple interactions. Psychological assessment, 24(4), 983.
    [71] Ghosn, J., & Bengio, Y. (1997). Multi-task learning for stock selection. Advances in Neural Information Processing Systems, 946-952.
    [72] Wang, X., Zhang, C., & Zhang, Z. (2009, June). Boosted multi-task learning for face verification with applications to web image and video search. In computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on (pp. 142-149). IEEE.
    [73] Zhang, D., Shen, D., & Alzheimer's Disease Neuroimaging Initiative. (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer's disease. Neuroimage, 59(2), 895-907.
    [74] Liu, J., Ji, S., & Ye, J. (2009, June). Multi-task feature learning via efficient l 2, 1-norm minimization. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence (pp. 339-348). AUAI Press.
    [75] Jalali, A., Sanghavi, S., Ruan, C., & Ravikumar, P. K. (2010). A dirty model for multi-task learning. In Advances in Neural Information Processing Systems (pp. 964-972).
    [76] Zhou, J., Chen, J., & Ye, J. (2011). Clustered multi-task learning via alternating structure optimization. In Advances in neural information processing systems (pp. 702-710).
    [77] Bonilla, E. V., Chai, K. M., & Williams, C. (2007). Multi-task Gaussian process prediction. In Advances in neural information processing systems (pp. 153-160).
    [78] Obozinski, G., Taskar, B., & Jordan, M. (2006). Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep, 2.
    [79] Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73(3), 243-272.
    [80] Zhou, Y., Jin, R., & Hoi, S. C. (2010, May). Exclusive Lasso for Multi-task Feature Selection. In AISTATS (Vol. 9, pp. 988-995).
    [81] Evgeniou, T., Micchelli, C. A., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(Apr), 615-637.
    [82] Dinuzzo, F. (2013). Learning output kernels for multi-task problems. Neurocomputing, 118, 119-126.
    [83] Bonilla, E. V., Agakov, F. V., & Williams, C. K. (2007, March). Kernel Multi-task Learning using Task-specific Features. In AISTATS (pp. 43-50).
    [84] Caponnetto, A., Micchelli, C. A., Pontil, M., & Ying, Y. (2008). Universal multi-task kernels. Journal of Machine Learning Research, 9(Jul), 1615-1646.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE