研究生: |
汪叔慧 Wang, Shu-Hui |
---|---|
論文名稱: |
應用於即時情感辨識之基於屬性的雙通道時間網絡 AST-Net: An Attribute-based Siamese Temporal Network for Real-Time Emotion Recognition |
指導教授: |
許秋婷
Hsu, Chiou-Ting |
口試委員: |
李祈均
Lee, Chi-Chun 陳永昇 Chen, Yong-Sheng |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 英文 |
論文頁數: | 32 |
中文關鍵詞: | 情感辨識 、時間網絡 、卷積類神經網路 |
外文關鍵詞: | Temporal Network, Attribute Feature |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
預測臉部連續並且自發性的情緒變化在電腦視覺領域是一個很重要的研究。
因為了解即時並且細微的情緒變化會對許多人機互動和醫療監控的應用領域有
很大的幫助。在這篇論文裡,我們會著重分析兩個情緒象限,valence 和 arousal
在時間上的動態情緒變化。我們提出了一個基於屬性的雙通道時間網絡,這個網
路包含了一個離散的情緒性卷積網路模型 (discrete emotion CNN model)和一個
堆疊的長短期記憶模型 (Stacked-LSTM)。透過這兩個模型,我們可以有效結合
空間上的臉部特徵資訊和長時間的動態變化進而達到幫助預測的目的。其中,離
散的情緒性卷積網路模型是為了擷取出不受動作和個體特徵變化影響的關於情
緒的特徵;而堆疊的長短期記憶模型則是用於學習沿著時域上的情緒的動態依賴
性。此外,為了穩定訓練過程,並從而得出更平穩可靠的長期預測結果,我們會
同時將兩段在時間上位移過的影片輸入 Siamese (雙通道)網路架構。AVEC2012
的實驗結果顯示,我們提出的方法不僅可以即時預測 (平均每秒預測 40.1 個影
格),也能在只用影像資訊的條件下得到現階段在 AVEC2012 這個資料上最好的
結果。
Predicting continuous facial emotions is essential to many applications in
human-computer interaction. In this paper, we focus on predicting the two
dimensional emotions: valence and arousal, to interpret the dynamically yet subtly
changed facial emotions. We propose an Attribute-based Siamese Temporal Network
(AST-Net), which includes a discrete emotion CNN model and a Stacked-LSTM, to
incorporate both the spatial facial attributes and the long-term dynamics into the
prediction. The discrete emotion CNN model aims to extract attribute-related but
pose- and identity-invariant features; and the Stacked-LSTM is used to characterize
the dynamic dependency along the temporal domain. Furthermore, in order to
stabilize the training procedure and also to derive a smoother and reliable long-term
prediction, we propose to jointly learn the model from two temporally-shifted videos
under the Siamese network architecture. Experimental results on AVEC2012 dataset
show that the proposed AST-Net not only processes in real time (40.1 frames per
second) but also achieves the state-of-the-art performance even when using the vision
modality alone.
[1] T. Baltrušaitis, N. Banda, and P. Robinson, " Dimensional affect recognition
using continuous conditional random fields," In Automatic Face and Gesture
Recognition (FG), 2013 10th IEEE International Conference and Workshops on,
pages 1–8. IEEE, 2013.
[2] P. Cardinal, N. Dehak, AL. Koerich, J. Alam, and P. Boucher, “Ets system for
av+ ec 2015 challenge,” In Proc. 5th International Workshop on Audio/Visual
Emotion Challenge, pages 17–23. ACM, 2015.
[3] L. Chao, J. Tao, M. Yang, Y. Li, and Z. Wen, “Long short term memory recurrent
neural network based multimodal dimensional emotion recognition," In Proc. 5th
International Workshop on Audio/Visual Emotion Challenge, pages 65–72. ACM,
2015.
[4] H. Chen, J. Li, F. Zhang, Y. Li, and H. Wang, “3d model-based continuous
emotion recognition,” In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, pages 1836–1845, 2015.
[5] S. Chen and Q. Jin, " Multi-modal dimensional emotion recognition using
recurrent neural networks," In Proc. 5th International Workshop on Audio/Visual
Emotion Challenge, pages 49–56. ACM, 2015.
[6] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for
visual recognition and description,” In Proc. IEEE Conference On Computer
Vision and Pattern Recognition, pages 2625–2634, 2015.
[7] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network
fusion for video action recognition,” In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 1933–1941, 2016.
29
[8] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, and
Y. Zhou, “Challenges in representation learning: A report on three machine
learning contests,” In International Conference on Neural Information
Processing, pages 117–124. Springer, 2013.
[9] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural
computation, 9(8):1735–1780, 1997.
[10] S. Jaiswal and M. Valstar, “Deep learning the dynamic appearance and shape of
facial action units,” In Applications of Computer Vision (WACV), 2016 IEEE
Winter Conference on, pages 1–8. IEEE, 2016.
[11] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human
action recognition,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35(1):221–231, 2013.
[12] M. Kächele, P. Thiam, G. Palm, F. Schwenker, and M. Schels, “Ensemble
methods for continuous affect recognition: Multi-modality, temporality, and
challenges,” In Proc.5th International Workshop on Audio/Visual Emotion
Challenge, pages 9–16. ACM, 2015.
[13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei,
“Large-scale video classification with convolutional neural networks,” In Proc.
IEEE Conference on Computer Vision and Pattern Recognition, pages
1725–1732, 2014
[14] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The
extended cohn-kanade dataset (ck+): A complete dataset for action unit and
emotion-specified expression,” In Proc. Computer Vision and Pattern
Recognition Workshops (CVPRW), pages 94–101. IEEE, 2010.
[15] G. McKeown, M. F. Valstar, R. Cowie, and M. Pantic, “The semaine corpus of
emotionally coloured character interactions,” In Multimedia and Expo (ICME),
30
2010 IEEE International Conference on, pages 1079–1084. IEEE, 2010.
[16] H. Meng, N. Bianchi-Berthouze, Y. Deng, J. Cheng, and J. P. Cosmas,
“Time-delay neural network for continuous emotional dimension prediction from
facial expression sequences,” IEEE transactions on cybernetics, 46(4):916–929,
2016.
[17] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and
G. Toderici, “Beyond short snippets: Deep networks for video classification,” In
Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages
4694–4702,2015.
[18] M. A. Nicolaou, S. Zafeiriou, and M. Pantic, “Correlated-spaces regression for
learning continuous emotion dimensions,” In Proc. 21st ACM International
Conference on Multimedia, pages 773–776. ACM, 2013.
[19] J. Nicolle, V. Rapp, K. Bailly, L. Prevost, and M. Chetouani, “Robust continuous
prediction of human emotions using multiscale dynamic cues,” In Proc. 14th
ACM International Conference on Multimodal interaction, pages 501–508. ACM,
2012.
[20] D. Ozkan, S. Scherer, and L. P. Morency, “Step-wise emotion recognition using
concatenated-hmm,” In Proc. 14th ACM International Conference on
Multimodal interaction, pages 477–484. ACM, 2012.
[21] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for
facial expression analysis,” In Proc. ICME., page 5. IEEE, 2005.
[22] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” In Proc.
British Machine Vision Conference, 2015.
[23] J. Posner, JA. Russell, and BS. Peterson, “The circumplex model of affect: An
integrative approach to affective neuroscience, cognitive development, and
psychopathology,” Development and Psychopathology, 17(3):715–734, 2005.
31
[24] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the recola
multi-modal corpus of remote collaborative and affective interactions,” In
Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International
Conference and Workshops on, pages 1–8. IEEE, 2013.
[25] F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.P. Thiran, T. Ebrahimi, D. Lalanne,
and B. Schuller, “Prediction of asynchronous dimensional emotion ratings from
audiovisual and physiological data,” Pattern Recognition Letters, 66:22–30,
2015.
[26] F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie,
and M. Pantic, “Av+ ec 2015: The first affect recognition challenge bridging
across audio, video, and physiological data,” In Proc. 5th International
Workshop on Audio/Visual Emotion Challenge, pages 3–8. ACM, 2015
[27] A. Savran, H. Cao, A. Nenkova, and R. Verma, “Temporal bayesian fusion for
affect sensing: Combining video, audio, and lexical modalities,” IEEE
Transactions on Cybernetics, 45(9):1927–1941, 2015.
[28] B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic, “Avec 2012: the
continuous audio/visual emotion challenge,” In Proc. 14th ACM International
Conference on Multimodal interaction, pages 449–456. ACM, 2012
[29] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action
recognition in videos,” In Advances in Neural Information Processing Systems,
pages 568–576, 2014.
[30] C. Soladié, H. Salam, C. Pelachaud, N. Stoiber, and R. Séguier, “A multimodal
fuzzy inference system using a continuous facial expression representation for
emotion detection,” In Proc. 14th ACM International Conference on Multimodal
Interaction, pages 493–500. ACM, 2012.
[31] Y. I. Tian, T. Kanade, and J. F. Cohn, “Recognizing action units for facial
32
expression analysis,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 23(2):97–115, 2001.
[32] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” In Proc. IEEE
International Conference on Computer Vision, pages 4489–4497, 2015.
[33] M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an addition
to the mmi facial expression database,” In Proc. 3rd Intern. Workshop on
EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect,
page 65, 2010.
[34] H. Wang and C. Schmid, “Action recognition with improved trajectories,” In
Proc. IEEE International Conference on Computer Vision, pages 3551–3558,
2013.