簡易檢索 / 詳目顯示

研究生: 汪叔慧
Wang, Shu-Hui
論文名稱: 應用於即時情感辨識之基於屬性的雙通道時間網絡
AST-Net: An Attribute-based Siamese Temporal Network for Real-Time Emotion Recognition
指導教授: 許秋婷
Hsu, Chiou-Ting
口試委員: 李祈均
Lee, Chi-Chun
陳永昇
Chen, Yong-Sheng
學位類別: 碩士
Master
系所名稱:
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 32
中文關鍵詞: 情感辨識時間網絡卷積類神經網路
外文關鍵詞: Temporal Network, Attribute Feature
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 預測臉部連續並且自發性的情緒變化在電腦視覺領域是一個很重要的研究。
    因為了解即時並且細微的情緒變化會對許多人機互動和醫療監控的應用領域有
    很大的幫助。在這篇論文裡,我們會著重分析兩個情緒象限,valence 和 arousal
    在時間上的動態情緒變化。我們提出了一個基於屬性的雙通道時間網絡,這個網
    路包含了一個離散的情緒性卷積網路模型 (discrete emotion CNN model)和一個
    堆疊的長短期記憶模型 (Stacked-LSTM)。透過這兩個模型,我們可以有效結合
    空間上的臉部特徵資訊和長時間的動態變化進而達到幫助預測的目的。其中,離
    散的情緒性卷積網路模型是為了擷取出不受動作和個體特徵變化影響的關於情
    緒的特徵;而堆疊的長短期記憶模型則是用於學習沿著時域上的情緒的動態依賴
    性。此外,為了穩定訓練過程,並從而得出更平穩可靠的長期預測結果,我們會
    同時將兩段在時間上位移過的影片輸入 Siamese (雙通道)網路架構。AVEC2012
    的實驗結果顯示,我們提出的方法不僅可以即時預測 (平均每秒預測 40.1 個影
    格),也能在只用影像資訊的條件下得到現階段在 AVEC2012 這個資料上最好的
    結果。


    Predicting continuous facial emotions is essential to many applications in
    human-computer interaction. In this paper, we focus on predicting the two
    dimensional emotions: valence and arousal, to interpret the dynamically yet subtly
    changed facial emotions. We propose an Attribute-based Siamese Temporal Network
    (AST-Net), which includes a discrete emotion CNN model and a Stacked-LSTM, to
    incorporate both the spatial facial attributes and the long-term dynamics into the
    prediction. The discrete emotion CNN model aims to extract attribute-related but
    pose- and identity-invariant features; and the Stacked-LSTM is used to characterize
    the dynamic dependency along the temporal domain. Furthermore, in order to
    stabilize the training procedure and also to derive a smoother and reliable long-term
    prediction, we propose to jointly learn the model from two temporally-shifted videos
    under the Siamese network architecture. Experimental results on AVEC2012 dataset
    show that the proposed AST-Net not only processes in real time (40.1 frames per
    second) but also achieves the state-of-the-art performance even when using the vision
    modality alone.

    中文摘要 I Abstract II 1. Introduction 1 2. Related Work 5 3. Proposed Method 6 3.1 Pre-Processing 7 3.2 Discrete emotion CNN model 7 3.3 Stacked-LSTM 8 3.4 Siamese Temporal Network and Loss Function 9 4. Experiments 13 4.1 Datasets 13 4.2 Evaluation Scheme 15 4.3 Implementation Details 16 4.4 Results 16 5. Conclusion 27 References 28

    [1] T. Baltrušaitis, N. Banda, and P. Robinson, " Dimensional affect recognition
    using continuous conditional random fields," In Automatic Face and Gesture
    Recognition (FG), 2013 10th IEEE International Conference and Workshops on,
    pages 1–8. IEEE, 2013.
    [2] P. Cardinal, N. Dehak, AL. Koerich, J. Alam, and P. Boucher, “Ets system for
    av+ ec 2015 challenge,” In Proc. 5th International Workshop on Audio/Visual
    Emotion Challenge, pages 17–23. ACM, 2015.
    [3] L. Chao, J. Tao, M. Yang, Y. Li, and Z. Wen, “Long short term memory recurrent
    neural network based multimodal dimensional emotion recognition," In Proc. 5th
    International Workshop on Audio/Visual Emotion Challenge, pages 65–72. ACM,
    2015.
    [4] H. Chen, J. Li, F. Zhang, Y. Li, and H. Wang, “3d model-based continuous
    emotion recognition,” In Proc. IEEE Conference on Computer Vision and
    Pattern Recognition, pages 1836–1845, 2015.
    [5] S. Chen and Q. Jin, " Multi-modal dimensional emotion recognition using
    recurrent neural networks," In Proc. 5th International Workshop on Audio/Visual
    Emotion Challenge, pages 49–56. ACM, 2015.
    [6] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
    K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for
    visual recognition and description,” In Proc. IEEE Conference On Computer
    Vision and Pattern Recognition, pages 2625–2634, 2015.
    [7] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network
    fusion for video action recognition,” In Proc. IEEE Conference on Computer
    Vision and Pattern Recognition, pages 1933–1941, 2016.

    29

    [8] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, and
    Y. Zhou, “Challenges in representation learning: A report on three machine
    learning contests,” In International Conference on Neural Information
    Processing, pages 117–124. Springer, 2013.
    [9] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural
    computation, 9(8):1735–1780, 1997.
    [10] S. Jaiswal and M. Valstar, “Deep learning the dynamic appearance and shape of
    facial action units,” In Applications of Computer Vision (WACV), 2016 IEEE
    Winter Conference on, pages 1–8. IEEE, 2016.
    [11] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human
    action recognition,” IEEE Transactions on Pattern Analysis and Machine
    Intelligence, 35(1):221–231, 2013.
    [12] M. Kächele, P. Thiam, G. Palm, F. Schwenker, and M. Schels, “Ensemble
    methods for continuous affect recognition: Multi-modality, temporality, and
    challenges,” In Proc.5th International Workshop on Audio/Visual Emotion
    Challenge, pages 9–16. ACM, 2015.
    [13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei,
    “Large-scale video classification with convolutional neural networks,” In Proc.
    IEEE Conference on Computer Vision and Pattern Recognition, pages
    1725–1732, 2014
    [14] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The
    extended cohn-kanade dataset (ck+): A complete dataset for action unit and
    emotion-specified expression,” In Proc. Computer Vision and Pattern
    Recognition Workshops (CVPRW), pages 94–101. IEEE, 2010.
    [15] G. McKeown, M. F. Valstar, R. Cowie, and M. Pantic, “The semaine corpus of
    emotionally coloured character interactions,” In Multimedia and Expo (ICME),

    30

    2010 IEEE International Conference on, pages 1079–1084. IEEE, 2010.
    [16] H. Meng, N. Bianchi-Berthouze, Y. Deng, J. Cheng, and J. P. Cosmas,
    “Time-delay neural network for continuous emotional dimension prediction from
    facial expression sequences,” IEEE transactions on cybernetics, 46(4):916–929,
    2016.
    [17] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and
    G. Toderici, “Beyond short snippets: Deep networks for video classification,” In
    Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages
    4694–4702,2015.
    [18] M. A. Nicolaou, S. Zafeiriou, and M. Pantic, “Correlated-spaces regression for
    learning continuous emotion dimensions,” In Proc. 21st ACM International
    Conference on Multimedia, pages 773–776. ACM, 2013.
    [19] J. Nicolle, V. Rapp, K. Bailly, L. Prevost, and M. Chetouani, “Robust continuous
    prediction of human emotions using multiscale dynamic cues,” In Proc. 14th
    ACM International Conference on Multimodal interaction, pages 501–508. ACM,
    2012.
    [20] D. Ozkan, S. Scherer, and L. P. Morency, “Step-wise emotion recognition using
    concatenated-hmm,” In Proc. 14th ACM International Conference on
    Multimodal interaction, pages 477–484. ACM, 2012.
    [21] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for
    facial expression analysis,” In Proc. ICME., page 5. IEEE, 2005.
    [22] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” In Proc.
    British Machine Vision Conference, 2015.
    [23] J. Posner, JA. Russell, and BS. Peterson, “The circumplex model of affect: An
    integrative approach to affective neuroscience, cognitive development, and
    psychopathology,” Development and Psychopathology, 17(3):715–734, 2005.

    31

    [24] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the recola
    multi-modal corpus of remote collaborative and affective interactions,” In
    Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International
    Conference and Workshops on, pages 1–8. IEEE, 2013.
    [25] F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.P. Thiran, T. Ebrahimi, D. Lalanne,
    and B. Schuller, “Prediction of asynchronous dimensional emotion ratings from
    audiovisual and physiological data,” Pattern Recognition Letters, 66:22–30,
    2015.
    [26] F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie,
    and M. Pantic, “Av+ ec 2015: The first affect recognition challenge bridging
    across audio, video, and physiological data,” In Proc. 5th International
    Workshop on Audio/Visual Emotion Challenge, pages 3–8. ACM, 2015
    [27] A. Savran, H. Cao, A. Nenkova, and R. Verma, “Temporal bayesian fusion for
    affect sensing: Combining video, audio, and lexical modalities,” IEEE
    Transactions on Cybernetics, 45(9):1927–1941, 2015.
    [28] B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic, “Avec 2012: the
    continuous audio/visual emotion challenge,” In Proc. 14th ACM International
    Conference on Multimodal interaction, pages 449–456. ACM, 2012
    [29] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action
    recognition in videos,” In Advances in Neural Information Processing Systems,
    pages 568–576, 2014.
    [30] C. Soladié, H. Salam, C. Pelachaud, N. Stoiber, and R. Séguier, “A multimodal
    fuzzy inference system using a continuous facial expression representation for
    emotion detection,” In Proc. 14th ACM International Conference on Multimodal
    Interaction, pages 493–500. ACM, 2012.
    [31] Y. I. Tian, T. Kanade, and J. F. Cohn, “Recognizing action units for facial

    32

    expression analysis,” IEEE Transactions on Pattern Analysis and Machine
    Intelligence, 23(2):97–115, 2001.
    [32] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
    spatiotemporal features with 3d convolutional networks,” In Proc. IEEE
    International Conference on Computer Vision, pages 4489–4497, 2015.
    [33] M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an addition
    to the mmi facial expression database,” In Proc. 3rd Intern. Workshop on
    EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect,
    page 65, 2010.
    [34] H. Wang and C. Schmid, “Action recognition with improved trajectories,” In
    Proc. IEEE International Conference on Computer Vision, pages 3551–3558,
    2013.

    QR CODE