簡易檢索 / 詳目顯示

研究生: 林子盈
Lin, Tzu-Ying
論文名稱: 基於具鑑別力之多模組深度學習神經網路之RGB-D手勢及人臉辨識
BiodesNet: Discriminative Multi-Modal Deep Learning for RGB-D Gesture and Face Recognition
指導教授: 邱瀞德
Chiu, Ching-Te
口試委員: 張隆紋
Chang, Long-Wen
范倫達
Van, Lan-Da
楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱:
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 60
中文關鍵詞: 手勢辨識人臉辨識深度卷積神經網路多模組學習RGB-D識別全域描述子
外文關鍵詞: gesture recognition, face recognition, deep convolutional neural network, multi-modality, RGB-D recognition, global descriptor
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來由於深度感測器的快速發展及其廣泛的應用場景,使得光學深度影像感測在人臉辨識及手勢辨識應用中成為重要的技術。深度影像提供了有關外觀及物件形狀的資訊外,其不受光線的影響,可以不分晝夜進行偵測。如在AR/VR的應用中,由於光線變化複雜,深度資訊對於手勢辨識顯得更為重要。利用深度的資訊,預期可以提高辨識的準確度及用於防偽以提升安全性。
    雖然許多研究已經被提出,大多數的方法使用傳統的手工提取特徵(hand-crafted),這種方法的問題在於除了不容易擴展到不同的資料集外,也無法將取得色彩影像特徵的方式直接套用於深度影像上。最近隨著深度卷積神經網路的發展,一些方法將深度影像作為影像的第四個通道當作輸入,或者分別學習兩者的特徵並在神經網路的最後直接做結合,然而這些方法都沒有足夠利用色彩與深度影像的個別特徵,來加強他們之間的互補性及關聯性。
    在這篇論文中,我們提出一套具鑑別性的多模組深度學習框架,用於三維感測器所捕捉的色彩和深度影像上,完成人臉及手勢的識別和辨認。我們首先對色彩及深度分別做訓練以取得兩者的特徵,然後在神經網路的後期做連接時,加上我們定義的辨別力及關聯損失函數,藉此找到不同模組具識別力的特徵,並同時利用這兩種模組的互補關係,提升辨識度。
    我們分別在人臉及手勢辨識的RGB-D資料集上做實驗,結果顯示我們的多模組學習方法,在ASL Finger Spelling資料集上的手勢識別達到97.8\%的準確度.且在IIITD資料集的人臉辨識達到99.7\%的準確度。此外,我們將一張色彩及其對應深度的人臉影像,透過神經網路轉成一個特徵向量,進而製作成全域描述子。在使用256位元全域描述子的情況下,EER達到了5.663\%,且在有GPU加速的情況下,提取一張影像的全域描述子的時間僅需0.83秒,比對兩個256位元的全域描述子僅需22.7微秒。


    In recent years, due to the rapid development of depth sensors and their wide application scenarios, it becomes an important technology for using depth image to face and gesture recognition. Depth data provides more information about appearance and object shape, and it is invariant to lighting or color variations. For example, in AR/VR applications, depth information is more important for gesture recognition due to the complex light changes. Hence, with depth information, it can be expected that the performance of recognition as well as safety can be improved.
    Although many studies have been conducted, most work used hand-crafted features, the problem is that they do not readily extend to different datasets and need extra effort to extract features in depth images. Recently, with the growth of deep convolutional neural network, most of them either treat RGB-D as undifferentiated four-channel data, or learn features from color and depth separately, which cannot adequately utilize the difference between two individually features.
    In this paper, we propose a CNN-based multi-modal learning framework for RGB-D gesture and face recognition. After training the network of color and depth separately, we use both feature fusion and add our own defined discriminative and associate loss function to strengthen the complementary and discrimination between two modalities to improve the performance.
    We performed experiments on gesture and face RGB-D datasets. The results show that our multi-modal learning method achieved 97.8\% and 99.7\% classification accuracy on the ASL Finger Spelling and IIITD Face RGB-D dataset respectively. In addition, we map a face image with color and its corresponding depth into a feature vector, and convert it into a global descriptor. The equal error rate is 5.663\% with 256 bits global descriptors. To extract a global descriptor from an image, it takes only 0.83 seconds with GPU acceleration and comparing two 256 bits global descriptors for just 22.7 microseconds without GPU acceleration.

    1 Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Goal and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Related Works 8 2.1 RGB-D Based Gesture and Face Recognition . . . . . . . . . . . . . 8 2.2 Multi-Modal Feature Fusion for RGB-D Based Recognition . . . . . 10 2.3 Extract Global Descriptors using CNN . . . . . . . . . . . . . . . . 13 3 BiodesNet: Discriminative and Complementary Multi-Modal Fea- ture Fusion with CNN-based Learning Framework 15 3.1 RGB-D Based Gesture Recognition . . . . . . . . . . . . . . . . . . 16 3.1.1 Proposed Network Structure . . . . . . . . . . . . . . . . . . 17 3.1.2 Objective Functions . . . . . . . . . . . . . . . . . . . . . . 18 3.2 RGB-D Based Face Recognition and Verifi cation with Descriptor. 23 3.2.1 Proposed Network Structure . . . . . . . . . . . . . . . . . . 24 3.2.2 Objective Functions . . . . . . . . . . . . . . . . . . . . . . 28 4 Experimental Results 32 4.1 RGB-D Based Gesture Recognition . . . . . . . . . . . . . . . . . . 32 4.1.1 Implementation Destails . . . . . . . . . . . . . . . . . . . . 32 4.1.2 ASL Finger Spelling Dataset [1] . . . . . . . . . . . . . . . . 33 4.1.3 Classifi cation Accuracy and Comparison . . . . . . . . . .. 35 4.2 RGB-D Based Face Recognition and Verifi cation with Descriptor. 39 4.2.1 Implementation Destails . . . . . . . . . . . . . . . . . . . . 39 4.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.3 Classifi cation Accuracy and Comparison . . . . . . . . . .. 43 4.2.4 Evaluation Results for Face Verifi cation with Descriptor .. 49 5 Conclusions 53

    [1] N. Pugeault and R. Bowden, “Spelling it out: Real-time asl fi ngerspelling
    recognition,” in Computer Vision Workshops (ICCV Workshops), 2011, pp.
    1114–1119.
    [2] F. Schroff , D. Kalenichenko, and J. Philbin, “Facenet: A unifi ed embedding
    for face recognition and clustering,” in Proceedings of the IEEE conference on
    computer vision and pattern recognition, 2015, pp. 815–823.
    [3] A. Wang, J. Lu, J. Cai, T.-J. Cham, and G. Wang, “Large-margin multi-
    modal deep learning for rgb-d object recognition,” IEEE Transactions on
    Multimedia, vol. 17, no. 11, pp. 1887–1898, 2015.
    [4] J.-H. Huang and C.-T. Chiu, “Learning global descriptors using supervised
    deep convolutional neural networks for fi ngerprint recognition,” Bachelor’s
    Thesis, National Tsing Hua University, 2017.
    [5] G. Goswami, M. Vatsa, and R. Singh, “Rgb-d face recognition with texture
    and attribute features.” IEEE Trans. Information Forensics and Security,
    vol. 9, no. 10, pp. 1629–1640, 2014.
    [6] G. Goswami, S. Bharadwaj, M. Vatsa, and R. Singh, “On rgb-d face recogni-
    tion using kinect,” in Biometrics: Theory, Applications and Systems (BTAS),
    2013 IEEE Sixth International Conference on, 2013, pp. 1–6.
    [7] R. Min, N. Kose, and J.-L. Dugelay, “Kinectfacedb: A kinect database for
    face recognition,” IEEE Transactions on Systems, Man, and Cybernetics:
    Systems, vol. 44, no. 11, pp. 1534–1548, 2014.
    [8] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” In-
    ternational journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
    [9] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary
    patterns: Application to face recognition,” IEEE Transactions on Pattern
    Analysis & Machine Intelligence, no. 12, pp. 2037–2041, 2006.
    [10] H. Zhu, J.-B. Weibel, and S. Lu, “Discriminative multi-modal feature fusion
    for rgbd indoor scene recognition,” in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, 2016, pp. 2969–2976.
    [11] K. O. Rodriguez and G. C. Chavez, “Finger spelling recognition from rgb-
    d information using kernel descriptor,” in Graphics, Patterns and Images
    (SIBGRAPI), 2013 26th SIBGRAPI-Conference on, 2013, pp. 1–7.
    [12] V. Vapnik, The nature of statistical learning theory. Springer science &
    business media, 2013.
    [13] H. Mahmud, M. K. Hasan, and M. Abdullah-Al-Tariq, “Hand gesture recog-
    nition using sift features on depth image,” 2016.
    [14] T. Mantecón, C. R. del Blanco, F. Jaureguizar, and N. García, “Visual face
    recognition using bag of dense derivative depth patterns,” IEEE Signal Pro-
    cessing Letters, vol. 23, no. 6, pp. 771–775, 2016.
    [15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detec-
    tion,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
    Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
    [16] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32,
    2001.
    [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haff ner, “Gradient-based learning
    applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
    pp. 2278–2324, 1998.
    [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi cation with
    deep convolutional neural networks,” in Advances in neural information pro-
    cessing systems, 2012, pp. 1097–1105.
    [19] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
    A large-scale hierarchical image database,” in Computer Vision and Pattern
    Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–
    255.
    [20] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
    object detection with region proposal networks,” in Advances in neural infor-
    mation processing systems, 2015, pp. 91–99.
    [21] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for hu-
    man action recognition,” IEEE transactions on pattern analysis and machine
    intelligence, vol. 35, no. 1, pp. 221–231, 2013.
    [22] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical
    features for scene labeling,” IEEE transactions on pattern analysis and ma-
    chine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
    [23] Q. Gao, J. Liu, Z. Ju, Y. Li, T. Zhang, and L. Zhang, “Static hand gesture
    recognition with parallel cnns for space human-robot interaction,” in Interna-
    tional Conference on Intelligent Robotics and Applications. Springer, 2017,
    pp. 462–473.
    [24] S.-Z. Li, B. Yu, W. Wu, S.-Z. Su, and R.-R. Ji, “Feature learning based on sae–
    pca network for human gesture recognition in rgbd images,” Neurocomputing,
    vol. 151, pp. 565–573, 2015.
    [25] Y.-C. Lee, J. Chen, C. W. Tseng, and S.-H. Lai, “Accurate and robust face
    recognition from rgb-d images with a deep learning approach.” in BMVC,
    2016.
    [26] C. Couprie, C. Farabet, L. Najman, and Y. LeCun, “Indoor semantic seg-
    mentation using depth information,” arXiv preprint arXiv:1301.3572, 2013.
    [27] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard,
    “Multimodal deep learning for robust rgb-d object recognition,” in Intelligent
    Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on.
    IEEE, 2015, pp. 681–687.
    [28] O. K. Oyedotun, G. G. Demisse, A. E. R. Shabayek, D. Aouada, and B. E.
    Ottersten, “Facial expression recognition via joint deep learning of rgb-depth
    map latent representations.” in ICCV Workshops, 2017, pp. 3161–3168.
    [29] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from
    rgb-d images for object detection and segmentation,” in European Conference
    on Computer Vision. Springer, 2014, pp. 345–360.
    [30] J. Schlosser, C. K. Chow, and Z. Kira, “Fusing lidar and images for pedestrian
    detection using convolutional neural networks,” in Robotics and Automation
    (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 2198–
    2205.
    [31] V. Nair and G. E. Hinton, “Rectifi ed linear units improve restricted boltz-
    mann machines,” in Proceedings of the 27th international conference on ma-
    chine learning (ICML-10), 2010, pp. 807–814.
    [32] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
    S. Ghemawat, G. Irving, M. Isard et al., “Tensorfl ow: a system for large-scale
    machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
    [33] T. Tieleman and G. Hinton, “Rmsprop: Divide the gradient by a running
    average of its recent magnitude. coursera: Neural networks for machine learn-
    ing,” COURSERA Neural Networks Mach. Learn, 2012.
    [34] M. Ma, X. Xu, J. Wu, and M. Guo, “Design and analyze the structure based
    on deep belief network for gesture recognition,” in Advanced Computational
    Intelligence (ICACI), 2018 Tenth International Conference on. IEEE, 2018,
    pp. 40–44.
    [35] S. J. Pan, Q. Yang et al., “A survey on transfer learning,” IEEE Transactions
    on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
    [36] K. W. Bowyer, K. Chang, and P. Flynn, “A survey of approaches to three-
    dimensional face recognition,” in Pattern Recognition, 2004. ICPR 2004. Pro-
    ceedings of the 17th International Conference on, vol. 1. IEEE, 2004, pp.
    358–361.
    [37] G. Goswami, M. Vatsa, and R. Singh, “Face recognition with rgb-d images
    using kinect,” in Face Recognition Across the Imaging Spectrum. Springer,
    2016, pp. 281–303.
    [38] A. Chowdhury, S. Ghosh, R. Singh, and M. Vatsa, “Rgb-d face recognition
    via learning-based reconstruction,” in Biometrics Theory, Applications and
    Systems (BTAS), 2016 IEEE 8th International Conference on. IEEE, 2016,
    pp. 1–7.
    [39] G. Goswami, R. Singh, M. Vatsa, and A. Majumdar, “Kernel group sparse
    representation based classifi er for multimodal biometrics,” in Neural Networks
    (IJCNN), 2017 International Joint Conference on. IEEE, 2017, pp. 2894–
    2901.
    [40] J. Cui, H. Zhang, H. Han, S. Shan, and X. Chen, “Improving 2d face recog-
    nition via discriminative face depth estimation,” Proc. ICB, pp. 1–8, 2018.
    [41] H. Zhang, H. Han, J. Cui, S. Shan, and X. Chen, “Rgb-d face recognition
    via deep complementary and common feature learning,” in Automatic Face
    & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference
    on. IEEE, 2018, pp. 8–15.
    [42] M. Ouloul, Z. Moutakki, K. Afdel, and A. Amghar, “An effi cient face recogni-
    tion using sift descriptor in rgb-d images,” International Journal of Electrical
    and Computer Engineering (IJECE), vol. 5, no. 6, pp. 1227–1233, 2015.
    [43] S. Azzakhnini, L. Ballihi, and D. Aboutajdine, “A learned feature descriptor
    for effi cient gender recognition using an rgb-d sensor,”in Signal, Image, Video
    and Communications (ISIVC), International Symposium on. IEEE, 2016,
    pp. 29–34.
    [44] N. Ahmad, “Robust multimodal face recognition with pre-processed kinect
    rgb-d images,” JOURNAL OF ENGINEERING AND APPLIED SCIENCES,
    vol. 36, no. 1, 2017.
    [45] M. Aman, W. Shah, and B. J. Ihtesham-ul Islam, “Fusion of color and depth
    information for facial recognition using a multi perspective approach,” p. 365–
    374, 2017.

    QR CODE