研究生: |
林子盈 Lin, Tzu-Ying |
---|---|
論文名稱: |
基於具鑑別力之多模組深度學習神經網路之RGB-D手勢及人臉辨識 BiodesNet: Discriminative Multi-Modal Deep Learning for RGB-D Gesture and Face Recognition |
指導教授: |
邱瀞德
Chiu, Ching-Te |
口試委員: |
張隆紋
Chang, Long-Wen 范倫達 Van, Lan-Da 楊家輝 Yang, Jar-Ferr |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 60 |
中文關鍵詞: | 手勢辨識 、人臉辨識 、深度卷積神經網路 、多模組學習 、RGB-D識別 、全域描述子 |
外文關鍵詞: | gesture recognition, face recognition, deep convolutional neural network, multi-modality, RGB-D recognition, global descriptor |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來由於深度感測器的快速發展及其廣泛的應用場景,使得光學深度影像感測在人臉辨識及手勢辨識應用中成為重要的技術。深度影像提供了有關外觀及物件形狀的資訊外,其不受光線的影響,可以不分晝夜進行偵測。如在AR/VR的應用中,由於光線變化複雜,深度資訊對於手勢辨識顯得更為重要。利用深度的資訊,預期可以提高辨識的準確度及用於防偽以提升安全性。
雖然許多研究已經被提出,大多數的方法使用傳統的手工提取特徵(hand-crafted),這種方法的問題在於除了不容易擴展到不同的資料集外,也無法將取得色彩影像特徵的方式直接套用於深度影像上。最近隨著深度卷積神經網路的發展,一些方法將深度影像作為影像的第四個通道當作輸入,或者分別學習兩者的特徵並在神經網路的最後直接做結合,然而這些方法都沒有足夠利用色彩與深度影像的個別特徵,來加強他們之間的互補性及關聯性。
在這篇論文中,我們提出一套具鑑別性的多模組深度學習框架,用於三維感測器所捕捉的色彩和深度影像上,完成人臉及手勢的識別和辨認。我們首先對色彩及深度分別做訓練以取得兩者的特徵,然後在神經網路的後期做連接時,加上我們定義的辨別力及關聯損失函數,藉此找到不同模組具識別力的特徵,並同時利用這兩種模組的互補關係,提升辨識度。
我們分別在人臉及手勢辨識的RGB-D資料集上做實驗,結果顯示我們的多模組學習方法,在ASL Finger Spelling資料集上的手勢識別達到97.8\%的準確度.且在IIITD資料集的人臉辨識達到99.7\%的準確度。此外,我們將一張色彩及其對應深度的人臉影像,透過神經網路轉成一個特徵向量,進而製作成全域描述子。在使用256位元全域描述子的情況下,EER達到了5.663\%,且在有GPU加速的情況下,提取一張影像的全域描述子的時間僅需0.83秒,比對兩個256位元的全域描述子僅需22.7微秒。
In recent years, due to the rapid development of depth sensors and their wide application scenarios, it becomes an important technology for using depth image to face and gesture recognition. Depth data provides more information about appearance and object shape, and it is invariant to lighting or color variations. For example, in AR/VR applications, depth information is more important for gesture recognition due to the complex light changes. Hence, with depth information, it can be expected that the performance of recognition as well as safety can be improved.
Although many studies have been conducted, most work used hand-crafted features, the problem is that they do not readily extend to different datasets and need extra effort to extract features in depth images. Recently, with the growth of deep convolutional neural network, most of them either treat RGB-D as undifferentiated four-channel data, or learn features from color and depth separately, which cannot adequately utilize the difference between two individually features.
In this paper, we propose a CNN-based multi-modal learning framework for RGB-D gesture and face recognition. After training the network of color and depth separately, we use both feature fusion and add our own defined discriminative and associate loss function to strengthen the complementary and discrimination between two modalities to improve the performance.
We performed experiments on gesture and face RGB-D datasets. The results show that our multi-modal learning method achieved 97.8\% and 99.7\% classification accuracy on the ASL Finger Spelling and IIITD Face RGB-D dataset respectively. In addition, we map a face image with color and its corresponding depth into a feature vector, and convert it into a global descriptor. The equal error rate is 5.663\% with 256 bits global descriptors. To extract a global descriptor from an image, it takes only 0.83 seconds with GPU acceleration and comparing two 256 bits global descriptors for just 22.7 microseconds without GPU acceleration.
[1] N. Pugeault and R. Bowden, “Spelling it out: Real-time asl fi ngerspelling
recognition,” in Computer Vision Workshops (ICCV Workshops), 2011, pp.
1114–1119.
[2] F. Schroff , D. Kalenichenko, and J. Philbin, “Facenet: A unifi ed embedding
for face recognition and clustering,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2015, pp. 815–823.
[3] A. Wang, J. Lu, J. Cai, T.-J. Cham, and G. Wang, “Large-margin multi-
modal deep learning for rgb-d object recognition,” IEEE Transactions on
Multimedia, vol. 17, no. 11, pp. 1887–1898, 2015.
[4] J.-H. Huang and C.-T. Chiu, “Learning global descriptors using supervised
deep convolutional neural networks for fi ngerprint recognition,” Bachelor’s
Thesis, National Tsing Hua University, 2017.
[5] G. Goswami, M. Vatsa, and R. Singh, “Rgb-d face recognition with texture
and attribute features.” IEEE Trans. Information Forensics and Security,
vol. 9, no. 10, pp. 1629–1640, 2014.
[6] G. Goswami, S. Bharadwaj, M. Vatsa, and R. Singh, “On rgb-d face recogni-
tion using kinect,” in Biometrics: Theory, Applications and Systems (BTAS),
2013 IEEE Sixth International Conference on, 2013, pp. 1–6.
[7] R. Min, N. Kose, and J.-L. Dugelay, “Kinectfacedb: A kinect database for
face recognition,” IEEE Transactions on Systems, Man, and Cybernetics:
Systems, vol. 44, no. 11, pp. 1534–1548, 2014.
[8] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” In-
ternational journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
[9] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary
patterns: Application to face recognition,” IEEE Transactions on Pattern
Analysis & Machine Intelligence, no. 12, pp. 2037–2041, 2006.
[10] H. Zhu, J.-B. Weibel, and S. Lu, “Discriminative multi-modal feature fusion
for rgbd indoor scene recognition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 2969–2976.
[11] K. O. Rodriguez and G. C. Chavez, “Finger spelling recognition from rgb-
d information using kernel descriptor,” in Graphics, Patterns and Images
(SIBGRAPI), 2013 26th SIBGRAPI-Conference on, 2013, pp. 1–7.
[12] V. Vapnik, The nature of statistical learning theory. Springer science &
business media, 2013.
[13] H. Mahmud, M. K. Hasan, and M. Abdullah-Al-Tariq, “Hand gesture recog-
nition using sift features on depth image,” 2016.
[14] T. Mantecón, C. R. del Blanco, F. Jaureguizar, and N. García, “Visual face
recognition using bag of dense derivative depth patterns,” IEEE Signal Pro-
cessing Letters, vol. 23, no. 6, pp. 771–775, 2016.
[15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detec-
tion,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
[16] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32,
2001.
[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haff ner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998.
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi cation with
deep convolutional neural networks,” in Advances in neural information pro-
cessing systems, 2012, pp. 1097–1105.
[19] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–
255.
[20] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural infor-
mation processing systems, 2015, pp. 91–99.
[21] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for hu-
man action recognition,” IEEE transactions on pattern analysis and machine
intelligence, vol. 35, no. 1, pp. 221–231, 2013.
[22] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical
features for scene labeling,” IEEE transactions on pattern analysis and ma-
chine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
[23] Q. Gao, J. Liu, Z. Ju, Y. Li, T. Zhang, and L. Zhang, “Static hand gesture
recognition with parallel cnns for space human-robot interaction,” in Interna-
tional Conference on Intelligent Robotics and Applications. Springer, 2017,
pp. 462–473.
[24] S.-Z. Li, B. Yu, W. Wu, S.-Z. Su, and R.-R. Ji, “Feature learning based on sae–
pca network for human gesture recognition in rgbd images,” Neurocomputing,
vol. 151, pp. 565–573, 2015.
[25] Y.-C. Lee, J. Chen, C. W. Tseng, and S.-H. Lai, “Accurate and robust face
recognition from rgb-d images with a deep learning approach.” in BMVC,
2016.
[26] C. Couprie, C. Farabet, L. Najman, and Y. LeCun, “Indoor semantic seg-
mentation using depth information,” arXiv preprint arXiv:1301.3572, 2013.
[27] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard,
“Multimodal deep learning for robust rgb-d object recognition,” in Intelligent
Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on.
IEEE, 2015, pp. 681–687.
[28] O. K. Oyedotun, G. G. Demisse, A. E. R. Shabayek, D. Aouada, and B. E.
Ottersten, “Facial expression recognition via joint deep learning of rgb-depth
map latent representations.” in ICCV Workshops, 2017, pp. 3161–3168.
[29] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from
rgb-d images for object detection and segmentation,” in European Conference
on Computer Vision. Springer, 2014, pp. 345–360.
[30] J. Schlosser, C. K. Chow, and Z. Kira, “Fusing lidar and images for pedestrian
detection using convolutional neural networks,” in Robotics and Automation
(ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 2198–
2205.
[31] V. Nair and G. E. Hinton, “Rectifi ed linear units improve restricted boltz-
mann machines,” in Proceedings of the 27th international conference on ma-
chine learning (ICML-10), 2010, pp. 807–814.
[32] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., “Tensorfl ow: a system for large-scale
machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
[33] T. Tieleman and G. Hinton, “Rmsprop: Divide the gradient by a running
average of its recent magnitude. coursera: Neural networks for machine learn-
ing,” COURSERA Neural Networks Mach. Learn, 2012.
[34] M. Ma, X. Xu, J. Wu, and M. Guo, “Design and analyze the structure based
on deep belief network for gesture recognition,” in Advanced Computational
Intelligence (ICACI), 2018 Tenth International Conference on. IEEE, 2018,
pp. 40–44.
[35] S. J. Pan, Q. Yang et al., “A survey on transfer learning,” IEEE Transactions
on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
[36] K. W. Bowyer, K. Chang, and P. Flynn, “A survey of approaches to three-
dimensional face recognition,” in Pattern Recognition, 2004. ICPR 2004. Pro-
ceedings of the 17th International Conference on, vol. 1. IEEE, 2004, pp.
358–361.
[37] G. Goswami, M. Vatsa, and R. Singh, “Face recognition with rgb-d images
using kinect,” in Face Recognition Across the Imaging Spectrum. Springer,
2016, pp. 281–303.
[38] A. Chowdhury, S. Ghosh, R. Singh, and M. Vatsa, “Rgb-d face recognition
via learning-based reconstruction,” in Biometrics Theory, Applications and
Systems (BTAS), 2016 IEEE 8th International Conference on. IEEE, 2016,
pp. 1–7.
[39] G. Goswami, R. Singh, M. Vatsa, and A. Majumdar, “Kernel group sparse
representation based classifi er for multimodal biometrics,” in Neural Networks
(IJCNN), 2017 International Joint Conference on. IEEE, 2017, pp. 2894–
2901.
[40] J. Cui, H. Zhang, H. Han, S. Shan, and X. Chen, “Improving 2d face recog-
nition via discriminative face depth estimation,” Proc. ICB, pp. 1–8, 2018.
[41] H. Zhang, H. Han, J. Cui, S. Shan, and X. Chen, “Rgb-d face recognition
via deep complementary and common feature learning,” in Automatic Face
& Gesture Recognition (FG 2018), 2018 13th IEEE International Conference
on. IEEE, 2018, pp. 8–15.
[42] M. Ouloul, Z. Moutakki, K. Afdel, and A. Amghar, “An effi cient face recogni-
tion using sift descriptor in rgb-d images,” International Journal of Electrical
and Computer Engineering (IJECE), vol. 5, no. 6, pp. 1227–1233, 2015.
[43] S. Azzakhnini, L. Ballihi, and D. Aboutajdine, “A learned feature descriptor
for effi cient gender recognition using an rgb-d sensor,”in Signal, Image, Video
and Communications (ISIVC), International Symposium on. IEEE, 2016,
pp. 29–34.
[44] N. Ahmad, “Robust multimodal face recognition with pre-processed kinect
rgb-d images,” JOURNAL OF ENGINEERING AND APPLIED SCIENCES,
vol. 36, no. 1, 2017.
[45] M. Aman, W. Shah, and B. J. Ihtesham-ul Islam, “Fusion of color and depth
information for facial recognition using a multi perspective approach,” p. 365–
374, 2017.