研究生: |
林久愛 Lin, Jiou-Ai |
---|---|
論文名稱: |
通過錨箱和多簡化區域提議網絡與多池化之 RGB-D 物件偵測 RGB-D Based Object Detection via Anchor Box with Multi-Reduced Region Proposal Network and Multi-Pooling |
指導教授: |
邱瀞德
Chiu, Ching-Te |
口試委員: |
張隆紋
Chang, Long-Wen 蘇豐文 Soo, Von-Wun |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2019 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 48 |
中文關鍵詞: | RGB-D 物件識別 、區域候選網路 、ROI 池化層 、錨框 |
外文關鍵詞: | RGB-D Object Detection, Region Proposal Network (RPN), Region of Interest (ROI) Pooling, anchor box |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來隨著自動化產業的發展,物件偵測的技術越來越受到重視。且比起語意分割,物件偵測所需要的資源量往往比較少,因此這個技術更為實用。他可以被應用在機械手臂的抓取物件上與自動駕駛車輛的道路行駛上,又或者應用於居家看護等等。使用物件辨識的技術可以大幅節省人力成本,且不會有視覺疲勞的問題,並可以隨時保持專注度與一定的精準度。也因此在深度學習的蓬勃發展的同時,有許多研究將深度學習應用於物件偵測中, 並且取得非常成功的結果。然而,他們大多數以RGB 影像作為輸入為主,鮮少使用到深度的訊息。
由於深度感測器的快速發展及其廣泛的應用場景,使得光學深度影像感測更為普及。深度影像提供了有關外觀及物件形狀的資訊外,其不受光線的影響,並可以不分晝夜進行偵測。利用深度的資訊,預期可以提高辨識的準確度及以提升安全性。並且由過去許多成功作品的結果中可以知道顏色的訊息也同樣重要。因此在這篇論文中,我們同時應用了顏色與深度圖像來當作物件偵測時的輸入。本研究以施等人所提出的Faster R-CNN 的實時物體偵測為基礎,目標是使其從原本的單純使用RGB,改良變為一個快速且良好的基於RGB-D 的物體偵測架構。我們除了新增深度作為輸入以外,也調整錨框的種類來改善一些偵測結果不理想的物件。我們也有討論多個RPN 和多個ROI 池的影響。
我們分別在SUN RGB-D 資料集與NYU.v2 資料集上做實驗,結果顯示我們的加入深度訊息後的模型架構,在SUN RGB-D 資料集上比原始單純使用RGB 訊息的架構還高9.017% 的準確度。並且不論是SUN RGB-D 資料集還是NYU.v2 資料集上,在有GPU 加速的情況下測試一組RGB-D 影像的時間僅需0.123 秒並為持一定的準確度。並且在我們調整了錨框(anchor box) 的種類後,在SUN RGB-D 資料集上提升了1.58% 的準確度。
In recent years, with the development of the automation industry, the technology of object detection has received more and more attention. Compared to segmentation, object detection requires less resources, so it is more practical. It can be applied to the grabbing objects of the robotic manipulation [1] [2] [3] and the road driving by autonomous car [4] [5] [6], or the home care [7] [8] [9] [10] and so on. The use of object detection technology can save labor costs, without the problem of visual fatigue, and can maintain concentration and accuracy at any time. Therefore, while deep learning is developing, more and more studies have applied deep learning in object detection and achieve very successful results.
However, most of them use RGB images as input, and rarely use the information of depth.
Due to the rapid development of depth sensors and their wide application scenarios, it becomes optical depth image sensing more popular. Depth images not only provide information about the appearance and shape of objects, it is also unaffected by light and can be detected day and night. With depth information, it is expected to improve the accuracy of identification and improve safety. The information of the color is known from the successful results of many works in the past is equally important. So in this paper, we use both RGB and depth image as input of the object detection. This study is based on the realtime object detection of Faster R-CNN proposed by Shih et al. [11], and our goal is to make
it to be a fast and high-performance RGB-D based object detection architecture.
In addition to adding depth as input, we also adjust the type of anchor boxes to improve some objects with unsatisfactory detection results. We also discuss the impact of multi-RPN and multi-ROI pooling.
We performed experiments on SUN RGB-D datasets [12] and NYU.v2 datasets [13]. The results show that after adding the depth path as input, mAP of our architecture is 9.017% higher than the mAP of original architecture using only RGB images as input in SUN RGB-D datasets. Whether it is the SUN RGB-D datasets or the NYU.v2 datasets, the time to test a pair of RGB-D images with GPU acceleration is only 0.123 seconds and has a certain accuracy. After we adjusted the type of anchor box, which improved the accuracy of 1.58% on the SUN RGB-D datasets.
[1] G. Biegelbauer and M. Vincze, “Efficient 3d object detection by fitting superquadrics to range image data for robot’s object manipulation,” in Proceedings 2007 IEEE International Conference on Robotics and Automation. IEEE, 2007, pp. 1086–1091.
[2] S. Ekvall, D. Kragic, and P. Jensfelt, “Object detection and mapping for service robot tasks,” Robotica, vol. 25, no. 2, pp. 175–187, 2007.
[3] C. C. Kemp, A. Edsinger, and E. Torres-Jara, “Challenges for robot manipulation in human environments [grand challenges of robotics],” IEEE Robotics & Automation Magazine, vol. 14, no. 1, pp. 20–29, 2007.
[4] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving.” in CVPR Workshops, 2017, pp. 446–454.
[5] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu, “Traffic-sign detection and classification in the wild,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2110–2118.
[6] W. Ouyang and X. Wang, “Joint deep learning for pedestrian detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2056–2063.
[7] B. Graf, U. Reiser, M. Hägele, K. Mauz, and P. Klein, “Robotic home assistant care-o-bot® 3-product vision and innovation platform,” in 2009 IEEE Workshop on Advanced Robotics and its Social Impacts. IEEE, 2009, pp. 139–144.
[8] A. G. Hauptmann, J. Gao, R. Yan, Y. Qi, J. Yang, and H. D. Wactlar, “Automated analysis of nursing home observations,” IEEE Pervasive Computing, vol. 3, no. 2, pp. 15–21, 2004.
[9] Y. Benezeth, H. Laurent, B. Emile, and C. Rosenberger, “Towards a sensor for detecting human presence and characterizing activity,” Energy and Buildings, vol. 43, no. 2-3, pp. 305–314, 2011.
[10] L.-H. Juang and M.-N. Wu, “Fall down detection under smart home system,” Journal of medical systems, vol. 39, no. 10, p. 107, 2015.
[11] K. Shih, C. Chiu, J. Lin, and Y. Bu, “Real-time object detection with reduced region proposal network via multi-feature concatenation.” IEEE transactions on neural networks and learning systems, 2019.
[12] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[13] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in ECCV, 2012.
[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
[16] Q. Luo, H. Ma, Y. Wang, L. Tang, and R. Xiong, “3d-ssd: Learning hierarchical features from rgb-d images for amodal 3d object detection,” arXiv preprint arXiv:1711.00238, 2017.
[17] M. M. Rahman, Y. Tan, J. Xue, L. Shao, and K. Lu, “3d object detection: Learning 3d bounding boxes from scaled down 2d bounding boxes in rgb-d images,” Information Sciences, vol. 476, pp. 147–158, 2019.
[18] W. Zhiqiang and L. Jun, “A review of object detection based on convolutional neural network,” in Control Conference (CCC), 2017 36th Chinese. IEEE, 2017, pp. 11 104–11 109.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
42
[20] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[21] Y. Yoon, G. N. DeSouza, and A. C. Kak, “Real-time tracking and pose estimation for industrial objects using geometric features,” in 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422), vol. 3. IEEE, 2003, pp. 3473–3478.
[22] L. Armesto and J. Tornero, “Automation of industrial vehicles: A visionbased line tracking application,” in 2009 IEEE Conference on Emerging Technologies & Factory Automation. IEEE, 2009, pp. 1–7.
[23] C. Rennie, R. Shome, K. E. Bekris, and A. F. De Souza, “A dataset for improved rgbd-based object detection and pose estimation for warehouse pickand-place,” IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 1179–1185, 2016.
[24] T. Tsujimura and T. Yabuta, “Object detection by tactile sensing method employing force/torque information,” IEEE Transactions on robotics and Automation, vol. 5, no. 4, pp. 444–450, 1989.
[25] C.-C. Wang and C. Thorpe, “Simultaneous localization and mapping with detection and tracking of moving objects,” in Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), vol. 3. IEEE, 2002, pp. 2918–2924.
[26] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[27] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
[28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[29] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
[30] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
[31] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in European conference on computer vision. Springer, 2014, pp. 391–405.
[32] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, “Multiscale combinatorial grouping,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 328–335.
[33] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
[34] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
[35] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
[36] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of computer and system sciences, vol. 55, no. 1, pp. 119–139, 1997.
[37] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[38] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” arXiv preprint arXiv: 1702.08502, 2017.
[39] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, vol. 1, no. 2, 2017, pp. 2261–2269.
[40] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters—improve semantic segmentation by global convolutional network,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 1743–1751.
[41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
[42] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
[43] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in neural information processing systems, 1993, pp. 164–171.
[44] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2074–2082.
[45] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object detection in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 808–816.
[46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
[47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
[48] J. Lahoud and B. Ghanem, “2d-driven 3d object detection in rgb-d images,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4622–4630.
[49] D. Xu, D. Anguelov, and A. Jain, “Pointfusion: Deep sensor fusion for 3d bounding box estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 244–253.
[50] Z. Ren and E. B. Sudderth, “Three-dimensional object detection and layout prediction using clouds of oriented gradients,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1525–1533.