研究生: |
曾守曜 Tseng, Shou-Yao |
---|---|
論文名稱: |
用於物體交互感知的非局部注意區域 Non-local RoI for Cross-Object Perception |
指導教授: |
陳煥宗
Chen, Hwann-Tzong |
口試委員: |
劉庭祿
Liu, Tyng-Luh 許秋婷 Hsu, Chiou-Ting |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2018 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 24 |
中文關鍵詞: | 電腦視覺 、深度學習 、物體偵測 、物體分割 、區域性卷積神經網路 、非局部性 、物體關係 |
外文關鍵詞: | Computer Vision, Deep learning, Object Detection, Instance Segmentation, R-CNN, Non-local, Object Relation |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文提出一個泛用且靈活的「非局部注意區塊模組」,此模組可以無縫接軌的使用在各種基於區域卷積類神經網路(R-CNN)所衍生出的網路模型,並且應用在各種該類型網路相關的任務上。傳統的區域卷積類神經網路,對於每個物體檢測區域都採取各自獨立的方式來進行預測。然而,場景中物體間彼此的關係,對於物體偵測和物體分割會提供額外的有用訊息。非局部注意區塊模組,是一個簡單、低計算成本且有效的模組,此模組有效的利用了物體間的關係,使得每一個獨立的物體檢測區域,可以增加對於其他非局部性區域特徵的關注,對於許多電腦視覺領域中的認知任務都可以起到幫助的作用。透過實驗證明,此模組在COCO標準數據集上,可以使Faster/Mask R-CNN分別在物體偵測和物體分割上達到更好的效果。
This thesis introduce the concept of Non-Local Region of Interest (RoI) Block as a generic and flexible module that can be seamlessly adapted into different generalized R-CNN models for various kind of tasks. R-CNN treats RoIs independently and performs the prediction based on individual object's features within the bounding box proposal. However, the relation between objects may provide useful information for detection and segmentation. The proposed Non-Local RoI Block give access to each RoI the information from all the RoIs, and results in a simple, low-cost but effective module for many perception tasks in computer vision. Our experiment results show that Non-local RoI Block can improve the performance of Faster/Mask R-CNN for object detection and instance segmentation on the COCO benchmarks.
[1] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: a randomized correspondence algorithm for structural image editing. SIGGRAPH, ACM Transactions on Graphics, 28(3):24:1–24:11, 2009.
[2] P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In Neural
Information Processing Systems (NIPS), pages 4502–4510, 2016.
[3] A. Buades, B. Coll, and J. Morel. A non-local algorithm for image denoising. In
Computer Vision and Pattern Recognition (CVPR), pages 60–65, 2005.
[4] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with bm3d? In Computer Vision and Pattern Recognition (CVPR), pages 2392–2399, 2012.
[5] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising with multi-layer
perceptrons, part 2: training trade-offs and analysis of their mechanisms. CoRR,
abs/1211.1552, 2012.
[6] P. Carbonetto, N. de Freitas, and K. Barnard. A statistical model for general
contextual object recognition. In European Conference on Computer Vision (ECCV), pages 350–362, 2004.
[7] S. Chandra, N. Usunier, and I. Kokkinos. Dense and low-rank gaussian crfs using deep embeddings. In International Conference on Computer Vision (ICCV),
pages 5113–5122, 2017.
[8] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic
image segmentation with deep convolutional nets and fully connected crfs. CoRR,
abs/1412.7062, 2014.
[9] N. Chen, Q. Zhou, and V. K. Prasanna. Understanding web images by object
relation network. In World Wide Web Conference (WWW), pages 291–300, 2012.
[10] M. J. Choi, A. Torralba, and A. S. Willsky. A tree-based context model for
object recognition. TPAMI, 34(2):240–252, 2012.
[11] K. Dabov, A. Foi, V. Katkovnik, and K. O. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. Transactions on Image Processing (TIP), 16(8):2080–2095, 2007.
[12] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: object detection via region-based
fully convolutional networks. In Neural Information Processing Systems (NIPS),
pages 379–387, 2016.
[13] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In International Conference on Computer Vision (ICCV),
pages 764–773, 2017.
[14] S. K. Divvala, D. Hoiem, J. Hays, A. A. Efros, and M. Hebert. An empirical study of context in object detection. In Computer Vision and Pattern Recognition
(CVPR), pages 1271–1278, 2009.
[15] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In
International Conference on Computer Vision (ICCV), pages 1033–1038, 1999.
[16] Facebook Research. Caffe2: A new lightweight, modular, and scalable deep
learning framework. https://caffe2.ai, 2017.
[17] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object
detection with discriminatively trained part-based models. TPAMI, 32(9):1627–
1645, 2010.
[18] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD : Deconvolutional
single shot detector. CoRR, abs/1701.06659, 2017.
[19] C. Galleguillos, A. Rabinovich, and S. J. Belongie. Object categorization using
co-occurrence, location and appearance. In Computer Vision and Pattern Recognition (CVPR), 2008.
[20] J. Gehring, M. Auli, D. Grangier, and Y. Dauphin. A convolutional encoder model for neural machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 123–135, 2017.
[21] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional
sequence to sequence learning. In International Conference on Machine Learning (ICML), pages 1243–1252, 2017.
[22] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
[23] R. B. Girshick. Fast R-CNN. In International Conference on Computer Vision
ICCV, pages 1440–1448, 2015.
[24] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies
for accurate object detection and semantic segmentation. In Computer Vision
and Pattern Recognition (CVPR), pages 580–587, 2014.
[25] D. Glasner, S. Bagon, and M. Irani. Super-resolution from a single image. In
International Conference on Computer Vision (ICCV), pages 349–356, 2009.
[26] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. In International
Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
[27] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision (ECCV), pages 346–361, 2014.
[28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), pages 770–778,
2016.
[29] Y. Hoshen. VAIN: attentional multi-agent predictive modeling. In Neural Information Processing Systems (NIPS), pages 2698–2708, 2017.
[30] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with
gaussian edge potentials. In Neural Information Processing Systems (NIPS), pages 109–117, 2011.
[31] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei. Visual genome:
Connecting language and vision using crowdsourced dense image annotations.
International Journal of Computer Vision (IJCV), 123(1):32–73, 2017.
[32] S. Kumar and M. Hebert. A hierarchical field framework for unified context based classification. In International Conference on Computer Vision (ICCV),
pages 1284–1291, 2005.
[33] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In International
Conference on Machine Learning (ICML), pages 282–289, 2001.
[34] S. Lefkimmiatis. Non-local color image denoising with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), pages 5882–5891,
2017.
[35] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-head R-CNN: in
defense of two-stage object detector. CoRR, abs/1711.07264, 2017.
[36] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In Computer Vision and Pattern
Recognition (CVPR), pages 936–944, 2017.
[37] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
and C. L. Zitnick. Microsoft COCO: common objects in context. In European
Conference on Computer Vision (ECCV), pages 740–755, 2014.
[38] S. Liu, S. D. Mello, J. Gu, G. Zhong, M. Yang, and J. Kautz. Learning affinity
via spatial propagation networks. In Neural Information Processing Systems
(NIPS), pages 1519–1529, 2017.
[39] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg.
SSD: single shot multibox detector. In European Conference on Computer Vision
(ECCV), pages 21–37, 2016.
[40] T. Malisiewicz and A. A. Efros. Beyond categories: The visual memex model for reasoning about object relationships. In Neural Information Processing Systems (NIPS), pages 1222–1230, 2009.
[41] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. L.
Yuille. The role of context for object detection and semantic segmentation in
the wild. In Computer Vision and Pattern Recognition (CVPR), pages 891–898,
2014.
[42] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann
machines. In International Conference on Machine Learning (ICML), pages
807–814, 2010.
[43] A. Newell and J. Deng. Pixels to graphs by associative embedding. In Neural
Information Processing Systems (NIPS), pages 2168–2177, 2017.
[44] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In Neural
Information Processing Systems Workshop (NIPS-W), 2017.
[45] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once:
Unified, real-time object detection. In Computer Vision and Pattern Recognition
(CVPR), pages 779–788, 2016.
[46] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In Computer
Vision and Pattern Recognition (CVPR), pages 6517–6525, 2017.
[47] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[48] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time
object detection with region proposal networks. In Neural Information Processing
Systems (NIPS), pages 91–99, 2015.
[49] A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational
reasoning. In Neural Information Processing Systems (NIPS), pages 4974–4983,
2017.
[50] S. M. Smith and J. M. Brady. SUSAN - A new approach to low level image
processing. International Journal of Computer Vision (IJCV), 23(1):45–78, 1997.
[51] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In
International Conference on Computer Vision (ICCV), pages 839–846, 1998.
[52] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based
vision system for place and object recognition. In International Conference on
Computer Vision (ICCV), pages 273–280, 2003.
[53] S.-Y. R. Tseng. Detectron.pytorch. https://github.com/roytseng-tw/Detectron.pytorch, 2018.
[54] Z. Tu. Auto-context and its application to high-level vision tasks. In Computer Vision and Pattern (CVPR), 2008.
[55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information
Processing Systems (NIPS), pages 6000–6010, 2017.
[56] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Computer Vision and Pattern Recognition (CVPR), 2018.
[57] N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tacchetti.
Visual interaction networks: Learning a physics simulator from video. In Neural
Information Processing Systems (NIPS), pages 4542–4550, 2017.
[58] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative
message passing. In Computer Vision and Pattern Recognition (CVPR), pages
3097–3106, 2017.
[59] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV), pages 1529–1537, 2015.