利用姿態擴增改良小樣本學習情境下基於結構之深度學習相機定位方法

簡易檢索 / 詳目顯示

回結果列表

研究生：	蔡承祐 Tsai, Cheng-Yu
論文名稱：	利用姿態擴增改良小樣本學習情境下基於結構之深度學習相機定位方法 Few-Shot Deep Structure-based Camera Localization with Pose Augmentation
指導教授：	賴尚宏 Lai, Shang-Hong
口試委員:	許秋婷 Hsu, Chiu-Ting 陳煥宗 Chen, Hwann-Tzong 陳奕廷 Chen, Yi-Ting
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	43
中文關鍵詞：	深度學習、電腦視覺、相機定位、資料擴增、小樣本學習
外文關鍵詞：	Deep Learning, Computer Vision, Camera Localization, Data Augmentation, Few-Shot Learing
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

相機定位是從查詢圖像估測對應的相機姿態的問題。與依賴人為設計特徵描述符和特徵匹配的傳統定位方法不同，基於深度學習的方法使用深度模型以獲得更好的泛化能力。基於深度學習的方法可分為兩類：基於圖像的方法和基於結構的方法。

以往的研究表明數據增強可以提高基於圖像的方法的性能，但並沒有研究對基於結構的方法和數據增強技術進行探討。在本論文中，我們證明了額外的增強圖像-姿勢對能夠進一步提高基於結構的方法的性能，尤其是在小樣本情況下。

我們研究了不同的修復和渲染策略，並比較它們對增強數據的益處。我們提出一種基於信心的採樣方法，可以大幅減少預測所需的時間、提升FPS，同時保持高準確率（召回率）和低中位數的平移與旋轉誤差。

Camera localization is a problem in predicting the camera pose from an input query image. Unlike traditional localization methods that rely on handcrafted descriptors and feature matching, deep learning-based methods use deep models for better generalization. There are two types of deep learning-based camera localization methods: image-based and structure-based. Previous work has shown that data augmentation can improve the performance of image-based methods, but there are no research studies on the structure-based method with data augmentation technique. In this thesis, we prove that additional augmented image-pose pairs can further improve the performance of the structure-based method, especially in the few-shot situation.

We investigate different inpainting and rendering strategies and compare their performance on the pose augmentation. Furthermore, we propose a confidence-based sampling scheme that drastically decreases the reference time and increases the FPS while maintaining high accuracy and low translation & rotation errors.

Introduction 1
1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Related Work 5
1 Image-based Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Absolute Pose Regression (APR) . . . . . . . . . . . . . . . . . . . . 5
1.2 Relative Pose Regression (RPR) . . . . . . . . . . . . . . . . . . . . . 6
2 Structure-based Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Camera Localization under Few-Shot . . . . . . . . . . . . . . . . . . . . . . 9
Proposed Method 10
1 Structure-Based Camera Localization Procedure . . . . . . . . . . . . . . . . . 10
2 HscNet [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Rendering Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Invalid Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Label Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Confidence-based Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Experiments 22
1 Datasets and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.1 7-Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2 12-Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3 Few Shot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Experimental Result Analysis and Discussion . . . . . . . . . . . . . . . . . . 24
2.1 Experiments on 7-Scenes . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Experiment on 12-Scenes . . . . . . . . . . . . . . . . . . . . . . . . 25
3 RGB Inpainting and Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Rendering Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Label Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Discussion on k and m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1 k for K-Nearest-Neighbors Search . . . . . . . . . . . . . . . . . . . . 31
6.2 m for Multiple of Augmentation . . . . . . . . . . . . . . . . . . . . . 32
6.3 Results of Different k and m settings . . . . . . . . . . . . . . . . . . . 32
7 Confidence-based Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Conclusions 39
References 40
                                

[1] X. Li, S. Wang, Y. Zhao, J. Verbeek, and J. Kannala, “Hierarchical scene coordinate clas-sification and regression for visual localization,” in 2020 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR), pp. 11980–11989, 2020.
[2] E. Brachmann and C. Rother, “Learning less is more - 6d camera localization via 3d surfaceregression,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp. 4654–4662, 2018.
[3] S. Fang-Yu, “Improving the accuracy of deep localization models by spatially-augmentedcamera poses,” Master’s thesis, National Tsing Hua University, 2020.
[4] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, “Synsin: End-to-end view synthesisfrom a single image,” in 2020 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), pp. 7465–7475, 2020.
[5] S. Dong, S. Wang, Y. Zhuang, J. Kannala, M. Pollefeys, and B. Chen, “Visual localizationvia few-shot scene region classification,” in 2022 International Conference on 3D Vision(3DV), (Los Alamitos, CA, USA), pp. 393–402, IEEE Computer Society, sep 2022.
[6] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robusthierarchical localization at large scale,” in Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR), June 2019.
[7] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning featurematching with graph neural networks,” in IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR), June 2020.
[8] E. Brachmann and C. Rother, “Visual camera re-localization from RGB and RGB-D im-ages using DSAC,” TPAMI, 2021.
[9] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “Orb-slam: A versatile and accuratemonocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163,2015.
[10] R. Castle, G. Klein, and D. W. Murray, “Video-rate localization in multiple maps forwearable augmented reality,” in 2008 12th IEEE International Symposium on WearableComputers, pp. 15–22, 2008.
[11] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg, “Real-timedetection and tracking for augmented reality on mobile phones,” IEEE Transactions onVisualization and Computer Graphics, vol. 16, no. 3, pp. 355–368, 2010.
[12] S. Middelberg, T. Sattler, O. Untzelmann, and L. Kobbelt, “Scalable 6-dof localization onmobile devices,” in Computer Vision – ECCV 2014 (D. Fleet, T. Pajdla, B. Schiele, andT. Tuytelaars, eds.), (Cham), pp. 268–283, Springer International Publishing, 2014.
[13] D. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of theSeventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157 vol.2,1999.
[14] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Com-puter Vision and Image Understanding, vol. 110, pp. 346–359, 06 2008.
[15] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to siftor surf,” in 2011 International Conference on Computer Vision, pp. 2564–2571, 2011.
[16] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in 2016 IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), pp. 4104–4113, 2016.
[17] J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixelwise view selection forunstructured multi-view stereo,” in Computer Vision – ECCV 2016 (B. Leibe, J. Matas,N. Sebe, and M. Welling, eds.), (Cham), pp. 501–518, Springer International Publishing,2016.
[18] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time6-dof camera relocalization,” in 2015 IEEE International Conference on Computer Vision(ICCV), pp. 2938–2946, 2015.
[19] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocal-ization,” in 2016 IEEE International Conference on Robotics and Automation (ICRA),pp. 4762–4769, 2016.
[20] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression withdeep learning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 6555–6564, 2017.
[21] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Image-based localization using hour-glass networks,” in 2017 IEEE International Conference on Computer Vision Workshops(ICCVW), pp. 870–877, 2017.
[22] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of mapsfor camera localization,” in 2018 IEEE/CVF Conference on Computer Vision and PatternRecognition, pp. 2616–2625, 2018.
[23] F. Walch, C. Hazirbas, L. Leal-Taixé, T. Sattler, S. Hilsenbeck, and D. Cremers, “Image-based localization using lstms for structured feature correlation,” in 2017 IEEE Interna-tional Conference on Computer Vision (ICCV), pp. 627–637, 2017.
[24] B. Wang, C. Chen, C. Xiaoxuan Lu, P. Zhao, N. Trigoni, and A. Markham, “Atloc: At-tention guided camera localization,” Proceedings of the AAAI Conference on ArtificialIntelligence, vol. 34, pp. 10393–10401, Apr. 2020.
[25] F. Xue, X. Wu, S. Cai, and J. Wang, “Learning multi-view camera relocalization withgraph neural networks,” in 2020 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), pp. 11372–11381, 2020.
[26] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning for visual localiza-tion and odometry,” in 2018 IEEE International Conference on Robotics and Automation(ICRA), pp. 6939–6946, 2018.
[27] N. Radwan, A. Valada, and W. Burgard, “Vlocnet++: Deep multitask learning for semanticvisual localization and odometry,” IEEE Robotics and Automation Letters, vol. 3, no. 4,pp. 4407–4414, 2018.
[28] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, “Camera relocalization by computingpairwise relative poses using convolutional neural network,” in 2017 IEEE InternationalConference on Computer Vision Workshops (ICCVW), pp. 920–929, 2017.
[29] V. Balntas, S. Li, and V. Prisacariu, “Relocnet: Continuous metric learning relocalisa-tion using neural nets,” in Proceedings of the European Conference on Computer Vision(ECCV), September 2018.
[30] M. Ding, Z. Wang, J. Sun, J. Shi, and P. Luo, “Camnet: Coarse-to-fine retrieval for cam-era re-localization,” in 2019 IEEE/CVF International Conference on Computer Vision(ICCV), pp. 2871–2880, 2019.
[31] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii,“Inloc: Indoor visual localization with dense matching and view synthesis,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1293–1307,2021.
[32] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother,“Dsac —differentiable ransac for camera localization,” in 2017 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pp. 2492–2500, 2017.
[33] L. Yang, Z. Bai, C. Tang, H. Li, Y. Furukawa, and P. Tan, “Sanet: Scene agnostic networkfor camera localization,” in 2019 IEEE/CVF International Conference on Computer Vision(ICCV), pp. 42–51, 2019.
[34] L. Zhou, Z. Luo, T. Shen, J. Zhang, M. Zhen, Y. Yao, T. Fang, and L. Quan, “Kfnet:Learning temporal camera relocalization using kalman filtering,” in 2020 IEEE/CVF Con-ference on Computer Vision and Pattern Recognition (CVPR), pp. 4918–4927, 2020.
[35] S. Tang, C. Tang, R. Huang, S. Zhu, and P. Tan, “Learning camera localization via densescene matching,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition (CVPR), pp. 1831–1841, 2021.
[36] S. Tang, S. Tang, A. Tagliasacchi, P. Tan, and Y. Furukawa, “Neumap: Neural coordinatemapping by auto-transdecoder for camera localization,” in 2023 IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR), 2023.
[37] L. Kneip, D. Scaramuzza, and R. Siegwart, “A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation,”in CVPR 2011, pp. 2969–2976, 2011.
[38] J. Wu, L. Ma, and X. Hu, “Delving deeper into convolutional neural networks for cam-era relocalization,” in 2017 IEEE International Conference on Robotics and Automation(ICRA), pp. 5644–5651, 2017.
[39] T. Naseer and W. Burgard, “Deep regression for monocular camera-based 6-dof globallocalization in outdoor environments,” in 2017 IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), pp. 1525–1530, 2017.
[40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pp. 1–9, 2015.
[41] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale imagerecognition,” 2015.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
[43] Q. Xiao, H. Luo, and C. Zhang, “Margin sample mining loss: A deep learning basedmethod for person re-identification,” 2017.
[44] M. Fischler and R. Bolles, “Random sample consensus: A paradigm for model fitting withapplications to image analysis and automated cartography,” Communications of the ACM,vol. 24, no. 6, pp. 381–395, 1981.
[45] O. Chum and J. Matas, “Optimal randomized ransac,” IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, vol. 30, no. 8, pp. 1472–1482, 2008.
[46] T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixé, “Understanding the limitations ofcnn-based absolute camera pose regression,” in 2019 IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR), pp. 3297–3307, 2019.
[47] M. Bertalmio, A. Bertozzi, and G. Sapiro, “Navier-stokes, fluid dynamics, and imageand video inpainting,” in Proceedings of the 2001 IEEE Computer Society Conference onComputer Vision and Pattern Recognition. CVPR 2001, vol. 1, pp. I–I, 2001.
[48] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordi-nate regression forests for camera relocalization in rgb-d images,” in 2013 IEEE Confer-ence on Computer Vision and Pattern Recognition, pp. 2930–2937, 2013.
[49] J. Valentin, A. Dai, M. Niessner, P. Kohli, P. Torr, S. Izadi, and C. Keskin, “Learning tonavigate the energy landscape,” in 2016 Fourth International Conference on 3D Vision(3DV), pp. 323–332, 2016.

簡易檢索 / 詳目顯示

相關論文