簡易檢索 / 詳目顯示

研究生: 邱筑琪
Chiu, Chu-Chi
論文名稱: 基於視覺轉換器與注意力監督的視覺里程計
ViTVO: Vision Transformer based Visual Odometry with Attention Supervision
指導教授: 李濬屹
Lee, Chun-Yi
口試委員: 陳煥宗
Chen, Hwann-Tzong
邱維辰
Chiu, Wei-Chen
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2022
畢業學年度: 111
語文別: 中文
論文頁數: 26
中文關鍵詞: 視覺里程計注意力監督視覺轉換器
外文關鍵詞: Visual Odometry, Attention mechanism, Vision Transformer
相關次數: 點閱:80下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文提出了一個名為ViTVO,基於視覺轉換器與注意力監督的視覺里程計。ViTVO的輸入為光流影像以及深度圖,輸出相機的旋轉與位移。ViTVO的架構主要可以分成三個部分。(1) 輸入資料抽取模組,主要將輸入的影像抽取出有用的資訊和(2) 轉換器抽取模組,將影像裡動態的資訊過濾掉,以及(3) 注意力監督模組,強化轉換器模組區分動態與靜態的資訊。之前基於轉換器的視覺里程計架構,大多使用多張連續影像來作為輸入,注意力架構用於過濾掉不重要的影像幀。本論文針對動態物體的影響來分析視覺轉換器在視覺里程計的優缺點,並針對缺點,提出注意力監督,過濾影像中的動態物體以此來增加預測相機旋轉位移的準確率。在同樣輸入下,我們比較ViTVO與其他神經網路架構,實驗結果表明ViTVO能夠贏過其他傳統方法,並證明ViTVO提出的注意力監督能夠增加預測相機移動的準確度。


    In this thesis, we proposed a Vision Transformer based visual odometry framework (ViTVO). The architecture of ViTVO consists of three primary parts: (1) a input feature extractor responsible for extracting a feature embedding from the optical flow map and depth map, and (2) a Transformer encoder responsible for filtering the noises from dynamic features, and (3) a attention supervision module responsible for enhance the ability to detect dynamic objects. Unlike the previous works, ViTVO extracts features from single flow-depth pair, and analyses the moving objects in visual odometry input. Furthermore, we deploy the attention loss to discard the noises from moving objects. The experiments show that ViTVO trained on synthetic dataset, without any finetune, can be generalized to Sintel dataset.

    摘要 i Abstract ii 致謝 iii 圖目錄 v 表目錄 vi Chapter 1 Introduction 1 Chapter 2 Related Work 7 2.1 Traditional VO methods 7 2.2 Learning based VO methods 7 2.3 Transformer and its application in VO 8 Chapter 3 Methodology 10 3.1 Overview 10 3.2 The Components in ViTVO 11 (1) Vision Transformer Encoder 11 (2) MLP decoder 12 3.3 Loss Function 12 Odometry loss 12 Attention loss 12 Chapter 4 Experimental Results 14 4.1 Experimental Setup 14 Baselines 14 Datasets 14 Implementation details 14 Evaluation Metrics 14 4.2 Quantitative Results 15 Comparison of ViTVO and the Baselines 15 4.3 Qualitative Results 15 Examination of the Ability for Dealing with Noises through Saliency Map 16 Evaluate on the Sintel Validation Set 17 4.4 Ablation Analysis 17 Ablation Study for the Effectiveness of the Attention Loss 17 Input Noise Analysis 18 Chapter 5 Conclusion 20 Reference 21

    [1] Alcantarilla, P.F., Yebes, J.J., Almaz ́an, J., Bergasa, L.M.: On combining visualslam and dense scene flow to increase the robustness of localization and mappingin dynamic environments. In: 2012 IEEE International Conference on Roboticsand Automation. pp. 1290–1297, 2012
    [2] Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Computer Vision – ECCV 2012, 2012
    [3] Cao, R., Wang, Y., Yan, K., Chen, B., Ding, T., Zhang, T.: An end-to-end visual odometry based on self-attention mechanism. In: 2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS). pp. 406–410, 2022
    [4] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.:End-to-end object detection with transformers. CoRRabs/2005.12872, 2020
    [5] Chen, H.W., Liao, T.H., Yang, H.K., Lee, C.Y.: Pixel-wise prediction based visual odometry via uncertainty estimation, 2022
    [6] Costante, G., Ciarfuglia, T.A.: LS-VO: learning dense optical subspace for robustvisual odometry etimation. CoRRabs/1709.06019, 2017
    [7] Costante, G., Mancini, M., Valigi, P., Ciarfuglia, T.: Exploring representation learning with cnns for frame-to-frame ego-motion estimation. IEEE Robotics and Automation Letters1, 1–1, 2015
    [8] Damirchi, H., Khorrambakht, R., Taghirad, H.D.: Exploring self-attention for visual odometry. CoRRabs/2011.08634, 2020
    [9] Dosovitskiy, A., Beyer, L., Kolesnikov, A. Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J.,Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. CoRRabs/2010.11929, 2020
    [10] Engel, J., Koltun, V., Cremers, D.:Direct sparse odometry. CoRRabs/1607.02565, 2016
    [11] Engel, J., Sch ̈ops, T., Cremers, D.: Lsd-slam: Large-scale direct monocular slam.In: Computer Vision – ECCV 2014. pp. 834–849, 2014
    [12] Geiger, A., Ziegler, J., Stiller, C.: Stereoscan: Dense 3d reconstruction in real-time. In: 2011 IEEE Intelligent Vehicles Symposium (IV). pp. 963–968, 2011
    [13] Kaygusuz, N., Mendez, O., Bowden, R.: Aft-vo: Asynchronous fusion transformers for multi-view visual odometry estimation, 2022
    [14] Kendall, A., Grimes, M., Cipolla, R.: Convolutional networks for real-time 6-dofcamera relocalization. CoRRabs/1505.07427, 2015
    [15] Klein, G., Murray, D.: Parallel tracking and mapping for small arworkspaces. In: 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality. pp. 225–234, 2007
    [16] Klochkov, A., Drokin, I.: Transformer-based deep monocular visual odometry for edge devices. In: Proceedings of the 31st FRUCT Conference. p. 422–428 (2022)
    [17] Konda, K., Memisevic, R.: Learning visual odometry with a convolutional net-work. vol. 1, 2015
    [18] Kuo, X.Y., Liu, C., Lin, K.C., Lee, C.Y.: Dynamic attention-based visual odometry. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 160–169, 2020
    [19] Lee, S., Im, S., Lin, S., Kweon, I.S.: Instance-wise depth and motion learning from monocular videos. CoRRabs/1912.09351, 2019
    [20] Li, R., Wang, S., Long, Z., Gu, D.: Undeepvo: Monocular visual odometry through unsupervised deep learning. CoRRabs/1709.06841, 2017
    [21] Li, S., Lee, D.: Rgb-d slam in dynamic environments using static point weighting. IEEE Robotics and Automation Letters2(4), 2263–2270, 2017
    [22] Li, X., Hou, Y., Wang, P., Gao, Z., Xu, M., Li, W.: Transformer guided geometry model for flow-based unsupervised visual odometry. In: Neural Comput Applic33. p. 8031–8042, 2021
    [23] Liu, L., Zhai, G., Ye, W., Liu, Y.: Unsupervised learning of scene flow estimation fusing with local rigidity. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. pp. 876–882. International Joint Conferences on Artificial Intelligence Organization, 2019
    [24] Muller, P., Savakis, A.: Flowdometry: An optical flow and deep learning based approach to visual odometry. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 624–631, 2017
    [25] Mur-Artal, R., Tard ́os, J.D.: ORB-SLAM2: an open-source SLAM system formonocular, stereo and RGB-D cameras. CoRRabs/1610.06475, 2016
    [26] Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: Dtam: Dense tracking andmapping in real-time. In: 2011 International Conference on Computer Vision.pp. 2320–2327, 2011
    [27] Parisotto, E.: An empirical evaluation of sequence-based deep learning architectures for visual odometry, 2018
    [28] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G. Killeen,T Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., K ̈opf, A., Yang, E.Z.,DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai,J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. CoRR, 2019
    [29] Qin, T., Pan, J., Cao, S., Shen, S.: A general optimization-based framework for local odometry estimation with multiple sensors. CoRRabs/1901.03638(2019)
    [30] Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? CoRRabs/2108.08810, 2021
    [31] Riazuelo, L., Montano, L., Montiel, J.M.M.: Semantic visual slam in populated environments. In: 2017 European Conference on Mobile Robots (ECMR). pp. 1–7, 2017
    [32] Saputra, M.R.U., de Gusmao, P.P.B., Wang, S., Markham, A., Trigoni, N.: Learning monocular visual odometry through geometry-aware curriculum learning. CoRRabs/1903.10543, 2019
    [33] Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D.,Batra, D.:Grad-cam: Why did you say that? visual explanations from deep networks viagradient-based localization. CoRRabs/1610.02391, 2016
    [34] Shen, S., Cai, Y., Wang, W., Scherer, S.: Dytanvo: Joint refinement of visual odometry and motion segmentation in dynamic environments, 2022
    [35] Sun, Y., Liu, M., Meng, M.Q.H.: Improving rgb-d slam in dynamic environments: A motion removal approach. Robotics and Autonomous Systems89, 110–122, 2017
    [36] Valada, A., Radwan, N., Burgard, W.: Deep auxiliary learning for visual localization and odometry. CoRRabs/1803.03642, 2018
    [37] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,Kaiser, L. Polosukhin, I.: Attention is all you need. CoRRabs/1706.03762, 2017
    [38] Wang, B., Chen, C., Lu, C.X., Zhao, P., Trigoni, N., Markham, A.: Atloc: Atten-tion guided camera localization. CoRR, 2019
    [39] Wang, S., Clark, R., Wen, H., Trigoni, N.: Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. CoRRabs/1709.08429, 2017
    [40] Wang, W., Hu, Y., Scherer, S.A.: Tartanvo: A generalizable learning-based VO.CoRRabs/2011.00359, 2020
    [41] Wang, Y., Huang, S.: Motion segmentation based robust rgb-d slam. In: Proceeding of the 11th World Congress on Intelligent Control and Automation. pp.3122–3127, 2014
    [42] Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H.: End-to-end video instance segmentation with transformers. CoRRabs/2011.14503, 2020
    [43] Wangsiripitak, S., Murray, D.W.: Avoiding moving outliers in visual slam bytracking moving objects. In: 2009 IEEE International Conference on Robotics and Automation. pp. 375–380, 2009
    [44] Xue, F., Wang, Q., Wang, X., Dong, W., Wang, J., Zha, H.: Guided feature selection for deep visual odometry. CoRRabs/1811.09935, 2018
    [45] Xue, F., Wang, X., Li, S., Wang, Q., Wang, J., Zha, H.: Beyond tracking: Selecting memory and refining poses for deep visual odometry. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8567–8575, 2019
    [46] Ye, W., Lan, X., Chen, S., Ming, Y., Yu, X., Bao, H., Cui, Z., Zhang, G.: Pvo: Panoptic visual odometry, 2022
    [47] Yin, Z., Shi, J.: Geonet: Unsupervised learning of dense depth, optical flow and camera pose. CoRRabs/1803.02276, 2018
    [48] Zhan, H., Garg, R., Weerasekera, C.S., Li, K., Agarwal, H., Reid, I.D.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. CoRRabs/1803.03893, 2018
    [49] Zhan, H., Weerasekera, C.S., Bian, J., Reid, I.D.: Visual odometry revisited: What should be learnt? CoRRabs/1909.09803, 2019
    [50] Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. CoRRabs/1704.07813, 2017
    [51] Zhou, W., Zhang, H., Yan, Z., Wang, W., Lin, L.: Decoupledposenet: Cascade decoupled pose learning for unsupervised camera ego-motion estimation. IEEE Transactions on Multimedia, 2022
    [52] Zhu, R., Yang, M., Liu, W., Song, R., Yan, B., Xiao, Z.: Deepavo: Efficient pose refining with feature distilling for deep visual odometry. CoRRabs/2105.09899, 2021

    QR CODE