簡易檢索 / 詳目顯示

研究生: 陳澔威
Chen, Hao-Wei
論文名稱: 像素預測為基礎搭配不確定性的視覺里程計
Pixel-Wise Prediction based Visual Odometry via Uncertainty Estimation
指導教授: 李濬屹
Lee, Chun-Yi
口試委員: 周志遠
Chou, Chi-Yuan
蔡一民
Tsai, Yi-Min
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 41
中文關鍵詞: 視覺里程計不確定性預測像素導向預測
外文關鍵詞: Visual Odometry, Uncertainty Estimation, Pixel-Wise Predictions
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 此篇論文提出以像素預測為基礎的視覺里程計 (PWVO),這個設計讓我
    們可以根據所觀察到的資訊去對全圖的像素進行旋轉及位移的密集預測,
    PWVO 運用了不確定預測的機制在我們所觀察到的資訊當中去識別出含有
    干擾訊號的區域,並且透過選擇機制去結合以像素預測為基礎的預測結果,
    最終得出精鍊的旋轉及位移預測,為了讓 PWVO 在訓練的時候更加全面,
    我們更近一步去設計了一套創造虛擬訓練資料的工作流程,實驗結果顯示
    PWVO 能夠給出令人滿意的結果,除此之外,我們分析了 PWVO 當中我們
    所設計的各式機制,並驗證了這些機制的有效性,且我們展示了 PWVO 所
    預測的不確定性圖,它確實可以捕捉到在我們所觀察到的資訊當中的干擾
    訊號。


    This paper introduces pixel-wise prediction based visual odometry (PWVO),
    which is a dense prediction task that evaluates the values of translation and rota-
    tion for every pixel in its input observations. PWVO employs uncertainty estima-
    tion to identify the noisy regions in the input observations, and adopts a selection
    mechanism to integrate pixel-wise predictions based on the estimated uncertainty
    maps to derive the final translation and rotation. In order to train PWVO in a com-
    prehensive fashion, we further develop a data generation workflow for generating
    synthetic training data. The experimental results show that PWVO is able to de-
    liver favorable results. In addition, our analyses validate the effectiveness of the
    designs adopted in PWVO, and demonstrate that the uncertainty maps estimated
    by PWVO is capable of capturing the noises in its input observations.

    摘要 i Abstract ii 1 Introduction 1 2 Related Work 5 3 Methodology 7 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Overview of the PWVO Framework . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Encoding Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4 Pixel-Wise Prediction Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.4.1 Distribution Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.4.2 Selection Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.5 Refinement and Total Loss Adopted by PWVO . . . . . . . . . . . . . . . . . 11 4 Data Generation 13 5 Experimental Results 15 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2.1 Comparison of PWVO and the Baselines . . . . . . . . . . . . . . . . 16 5.2.2 Ablation Study for the Effectiveness of the Components in PWVO . . . 17 5.2.3 Importance of the Additional Reconstruction Loss ˆLF . . . . . . . . . 17 5.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.3.1 Examination of the Ability for Dealing with Noises through Saliency Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.3.2 Examination of the Uncertainty Map Estimated by PWVO . . . . . . . 19 5.3.3 Evaluation on the Sintel Validation Set . . . . . . . . . . . . . . . . . 19 6 Limitations and Future Directions 21 7 Conclusion 23 8 Appendix 25 8.1 List of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.2 Background Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.2.1 Optical Flow Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 25 iii 8.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.3.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.3.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 8.4 The Configurations of the Data Generation Workflow . . . . . . . . . . . . . . 28 8.5 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 30 8.5.1 The Impact of Object Noises on the Performance of VO . . . . . . . . 30 8.5.2 The Influence of the Patch Size k on the Performance of PWVO . . . . 30 8.5.3 Additional Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 31 References 37

    [1] K. Konda and R. Memisevic, “Learning visual odometry with a convolutional network,”
    in VISAPP International Conference on Computer Vision Theory and Applications, 2015.
    [2] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in Proc. IEEE Int.
    Conf. on Computer Vision (ICCV), 2015.
    [3] D. Jayaraman and K. Grauman, “Learning image representations tied to ego-motion,” in
    Proc. IEEE Int. Conf. on Computer Vision (ICCV), 2015.
    [4] S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards end-to-end visual odom-
    etry with deep recurrent convolutional neural networks,” in Proc. IEEE Int. Conf. on
    Robotics and Automation (ICRA), 2017.
    [5] F. Walch, C. Hazirbas, L. Leal-Taixé, T. Sattler, S. Hilsenbeck, and D. Cremers, “Image-
    based localization using lstms for structured feature correlation,” in Proc. IEEE Int. Conf.
    on Computer Vision (ICCV), 2017.
    [6] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural net-
    work for 6d object pose estimation in cluttered scenes,” in Proc. Robotics: Science and
    Systems (RSS), 2018.
    [7] V. Balntas, S. Li, and V. Prisacariu, “Relocnet: Continuous metric learning relocalisation
    using neural nets,” in Proc. European Conf. on Computer Vision (ECCV), 2018.
    [8] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, “Camera relocalization by computing
    pairwise relative poses using convolutional neural network,” in Proc. IEEE Int. Conf. on
    Computer Vision Workshop (ICCVW), 2017.
    [9] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Relative camera pose estimation using
    convolutional neural networks,” in International Conference on Advanced Concepts for
    Intelligent Vision Systems, 2017.
    [10] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-
    time 6-dof camera relocalization,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV),
    pp. 2938–2946, 10 2015.
    [11] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocaliza-
    tion,” in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2016.
    [12] M. Cai, C. Shen, and I. Reid, “A hybrid probabilistic model for camera relocalization,” in
    Proc. British Machine Vision Conf. (BMVC), 2018.
    37
    [13] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with
    deep learning,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
    pp. 6555–6564, 2017.
    [14] T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixé, “Understanding the limitations of
    cnn-based absolute camera pose regression,” in Proc. IEEE Conf. on Computer Vision
    and Pattern Recognition (CVPR), 2019.
    [15] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of maps
    for camera localization,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition
    (CVPR), 2018.
    [16] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Image-based localization using hour-
    glass networks,” in Proc. IEEE Int. Conf. on Computer Vision Workshop (ICCVW), 2017.
    [17] T. Naseer and W. Burgard, “Deep regression for monocular camera-based 6-dof global
    localization in outdoor environments,” in Proc. IEEE Int. Conf. on Intelligent Robots and
    Systems (IROS), 2017.
    [18] N. Radwan, A. Valada, and W. Burgard, “Vlocnet++: Deep multitask learning for semantic
    visual localization and odometry,” in IEEE Robotics Autom. Lett., 2018.
    [19] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning for visual localization
    and odometry,” in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2017.
    [20] J. Wu, L. Ma, and X. Hu, “Delving deeper into convolutional neural networks for camera
    relocalization,” in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2017.
    [21] F. Xue, Q. Wang, X. Wang, W. Dong, J. Wang, and H. Zha, “Guided feature selection for
    deep visual odometry,” in Proc. Asian Conf. on Computer Vision (ACCV), 2018.
    [22] N. Yang, L. von Stumberg, R. Wang, and D. Cremers, “D3vo: Deep depth, deep pose
    and deep uncertainty for monocular visual odometry,” in Proc. IEEE Conf. on Computer
    Vision and Pattern Recognition (CVPR), 2020.
    [23] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learn-
    ing with cnns for frame-to-frame ego-motion estimation,” in IEEE Robotics Autom. Lett.,
    2016.
    [24] G. Costante and T. A. Ciarfuglia, “Ls-vo: Learning dense optical subspace for robust
    visual odometry estimation,” in IEEE Robotics Autom. Lett., 2018.
    [25] P. Muller and A. Savakis, “Flowdometry: An optical flow and deep learning based ap-
    proach to visual odometry,” in Proc. IEEE Winter Conf. on Applications of Computer
    Vision (WACV), 2017.
    [26] B. Wang, C. Chen, C. X. Lu, P. Zhao, N. Trigoni, and A. Markham, “Atloc: Attention
    guided camera localization,” arXiv preprint arXiv:1909.03557, 2019.
    [27] H. Damirchi, R. Khorrambakht, and H. D. Taghirad, “Exploring self-attention for visual
    odometry,” arXiv, vol. abs/2011.08634, 2020.
    38
    [28] E. Parisotto, D. S. Chaplot, J. Zhang, and R. Salakhutdinov, “Global pose estimation with
    an attention-based recurrent network,” in Proc. IEEE Conf. on Computer Vision and Pat-
    tern Recognition Workshop (CVPRW), pp. 237–246, 2018.
    [29] C. Chen, S. Rosa, Y. Miao, C. X. Lu, W. Wu, A. Markham, and N. Trigoni, “Selective sen-
    sor fusion for neural visual-inertial odometry,” in Proc. IEEE Conf. on Computer Vision
    and Pattern Recognition (CVPR), pp. 10542–10551, 2019.
    [30] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci, “Evaluation of non-geometric meth-
    ods for visual odometry,” Robotics Auton. Syst., vol. 62, no. 12, pp. 1717–1730, 2014.
    [31] T. Zhang, X. Liu, K. Kühnlenz, and M. Buss, “Visual odometry for the autonomous city
    explorer,” in Proc. IEEE Int. Conf. on Intelligent Robots and Systems (IROS), pp. 3513–
    3518, 2009.
    [32] X. Kuo, C. Liu, K. Lin, E. Luo, Y. Chen, and C. Lee, “Dynamic attention-based visual
    odometry,” in Proc. IEEE Int. Conf. on Intelligent Robots and Systems (IROS), pp. 5753–
    5760, 2020.
    [33] M. Kaneko, K. Iwami, T. Ogawa, T. Yamasaki, and K. Aizawa, “Mask-slam: Robust
    feature-based monocular SLAM by masking using semantic segmentation,” in Proc. IEEE
    Conf. on Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 258–266,
    2018.
    [34] B. Bescós, J. M. Fácil, J. Civera, and J. Neira, “Dynaslam: Tracking, mapping, and in-
    painting in dynamic scenes,” IEEE Robotics Autom. Lett., vol. 3, no. 4, pp. 4076–4083,
    2018.
    [35] T. Sun, Y. Sun, M. Liu, and D. Yeung, “Movable-object-aware visual SLAM via weakly
    supervised semantic segmentation,” CoRR, vol. abs/1906.03629, 2019.
    [36] C. Chen, S. Rosa, Y. Miao, C. X. Lu, W. Wu, A. Markham, and N. Trigoni, “Selective sen-
    sor fusion for neural visual-inertial odometry,” in Proc. IEEE Conf. on Computer Vision
    and Pattern Recognition (CVPR), 2019.
    [37] F. Gao, J. Yu, H. Shen, Y. Wang, and H. Yang, “Attentional separation-and-
    aggregation network for self-supervised depth-pose learning in dynamic scenes,” CoRR,
    vol. abs/2011.09369, 2020.
    [38] B. Li, S. Wang, H. Ye, X. Gong, and Z. Xiang, “Cross-modal knowledge distillation for
    depth privileged monocular visual odometry,” IEEE Robotics and Automation Letters,
    vol. 7, no. 3, pp. 6171–6178, 2022.
    [39] S. Lee, F. Rameau, F. Pan, and I. S. Kweon, “Attentive and contrastive learning for joint
    depth and motion field estimation,” CoRR, vol. abs/2110.06853, 2021.
    [40] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for
    computer vision?,” in Proc. Conf. on Neural Information Processing Systems (NeurIPS),
    2017.
    [41] M. Klodt and A. Vedaldi, “Supervising the new with the old: Learning sfm from sfm,” in
    Proc. European Conf. on Computer Vision (ECCV), 2018.
    39
    [42] H. Strasdat, J. M. M. Montiel, and A. J. Davison, “Real-time monocular slam: Why fil-
    ter?,” in 2010 IEEE International Conference on Robotics and Automation, pp. 2657–
    2664, 2010.
    [43] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocular cam-
    era,” in 2013 IEEE International Conference on Computer Vision, pp. 1449–1456, 2013.
    [44] X.-Y. Dai, Q.-H. Meng, and S. Jin, “Uncertainty-driven active view planning in feature-
    based monocular vslam,” Applied Soft Computing, vol. 108, p. 107459, 2021.
    [45] G. Costante and M. Mancini, “Uncertainty estimation for data-driven visual odometry,”
    IEEE Trans. Robotics, vol. 36, no. 6, pp. 1738–1757, 2020.
    [46] R. Mur-Artal and J. D. Tardós, “Orb-slam2: an open-source slam system for monocular,
    stereo and rgb-d cameras,” in IEEE Trans. Robotics, 2017.
    [47] G. Klein and D. W. Murray, “Parallel tracking and mapping for small ar workspaces,” in
    IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR), 2007.
    [48] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3d reconstruction in real-time,”
    in IEEE Intelligent Vehicles Symposium (IV), 2011.
    [49] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Trans. Pattern Anal-
    ysis and Machine Intelligence (TPAMI), 2017.
    [50] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular,” in Proc.
    European Conf. on Computer Vision (ECCV), 2014.
    [51] R. A. Newcombe, S. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping
    in real-time,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), 2011.
    [52] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski, “An
    intriguing failing of convolutional neural networks and the coordconv solution,” in Proc.
    Conf. on Neural Information Processing Systems (NeurIPS), pp. 9628–9639, 2018.
    [53] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting
    with applications to image analysis and automated cartography,” Commun. ACM, vol. 24,
    p. 381–395, jun 1981.
    [54] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for
    optical flow evaluation,” in Proc. European Conf. on Computer Vision (ECCV), pp. 611–
    625, 2012.
    [55] W. Wang, Y. Hu, and S. A. Scherer, “Tartanvo: A generalizable learning-based vo,” in
    Proc. Conf. on Robot Learning (CoRL), 2020.
    [56] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visu-
    alising image classification models and saliency maps,” in Proc. Int. Conf. on Learning
    Representations Workshop (ICLRW), 2014.
    [57] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt,
    D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,”
    in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2758–2766,
    2015.
    40
    [58] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evo-
    lution of optical flow estimation with deep networks,” in Proc. IEEE Conf. on Computer
    Vision and Pattern Recognition (CVPR), pp. 1647–1655, 2017.
    [59] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Models matter, so does training: An empirical
    study of cnns for optical flow estimation,” in IEEE Trans. Pattern Analysis and Machine
    Intelligence (TPAMI), vol. 42, pp. 1408–1423, 2020.
    [60] S. Zhao, Y. Sheng, Y. Dong, E. Chang, and Y. Xu, “Maskflownet: Asymmetric feature
    matching with learnable occlusion mask,” in Proc. IEEE Conf. on Computer Vision and
    Pattern Recognition (CVPR), pp. 6277–6286, 2020.
    [61] Z. Yin, T. Darrell, and F. Yu, “Hierarchical discrete distribution decomposition for match
    density estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition
    (CVPR), pp. 6037–6046, 2019.
    [62] J. Wulff, L. Sevilla-Lara, and M. J. Black, “Optical flow in mostly rigid scenes,” in Proc.
    IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 6911–6920, 2017.
    [63] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in
    Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2720–2729,
    2017.
    [64] J. Hur and S. Roth, “Iterative residual refinement for joint optical flow and occlusion
    estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
    pp. 5747–5756, 2019.
    [65] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Proc.
    European Conf. on Computer Vision (ECCV), 2020.
    [66] G. Yang and D. Ramanan, “Volumetric correspondence networks for optical flow,” in Proc.
    Conf. on Neural Information Processing Systems (NeurIPS), 2019.
    [67] A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zo-
    ran, A. Brock, E. Shelhamer, O. Hénaff, M. M. Botvinick, A. Zisserman, O. Vinyals, and
    J. Carreira, “Perciever io: A general architecture for structured inputs and outputs,” in
    Proc. Int. Conf. on Learning Representations (ICLR), 2022.
    [68] Z. Huang, X. Shi, C. Zhang, Q. Wang, K. C. Cheung, H. Qin, J. Dai, and H. Li, “Flow-
    former: A transformer architecture for optical flow,” ArXiv, vol. abs/2203.16194, 2022.
    [69] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and
    I. Polosukhin, “Attention is all you need,” in Proc. Conf. on Neural Information Processing
    Systems (NeurIPS), 2017.
    [70] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”
    in The International Journal of Robotics Research, pp. 1231–1237, 2013.
    [71] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A
    large dataset to train convolutional networks for disparity, optical flow, and scene flow
    estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
    2016.

    QR CODE