像素預測為基礎搭配不確定性的視覺里程計｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳澔威 Chen, Hao-Wei
論文名稱：	像素預測為基礎搭配不確定性的視覺里程計 Pixel-Wise Prediction based Visual Odometry via Uncertainty Estimation
指導教授：	李濬屹 Lee, Chun-Yi
口試委員:	周志遠 Chou, Chi-Yuan 蔡一民 Tsai, Yi-Min
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2022
畢業學年度：	110
語文別：	中文
論文頁數：	41
中文關鍵詞：	視覺里程計、不確定性預測、像素導向預測
外文關鍵詞：	Visual Odometry, Uncertainty Estimation, Pixel-Wise Predictions
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

此篇論文提出以像素預測為基礎的視覺里程計 (PWVO)，這個設計讓我
們可以根據所觀察到的資訊去對全圖的像素進行旋轉及位移的密集預測，
PWVO 運用了不確定預測的機制在我們所觀察到的資訊當中去識別出含有
干擾訊號的區域，並且透過選擇機制去結合以像素預測為基礎的預測結果，
最終得出精鍊的旋轉及位移預測，為了讓 PWVO 在訓練的時候更加全面，
我們更近一步去設計了一套創造虛擬訓練資料的工作流程，實驗結果顯示
PWVO 能夠給出令人滿意的結果，除此之外，我們分析了 PWVO 當中我們
所設計的各式機制，並驗證了這些機制的有效性，且我們展示了 PWVO 所
預測的不確定性圖，它確實可以捕捉到在我們所觀察到的資訊當中的干擾
訊號。

This paper introduces pixel-wise prediction based visual odometry (PWVO),
which is a dense prediction task that evaluates the values of translation and rota-
tion for every pixel in its input observations. PWVO employs uncertainty estima-
tion to identify the noisy regions in the input observations, and adopts a selection
mechanism to integrate pixel-wise predictions based on the estimated uncertainty
maps to derive the final translation and rotation. In order to train PWVO in a com-
prehensive fashion, we further develop a data generation workflow for generating
synthetic training data. The experimental results show that PWVO is able to de-
liver favorable results. In addition, our analyses validate the effectiveness of the
designs adopted in PWVO, and demonstrate that the uncertainty maps estimated
by PWVO is capable of capturing the noises in its input observations.

摘要 i
Abstract ii
Introduction 1
Related Work 5
Methodology 7
1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Overview of the PWVO Framework . . . . . . . . . . . . . . . . . . . . . . . 8
3 Encoding Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Pixel-Wise Prediction Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Distribution Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Selection Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Refinement and Total Loss Adopted by PWVO . . . . . . . . . . . . . . . . . 11
Data Generation 13
Experimental Results 15
1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Comparison of PWVO and the Baselines . . . . . . . . . . . . . . . . 16
2.2 Ablation Study for the Effectiveness of the Components in PWVO . . . 17
2.3 Importance of the Additional Reconstruction Loss ˆLF . . . . . . . . . 17
3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Examination of the Ability for Dealing with Noises through Saliency
Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Examination of the Uncertainty Map Estimated by PWVO . . . . . . . 19
3.3 Evaluation on the Sintel Validation Set . . . . . . . . . . . . . . . . . 19
Limitations and Future Directions 21
Conclusion 23
Appendix 25
1 List of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Background Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Optical Flow Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 25
iii
3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 The Configurations of the Data Generation Workflow . . . . . . . . . . . . . . 28
5 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1 The Impact of Object Noises on the Performance of VO . . . . . . . . 30
5.2 The Influence of the Patch Size k on the Performance of PWVO . . . . 30
5.3 Additional Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 31
References 37
                                

[1] K. Konda and R. Memisevic, “Learning visual odometry with a convolutional network,”
in VISAPP International Conference on Computer Vision Theory and Applications, 2015.
[2] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in Proc. IEEE Int.
Conf. on Computer Vision (ICCV), 2015.
[3] D. Jayaraman and K. Grauman, “Learning image representations tied to ego-motion,” in
Proc. IEEE Int. Conf. on Computer Vision (ICCV), 2015.
[4] S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards end-to-end visual odom-
etry with deep recurrent convolutional neural networks,” in Proc. IEEE Int. Conf. on
Robotics and Automation (ICRA), 2017.
[5] F. Walch, C. Hazirbas, L. Leal-Taixé, T. Sattler, S. Hilsenbeck, and D. Cremers, “Image-
based localization using lstms for structured feature correlation,” in Proc. IEEE Int. Conf.
on Computer Vision (ICCV), 2017.
[6] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural net-
work for 6d object pose estimation in cluttered scenes,” in Proc. Robotics: Science and
Systems (RSS), 2018.
[7] V. Balntas, S. Li, and V. Prisacariu, “Relocnet: Continuous metric learning relocalisation
using neural nets,” in Proc. European Conf. on Computer Vision (ECCV), 2018.
[8] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, “Camera relocalization by computing
pairwise relative poses using convolutional neural network,” in Proc. IEEE Int. Conf. on
Computer Vision Workshop (ICCVW), 2017.
[9] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Relative camera pose estimation using
convolutional neural networks,” in International Conference on Advanced Concepts for
Intelligent Vision Systems, 2017.
[10] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-
time 6-dof camera relocalization,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV),
pp. 2938–2946, 10 2015.
[11] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocaliza-
tion,” in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2016.
[12] M. Cai, C. Shen, and I. Reid, “A hybrid probabilistic model for camera relocalization,” in
Proc. British Machine Vision Conf. (BMVC), 2018.
37
[13] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with
deep learning,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
pp. 6555–6564, 2017.
[14] T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixé, “Understanding the limitations of
cnn-based absolute camera pose regression,” in Proc. IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), 2019.
[15] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of maps
for camera localization,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR), 2018.
[16] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Image-based localization using hour-
glass networks,” in Proc. IEEE Int. Conf. on Computer Vision Workshop (ICCVW), 2017.
[17] T. Naseer and W. Burgard, “Deep regression for monocular camera-based 6-dof global
localization in outdoor environments,” in Proc. IEEE Int. Conf. on Intelligent Robots and
Systems (IROS), 2017.
[18] N. Radwan, A. Valada, and W. Burgard, “Vlocnet++: Deep multitask learning for semantic
visual localization and odometry,” in IEEE Robotics Autom. Lett., 2018.
[19] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning for visual localization
and odometry,” in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2017.
[20] J. Wu, L. Ma, and X. Hu, “Delving deeper into convolutional neural networks for camera
relocalization,” in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2017.
[21] F. Xue, Q. Wang, X. Wang, W. Dong, J. Wang, and H. Zha, “Guided feature selection for
deep visual odometry,” in Proc. Asian Conf. on Computer Vision (ACCV), 2018.
[22] N. Yang, L. von Stumberg, R. Wang, and D. Cremers, “D3vo: Deep depth, deep pose
and deep uncertainty for monocular visual odometry,” in Proc. IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), 2020.
[23] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learn-
ing with cnns for frame-to-frame ego-motion estimation,” in IEEE Robotics Autom. Lett.,
2016.
[24] G. Costante and T. A. Ciarfuglia, “Ls-vo: Learning dense optical subspace for robust
visual odometry estimation,” in IEEE Robotics Autom. Lett., 2018.
[25] P. Muller and A. Savakis, “Flowdometry: An optical flow and deep learning based ap-
proach to visual odometry,” in Proc. IEEE Winter Conf. on Applications of Computer
Vision (WACV), 2017.
[26] B. Wang, C. Chen, C. X. Lu, P. Zhao, N. Trigoni, and A. Markham, “Atloc: Attention
guided camera localization,” arXiv preprint arXiv:1909.03557, 2019.
[27] H. Damirchi, R. Khorrambakht, and H. D. Taghirad, “Exploring self-attention for visual
odometry,” arXiv, vol. abs/2011.08634, 2020.
38
[28] E. Parisotto, D. S. Chaplot, J. Zhang, and R. Salakhutdinov, “Global pose estimation with
an attention-based recurrent network,” in Proc. IEEE Conf. on Computer Vision and Pat-
tern Recognition Workshop (CVPRW), pp. 237–246, 2018.
[29] C. Chen, S. Rosa, Y. Miao, C. X. Lu, W. Wu, A. Markham, and N. Trigoni, “Selective sen-
sor fusion for neural visual-inertial odometry,” in Proc. IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), pp. 10542–10551, 2019.
[30] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci, “Evaluation of non-geometric meth-
ods for visual odometry,” Robotics Auton. Syst., vol. 62, no. 12, pp. 1717–1730, 2014.
[31] T. Zhang, X. Liu, K. Kühnlenz, and M. Buss, “Visual odometry for the autonomous city
explorer,” in Proc. IEEE Int. Conf. on Intelligent Robots and Systems (IROS), pp. 3513–
3518, 2009.
[32] X. Kuo, C. Liu, K. Lin, E. Luo, Y. Chen, and C. Lee, “Dynamic attention-based visual
odometry,” in Proc. IEEE Int. Conf. on Intelligent Robots and Systems (IROS), pp. 5753–
5760, 2020.
[33] M. Kaneko, K. Iwami, T. Ogawa, T. Yamasaki, and K. Aizawa, “Mask-slam: Robust
feature-based monocular SLAM by masking using semantic segmentation,” in Proc. IEEE
Conf. on Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 258–266,
2018.
[34] B. Bescós, J. M. Fácil, J. Civera, and J. Neira, “Dynaslam: Tracking, mapping, and in-
painting in dynamic scenes,” IEEE Robotics Autom. Lett., vol. 3, no. 4, pp. 4076–4083,
2018.
[35] T. Sun, Y. Sun, M. Liu, and D. Yeung, “Movable-object-aware visual SLAM via weakly
supervised semantic segmentation,” CoRR, vol. abs/1906.03629, 2019.
[36] C. Chen, S. Rosa, Y. Miao, C. X. Lu, W. Wu, A. Markham, and N. Trigoni, “Selective sen-
sor fusion for neural visual-inertial odometry,” in Proc. IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), 2019.
[37] F. Gao, J. Yu, H. Shen, Y. Wang, and H. Yang, “Attentional separation-and-
aggregation network for self-supervised depth-pose learning in dynamic scenes,” CoRR,
vol. abs/2011.09369, 2020.
[38] B. Li, S. Wang, H. Ye, X. Gong, and Z. Xiang, “Cross-modal knowledge distillation for
depth privileged monocular visual odometry,” IEEE Robotics and Automation Letters,
vol. 7, no. 3, pp. 6171–6178, 2022.
[39] S. Lee, F. Rameau, F. Pan, and I. S. Kweon, “Attentive and contrastive learning for joint
depth and motion field estimation,” CoRR, vol. abs/2110.06853, 2021.
[40] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for
computer vision?,” in Proc. Conf. on Neural Information Processing Systems (NeurIPS),
2017.
[41] M. Klodt and A. Vedaldi, “Supervising the new with the old: Learning sfm from sfm,” in
Proc. European Conf. on Computer Vision (ECCV), 2018.
39
[42] H. Strasdat, J. M. M. Montiel, and A. J. Davison, “Real-time monocular slam: Why fil-
ter?,” in 2010 IEEE International Conference on Robotics and Automation, pp. 2657–
2664, 2010.
[43] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocular cam-
era,” in 2013 IEEE International Conference on Computer Vision, pp. 1449–1456, 2013.
[44] X.-Y. Dai, Q.-H. Meng, and S. Jin, “Uncertainty-driven active view planning in feature-
based monocular vslam,” Applied Soft Computing, vol. 108, p. 107459, 2021.
[45] G. Costante and M. Mancini, “Uncertainty estimation for data-driven visual odometry,”
IEEE Trans. Robotics, vol. 36, no. 6, pp. 1738–1757, 2020.
[46] R. Mur-Artal and J. D. Tardós, “Orb-slam2: an open-source slam system for monocular,
stereo and rgb-d cameras,” in IEEE Trans. Robotics, 2017.
[47] G. Klein and D. W. Murray, “Parallel tracking and mapping for small ar workspaces,” in
IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR), 2007.
[48] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3d reconstruction in real-time,”
in IEEE Intelligent Vehicles Symposium (IV), 2011.
[49] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Trans. Pattern Anal-
ysis and Machine Intelligence (TPAMI), 2017.
[50] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular,” in Proc.
European Conf. on Computer Vision (ECCV), 2014.
[51] R. A. Newcombe, S. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping
in real-time,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), 2011.
[52] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski, “An
intriguing failing of convolutional neural networks and the coordconv solution,” in Proc.
Conf. on Neural Information Processing Systems (NeurIPS), pp. 9628–9639, 2018.
[53] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting
with applications to image analysis and automated cartography,” Commun. ACM, vol. 24,
p. 381–395, jun 1981.
[54] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for
optical flow evaluation,” in Proc. European Conf. on Computer Vision (ECCV), pp. 611–
625, 2012.
[55] W. Wang, Y. Hu, and S. A. Scherer, “Tartanvo: A generalizable learning-based vo,” in
Proc. Conf. on Robot Learning (CoRL), 2020.
[56] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visu-
alising image classification models and saliency maps,” in Proc. Int. Conf. on Learning
Representations Workshop (ICLRW), 2014.
[57] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt,
D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,”
in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2758–2766,
2015.
40
[58] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evo-
lution of optical flow estimation with deep networks,” in Proc. IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), pp. 1647–1655, 2017.
[59] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Models matter, so does training: An empirical
study of cnns for optical flow estimation,” in IEEE Trans. Pattern Analysis and Machine
Intelligence (TPAMI), vol. 42, pp. 1408–1423, 2020.
[60] S. Zhao, Y. Sheng, Y. Dong, E. Chang, and Y. Xu, “Maskflownet: Asymmetric feature
matching with learnable occlusion mask,” in Proc. IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), pp. 6277–6286, 2020.
[61] Z. Yin, T. Darrell, and F. Yu, “Hierarchical discrete distribution decomposition for match
density estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR), pp. 6037–6046, 2019.
[62] J. Wulff, L. Sevilla-Lara, and M. J. Black, “Optical flow in mostly rigid scenes,” in Proc.
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 6911–6920, 2017.
[63] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in
Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2720–2729,
2017.
[64] J. Hur and S. Roth, “Iterative residual refinement for joint optical flow and occlusion
estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
pp. 5747–5756, 2019.
[65] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Proc.
European Conf. on Computer Vision (ECCV), 2020.
[66] G. Yang and D. Ramanan, “Volumetric correspondence networks for optical flow,” in Proc.
Conf. on Neural Information Processing Systems (NeurIPS), 2019.
[67] A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zo-
ran, A. Brock, E. Shelhamer, O. Hénaff, M. M. Botvinick, A. Zisserman, O. Vinyals, and
J. Carreira, “Perciever io: A general architecture for structured inputs and outputs,” in
Proc. Int. Conf. on Learning Representations (ICLR), 2022.
[68] Z. Huang, X. Shi, C. Zhang, Q. Wang, K. C. Cheung, H. Qin, J. Dai, and H. Li, “Flow-
former: A transformer architecture for optical flow,” ArXiv, vol. abs/2203.16194, 2022.
[69] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and
I. Polosukhin, “Attention is all you need,” in Proc. Conf. on Neural Information Processing
Systems (NeurIPS), 2017.
[70] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”
in The International Journal of Robotics Research, pp. 1231–1237, 2013.
[71] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A
large dataset to train convolutional networks for disparity, optical flow, and scene flow
estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
2016.

簡易檢索 / 詳目顯示

相關論文