基於特殊歐幾里得三維群之分數擴散模型解決六維物體姿態估計中的模糊性

簡易檢索 / 詳目顯示

回結果列表

研究生：	蕭子謦 Hsiao, Tsu-Ching
論文名稱：	基於特殊歐幾里得三維群之分數擴散模型解決六維物體姿態估計中的模糊性 Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
指導教授：	李濬屹 Lee, Chun-Yi
口試委員:	陳煥宗 Chen, Hwann-Tzong 劉育綸 Liu, Yu-Lun
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	45
中文關鍵詞：	電腦視覺、物體姿態估計、擴散模型、李群
外文關鍵詞：	Computer Vision, Object Pose Estimation, Diffusion Model, Lie Group
相關次數：	點閱：114 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

從單一RGB圖像中，解決由物體的對稱性或遮擋所造成的姿態模糊性，並精準預測物體的6D姿態是一個重大的挑戰。為了應對這一挑戰，我們提出了一種新穎的基於特殊歐幾里得三維群 SE(3) 的分數擴散模型。這是首次將基於 SE(3) 的分數擴散模型應用到圖像領域中解決姿態估計問題。實驗數據顯示，該方法在處理姿態模糊性與減輕影像透視引起的模糊性上能達到卓越的效果，同時也展示我們對 SE(3) 提出的替代斯坦分數公式具有很好的穩健性。這種公式不僅提高了朗之萬動力學方程式在 SE(3) 上的收歛性，也增強了斯坦分數的計算效率。因此，我們開發出一種有潛力的6D物體姿態估計方法。

Addressing accuracy limitations and pose ambiguity in 6D object pose estimation from single RGB images presents a significant challenge, particularly due to object symmetries or occlusions. In response, we introduce a novel score-based diffusion method applied to the SE(3) group, marking the first application of diffusion models to SE(3) within the image domain, specifically tailored for pose estimation tasks. Extensive evaluations demonstrate the method's efficacy in handling pose ambiguity, mitigating perspective-induced ambiguity, and showcasing the robustness of our surrogate Stein score formulation on SE(3). This formulation not only improves the convergence of Langevin dynamics but also enhances computational efficiency. Thus, we pioneer a promising strategy for 6D object pose estimation.

摘要 i
Abstract ii
Introduction 1
Related Work 5
1 Methodologies for Dealing with Pose Ambiguity Issues . . . . . . . . . . . . . 5
2 Previous Diffusion Probabilistic Models and Their Application Domains . . . . 6
Background 7
1 Lie Groups and Their Application in Pose Estimation . . . . . . . . . . . . . . 7
2 Parametrization of SE(3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Score­Based Generative Modeling . . . . . . . . . . . . . . . . . . . . . . . . 8
Preliminaries 11
1 Comparative Analysis of Diffusion Models on Lie Groups . . . . . . . . . . . 11
2 The Benefits of SE(3) over R3SO(3) in Perspective­Affected Pose Estimation 12
Methodology 13
1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Efficient Computation of Stein Score . . . . . . . . . . . . . . . . . . . . . . . 14
3 Surrogate Stein Score Calculation on SE(3) . . . . . . . . . . . . . . . . . . . 15
4 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Experimental Results 19
1 Experimental Hypotheses and Validation Objectives . . . . . . . . . . . . . . . 19
2 Datasets and Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Quantitative Results on SYMSOL . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Quantitative Results on SYMSOL­T . . . . . . . . . . . . . . . . . . . . . . . 22
5 Analysis of SE(3) and R3SO(3) in the Presence of Image Perspective Ambiguity 23
6 Performance Analysis: Surrogate Score versus Automatically Differentiated
True Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Comparison with Other Diffusion Models . . . . . . . . . . . . . . . . . . . . 25
Limitations and Future Directions 27
Conclusion 29
iii
Appendix 31
1 Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.1 Calculation of Stein Scores Using Automatic Differentiation in JAX . . 31
1.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Visualization of SYMSOL­T Results . . . . . . . . . . . . . . . . . . 34
2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 Closed­Form of Stein Scores . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 Left and Right Jacobians on SO(3) . . . . . . . . . . . . . . . . . . . 36
2.3 Eigenvector of The Jacobians . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Closed­Form of Stein Scores on SE(3) . . . . . . . . . . . . . . . . . 37
References 41
                                

[1] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, T. Birdal, N. Navab, and F. Tombari, “Explaining the ambiguity of object detection and 6d pose from visual data,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), pp. 6840–6849, 2019.
[2] T. Hodan, D. Baráth, and J. Matas, “EPOS: estimating 6d pose of objects with symmetries,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 11700–11709, 2020.
[3] H. Deng, M. Bui, N. Navab, L. Guibas, S. Ilic, and T. Birdal, “Deep bingham networks: Dealing with uncertainty and ambiguity in pose estimation,” 2020.
[4] K. A. Murphy, C. Esteves, V. Jampani, S. Ramalingam, and A. Makadia, “Implicitpdf: Nonparametric representation of probability distributionson the rotation manifold,” in Proc. Int. Conf. on Machine Learning (ICML), vol. 139, pp. 7882–7893, 2021.
[5] T. Hodan, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis, “Tless: An rgbd dataset for 6d pose estimation of textureless objects,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 880–888, IEEE, 2017.
[6] K. Park, T. Patten, and M. Vincze, “Pix2pose: Pixelwise coordinate regression of objects for 6d pose estimation,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), pp. 7667– 7676, 2019.
[7] G. Wang, F. Manhardt, F. Tombari, and X. Ji, “Gdrnet: Geometryguided direct regression network for monocular 6d object pose estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 16611–16621, 2021.
[8] S. Thalhammer, T. Patten, and M. Vincze, “COPE: endtoend trainable constant runtime object pose estimation,” in Proc. IEEE Winter Conf. on Applications of Computer Vision (WACV), pp. 2859–2869, 2023.
[9] T. Höfer, B. Kiefer, M. Messmer, and A. Zell, “Hyperposepdfhypernetworks predicting the probability distribution on so (3),” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2369–2379, 2023.
[10] R. L. Haugaard, F. Hagelskjær, and T. M. Iversen, “Spyropose: Importance sampling pyramids for object pose distribution estimation in SE(3),” CoRR, vol. abs/2303.05308, 2023.
[11] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in Proc. Conf. on Neural Information Processing Systems (NeurIPS), pp. 11895– 11907, 2019. 41
[12] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Conf. on Neural Information Processing Systems (NeurIPS), 2020.
[13] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in Proc. Int. Conf. on Learning Representations (ICLR), 2021.
[14] Y. Song, J. SohlDickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Scorebased generative modeling through stochastic differential equations,” in Proc. Int. Conf. on Learning Representations (ICLR), 2021.
[15] A. Leach, S. M. Schmon, M. T. Degiacomi, and C. G. Willcocks, “Denoising diffusion probabilistic models on so(3) for rotational alignment,” in Proc. Int. Conf. on Learning Representations Workshop (ICLRW), 2022.
[16] Y. Jagvaral, F. Lanusse, and R. Mandelbaum, “Diffusion generative models on so(3).” https://openreview.net/pdf?id=jHA-yCyBGb, 2023.
[17] J. Urain, N. Funk, J. Peters, and G. Chalvatzaki, “Se(3)diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,” CoRR, vol. abs/2209.03855, 2022.
[18] J. Yim, B. L. Trippe, V. D. Bortoli, E. Mathieu, A. Doucet, R. Barzilay, and T. S. Jaakkola, “SE(3) diffusion model with application to protein backbone generation,” CoRR, vol. abs/2302.02277, 2023.
[19] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” in Robotics: Science and Systems XIV, 2018.
[20] A. Amini, A. S. Periyasamy, and S. Behnke, “Yolopose: Transformerbased multiobject 6d pose estimation using keypoint regression,” in Intelligent Autonomous Systems (IAS), vol. 577, pp. 392–406, 2022.
[21] Y. Labbé, J. Carpentier, M. Aubry, and J. Sivic, “Cosypose: Consistent multiview multiobject 6d pose estimation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pp. 574–591, Springer, 2020.
[22] Y. Di, F. Manhardt, G. Wang, X. Ji, N. Navab, and F. Tombari, “Sopose: Exploiting selfocclusion for direct 6d pose estimation,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), pp. 12376–12385, 2021.
[23] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixelwise voting network for 6dof pose estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 4561–4570, 2019.
[24] M. Rad and V. Lepetit, “BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), pp. 3848–3856, 2017. 42
[25] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for categorylevel 6d object pose and size estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2642–2651, 2019.
[26] L. Huang, T. Hodan, L. Ma, L. Zhang, L. Tran, C. D. Twigg, P. Wu, J. Yuan, C. Keskin, and R. Wang, “Neural correspondence field for object pose estimation,” in Proc. European Conf. on Computer Vision (ECCV), vol. 13670, pp. 585–603, 2022.
[27] B. Okorn, M. Xu, M. Hebert, and D. Held, “Learning orientation distributions for object pose estimation,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10580–10587, IEEE, 2020.
[28] I. Gilitschenski, R. Sahoo, W. Schwarting, A. Amini, S. Karaman, and D. Rus, “Deep orientation uncertainty learning based on a bingham loss,” in International conference on learning representations, 2020.
[29] S. Prokudin, P. Gehler, and S. Nowozin, “Deep directional statistics: Pose estimation with uncertainty quantification,” in Proceedings of the European conference on computer vision (ECCV), pp. 534–551, 2018.
[30] D. M. Klee, O. Biza, R. Platt, and R. Walters, “Image to sphere: Learning equivariant features for efficient pose prediction,” arXiv preprint arXiv:2302.13926, 2023.
[31] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, B. Cui, and M.H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” arXiv preprint arXiv:2209.00796, 2022.
[32] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical textconditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
[33] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning texttoimage diffusion models for subjectdriven generation,” arXiv preprint arXiv:2208.12242, 2022.
[34] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Highresolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
[35] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al., “Photorealistic texttoimage diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494, 2022.
[36] R. Yang, P. Srivastava, and S. Mandt, “Diffusion probabilistic modeling for video generation,” arXiv preprint arXiv:2203.09481, 2022.
[37] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” arXiv preprint arXiv:2204.03458, 2022.
[38] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022. 43
[39] R. Huang, Z. Zhao, H. Liu, J. Liu, C. Cui, and Y. Ren, “Prodiff: Progressive fast diffusion model for highquality texttospeech,” in Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605, 2022.
[40] D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, “Diffsound: Discrete diffusion model for texttosound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[41] S. Gong, M. Li, J. Feng, Z. Wu, and L. Kong, “Diffuseq: Sequence to sequence text generation with diffusion models,” arXiv preprint arXiv:2210.08933, 2022.
[42] X. Li, J. Thickstun, I. Gulrajani, P. S. Liang, and T. B. Hashimoto, “Diffusionlm improves controllable text generation,” Advances in Neural Information Processing Systems, vol. 35, pp. 4328–4343, 2022.
[43] F.A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[44] T. Amit, T. Shaharbany, E. Nachmani, and L. Wolf, “Segdiff: Image segmentation with diffusion probabilistic models,” arXiv preprint arXiv:2112.00390, 2021.
[45] D. Baranchuk, I. Rubachev, A. Voynov, V. Khrulkov, and A. Babenko, “Labelefficient semantic segmentation with diffusion models,” arXiv preprint arXiv:2112.03126, 2021.
[46] S. Chen, P. Sun, Y. Song, and P. Luo, “Diffusiondet: Diffusion model for object detection,” CoRR, vol. abs/2211.09788, 2022.
[47] J. Choi, D. Shim, and H. J. Kim, “Diffupose: Monocular 3d human pose estimation via denoising diffusion probabilistic model,” CoRR, vol. abs/2212.02796, 2022.
[48] K. Holmquist and B. Wandt, “Diffpose: Multihypothesis human pose estimation using diffusion models,” arXiv preprint arXiv:2211.16487, 2022.
[49] V. D. Bortoli, E. Mathieu, M. J. Hutchinson, J. Thornton, Y. W. Teh, and A. Doucet, “Riemannian scorebased generative modelling,” in Advances in Neural Information Processing Systems (A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022.
[50] E. Jørgensen, “The central limit problem for geodesic random walks,” Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, vol. 32, no. 12, pp. 1–64, 1975.
[51] J. Solà, J. Deray, and D. Atchuthan, “A micro lie theory for state estimation in robotics,” CoRR, vol. abs/1812.01537, 2018.
[52] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural Comput., vol. 23, no. 7, pp. 1661–1674, 2011.
[53] D. I. Nikolayev and T. I. Savyolov, “Normal distribution on the rotation group so (3),” Textures and Microstructures, vol. 29, 1970.
[54] S. Said, L. Bombrun, Y. Berthoumieu, and J. H. Manton, “Riemannian gaussian distributions on the space of symmetric positive definite matrices,” IEEE Trans. Inf. Theory, vol. 63, no. 4, pp. 2153–2170, 2017. 44
[55] G. Chirikjian and M. Kobilarov, “Gaussian approximation of nonlinear measurement models on lie groups,” in 53rd IEEE Conference on Decision and Control, pp. 6401–6406, IEEE, 2014.
[56] T. D. Barfoot and P. T. Furgale, “Associating uncertainty with threedimensional poses for use in estimation problems,” IEEE Trans. Robotics, vol. 30, pp. 679–693, 2014.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770– 778, 2016.
[58] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[59] L. Ziyin, T. Hartwig, and M. Ueda, “Neural networks fail to learn periodic functions and how to fix it,” Advances in Neural Information Processing Systems, vol. 33, pp. 1583– 1594, 2020.
[60] J. Lee, W. Kim, D. Gwak, and E. Choi, “Conditional generation of periodic signals with fourierbased decoder,” arXiv preprint arXiv:2110.12365, 2021.
[61] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
[62] T. Hodaň, M. Sundermeyer, B. Drost, Y. Labbé, E. Brachmann, F. Michel, C. Rother, and J. Matas, “Bop challenge 2020 on 6d object localization,” in Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 577– 594, Springer, 2020.
[63] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. WandermanMilne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018.
[64] B. Yi, M. Lee, A. Kloss, R. MartínMartín, and J. Bohg, “Differentiable factor graph optimization for learning smoothers,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.
[65] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. on Learning Representations (ICLR), 2015

簡易檢索 / 詳目顯示

相關論文