簡易檢索 / 詳目顯示

研究生: 曾偉誠
Tseng, Wei-Cheng
論文名稱: 類別層級之關節式形變類神經輻射場
CLA-NeRF: Category-Level Articulated Neural Radiance Field
指導教授: 孫民
Sun, Min
口試委員: 陳煥宗
Chen, Hwann-Tzong
王鈺強
Wang, Yu-Chiang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 30
中文關鍵詞: 深度學習新穎視角合成類神經輻射場關節式物件電腦視覺
外文關鍵詞: Deep Learning, Novel-View Synthesis, NeRF, Articulated Object, Computer Vision
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 我們提出了 CLA-NeRF——一種類別級關節式神經輻射場,可以執行視角合成、零件區域分割和關節姿勢估計。 CLA-NeRF在對像類別級別進行訓練時,不會使用CAD模型和深度資訊,而是使用一組相機姿勢和零件區域分割以及RGB圖像。在測試過程中,它只需要已知類別內未見過的3D物件的幾個RGB視圖來推斷相對應的零件區域分和神經輻射場。此外,給定一個關節姿勢作為輸入,CLA-NeRF可以根據關節條件進行渲染,也可以在任何相機姿勢下生成相應的RGB圖像。此外,可以通過逆向渲染來估計對應的關節姿勢。在我們的實驗中,我們對虛擬資料集和真實資料集的五個類別進行結果的評估。在所有實驗的設定下,我們的方法都能產生出精確的變形結果和準確的關節姿勢估計。我們相信,少鏡頭鉸接物體渲染和鉸接姿勢估計都為機器人感知和與看不見的關節物體交互打開了大門。有關其他的視覺畫結果,請參閱 https://weichengtseng.github.io/project_website/icra22/index.html


    We propose CLA-NeRF -- a Category-Level Articulated Neural Radiance Field that can perform view synthesis, part segmentation, and articulated pose estimation. CLA-NeRF is trained at the object category level using no CAD models and no depth, but a set of RGB images with ground truth camera poses and part segments. During inference, it only takes a few RGB views (i.e., few-shot) of an unseen 3D object instance within the known category to infer the object part segmentation and the neural radiance field. Given an articulated pose as input, CLA-NeRF can perform articulation-aware volume rendering to generate the corresponding RGB image at any camera pose. Moreover, the articulated pose of an object can be estimated via inverse rendering. In our experiments, we evaluate the framework across five categories on both synthetic and real-world data. In all cases, our method shows realistic deformation results and accurate articulated pose estimation. We believe that both few-shot articulated object rendering and articulated pose estimation open doors for robots to perceive and interact with unseen articulated objects. Please see https://weichengtseng.github.io/project_website/icra22/index.html for qualitative results.

    Acknowledgements 摘要 i Abstract ii 1 Introduction 1 2 Related Works 5 2.1 Articulated 3D Shape Representations . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Articulated Object Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Preliminaries: NeRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Method 9 3.1 Category-Level Semantic NeRF . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Joint Attributes Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Articulation-aware Volume Rendering . . . . . . . . . . . . . . . . . . . . . . 12 3.4 Articulated Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Experiments 15 4.1 Implementation Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.1 Rigid Body Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.2 Detail Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.2 Real-world data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3 View Synthesis and Part Segmentation . . . . . . . . . . . . . . . . . . . . . . 17 4.4 Articulated Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.5 Failure Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 Conclusion 25 References 27

    [1] F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y. Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su, “SAPIEN: A simulated part-based interactive environment,” in CVPR, 2020.

    [2] X. Li, H. Wang, L. Yi, L. Guibas, A. L. Abbott, and S. Song, “Category-level articulated object pose estimation,” in CVPR, 2020.

    [3] K. Desingh, S. Lu, A. Opipari, and O. C. Jenkins, “Factored pose estimation of articulated objects using efficient nonparametric belief propagation,” in ICRA, 2019.

    [4] F. Michel, A. Krull, E. Brachmann, M. Y. Yang, S. Gumhold, and C. Rother, “Pose estimation of kinematic chain instances via object coordinate regression,” in BMVC, 2015.

    [5] J. Mu, W. Qiu, A. Kortylewski, A. Yuille, N. Vasconcelos, and X. Wang, “A-SDF: Learning disentangled signed distance functions for articulated shape representation,” arXiv preprint arXiv: 2104.07645, 2021.

    [6] J. J. Park, P. Florence, J. Straub, R. A. Newcombe, and S. Lovegrove, “DeepSDF: Learning continuous signed distance functions for shape representation,” in CVPR, 2019.

    [7] A. Handa, A. Kurenkov, and M. Brundage, “Part I: Indexing datasets of 3d indoor objects.”

    https://sim2realai.github.io/Synthetic-Datasets-of-Objects-Part-I/, 2019.

    [8] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis,” in ECCV, 2020.

    [9] A. Yu, V. Ye, M. Tancik, and A. Kanazawa, “pixelNeRF: Neural radiance fields from one or few images,” in CVPR, 2021.

    [10] Q. Wang, Z. Wang, K. Genova, P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser, “Ibrnet: Learning multi-view image-based rendering,” in CVPR, 2021.

    [11] Y. Weng, H. Wang, Q. Zhou, Y. Qin, Y. Duan, Q. Fan, B. Chen, H. Su, and L. J. Guibas, “Captra: Category-level pose tracking for rigid and articulated objects from point clouds,” arXiv preprint arXiv:2104.03437, 2021.

    [12] A. Jain, R. Lioutikov, C. Chuck, and S. Niekum, “Screwnet: Category-independent articulation model estimation from depth images using screw theory,” in arXiv preprint, 2020.

    [13] B. Abbatematteo, S. Tellex, and G. Konidaris, “Learning to generalize kinematic models to novel objects,” in Proceedings of the Conference on Robot Learning, 2020.

    [14] A. Jacobson, Z. Deng, L. Kavan, and J. Lewis, “Skinning: Real-time shape deformation,” in ACM SIGGRAPH 2014 Courses, 2014.

    [15] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,” ACM Trans. Graphics (Proc. SIGGRAPH Asia), 2015.

    [16] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu, “Deephuman: 3d human reconstruction from a single image,” in ICCV, 2019.

    [17] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll, “Multi-garment net: Learning to dress 3d people from images,” in IEEE International Conference on Computer Vision (ICCV), IEEE, oct 2019.

    [18] S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou, “Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in CVPR, 2021.

    [19] M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” in CVPR, 2020.

    [20] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in CVPR, 2018.

    [21] J. Y. Zhang, S. Pepose, H. Joo, D. Ramanan, J. Malik, and A. Kanazawa, “Perceiving 3d human-object spatial arrangements from a single image in the wild,” in ECCV, 2020.

    [22] M. Omran, C. Lassner, G. Pons-Moll, P. V. Gehler, and B. Schiele, “Neural body fitting:

    Unifying deep learning and model-based human pose and shape estimation,” in 3DV, 2018.

    [23] B. Deng, J. Lewis, T. Jeruzalski, G. Pons-Moll, G. Hinton, M. Norouzi, and A. Tagliasacchi, “Neural articulated shape approximation,” in ECCV, 2020.

    [24] A. Noguchi, X. Sun, S. Lin, and T. Harada, “Neural articulated radiance field,” arXiv preprint arXiv:2104.03110, 2021.

    [25] D. Katz and O. Brock, “Manipulating articulated objects with interactive perception,” in 2008 IEEE International Conference on Robotics and Automation, pp. 272–277, IEEE, 2008.

    [26] D. Katz, M. Kazemi, J. A. Bagnell, and A. Stentz, “Interactive segmentation, tracking, and kinematic modeling of unknown 3d articulated objects,” in 2013 IEEE International Conference on Robotics and Automation, pp. 5003–5010, IEEE, 2013.

    [27] R. M. Martin and O. Brock, “Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors,” in IROS, 2014.

    [28] R. Martín-Martín, S. Höfer, and O. Brock, “An integrated approach to visual perception of articulated objects,” in ICRA, 2016.

    [29] K. Hausman, S. Niekum, S. Osentoski, and G. S. Sukhatme, “Active articulation model estimation through interactive perception,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 3305–3312, IEEE, 2015.

    [30] S. Pillai, M. R. Walter, and S. Teller, “Learning articulated motions from visual demonstration,” 2015.

    [31] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6d object pose and size estimation,” in CVPR, 2019.

    [32] A. Jain and S. Niekum, “Learning hybrid object kinematics for efficient hierarchical planning under uncertainty,” in IROS, 2020.

    [33] J. Sturm, C. Stachniss, and W. Burgard, “A probabilistic framework for learning kinematic models of articulated objects,” 2011.

    [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and

    I. Polosukhin, “Attention is all you need,” in NeurIPS (I. Guyon, U. V. Luxburg, S. Bengio,

    H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), 2017.

    [35] J. T. Kajiya and B. P. V. Herzen, “Ray tracing volume densities,” SIGGRAPH, 1984.

    [36] N. Max, “Optical models for direct volume rendering,” IEEE TVCG, 1995.

    [37] K. Rematas, R. Martin-Brualla, and V. Ferrari, “ShaRF: Shape-conditioned radiance fields from a single view,” in ICML, 2021.

    [38] L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T.-Y. Lin, “iNeRF: Inverting neural radiance fields for pose estimation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.

    [39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.

    [40] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision,” in CVPR, June 2020.

    [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.

    [42] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese,

    M. Savva, S. Song, H. Su, et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.

    [43] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su, “PartNet: A largescale benchmark for fine-grained and hierarchical part-level 3D object understanding,” in CVPR, June 2019.

    [44] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning.” http://pybullet.org, 2016–2021.

    [45] B. O. Community, Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.

    [46] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in CVPR, 2016.

    [47] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” in ECCV, 2016.

    [48] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018.

    QR CODE