簡易檢索 / 詳目顯示

研究生: 宋子宣
Solarte, Bolivar Enrique
論文名稱: 多視角一致性用於自監督學習的房間佈局幾何
Multi-view Consistency for Self-training Room Layout Geometries
指導教授: 孫民
Sun, Min
口試委員: 邱維辰
Chiu, Wei-Chen
彭文志
Peng, Wen-Chih
黃敬群
Huang, Ching-Chun
林嘉文
Lin, Chia-Wen
陳煥宗
Chen, Hwann-Tzong
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 113
語文別: 英文
論文頁數: 114
中文關鍵詞: 全景影像半監督式學習多視角幾何
外文關鍵詞: 360 images, semi-supervised learning, multi-view geometry
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 多視角一致性是幾何感知任務中的基礎且重要約束,包括深度預測、語義分割以及場景理解。然而,將多視角一致性應用於如室內佈局等幾何結構仍存在許多尚未解決的挑戰。例如,單目室內佈局預測由於攝影機的高度未知,導致佈局尺度的不確定性,進而無法還原真實的重建比例。再來,當考慮多佈局預測的整合重建時,尺度模糊性亦使得整合更加複雜。因此,處理佈局尺度與攝影機姿勢中的單目尺度是一個關鍵的問題。另外,如何在在無標註的情況下進行自我訓練以及評估模型性能對於真實世界應用意義非凡,然而在文獻中卻尚未得到解決。在本論文中,我們通過定義多視角佈局一致性(MLC),特別是從多張全景單目影像的角度,來解決這項挑戰。

    在第一章中,我們主要針對相對攝影機位姿估計任務,通過一種新穎的歸一化方法來利用全景影像的球面投影,以獲得更具數值穩定性的相機位姿解。

    在第二章中,我們提出了一種多視角全景佈局配準方法,以對齊多個佈局和估計的相機位姿,應用於平面圖估計任務。

    在第三章中,我們介紹了一種室內佈局估計的自我訓練方法,該方法利用已知相機位姿的多視角全景佈局預測來構建偽標籤。此解決方案使我們能在新的數據中自我訓練房間佈局模型且無需任何標籤註解。

    最後,在第四章中,我們提出了一種比第三章更加有效的射線投影偽標記方法,用於計算出更具有幾何意義的偽標籤,以自我訓練模型。該方法無需任何幾何假設(如曼哈頓世界或平面牆結構),解決了牆體遮擋和幾何不一致等重要問題,全程不使用任何人工標記註解。


    Multi-view consistency (MVC) is a fundamental constraint in geometry perception tasks, including depth estimation, semantic segmentation, and scene understanding. However, applying MVC to geometries such as room layouts presents several unresolved challenges. For instance, the monocular room layout estimation is inherently scale-ambiguous due to its unknown camera height, which results in an uncertain layout scale. Integrating multiple layout estimations from a monocular camera is further complicated by the scale ambiguity in the monocular camera pose estimation. Consequently, handling both the layout scales and the monocular scale in camera poses remains an open challenge. Additionally, self-training room layout models and evaluating their performance without labeled annotations have significant implications for real-world applications, which have not been addressed in the literature. In this dissertation, we address these challenges by defining the multi-view layout consistency (MLC), particularly from multiple 360-images in a monocular fashion.

    In the first chapter, we primarily aim at the task of relative camera pose estimation, exploiting the spherical projection of 360-images by a novel normalization that yields a more numerical stable camera pose solution.

    In the second chapter, we propose a multi-view 360-layout registration to align multiple layout and camera pose estimates, particularly for the task of floor plan estimation.

    In the third chapter, we introduce a self-training approach for room layout estimation, that leverages the multi-view 360-layout registration to construct pseudo-labels from noisy estimates. This solution allows us to self-train room layout models in a new data domain without using any label annotation.

    Finally, in the fourth chapter, we propose a ray-casting pseudo-labeling approach to estimate geometry-aware pseudo-labels for self-training room layout models, addressing important issues like wall occlusion and inconsistency geometries without any geometry assumption such as Manhattan World, or plane-wall structures. All of this, without using any human label annotations.

    Acknowledgements 3 摘要 5 Abstract 7 Declaration 9 Contents 11 List of Figures 17 List of Tables 25 Chapter 1 Introduction 1 1.1 1.2 1.3 Chapter 2 Room-Layout Representation from 360-Images . . . . . . . . . . . . 1 Multiview Layout Consistency (MLC) . . . . . . . . . . . . . . . . . 3 Challenges for Multiview Layout Consistency . . . . . . . . . . . . . 4 Robust 360-8PA: Redesigning the normalized 8-point algorithm for 360-degree images 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Spherical Projection and Bearing Vectors . . . . . . . . . . . . . . . 12 2.3.2 Epipolar Constraint and the Eight-Point Algorithm . . . . . . . . . 13 2.3.3 Spherical Normalization . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.4 Normalization and Stability . . . . . . . . . . . . . . . . . . . . . . 15 2.3.5 Non-linear Optimization over S and K . . . . . . . . . . . . . . . . 18 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Noise and Outlier Evaluation . . . . . . . . . . . . . . . . . . . . . 20 2.4.2 Experiments using large FoV images . . . . . . . . . . . . . . . . . 22 2.4.3 RANSAC Experiments . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 3 360-DFPE: Leveraging Monocular 360-Layouts for Direct Floor Plan Estimation 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Single Room Layout Estimation . . . . . . . . . . . . . . . . . . . 31 3.2.2 Floor Plan Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 360-Layout Registration . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Relative Scale Recovery . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.4 Room Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.5 Plane Estimation from Layout Geometry . . . . . . . . . . . . . . . 40 3.3.6 Room Shape Optimization . . . . . . . . . . . . . . . . . . . . . .42 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Failure Cases and Limitations . . . . . . . . . . . . . . . . . . . . . 49 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 4 360-MLC: Multi-view Layout Consistency for Self-training and Hyper-parameter Tuning 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.1 Multi-view Layout Re-projection . . . . . . . . . . . . . . . . . . . 59 4.3.2 Weighted Boundary Consistency Loss . . . . . . . . . . . . . . . . 60 4.3.3 Multi-view Layout Consistency Metric . . . . . . . . . . . . . . . . 61 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chapter 5 Self-training Room Layout Estimation via Geometry-aware Ray- casting 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.0.1 Room Layout Estimation. . . . . . . . . . . . . . . . . 76 5.2.0.2 Multi-view Layout. . . . . . . . . . . . . . . . . . . . 77 5.2.0.3 Semi-Supervised and Self-training Layout Estimation. . 77 5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.1 Self-training Room Layout with Multi-view Layout Consistency . . 79 5.3.2 Pseudo-labeling by Ray-casting . . . . . . . . . . . . . . . . . . . . 81 5.3.2.1 Probability distribution on a ray. . . . . . . . . . . . . 81 5.3.2.2 Multi-cycle ray-casting for pseudo-labeling. . . . . . . 83 5.3.3 Weighted Distance Loss . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.1.1 Baseline and Model Backbones. . . . . . . . . . . . . . 86 5.4.1.2 Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.1.3 Evaluation Metrics. . . . . . . . . . . . . . . . . . . . 87 5.4.1.4 Implementation Details. . . . . . . . . . . . . . . . . . 87 5.4.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.2.1 Evaluation using HorizonNet Backbone. . . . . . . . . 88 5.4.2.2 Evaluation using LGTNet Backbone. . . . . . . . . . . 89 5.4.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.3.1 Qualitative Results on Panoramic Images. . . . . . . . 90 5.4.3.2 Qualitative Pseudo-labels Results. . . . . . . . . . . . 92 5.4.3.3 Qualitative Results on Real-world Data. . . . . . . . . 92 5.4.4 Ablation Study for Weighted Distance Loss Formulation . . . . . . 93 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Chapter 6 Conclusions 95 References 97

    [1] B. Solarte, C.-H. Wu, K.-W. Lu, Y.-H. Tsai, W.-C. Chiu, and M. Sun, “Robust 360-
    8pa: Redesigning the normalized 8-point algorithm for 360-fov images,” in 2021
    IEEE International Conference on Robotics and Automation (ICRA), pp. 11032–
    11038, IEEE, 2021.
    [2] B. Solarte, Y.-C. Liu, C.-H. Wu, Y.-H. Tsai, and M. Sun, “360-dfpe: Leveraging
    monocular 360-layouts for direct floor plan estimation,” IEEE Robotics and Au-
    tomation Letters, vol. 7, no. 3, pp. 6503–6510, 2022.
    [3] B. Solarte, C.-H. Wu, Y.-C. Liu, Y.-H. Tsai, and M. Sun, “360-mlc: Multi-view
    layout consistency for self-training and hyper-parameter tuning,” in NeurIPS, 2022.
    [4] B. Solarte, C.-H. Wu, J.-C. Jhang, J. Lee, Y.-H. Tsai, and M. Sun, “Self-
    training room layout estimation via geometry-aware ray-casting,” arXiv preprint
    arXiv:2407.15041, 2024.
    [5] B. D. Lucas, T. Kanade, et al., “An iterative image registration technique with an
    application to stereo vision,” International Joint Conferences on Artificial Intelli-
    gence, 1981.
    [6] H. C. Longuet-Higgins, “A computer algorithm for reconstructing a scene from two
    projections,” Nature, 1981.
    [7] C. Liu, J. Wu, and Y. Furukawa, “Floornet: A unified framework for floorplan
    reconstruction from 3d scans,” in Proceedings of the European conference on com-
    puter vision (ECCV), pp. 201–217, 2018.
    [8] J. Chen, C. Liu, J. Wu, and Y. Furukawa, “Floor-sp: Inverse cad for floorplans by
    sequential room-wise shortest path,” in Proceedings of the IEEE/CVF International
    Conference on Computer Vision, pp. 2661–2670, 2019.
    [9] P. V. Tran, “Sslayout360: Semi-supervised indoor layout estimation from 360◦
    panorama,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog-
    nition (CVPR), pp. 15348–15357, IEEE Computer Society, 2021.
    [10] S. Cruz, W. Hutchcroft, Y. Li, N. Khosravan, I. Boyadzhiev, and S. B. Kang, “Zil-
    low indoor dataset: Annotated floor plans with 360deg panoramas and 3d room
    layouts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition, pp. 2133–2143, 2021.
    [11] C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen, “Horizonnet: Learning room layout
    with 1d representation and pano stretch data augmentation,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1047–
    1056, 2019.
    [12] B. Solarte, Y.-C. Liu, C.-H. Wu, Y.-H. Tsai, and M. Sun, “360-dfpe: Leveraging
    monocular 360-layouts for direct floor plan estimation,” IEEE Robotics and Au-
    tomation Letters, pp. 1–1, 2022.
    [13] Z. Jiang, Z. Xiang, J. Xu, and M. Zhao, “Lgt-net: Indoor panoramic room layout
    estimation with geometry-aware transformer network,” in CVPR, 2022.
    [14] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D Dynamic Scene
    Graphs: Actionable Spatial Perception with Places, Objects, and Humans,” in Pro-
    ceedings of Robotics: Science and Systems, July 2020.
    [15] N. Hughes, Y. Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Car-
    lone, “Foundations of spatial perception for robotics: Hierarchical representa-
    tions and real-time systems,” The International Journal of Robotics Research,
    p. 02783649241229725, 2024.
    [16] F. Boniardi, A. Valada, R. Mohan, T. Caselitz, and W. Burgard, “Robot localization
    in floor plans using a room layout edge extraction network,” in 2019 IEEE/RSJ In-
    ternational Conference on Intelligent Robots and Systems (IROS), pp. 5291–5297,
    IEEE, 2019.
    [17] Z. Ravichandran, L. Peng, N. Hughes, J. Griffith, and L. Carlone, “Hierarchical
    representations and explicit memory: Learning effective navigation policies on
    3D scene graphs using graph neural networks,” arXiv preprint arXiv: 2108.01176,
    2021.
    [18] S. Yang and S. Scherer, “Monocular object and plane slam in structured environ-
    ments,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3145–3152, 2019.
    [19] C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d
    room layout from a single rgb image,” in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 2051–2059, 2018.
    [20] G. Pintore, M. Agus, and E. Gobbetti, “Atlantanet: Inferring the 3d indoor layout
    from a single 360 image beyond the manhattan world assumption,” in European
    Conference on Computer Vision, pp. 432–448, Springer, 2020.
    [21] S. Stekovic, M. Rad, F. Fraundorfer, and V. Lepetit, “Montefloor: Extending mcts
    for reconstructing accurate large-scale floor plans,” in Proceedings of the IEEE/
    CVF International Conference on Computer Vision (ICCV), pp. 16034–16043, Oc-
    tober 2021.
    [22] R. Cabral and Y. Furukawa, “Piecewise planar and compact floorplan reconstruc-
    tion from images,” in 2014 IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 628–635, IEEE, 2014.
    [23] G. Pintore, F. Ganovelli, R. Pintus, R. Scopigno, and E. Gobbetti, “3d floor plan
    recovery from overlapping spherical images,” Computational visual media, vol. 4,
    no. 4, pp. 367–383, 2018.
    [24] D. Rozumnyi, S. Popov, K.-K. Maninis, M. Nießner, and V. Ferrari, “Estimating
    generic 3d room structures from 2d annotations,” arXiv preprint arXiv:2306.09077,
    2023.
    [25] C. Sun, M. Sun, and H. T. Chen, “Hohonet: 360 indoor holistic understanding with
    latent horizontal features,” in CVPR, 2021.
    [26] F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Led2-net: Monocular
    360deg layout estimation via differentiable depth rendering,” in CVPR, 2021.
    [27] C. Fernandez-Labrador, J. M. Facil, A. Perez-Yus, C. Demonceaux, J. Civera, and
    J. J. Guerrero, “Corners for layout: End-to-end layout recovery from 360 images,”
    IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1255–1262, 2020.
    [28] Z. Shen, Z. Zheng, C. Lin, L. Nie, K. Liao, S. Zheng, and Y. Zhao, “Disentangling
    orthogonal planes for indoor panoramic room layout estimation with cross-scale
    distortion awareness,” in Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition, pp. 17337–17345, 2023.
    [29] N. Nejatishahidin, W. Hutchcroft, M. Narayana, I. Boyadzhiev, Y. Li, N. Khosra-
    van, J. Košecká, and S. B. Kang, “Graph-covis: Gnn-based multi-view panorama
    global pose estimation,” in CVPR Workshop on Omnidirectional Computer Vision,
    2023.
    [30] J.-W. Su, C.-H. Peng, P. Wonka, and H.-K. Chu, “Gpr-net: Multi-view layout esti-
    mation via a geometry-aware panorama registration network,” in CVPR, 2023.
    [31] W. Hutchcroft, Y. Li, I. Boyadzhiev, Z. Wan, H. Wang, and S. B. Kang, “Covispose:
    Co-visibility pose transformer for wide-baseline relative pose estimation in 360
    indoor panoramas,” in European Conference on Computer Vision, pp. 615–633,
    Springer, 2022.
    [32] D. Scaramuzza and F. Fraundorfer, “Tutorial: visual odometry,” IEEE Robotics and
    Automation Magazine (RA-M), 2011.
    [33] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cam-
    bridge university press, 2003.
    [34] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accu-
    rate monocular slam system,” IEEE Transactions on Tobotics (T-RO), 2015.
    [35] S. Sumikura, M. Shibuya, and K. Sakurada, “OpenVSLAM: A Versatile Visual
    SLAM Framework,” in ACM MM, 2019.
    [36] P. Moulon, P. Monasse, R. Perrot, and R. Marlet, “Openmvg: Open multiple view
    geometry,” in International Workshop on Reproducible Research in Pattern Recog-
    nition, 2016.
    [37] H. Guan and W. A. P. Smith, “Structure-from-motion in spherical video using the
    von mises-fisher distribution,” IEEE Transactions on Image Processing, vol. 26,
    no. 2, pp. 711–723, 2017.
    [38] A. Pagani and D. Stricker, “Structure from motion using full spherical panoramic
    cameras,” in 2011 IEEE International Conference on Computer Vision Workshops
    (ICCV Workshops), 2011.
    [39] D. Nistér, “An efficient solution to the five-point relative pose problem,” IEEE
    Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2004.
    [40] C. Fermüller and Y. Aloimonos, “Ambiguity in structure from motion: Sphere ver-
    sus plane,” IJCV, 1998.
    [41] J. Fujiki, A. Torii, and S. Akaho, “Epipolar geometry via rectification of spheri-
    cal images,” in International Conference on Computer Vision/Computer Graphics
    Collaboration Techniques and Applications, 2007.
    [42] H. Taira, Y. Inoue, A. Torii, and M. Okutomi, “Robust feature matching for dis-
    torted projection by spherical cameras,” IPSJ Transactions on Computer Vision
    and Applications, 2015.
    [43] M. Fischler and R. Bolles, “Random sample paradigm for model consensus: A
    apphcatlons to image fitting with analysis and automated cartography,” Commun.
    ACM, vol. 24, no. 6, pp. 381–395, 1981.
    [44] T. L. da Silveira and C. R. Jung, “Perturbation analysis of the 8-point algorithm:
    a case study for wide fov cameras,” in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 11757–11766, 2019.
    [45] R. I. Hartley, “In defense of the eight-point algorithm,” IEEE Transactions on Pat-
    tern Analysis and Machine Intelligence (TPAMI), 1997.
    [46] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song,
    A. Zeng, and Y. Zhang, “Matterport3D: Learning from RGB-D data in indoor en-
    vironments,” in International Virtual Conference on 3D Vision (3DV), 2017.
    [47] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun, “MI-
    NOS: Multimodal indoor simulator for navigation in complex environments,”
    ArXiv:1712.03931, 2017.
    [48] D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stueckler, and D. Cremers, “The
    tum vi benchmark for evaluating visual-inertial odometry,” in International Con-
    ference on Intelligent Robots and Systems (IROS), October 2018.
    [49] Hongdong Li and R. Hartley, “Five-point motion estimation made easy,” in 18th
    International Conference on Pattern Recognition (ICPR’06), vol. 1, pp. 630–633,
    2006.
    [50] H. Li and R. Hartley, “Five-point motion estimation made easy,” in ICPR, 2006.
    [51] B. Li, L. Heng, G. H. Lee, and M. Pollefeys, “A 4-point algorithm for relative pose
    estimation of a calibrated camera with a known relative rotation angle,” in IEEE/
    RSJ International Conference on Intelligent Robots and Systems (IROS), 2013.
    [52] K. Fathian, J. P. Ramirez-paredes, E. A. Doucette, J. W. Curtis, and N. R. Gans,
    “QuEst: A Quaternion-Based Approach for Camera,” IEEE Robotics and Automa-
    tion Letters (RA-L), 2018.
    [53] M. Mühlich and R. Mester, “The role of total least squares in motion analysis,” in
    European Conference on Computer Vision, pp. 305–321, Springer, 1998.
    [54] Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry, An invitation to 3-d vision: from
    images to geometric models, vol. 26. Springer Science & Business Media, 2012.
    [55] A. Torii, A. Imiya, and N. Ohnishi, “Two-and three-view geometry for spherical
    cameras,” in Proceedings of the sixth workshop on omnidirectional vision, camera
    networks and non-classical cameras, 2005.
    [56] P.-Å. Wedin, “Perturbation bounds in connection with singular value decomposi-
    tion,” BIT Numerical Mathematics, vol. 12, no. 1, pp. 99–111, 1972.
    [57] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative
    to sift or surf,” in ICCV, 2011.
    [58] M. Lourakis, “levmar: Levenberg-marquardt nonlinear least squares algorithms
    in C/C++.” [web page] http://www.ics.forth.gr/~lourakis/levmar/, Jul.
    2004. [Accessed on 31 Jan. 2005.].
    [59] T. Tommasini, A. Fusiello, E. Trucco, and V. Roberto, “Making good features track
    better,” in Proceedings. 1998 IEEE Computer Society Conference on Computer
    Vision and Pattern Recognition (Cat. No. 98CB36231), pp. 178–183, IEEE, 1998.
    [60] P. H. Torr and D. W. Murray, “The development and comparison of robust methods
    for estimating the fundamental matrix,” International journal of computer vision,
    vol. 24, no. 3, pp. 271–300, 1997.
    [61] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for rgb-d
    cameras,” in 2013 IEEE International Conference on Robotics and Automation,
    pp. 3748–3754, IEEE, 2013.
    [62] A. J. Davison, “Futuremapping: The computational structure of spatial ai systems,”
    arXiv preprint arXiv:1803.11288, 2018.
    [63] F. Boniardi, T. Caselitz, R. Kümmerle, and W. Burgard, “Robust lidar-based local-
    ization in architectural floor plans,” in 2017 IEEE/RSJ International Conference
    on Intelligent Robots and Systems (IROS), pp. 3318–3324, IEEE, 2017.
    [64] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D Dynamic Scene
    Graphs: Actionable Spatial Perception with Places, Objects, and Humans,” in Pro-
    ceedings of Robotics: Science and Systems, July 2020.
    [65] F. Boniardi, A. Valada, R. Mohan, T. Caselitz, and W. Burgard, “Robot localization
    in floor plans using a room layout edge extraction network,” in 2019 IEEE/RSJ In-
    ternational Conference on Intelligent Robots and Systems (IROS), pp. 5291–5297,
    IEEE, 2019.
    [66] S. Hampali, S. Stekovic, S. D. Sarkar, C. S. Kumar, F. Fraundorfer, and V. Lepetit,
    “Monte carlo scene search for 3d scene understanding,” in Proceedings of the IEEE/
    CVF Conference on Computer Vision and Pattern Recognition, pp. 13804–13813,
    2021.
    [67] P.-H. Le and J. Košecka, “Dense piecewise planar rgb-d slam for indoor environ-
    ments,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Sys-
    tems (IROS), pp. 4944–4949, IEEE, 2017.
    [68] A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer, “Pl-slam:
    Real-time monocular visual slam with points and lines,” in 2017 IEEE international
    conference on robotics and automation (ICRA), pp. 4503–4508, IEEE, 2017.
    [69] C. Yan, B. Shao, H. Zhao, R. Ning, Y. Zhang, and F. Xu, “3d room layout estima-
    tion from a single rgb image,” IEEE Transactions on Multimedia, vol. 22, no. 11,
    pp. 3014–3024, 2020.
    [70] C.-W. Hsiao, C. Sun, M. Sun, and H.-T. Chen, “Flat2layout: Flat representation for
    estimating layout of general room types,” arXiv preprint arXiv:1905.12571, 2019.
    [71] C.-Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich, “Roomnet: End-
    to-end room layout estimation,” in Proceedings of the IEEE international confer-
    ence on computer vision, pp. 4865–4874, 2017.
    [72] F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Led2-net: Monocular
    360deg layout estimation via differentiable depth rendering,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12956–
    12965, 2021.
    [73] C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen, “Horizonnet: Learning room layout
    with 1d representation and pano stretch data augmentation,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1047–
    1056, 2019.
    [74] C. Sun, M. Sun, and H.-T. Chen, “Hohonet: 360 indoor holistic understanding with
    latent horizontal features,” in Proceedings of the IEEE/CVF Conference on Com-
    puter Vision and Pattern Recognition, pp. 2573–2582, 2021.
    [75] S. Sumikura, M. Shibuya, and K. Sakurada, “Openvslam: a versatile visual slam
    framework,” in Proceedings of the 27th ACM International Conference on Multi-
    media, pp. 2292–2295, 2019.
    [76] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song,
    A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor envi-
    ronments,” International Conference on 3D Vision (3DV), 2017.
    [77] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun, “MI-
    NOS: Multimodal indoor simulator for navigation in complex environments,”
    arXiv:1712.03931, 2017.
    [78] Y.-W. Chao, W. Choi, C. Pantofaru, and S. Savarese, “Layout estimation of highly
    cluttered indoor scenes using geometric and semantic cues,” in International Con-
    ference on Image Analysis and Processing, pp. 489–499, Springer, 2013.
    [79] A. Flint, C. Mei, D. Murray, and I. Reid, “A dynamic programming approach
    to reconstructing building interiors,” in European conference on computer vision,
    pp. 394–407, Springer, 2010.
    [80] A. Flint, D. Murray, and I. Reid, “Manhattan scene understanding using monocular,
    stereo, and 3d features,” in 2011 International Conference on Computer Vision,
    pp. 2228–2235, IEEE, 2011.
    [81] G. Tsai, C. Xu, J. Liu, and B. Kuipers, “Real-time indoor scene understanding using
    bayesian filtering with motion cues,” in 2011 International Conference on Com-
    puter Vision, pp. 121–128, IEEE, 2011.
    [82] Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context
    model for panoramic scene understanding,” in European conference on computer
    vision, pp. 668–686, Springer, 2014.
    [83] H. Yang and H. Zhang, “Efficient 3d room shape recovery from a single panorama,”
    in Proceedings of the IEEE conference on computer vision and pattern recognition,
    pp. 5422–5430, 2016.
    [84] J. Xu, B. Stenger, T. Kerola, and T. Tung, “Pano2cad: Room layout from a single
    panorama image,” in 2017 IEEE winter conference on applications of computer
    vision (WACV), pp. 354–362, IEEE, 2017.
    [85] S. Dasgupta, K. Fang, K. Chen, and S. Savarese, “Delay: Robust spatial layout
    estimation for cluttered indoor scenes,” in Proceedings of the IEEE conference on
    computer vision and pattern recognition, pp. 616–624, 2016.
    [86] A. Mallya and S. Lazebnik, “Learning informative edge maps for indoor scene lay-
    out prediction,” in Proceedings of the IEEE international conference on computer
    vision, pp. 936–944, 2015.
    [87] M. Hirzer, V. Lepetit, and P. ROTH, “Smart hypothesis generation for efficient and
    robust room layout estimation,” in Proceedings of the IEEE/CVF Winter Confer-
    ence on Applications of Computer Vision, pp. 2912–2920, 2020.
    [88] J. Xiao and Y. Furukawa, “Reconstructing the world’s museums,” International
    journal of computer vision, vol. 110, no. 3, pp. 243–258, 2014.
    [89] C. Lin, C. Li, and W. Wang, “Floorplan-jigsaw: Jointly estimating scene layout and
    aligning partial scans,” in Proceedings of the IEEE/CVF International Conference
    on Computer Vision, pp. 5674–5683, 2019.
    [90] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the
    IEEE international conference on computer vision, pp. 2961–2969, 2017.
    [91] D. Scaramuzza and F. Fraundorfer, “Visual odometry [tutorial],” IEEE robotics &
    automation magazine, vol. 18, no. 4, pp. 80–92, 2011.
    [92] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business
    Media, 2006.
    [93] A. Flint, C. Mei, D. Murray, and I. Reid, “A dynamic programming approach
    to reconstructing building interiors,” in European conference on computer vision,
    pp. 394–407, Springer, 2010.
    [94] A. Flint, D. Murray, and I. Reid, “Manhattan scene understanding using monocular,
    stereo, and 3d features,” in 2011 International Conference on Computer Vision,
    pp. 2228–2235, IEEE, 2011.
    [95] S. Dasgupta, K. Fang, K. Chen, and S. Savarese, “Delay: Robust spatial layout
    estimation for cluttered indoor scenes,” in Proceedings of the IEEE conference on
    computer vision and pattern recognition, pp. 616–624, 2016.
    [96] C.-Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich, “Roomnet: End-
    to-end room layout estimation,” in Proceedings of the IEEE international confer-
    ence on computer vision, pp. 4865–4874, 2017.
    [97] M. Hirzer, V. Lepetit, and P. ROTH, “Smart hypothesis generation for efficient and
    robust room layout estimation,” in Proceedings of the IEEE/CVF winter conference
    on applications of computer vision, pp. 2912–2920, 2020.
    [98] F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Layoutmp3d: Layout
    annotation of matterport3d,” arXiv preprint arXiv:2003.13516, 2020.
    [99] C. Zou, J.-W. Su, C.-H. Peng, A. Colburn, Q. Shan, P. Wonka, H.-K. Chu, and
    D. Hoiem, “3d manhattan room layout reconstruction from a single 360 image,”
    2019.
    [100] S. Y. Bao, M. Sun, and S. Savarese, “Toward coherent object detection and scene
    layout understanding,” Image and Vision Computing, vol. 29, no. 9, pp. 569–579,
    2011.
    [101] A. Gupta, M. Hebert, T. Kanade, and D. Blei, “Estimating spatial layout of rooms
    using volumetric reasoning about objects and surfaces,” Advances in neural infor-
    mation processing systems, vol. 23, 2010.
    [102] S.-T. Yang, F.-E. Wang, C.-H. Peng, P. Wonka, M. Sun, and H.-K. Chu, “Dula-
    net: A dual-projection network for estimating room layouts from a single rgb
    panorama,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition, pp. 3363–3372, 2019.
    [103] G. Pintore, M. Agus, and E. Gobbetti, “Atlantanet: Inferring the 3d indoor layout
    from a single 360◦ image beyond the manhattan world assumption,” in European
    Conference on Computer Vision, pp. 432–448, Springer, 2020.
    [104] J. M. Coughlan and A. L. Yuille, “Manhattan world: Compass direction from a sin-
    gle image by bayesian inference,” in Proceedings of the seventh IEEE international
    conference on computer vision, vol. 2, pp. 941–947, IEEE, 1999.
    [105] Z. Hu, B. Duan, Y. Zhang, M. Sun, and J. Huang, “Mvlayoutnet: 3d layout recon-
    struction with multi-view panoramas,” arXiv preprint arXiv:2112.06133, 2021.
    [106] H. Wang, W. Hutchcroft, Y. Li, Z. Wan, I. Boyadzhiev, Y. Tian, and S. B. Kang,
    “Psmnet: Position-aware stereo merging network for room layout estimation,”
    arXiv preprint arXiv:2203.15965, 2022.
    [107] H. Scudder, “Probability of error of some adaptive pattern-recognition machines,”
    IEEE Transactions on Information Theory, vol. 11, no. 3, pp. 363–371, 1965.
    [108] D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised meth-
    ods,” in 33rd annual meeting of the association for computational linguistics,
    pp. 189–196, 1995.
    [109] E. Riloff, “Automatically generating extraction patterns from untagged text,” in
    Proceedings of the national conference on artificial intelligence, pp. 1044–1049,
    1996.
    [110] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning
    method for deep neural networks,” in Workshop on challenges in representation
    learning, ICML, vol. 3, p. 896, 2013.
    [111] P. Bachman, O. Alsharif, and D. Precup, “Learning with pseudo-ensembles,” Ad-
    vances in neural information processing systems, vol. 27, 2014.
    [112] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel,
    “Mixmatch: A holistic approach to semi-supervised learning,” Advances in Neural
    Information Processing Systems, vol. 32, 2019.
    [113] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk,
    A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with
    consistency and confidence,” Advances in Neural Information Processing Systems,
    vol. 33, pp. 596–608, 2020.
    [114] Y. Zou, Z. Yu, B. Kumar, and J. Wang, “Unsupervised domain adaptation for se-
    mantic segmentation via class-balanced self-training,” in Proceedings of the Euro-
    pean conference on computer vision (ECCV), pp. 289–305, 2018.
    [115] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang, “Confidence regularized self-
    training,” in Proceedings of the IEEE/CVF International Conference on Computer
    Vision, pp. 5982–5991, 2019.
    [116] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student
    improves imagenet classification,” in Proceedings of the IEEE/CVF conference on
    computer vision and pattern recognition, pp. 10687–10698, 2020.
    [117] B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. Le, “Rethinking
    pre-training and self-training,” Advances in neural information processing systems,
    vol. 33, pp. 3833–3845, 2020.
    [118] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-
    averaged consistency targets improve semi-supervised deep learning results,” Ad-
    vances in neural information processing systems, vol. 30, 2017.
    [119] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He, “Data distilla-
    tion: Towards omni-supervised learning,” in Proceedings of the IEEE conference
    on computer vision and pattern recognition, pp. 4119–4128, 2018.
    [120] P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to data mining, pearson ed-
    ucation,” Inc., New Delhi, 2006.
    [121] P. Morerio, J. Cavazza, and V. Murino, “Minimal-entropy correlation alignment
    for unsupervised deep domain adaptation,” International Conference on Learning
    Representations, 2018.
    [122] K. Saito, D. Kim, P. Teterwak, S. Sclaroff, T. Darrell, and K. Saenko, “Tune it the
    right way: Unsupervised validation of domain adaptation via soft neighborhood
    density,” in Proceedings of the IEEE/CVF International Conference on Computer
    Vision, pp. 9184–9193, 2021.
    [123] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d-semantic data for
    indoor scene understanding,” arXiv preprint arXiv:1702.01105, 2017.
    [124] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song,
    A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor envi-
    ronments,” International Conference on 3D Vision (3DV), 2017.
    [125] S. Tang, F. Zhang, J. Chen, P. Wang, and Y. Furukawa, “Mvdiffusion: Enabling
    holistic multi-view image generation with correspondence-aware diffusion,” arXiv
    preprint arXiv:2307.01097, 2023.
    [126] J. Li, H. Dai, H. Han, and Y. Ding, “Mseg3d: Multi-modal 3d semantic segmen-
    tation for autonomous driving,” in Proceedings of the IEEE/CVF Conference on
    Computer Vision and Pattern Recognition (CVPR), pp. 21694–21704, June 2023.
    [127] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-
    supervised monocular depth prediction,” in The International Conference on Com-
    puter Vision (ICCV), October 2019.
    [128] S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M.
    Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y. Zhao,
    and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environ-
    ments for embodied AI,” in NeurIPS, Datasets and Benchmarks Track, 2021.

    QR CODE