研究生: |
宋子宣 Solarte, Bolivar Enrique |
---|---|
論文名稱: |
多視角一致性用於自監督學習的房間佈局幾何 Multi-view Consistency for Self-training Room Layout Geometries |
指導教授: |
孫民
Sun, Min |
口試委員: |
邱維辰
Chiu, Wei-Chen 彭文志 Peng, Wen-Chih 黃敬群 Huang, Ching-Chun 林嘉文 Lin, Chia-Wen 陳煥宗 Chen, Hwann-Tzong |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2024 |
畢業學年度: | 113 |
語文別: | 英文 |
論文頁數: | 114 |
中文關鍵詞: | 全景影像 、半監督式學習 、多視角幾何 |
外文關鍵詞: | 360 images, semi-supervised learning, multi-view geometry |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
多視角一致性是幾何感知任務中的基礎且重要約束,包括深度預測、語義分割以及場景理解。然而,將多視角一致性應用於如室內佈局等幾何結構仍存在許多尚未解決的挑戰。例如,單目室內佈局預測由於攝影機的高度未知,導致佈局尺度的不確定性,進而無法還原真實的重建比例。再來,當考慮多佈局預測的整合重建時,尺度模糊性亦使得整合更加複雜。因此,處理佈局尺度與攝影機姿勢中的單目尺度是一個關鍵的問題。另外,如何在在無標註的情況下進行自我訓練以及評估模型性能對於真實世界應用意義非凡,然而在文獻中卻尚未得到解決。在本論文中,我們通過定義多視角佈局一致性(MLC),特別是從多張全景單目影像的角度,來解決這項挑戰。
在第一章中,我們主要針對相對攝影機位姿估計任務,通過一種新穎的歸一化方法來利用全景影像的球面投影,以獲得更具數值穩定性的相機位姿解。
在第二章中,我們提出了一種多視角全景佈局配準方法,以對齊多個佈局和估計的相機位姿,應用於平面圖估計任務。
在第三章中,我們介紹了一種室內佈局估計的自我訓練方法,該方法利用已知相機位姿的多視角全景佈局預測來構建偽標籤。此解決方案使我們能在新的數據中自我訓練房間佈局模型且無需任何標籤註解。
最後,在第四章中,我們提出了一種比第三章更加有效的射線投影偽標記方法,用於計算出更具有幾何意義的偽標籤,以自我訓練模型。該方法無需任何幾何假設(如曼哈頓世界或平面牆結構),解決了牆體遮擋和幾何不一致等重要問題,全程不使用任何人工標記註解。
Multi-view consistency (MVC) is a fundamental constraint in geometry perception tasks, including depth estimation, semantic segmentation, and scene understanding. However, applying MVC to geometries such as room layouts presents several unresolved challenges. For instance, the monocular room layout estimation is inherently scale-ambiguous due to its unknown camera height, which results in an uncertain layout scale. Integrating multiple layout estimations from a monocular camera is further complicated by the scale ambiguity in the monocular camera pose estimation. Consequently, handling both the layout scales and the monocular scale in camera poses remains an open challenge. Additionally, self-training room layout models and evaluating their performance without labeled annotations have significant implications for real-world applications, which have not been addressed in the literature. In this dissertation, we address these challenges by defining the multi-view layout consistency (MLC), particularly from multiple 360-images in a monocular fashion.
In the first chapter, we primarily aim at the task of relative camera pose estimation, exploiting the spherical projection of 360-images by a novel normalization that yields a more numerical stable camera pose solution.
In the second chapter, we propose a multi-view 360-layout registration to align multiple layout and camera pose estimates, particularly for the task of floor plan estimation.
In the third chapter, we introduce a self-training approach for room layout estimation, that leverages the multi-view 360-layout registration to construct pseudo-labels from noisy estimates. This solution allows us to self-train room layout models in a new data domain without using any label annotation.
Finally, in the fourth chapter, we propose a ray-casting pseudo-labeling approach to estimate geometry-aware pseudo-labels for self-training room layout models, addressing important issues like wall occlusion and inconsistency geometries without any geometry assumption such as Manhattan World, or plane-wall structures. All of this, without using any human label annotations.
[1] B. Solarte, C.-H. Wu, K.-W. Lu, Y.-H. Tsai, W.-C. Chiu, and M. Sun, “Robust 360-
8pa: Redesigning the normalized 8-point algorithm for 360-fov images,” in 2021
IEEE International Conference on Robotics and Automation (ICRA), pp. 11032–
11038, IEEE, 2021.
[2] B. Solarte, Y.-C. Liu, C.-H. Wu, Y.-H. Tsai, and M. Sun, “360-dfpe: Leveraging
monocular 360-layouts for direct floor plan estimation,” IEEE Robotics and Au-
tomation Letters, vol. 7, no. 3, pp. 6503–6510, 2022.
[3] B. Solarte, C.-H. Wu, Y.-C. Liu, Y.-H. Tsai, and M. Sun, “360-mlc: Multi-view
layout consistency for self-training and hyper-parameter tuning,” in NeurIPS, 2022.
[4] B. Solarte, C.-H. Wu, J.-C. Jhang, J. Lee, Y.-H. Tsai, and M. Sun, “Self-
training room layout estimation via geometry-aware ray-casting,” arXiv preprint
arXiv:2407.15041, 2024.
[5] B. D. Lucas, T. Kanade, et al., “An iterative image registration technique with an
application to stereo vision,” International Joint Conferences on Artificial Intelli-
gence, 1981.
[6] H. C. Longuet-Higgins, “A computer algorithm for reconstructing a scene from two
projections,” Nature, 1981.
[7] C. Liu, J. Wu, and Y. Furukawa, “Floornet: A unified framework for floorplan
reconstruction from 3d scans,” in Proceedings of the European conference on com-
puter vision (ECCV), pp. 201–217, 2018.
[8] J. Chen, C. Liu, J. Wu, and Y. Furukawa, “Floor-sp: Inverse cad for floorplans by
sequential room-wise shortest path,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 2661–2670, 2019.
[9] P. V. Tran, “Sslayout360: Semi-supervised indoor layout estimation from 360◦
panorama,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), pp. 15348–15357, IEEE Computer Society, 2021.
[10] S. Cruz, W. Hutchcroft, Y. Li, N. Khosravan, I. Boyadzhiev, and S. B. Kang, “Zil-
low indoor dataset: Annotated floor plans with 360deg panoramas and 3d room
layouts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 2133–2143, 2021.
[11] C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen, “Horizonnet: Learning room layout
with 1d representation and pano stretch data augmentation,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1047–
1056, 2019.
[12] B. Solarte, Y.-C. Liu, C.-H. Wu, Y.-H. Tsai, and M. Sun, “360-dfpe: Leveraging
monocular 360-layouts for direct floor plan estimation,” IEEE Robotics and Au-
tomation Letters, pp. 1–1, 2022.
[13] Z. Jiang, Z. Xiang, J. Xu, and M. Zhao, “Lgt-net: Indoor panoramic room layout
estimation with geometry-aware transformer network,” in CVPR, 2022.
[14] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D Dynamic Scene
Graphs: Actionable Spatial Perception with Places, Objects, and Humans,” in Pro-
ceedings of Robotics: Science and Systems, July 2020.
[15] N. Hughes, Y. Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Car-
lone, “Foundations of spatial perception for robotics: Hierarchical representa-
tions and real-time systems,” The International Journal of Robotics Research,
p. 02783649241229725, 2024.
[16] F. Boniardi, A. Valada, R. Mohan, T. Caselitz, and W. Burgard, “Robot localization
in floor plans using a room layout edge extraction network,” in 2019 IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems (IROS), pp. 5291–5297,
IEEE, 2019.
[17] Z. Ravichandran, L. Peng, N. Hughes, J. Griffith, and L. Carlone, “Hierarchical
representations and explicit memory: Learning effective navigation policies on
3D scene graphs using graph neural networks,” arXiv preprint arXiv: 2108.01176,
2021.
[18] S. Yang and S. Scherer, “Monocular object and plane slam in structured environ-
ments,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3145–3152, 2019.
[19] C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d
room layout from a single rgb image,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2051–2059, 2018.
[20] G. Pintore, M. Agus, and E. Gobbetti, “Atlantanet: Inferring the 3d indoor layout
from a single 360 image beyond the manhattan world assumption,” in European
Conference on Computer Vision, pp. 432–448, Springer, 2020.
[21] S. Stekovic, M. Rad, F. Fraundorfer, and V. Lepetit, “Montefloor: Extending mcts
for reconstructing accurate large-scale floor plans,” in Proceedings of the IEEE/
CVF International Conference on Computer Vision (ICCV), pp. 16034–16043, Oc-
tober 2021.
[22] R. Cabral and Y. Furukawa, “Piecewise planar and compact floorplan reconstruc-
tion from images,” in 2014 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 628–635, IEEE, 2014.
[23] G. Pintore, F. Ganovelli, R. Pintus, R. Scopigno, and E. Gobbetti, “3d floor plan
recovery from overlapping spherical images,” Computational visual media, vol. 4,
no. 4, pp. 367–383, 2018.
[24] D. Rozumnyi, S. Popov, K.-K. Maninis, M. Nießner, and V. Ferrari, “Estimating
generic 3d room structures from 2d annotations,” arXiv preprint arXiv:2306.09077,
2023.
[25] C. Sun, M. Sun, and H. T. Chen, “Hohonet: 360 indoor holistic understanding with
latent horizontal features,” in CVPR, 2021.
[26] F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Led2-net: Monocular
360deg layout estimation via differentiable depth rendering,” in CVPR, 2021.
[27] C. Fernandez-Labrador, J. M. Facil, A. Perez-Yus, C. Demonceaux, J. Civera, and
J. J. Guerrero, “Corners for layout: End-to-end layout recovery from 360 images,”
IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1255–1262, 2020.
[28] Z. Shen, Z. Zheng, C. Lin, L. Nie, K. Liao, S. Zheng, and Y. Zhao, “Disentangling
orthogonal planes for indoor panoramic room layout estimation with cross-scale
distortion awareness,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 17337–17345, 2023.
[29] N. Nejatishahidin, W. Hutchcroft, M. Narayana, I. Boyadzhiev, Y. Li, N. Khosra-
van, J. Košecká, and S. B. Kang, “Graph-covis: Gnn-based multi-view panorama
global pose estimation,” in CVPR Workshop on Omnidirectional Computer Vision,
2023.
[30] J.-W. Su, C.-H. Peng, P. Wonka, and H.-K. Chu, “Gpr-net: Multi-view layout esti-
mation via a geometry-aware panorama registration network,” in CVPR, 2023.
[31] W. Hutchcroft, Y. Li, I. Boyadzhiev, Z. Wan, H. Wang, and S. B. Kang, “Covispose:
Co-visibility pose transformer for wide-baseline relative pose estimation in 360
indoor panoramas,” in European Conference on Computer Vision, pp. 615–633,
Springer, 2022.
[32] D. Scaramuzza and F. Fraundorfer, “Tutorial: visual odometry,” IEEE Robotics and
Automation Magazine (RA-M), 2011.
[33] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cam-
bridge university press, 2003.
[34] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accu-
rate monocular slam system,” IEEE Transactions on Tobotics (T-RO), 2015.
[35] S. Sumikura, M. Shibuya, and K. Sakurada, “OpenVSLAM: A Versatile Visual
SLAM Framework,” in ACM MM, 2019.
[36] P. Moulon, P. Monasse, R. Perrot, and R. Marlet, “Openmvg: Open multiple view
geometry,” in International Workshop on Reproducible Research in Pattern Recog-
nition, 2016.
[37] H. Guan and W. A. P. Smith, “Structure-from-motion in spherical video using the
von mises-fisher distribution,” IEEE Transactions on Image Processing, vol. 26,
no. 2, pp. 711–723, 2017.
[38] A. Pagani and D. Stricker, “Structure from motion using full spherical panoramic
cameras,” in 2011 IEEE International Conference on Computer Vision Workshops
(ICCV Workshops), 2011.
[39] D. Nistér, “An efficient solution to the five-point relative pose problem,” IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2004.
[40] C. Fermüller and Y. Aloimonos, “Ambiguity in structure from motion: Sphere ver-
sus plane,” IJCV, 1998.
[41] J. Fujiki, A. Torii, and S. Akaho, “Epipolar geometry via rectification of spheri-
cal images,” in International Conference on Computer Vision/Computer Graphics
Collaboration Techniques and Applications, 2007.
[42] H. Taira, Y. Inoue, A. Torii, and M. Okutomi, “Robust feature matching for dis-
torted projection by spherical cameras,” IPSJ Transactions on Computer Vision
and Applications, 2015.
[43] M. Fischler and R. Bolles, “Random sample paradigm for model consensus: A
apphcatlons to image fitting with analysis and automated cartography,” Commun.
ACM, vol. 24, no. 6, pp. 381–395, 1981.
[44] T. L. da Silveira and C. R. Jung, “Perturbation analysis of the 8-point algorithm:
a case study for wide fov cameras,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 11757–11766, 2019.
[45] R. I. Hartley, “In defense of the eight-point algorithm,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence (TPAMI), 1997.
[46] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song,
A. Zeng, and Y. Zhang, “Matterport3D: Learning from RGB-D data in indoor en-
vironments,” in International Virtual Conference on 3D Vision (3DV), 2017.
[47] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun, “MI-
NOS: Multimodal indoor simulator for navigation in complex environments,”
ArXiv:1712.03931, 2017.
[48] D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stueckler, and D. Cremers, “The
tum vi benchmark for evaluating visual-inertial odometry,” in International Con-
ference on Intelligent Robots and Systems (IROS), October 2018.
[49] Hongdong Li and R. Hartley, “Five-point motion estimation made easy,” in 18th
International Conference on Pattern Recognition (ICPR’06), vol. 1, pp. 630–633,
2006.
[50] H. Li and R. Hartley, “Five-point motion estimation made easy,” in ICPR, 2006.
[51] B. Li, L. Heng, G. H. Lee, and M. Pollefeys, “A 4-point algorithm for relative pose
estimation of a calibrated camera with a known relative rotation angle,” in IEEE/
RSJ International Conference on Intelligent Robots and Systems (IROS), 2013.
[52] K. Fathian, J. P. Ramirez-paredes, E. A. Doucette, J. W. Curtis, and N. R. Gans,
“QuEst: A Quaternion-Based Approach for Camera,” IEEE Robotics and Automa-
tion Letters (RA-L), 2018.
[53] M. Mühlich and R. Mester, “The role of total least squares in motion analysis,” in
European Conference on Computer Vision, pp. 305–321, Springer, 1998.
[54] Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry, An invitation to 3-d vision: from
images to geometric models, vol. 26. Springer Science & Business Media, 2012.
[55] A. Torii, A. Imiya, and N. Ohnishi, “Two-and three-view geometry for spherical
cameras,” in Proceedings of the sixth workshop on omnidirectional vision, camera
networks and non-classical cameras, 2005.
[56] P.-Å. Wedin, “Perturbation bounds in connection with singular value decomposi-
tion,” BIT Numerical Mathematics, vol. 12, no. 1, pp. 99–111, 1972.
[57] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative
to sift or surf,” in ICCV, 2011.
[58] M. Lourakis, “levmar: Levenberg-marquardt nonlinear least squares algorithms
in C/C++.” [web page] http://www.ics.forth.gr/~lourakis/levmar/, Jul.
2004. [Accessed on 31 Jan. 2005.].
[59] T. Tommasini, A. Fusiello, E. Trucco, and V. Roberto, “Making good features track
better,” in Proceedings. 1998 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (Cat. No. 98CB36231), pp. 178–183, IEEE, 1998.
[60] P. H. Torr and D. W. Murray, “The development and comparison of robust methods
for estimating the fundamental matrix,” International journal of computer vision,
vol. 24, no. 3, pp. 271–300, 1997.
[61] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for rgb-d
cameras,” in 2013 IEEE International Conference on Robotics and Automation,
pp. 3748–3754, IEEE, 2013.
[62] A. J. Davison, “Futuremapping: The computational structure of spatial ai systems,”
arXiv preprint arXiv:1803.11288, 2018.
[63] F. Boniardi, T. Caselitz, R. Kümmerle, and W. Burgard, “Robust lidar-based local-
ization in architectural floor plans,” in 2017 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pp. 3318–3324, IEEE, 2017.
[64] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D Dynamic Scene
Graphs: Actionable Spatial Perception with Places, Objects, and Humans,” in Pro-
ceedings of Robotics: Science and Systems, July 2020.
[65] F. Boniardi, A. Valada, R. Mohan, T. Caselitz, and W. Burgard, “Robot localization
in floor plans using a room layout edge extraction network,” in 2019 IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems (IROS), pp. 5291–5297,
IEEE, 2019.
[66] S. Hampali, S. Stekovic, S. D. Sarkar, C. S. Kumar, F. Fraundorfer, and V. Lepetit,
“Monte carlo scene search for 3d scene understanding,” in Proceedings of the IEEE/
CVF Conference on Computer Vision and Pattern Recognition, pp. 13804–13813,
2021.
[67] P.-H. Le and J. Košecka, “Dense piecewise planar rgb-d slam for indoor environ-
ments,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Sys-
tems (IROS), pp. 4944–4949, IEEE, 2017.
[68] A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer, “Pl-slam:
Real-time monocular visual slam with points and lines,” in 2017 IEEE international
conference on robotics and automation (ICRA), pp. 4503–4508, IEEE, 2017.
[69] C. Yan, B. Shao, H. Zhao, R. Ning, Y. Zhang, and F. Xu, “3d room layout estima-
tion from a single rgb image,” IEEE Transactions on Multimedia, vol. 22, no. 11,
pp. 3014–3024, 2020.
[70] C.-W. Hsiao, C. Sun, M. Sun, and H.-T. Chen, “Flat2layout: Flat representation for
estimating layout of general room types,” arXiv preprint arXiv:1905.12571, 2019.
[71] C.-Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich, “Roomnet: End-
to-end room layout estimation,” in Proceedings of the IEEE international confer-
ence on computer vision, pp. 4865–4874, 2017.
[72] F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Led2-net: Monocular
360deg layout estimation via differentiable depth rendering,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12956–
12965, 2021.
[73] C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen, “Horizonnet: Learning room layout
with 1d representation and pano stretch data augmentation,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1047–
1056, 2019.
[74] C. Sun, M. Sun, and H.-T. Chen, “Hohonet: 360 indoor holistic understanding with
latent horizontal features,” in Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pp. 2573–2582, 2021.
[75] S. Sumikura, M. Shibuya, and K. Sakurada, “Openvslam: a versatile visual slam
framework,” in Proceedings of the 27th ACM International Conference on Multi-
media, pp. 2292–2295, 2019.
[76] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song,
A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor envi-
ronments,” International Conference on 3D Vision (3DV), 2017.
[77] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun, “MI-
NOS: Multimodal indoor simulator for navigation in complex environments,”
arXiv:1712.03931, 2017.
[78] Y.-W. Chao, W. Choi, C. Pantofaru, and S. Savarese, “Layout estimation of highly
cluttered indoor scenes using geometric and semantic cues,” in International Con-
ference on Image Analysis and Processing, pp. 489–499, Springer, 2013.
[79] A. Flint, C. Mei, D. Murray, and I. Reid, “A dynamic programming approach
to reconstructing building interiors,” in European conference on computer vision,
pp. 394–407, Springer, 2010.
[80] A. Flint, D. Murray, and I. Reid, “Manhattan scene understanding using monocular,
stereo, and 3d features,” in 2011 International Conference on Computer Vision,
pp. 2228–2235, IEEE, 2011.
[81] G. Tsai, C. Xu, J. Liu, and B. Kuipers, “Real-time indoor scene understanding using
bayesian filtering with motion cues,” in 2011 International Conference on Com-
puter Vision, pp. 121–128, IEEE, 2011.
[82] Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context
model for panoramic scene understanding,” in European conference on computer
vision, pp. 668–686, Springer, 2014.
[83] H. Yang and H. Zhang, “Efficient 3d room shape recovery from a single panorama,”
in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 5422–5430, 2016.
[84] J. Xu, B. Stenger, T. Kerola, and T. Tung, “Pano2cad: Room layout from a single
panorama image,” in 2017 IEEE winter conference on applications of computer
vision (WACV), pp. 354–362, IEEE, 2017.
[85] S. Dasgupta, K. Fang, K. Chen, and S. Savarese, “Delay: Robust spatial layout
estimation for cluttered indoor scenes,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 616–624, 2016.
[86] A. Mallya and S. Lazebnik, “Learning informative edge maps for indoor scene lay-
out prediction,” in Proceedings of the IEEE international conference on computer
vision, pp. 936–944, 2015.
[87] M. Hirzer, V. Lepetit, and P. ROTH, “Smart hypothesis generation for efficient and
robust room layout estimation,” in Proceedings of the IEEE/CVF Winter Confer-
ence on Applications of Computer Vision, pp. 2912–2920, 2020.
[88] J. Xiao and Y. Furukawa, “Reconstructing the world’s museums,” International
journal of computer vision, vol. 110, no. 3, pp. 243–258, 2014.
[89] C. Lin, C. Li, and W. Wang, “Floorplan-jigsaw: Jointly estimating scene layout and
aligning partial scans,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision, pp. 5674–5683, 2019.
[90] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the
IEEE international conference on computer vision, pp. 2961–2969, 2017.
[91] D. Scaramuzza and F. Fraundorfer, “Visual odometry [tutorial],” IEEE robotics &
automation magazine, vol. 18, no. 4, pp. 80–92, 2011.
[92] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business
Media, 2006.
[93] A. Flint, C. Mei, D. Murray, and I. Reid, “A dynamic programming approach
to reconstructing building interiors,” in European conference on computer vision,
pp. 394–407, Springer, 2010.
[94] A. Flint, D. Murray, and I. Reid, “Manhattan scene understanding using monocular,
stereo, and 3d features,” in 2011 International Conference on Computer Vision,
pp. 2228–2235, IEEE, 2011.
[95] S. Dasgupta, K. Fang, K. Chen, and S. Savarese, “Delay: Robust spatial layout
estimation for cluttered indoor scenes,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 616–624, 2016.
[96] C.-Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich, “Roomnet: End-
to-end room layout estimation,” in Proceedings of the IEEE international confer-
ence on computer vision, pp. 4865–4874, 2017.
[97] M. Hirzer, V. Lepetit, and P. ROTH, “Smart hypothesis generation for efficient and
robust room layout estimation,” in Proceedings of the IEEE/CVF winter conference
on applications of computer vision, pp. 2912–2920, 2020.
[98] F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Layoutmp3d: Layout
annotation of matterport3d,” arXiv preprint arXiv:2003.13516, 2020.
[99] C. Zou, J.-W. Su, C.-H. Peng, A. Colburn, Q. Shan, P. Wonka, H.-K. Chu, and
D. Hoiem, “3d manhattan room layout reconstruction from a single 360 image,”
2019.
[100] S. Y. Bao, M. Sun, and S. Savarese, “Toward coherent object detection and scene
layout understanding,” Image and Vision Computing, vol. 29, no. 9, pp. 569–579,
2011.
[101] A. Gupta, M. Hebert, T. Kanade, and D. Blei, “Estimating spatial layout of rooms
using volumetric reasoning about objects and surfaces,” Advances in neural infor-
mation processing systems, vol. 23, 2010.
[102] S.-T. Yang, F.-E. Wang, C.-H. Peng, P. Wonka, M. Sun, and H.-K. Chu, “Dula-
net: A dual-projection network for estimating room layouts from a single rgb
panorama,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 3363–3372, 2019.
[103] G. Pintore, M. Agus, and E. Gobbetti, “Atlantanet: Inferring the 3d indoor layout
from a single 360◦ image beyond the manhattan world assumption,” in European
Conference on Computer Vision, pp. 432–448, Springer, 2020.
[104] J. M. Coughlan and A. L. Yuille, “Manhattan world: Compass direction from a sin-
gle image by bayesian inference,” in Proceedings of the seventh IEEE international
conference on computer vision, vol. 2, pp. 941–947, IEEE, 1999.
[105] Z. Hu, B. Duan, Y. Zhang, M. Sun, and J. Huang, “Mvlayoutnet: 3d layout recon-
struction with multi-view panoramas,” arXiv preprint arXiv:2112.06133, 2021.
[106] H. Wang, W. Hutchcroft, Y. Li, Z. Wan, I. Boyadzhiev, Y. Tian, and S. B. Kang,
“Psmnet: Position-aware stereo merging network for room layout estimation,”
arXiv preprint arXiv:2203.15965, 2022.
[107] H. Scudder, “Probability of error of some adaptive pattern-recognition machines,”
IEEE Transactions on Information Theory, vol. 11, no. 3, pp. 363–371, 1965.
[108] D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised meth-
ods,” in 33rd annual meeting of the association for computational linguistics,
pp. 189–196, 1995.
[109] E. Riloff, “Automatically generating extraction patterns from untagged text,” in
Proceedings of the national conference on artificial intelligence, pp. 1044–1049,
1996.
[110] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning
method for deep neural networks,” in Workshop on challenges in representation
learning, ICML, vol. 3, p. 896, 2013.
[111] P. Bachman, O. Alsharif, and D. Precup, “Learning with pseudo-ensembles,” Ad-
vances in neural information processing systems, vol. 27, 2014.
[112] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel,
“Mixmatch: A holistic approach to semi-supervised learning,” Advances in Neural
Information Processing Systems, vol. 32, 2019.
[113] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk,
A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with
consistency and confidence,” Advances in Neural Information Processing Systems,
vol. 33, pp. 596–608, 2020.
[114] Y. Zou, Z. Yu, B. Kumar, and J. Wang, “Unsupervised domain adaptation for se-
mantic segmentation via class-balanced self-training,” in Proceedings of the Euro-
pean conference on computer vision (ECCV), pp. 289–305, 2018.
[115] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang, “Confidence regularized self-
training,” in Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 5982–5991, 2019.
[116] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student
improves imagenet classification,” in Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pp. 10687–10698, 2020.
[117] B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. Le, “Rethinking
pre-training and self-training,” Advances in neural information processing systems,
vol. 33, pp. 3833–3845, 2020.
[118] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-
averaged consistency targets improve semi-supervised deep learning results,” Ad-
vances in neural information processing systems, vol. 30, 2017.
[119] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He, “Data distilla-
tion: Towards omni-supervised learning,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 4119–4128, 2018.
[120] P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to data mining, pearson ed-
ucation,” Inc., New Delhi, 2006.
[121] P. Morerio, J. Cavazza, and V. Murino, “Minimal-entropy correlation alignment
for unsupervised deep domain adaptation,” International Conference on Learning
Representations, 2018.
[122] K. Saito, D. Kim, P. Teterwak, S. Sclaroff, T. Darrell, and K. Saenko, “Tune it the
right way: Unsupervised validation of domain adaptation via soft neighborhood
density,” in Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 9184–9193, 2021.
[123] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d-semantic data for
indoor scene understanding,” arXiv preprint arXiv:1702.01105, 2017.
[124] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song,
A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor envi-
ronments,” International Conference on 3D Vision (3DV), 2017.
[125] S. Tang, F. Zhang, J. Chen, P. Wang, and Y. Furukawa, “Mvdiffusion: Enabling
holistic multi-view image generation with correspondence-aware diffusion,” arXiv
preprint arXiv:2307.01097, 2023.
[126] J. Li, H. Dai, H. Han, and Y. Ding, “Mseg3d: Multi-modal 3d semantic segmen-
tation for autonomous driving,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 21694–21704, June 2023.
[127] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-
supervised monocular depth prediction,” in The International Conference on Com-
puter Vision (ICCV), October 2019.
[128] S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M.
Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y. Zhao,
and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environ-
ments for embodied AI,” in NeurIPS, Datasets and Benchmarks Track, 2021.