自監督式環景室內格局預測｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	丁浩文 Ting, Hao-Wen
論文名稱：	自監督式環景室內格局預測 Self-Supervised 360° Room Layout Estimation
指導教授：	陳煥宗 Chen, Hwann-Tzong
口試委員:	孫民 Sun, Min 邱維辰 Chiu, Wei-Chen
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2022
畢業學年度：	110
語文別：	英文
論文頁數：	35
中文關鍵詞：	環景室內格局預測、自監督式學習、可微式渲染、多視角三維任務、曼哈頓世界
外文關鍵詞：	360° room layout, self-supervised learning, differentiable rendering, multi-views 3D, Manhattan world
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本篇論文提出了第一個自監督式的環景室內格局預測方法，可以在沒有任何標記資料的情形下訓練模型。一般來說，在深度預測的任務上會使用深度圖來表示每個像素的深度值，可以為重投影提供較嚴格地限制；相較之下，室內格局的表示法通常具有稀疏性及拓樸性，這樣的差異阻礙了自監督學習中圖像重投影一致性的計算。為了解決這個問
題，本篇論文提出了 Differentiable Layout View Rendering，它可以根據目標影像的預測格局及給定的相機視角，從來源影像取樣像素並渲染至目標攝影機之視角上。由於每個渲染的像素對於預測的格局是可微的，我們現在可以透過最小化重投影損失來訓練格局預測模型。此外，我們引入正則化損失來促進曼哈頓校準、天花板-地板校準、循環一致性和格局拉伸一致性，進一步改善我們的預測。我們在 ZillowIndoor 和
MatterportLayout 資料集上進行了自監督訓練的實驗，在資料稀缺的情況下展示了相當優異的成果。身為第一個自監督式的環景室內格局預測，這項研究對於房地產及旅遊模擬軟體等下游應用來說將會是一個非常有價值的解決方案。

We present the first self-supervised method to train 360◦ room layout estimation models without any labeled data. Unlike per-pixel dense depth that provides abundant correspondence constraints, layout representation is sparse and topological, hindering the use of self-supervised reprojection consistency on images. To address this issue, we propose Differentiable Layout View Rendering, which can warp a source image to the target camera pose given the estimated layout from the target image. As each rendered pixel is differentiable with respect to the estimated layout, we can now train the layout estimation model by minimizing reprojection loss. Besides, we introduce regularization losses to encourage Manhattan alignment, ceiling-floor alignment, cycle consistency, and layout stretch consistency, which further improve our predictions. Finally, we present the first self-supervised results on ZillowIndoor and MatterportLayout datasets. Our approach also shows promising solutions in data-scarce scenarios and active learning, which would have an immediate value in the real estate virtual tour software.

List of Tables 3
List of Figures 4
摘 要 5
Abstract 6
1 Introduction 7
2 Related Work 10
3 Approach 12
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Differentiable Layout View Rendering . . . . . . . . . . . . . . . . . . . . 13
3.4 Losses for Self Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Experiments 19
4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Conclusion 28

A More Details 29
A.1 Details of ceiling height inference . . . . . . . . . . . . . . . . . . . . . . 29
A.2 Detailed results for fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . 30
A.3 Detail of training loss weights . . . . . . . . . . . . . . . . . . . . . . . . 30
A.4 Qualitative results of ablation experiments . . . . . . . . . . . . . . . . . . 32
Bibliography 33

                                

[1] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton. Big self-supervised models are strong semi-supervised learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, NeurIPS, 2020.
[2] S. Cruz, W. Hutchcroft, Y. Li, N. Khosravan, I. Boyadzhiev, and S. B. Kang. Zillow indoor dataset: Annotated floor plans with 360deg panoramas and 3d room layouts. In CVPR, pages 2133–2143, 2021.
[3] C. Fernandez-Labrador, J. M. Fácil, A. Pérez-Yus, C. Demonceaux, J. Civera, and J. J. Guerrero. Corners for layout: End-to-end layout recovery from 360 images. IEEE Robotics Autom. Lett., pages 1255–1262, 2020.
[4] R. Garg, B. G. V. Kumar, G. Carneiro, and I. D. Reid. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In ECCV, 2016.
[5] C. Godard, O. M. Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
[6] C. Godard, O. M. Aodha, M. Firman, and G. J. Brostow. Digging into self-supervised monocular depth estimation. In ICCV, pages 3828–3838, 2019.
[7] G. Pintore, M. Agus, and E. Gobbetti. Atlantanet: Inferring the 3d indoor layout from a single $360ˆ\circ $ image beyond the manhattan world assumption. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, editors, ECCV, pages 432–448, 2020.
[8] M. A. Shabani, W. Song, M. Odamaki, H. Fujiki, and Y. Furukawa. Extreme structure from motion for indoor panoramas without visual overlaps. In ICCV, pages 5703–5711, 2021.
[9] C. Sun, C. Hsiao, M. Sun, and H. Chen. Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation. In CVPR, pages 1047–1056, 2019.
[10] C. Sun, M. Sun, and H. Chen. Hohonet: 360 indoor holistic understanding with latent horizontal features. In CVPR, pages 2573–2582, 2021.
[11] S. Suzuki and K. Abe. Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process., pages 32–46, 1985.
[12] P. V. Tran. Sslayout360: Semi-supervised indoor layout estimation from 360deg panorama. In CVPR, pages 15353–15362, 2021.
[13] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct methods. In CVPR, pages 2022–2030, 2018.
[14] F. Wang, Y. Yeh, M. Sun, W. Chiu, and Y. Tsai. Layoutmp3d: Layout annotation of matterport3d. arxiv:2003.13516, 2020.
[15] F. Wang, Y. Yeh, M. Sun, W. Chiu, and Y. Tsai. Led2-net: Monocular 360deg layout estimation via differentiable depth rendering. In CVPR, pages 12956–12965, 2021.
[16] J. Xie, R. B. Girshick, and A. Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In ECCV, 2016.
[17] S. Yang, F. Wang, C. Peng, P. Wonka, M. Sun, and H. Chu. Dula-net: A dualprojection network for estimating room layouts from a single RGB panorama. In CVPR, pages 3363–3372, 2019.
[18] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia. Lego: Learning edge with geometry all at once by watching videos. In CVPR, pages 225–234, 2018.
[19] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In AAAI, 2018.
[20] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, 2018.
[21] C. Zhang, Z. Cui, C. Chen, S. Liu, B. Zeng, H. Bao, and Y. Zhang. Deeppanocontext: Panoramic 3d scene understanding with holistic scene context graph and relationbased optimization. In ICCV, pages 12632–12641, 2021.
[22] Y. Zhang, S. Song, P. Tan, and J. Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. In ECCV, pages 668–686, 2014.
[23] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
[24] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single RGB image. In CVPR, pages 2051–2059, 2018.
[25] C. Zou, J. Su, C. Peng, A. Colburn, Q. Shan, P. Wonka, H. Chu, and D. Hoiem. Manhattan room layout reconstruction from a single $360ˆ{\circ }$ image: A comparative study of state-of-the-art methods. IJCV, pages 1410–1431, 2021.

簡易檢索 / 詳目顯示

相關論文