簡易檢索 / 詳目顯示

研究生: 蘇拉傑
Suraj Dengale
論文名稱: 使用擴散模型的有效室內全景資料擴增方法研究
Controlled Panorama Data Augmentation using Diffusion Models
指導教授: 吳財福
Wu, Tsai-Fu
孫民
Sun, Min
口試委員: 李濬屹
Lee, Chun-Yi
楊元福
Yang, Yuan-Fu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 38
中文關鍵詞: 全景圖像深度學習語義分割佈局估計數據增強
外文關鍵詞: 360, deep-learning, semantic-segmentation, layout-estimation, diffusion-models
相關次數: 點閱:51下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著平價的 360° 相機的普及,深度學習算法在全景圖像中應用變得在電腦視覺領域至關重要。然而,用於這些任務實際數據集有限。為了改善全景設置中 (1) 語意分割 和 (2) 室內佈局估計 的深度學習模型,我們提出了一種新穎的基於 ControlNet 的數據增強技術。傳統的數據增強方法雖然在透視圖像中效果良好,但由於全景圖像的獨特空間特性和連續性要求,會導致其失真。數據增強至關重要,因為它可以增強訓練的多樣性、減少過擬合並提高深度學習模型的泛化能力。因此,我們提出了一種兩步驟的方法,首先利用 PanoMixSwap 獲取準確的佈局和語意掩模,然後使用在 seg2image 條件下訓練的 ControlNet 生成高保真圖像。通過在推理過程中使用結構相似性指數(SSIM)作為損失函數,我們確保增強的圖像與實際數據在結構上一致。這種方法不僅解決了傳統增強技術的局限性,而且顯著提升了視覺模型在室內全景環境中的性能,特別是在語意分割和室內佈局估計任務中。


    With the rise of affordable 360° cameras, deep learning algorithms for panoramas have become vital in computer vision. However, real-world datasets for these tasks are limited. To improve deep learning models for (1) semantic segmentation and (2) indoor layout estimation in panorama settings, we propose a novel ControlNet-based data augmentation technique. Traditional data augmentation methods, which work well for perspective images, distort panoramic images due to their unique spatial properties and continuity requirements. Data augmentation is crucial as it enhances training diversity, reduces over-fitting, and improves the generalization of deep learning models. Hence, we propose a two-step approach, called ControlMixSwap which first utilizes PanoMixSwap to obtain accurate layouts and semantic masks, then generates high-fidelity images with ControlNet trained on a seg2image condition. By using the Structural Similarity Index Measure (SSIM) as a loss function during inference, we ensure the augmented images maintain structural consistency with real-world data. This method not only addresses the limitations of traditional augmentation techniques but also significantly enhances the performance of vision models in indoor panorama environments, specifically for semantic segmentation and indoor layout estimation tasks.

    Acknowledgements II 摘要 III Abstract IV Contents V List of Figures VIII List of Tables IX Chapter 1 Introduction.. 1 Chapter 2 Related Work.. 6 2.1 Data Augmentation..6 2.2 Diffusion for Conditional Image Generation..7 2.3 Task Specific Image Generation..7 Chapter 3 Preliminary..9 3.1 Diffusion Models..9 3.2 ControlNet for Conditional Image Generation..11 3.3 Evaluation Metrics for Generated Images ..12 3.4 PanoMixSwap for Data Augmentation.. 13 Chapter 4 Method..15 4.1 PanoMixSwap for Layout Generation..15 4.2 ControlNet Training..17 4.3 SSIM Guided ControlNet Inference..18 Chapter 5 Experiments..21 5.1 Experimental Setup..21 5.1.1 Dataset..21 5.1.2 Semantic Segmentation..22 5.1.3 Layout Estimation..22 5.1.4 SSIM Guided ControlNet..23 5.2 Experimental Results 24 5.2.1 Semantic Segmentation 24 5.2.2 Layout Estimation 25 5.3 Ablation Study..27 5.3.1 Comparison with SOTA Augmentation ..27 5.3.2 Mixed Prompt Training ..28 Chapter 6 Conclusion ..29 References ..30

    [1] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
    [2] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3813–3824, 2023.
    [3] Y.-C. Hsieh, C. Sun, S. Dengale, and M. Sun, “Panomixswap–panorama mixing via structural swapping for indoor scene understanding,” British Machine Vision Conference (BMVC), 2023.
    [4] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 6840–6851, Curran Associates, Inc., 2020. [5] D. Li, J. Li, H. Le, G. Wang, S. Savarese, and S. C. Hoi, “LAVIS: A one-stop library for language-vision intelligence,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), (Toronto, Canada), pp. 31–41, Association for Computational Linguistics, July 2023
    [6] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d-semantic data for indoor scene understanding,” arXiv preprint arXiv:1702.01105, 2017.
    [7] C. Sun, M. Sun, and H. Chen, “Hohonet: 360 indoor holistic understanding with latent horizontal features,” in CVPR, 2021.
    [8] C. Sun, C. Hsiao, M. Sun, and H. Chen, “Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 1047–1056, 2019.
    [9] Z. Jiang, Z. Xiang, J. Xu, and M. Zhao, “Lgt-net: Indoor panoramic room layout estimation with geometry-aware transformer network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1654–1663, 2022.
    [10] Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y. Zhao, “Panoformer: Panorama transformer for indoor 360◦ depth estimation,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I, pp. 195–211, Springer, 2022.
    [11] C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d room layout from a single rgb image,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2051–2059, 2018.
    [12] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” arXiv preprint arXiv:1709.06158, 2017.
    [13] J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou, “Structured3d: A large photo-realistic dataset for structured 3d modeling,” in Proceedings of The European Conference on Computer Vision (ECCV), 2020.
    [14] Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context model for panoramic scene understanding,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 668–686, Springer, 2014.
    [15] D. Seichter, P. Langer, T. Wengefeld, B. Lewandowski, D. Hoechemer, and H.-M. Gross, “Efficient and robust semantic mapping for indoor environments,” 2022.
    [16] P. Schütt, M. Schwarz, and S. Behnke, “Semantic interaction in augmented reality environments for microsoft hololens,” 2019 European Conference on Mobile Robots (ECMR), pp. 1–6, 2019.
    [17] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
    [18] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations, 2021.
    [19] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “SDXL: Improving latent diffusion models for high-resolution image synthesis,” in The Twelfth International Conference on Learning Representations, 2024.
    [20] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022.
    [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds.), vol. 25, Curran Associates, Inc., 2012.
    [22] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation strategies from data,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 113–123, 2019.
    [23] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702–703, 2020.
    [24] L. Sixt, B. Wild, and T. Landgraf, “Rendergan: Generating realistic labeled data,” Frontiers in Robotics and AI, vol. 5, p. 66, 2018.
    [25] X. Zhu, Y. Liu, J. Li, T. Wan, and Z. Qin, “Emotion classification with data augmentation using generative adversarial networks,” in Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III 22, pp. 349–360, Springer, 2018.
    [26] J.-H. Kim, W. Choo, and H. O. Song, “Puzzle mix: Exploiting saliency and local statistics for optimal mixup,” in International Conference on Machine Learning, pp. 5275–5285, PMLR, 2020.
    [27] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023–6032, 2019.
    [28] H. Guo, Y. Mao, and R. Zhang, “Mixup as locally linear out-of-manifold regularization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3714–3722, 2019.
    [29] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio, “Manifold mixup: Better representations by interpolating hidden states,” in International conference on machine learning, pp. 6438–6447, PMLR, 2019.
    [30] J. Yoo, N. Ahn, and K.-A. Sohn, “Rethinking data augmentation for image superresolution: A comprehensive analysis and a new strategy,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8375– 8384, 2020. [31] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
    [32] Y. Chen, V. T. Hu, E. Gavves, T. Mensink, P. Mettes, P. Yang, and C. G. Snoek, “Pointmixup: Augmentation for point clouds,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 330–345, Springer, 2020.
    [33] A. Umam, C.-K. Yang, Y.-Y. Chuang, J.-H. Chuang, and Y.-Y. Lin, “Point mixswap: Attentional point cloud mixing via swapping matched structural divisions,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23– 27, 2022, Proceedings, Part XXIX, pp. 596–611, Springer, 2022.
    [34] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds.), vol. 27, Curran Associates, Inc., 2014.
    [35] O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman, “Make-ascene: Scene-based text-to-image generation with human priors,” in Computer Vision – ECCV 2022 (S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, eds.), (Cham), pp. 89–106, Springer Nature Switzerland, 2022.
    [36] W. Fan, Y.-C. Chen, D. Chen, Y. Cheng, L. Yuan, and Y.-C. F. Wang, “Frido: Feature pyramid diffusion for complex scene image synthesis,” in AAAI Conference on Artificial Intelligence, 2022.
    [37] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. Lee, “Gligen: Open-set grounded text-to-image generation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (Los Alamitos, CA, USA), pp. 22511– 22521, IEEE Computer Society, jun 2023.
    [38] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,” in International Conference on Learning Representations, 2021.
    [39] H. Xue, Z. F. Huang, Q. Sun, L. Song, and W. Zhang, “Freestyle layout-to-image synthesis,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14256–14266, 2023.
    [40] B. Yang, Y. Luo, Z. Chen, G. Wang, X. Liang, and L. Lin, “Law-diffusion: Complex scene generation by diffusion with layouts,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), (Los Alamitos, CA, USA), pp. 22612–22622, IEEE Computer Society, oct 2023.
    [41] O. Avrahami, T. Hayes, O. Gafni, S. Gupta, Y. Taigman, D. Parikh, D. Lischinski, O. Fried, and X. Yin, “Spatext: Spatio-textual representation for controllable image generation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18370–18380, 2023.
    [42] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” in The Eleventh International Conference on Learning Representations, 2023. [43] C. Qin, S. Zhang, N. Yu, Y. Feng, X. Yang, Y. Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, S. Ermon, Y. Fu, and R. Xu, “Unicontrol: A unified diffusion model for controllable visual generation in the wild,” in Advances in Neural Information Processing Systems (A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds.), vol. 36, pp. 42961–42992, Curran Associates, Inc., 2023.
    [44] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22500–22510, 2022. [45] S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y. K. Wong, “Unicontrolnet: All-in-one control to text-to-image diffusion models,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
    [46] S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and D. J. Fleet, “Synthetic data from diffusion models improves imagenet classification,” Transactions on Machine Learning Research, 2023.
    [47] B. Trabucco, K. Doherty, M. A. Gurinas, and R. Salakhutdinov, “Effective data augmentation with diffusion models,” in The Twelfth International Conference on Learning Representations, 2024.
    [48] H. Ye, J. Kuen, Q. Liu, Z. Lin, B. Price, and D. Xu, “Seggen: Supercharging segmentation models with text2mask and mask2img synthesis,” ArXiv, vol. abs/2311.03355, 2023.
    [49] J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello, “Open-vocabulary panoptic segmentation with text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2955–2966, June 2023.
    [50] Q. H. Nguyen, T. Vu, A. Tran, and K. Nguyen, “Dataset diffusion: Diffusionbased synthetic dataset generation for pixel-level semantic segmentation,” in ThirtySeventh Conference on Neural Information Processing Systems, 2023.
    [51] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems (D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, eds.), vol. 29, Curran Associates, Inc., 2016.
    [52] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Neural Information Processing Systems, 2017.
    [53] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying MMD GANs,” in International Conference on Learning Representations, 2018.
    [54] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
    [55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
    [56] A. Horé and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th International Conference on Pattern Recognition, pp. 2366–2369, 2010.
    [57] C. Zou, J.-W. Su, C.-H. Peng, A. Colburn, Q. Shan, P. Wonka, H.-K. Chu, and D. Hoiem, “Manhattan room layout reconstruction from a single 360◦ image: A comparative study of state-of-the-art methods,” International Journal of Computer Vision, vol. 129, pp. 1410–1431, 2021.
    [58] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
    [59] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

    QR CODE